"Nothing is better than the wordless teaching and the advantage of non-action"
The cluster consists of the front-end and a number of computational nodes. The user submits a job from the front-end and the queueing system takes care to reserve requested resources and run the job when the time is optimal. An advanced job scheduling will assure fair access for everybody and optimize the overall system efficiency.
Please note that the cluster is build almost entirelly from grant funds. Scientists at CAMK agreed to contribute to the shared computational cluster instead of managing individually hardware purchased from grants. Please consider contribution when applying for grants...
To use the cluster one has to 'ssh
' to it's frontend node - chuck. It is accessible to each user from the internal network at Batycka - there are no separate accounts on the cluster. SSH keys are required. Still, to use the cluster you have to contact cluster@camk.edu.pl
in order to
/work/chuck/<user>
directory and quota (optionally).(please indicate whether you are an employee, guest, student, member of some group and how much space you need - by default employees get 1TB).
The Rocky 9 system on chuck enforces higher security standards. Old, less secure ssh keys became deprecated. If you can't access chuck with old RSA keys please generate a new ED25519 keys. It will not affect the old RSA keys. If you don't have ED25519 keys, then on any linux machine execute:
ssh-keygen -t ed25519 # accept the default paths and set non-empty passphrase
cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys # to authorize the keys on any linux machine in the CAMK network
The frontend can be used to:
The frontend is equiped with 40 cpu cores and 128 GB of memory. Please read the messages displayed after login - they contain current, important announcements. Do not run longer or memory hungry codes on the frontend. There are certain limits set on the frontend like 4 GB RAM/user and 3 h cpu time/process (see 'ulimit -a
') and the system will kill proceses which violate them. Please remember that this machine is shared by many users - be kind to others.
There is a special high performance cluster filesystem (BeeGFS) attached to the cluster:
/work/chuck
- 837 TB volume, (previous day mirror backup exists)It should be used for all data intensive activities on the cluster because it is much faster there, than other 'works' at CAMK. It is also visible to all workstations (at lower thoughput).
Please note that the performance has priority over data safety on /work/chuck! Use it to store simulation/analysis results but not for the only copy of codes, papers, etc.
Please take into account:
Aside from the standard linux packages there is some additional software available in a few ways:
scl list-collections` ,
and `man scl
` for other optionsscl enable -x gcc-toolset-13 bash
`)man module
` for detailsmodule av
` lists available modulesmodule add <module name>
` loads specific modulemodule purge
` removes all modules from the current shell
Important remarks:
In simplest words the cluster consists of the frontend which is used to submit jobs, and computational nodes on which the jobs are running. The software facilitating job and resource management is called a queueing system. Chuck uses SLURM queueing system. It's role is to queue jobs, manage resources and schedule jobs for execution. The basic principle is that the user's job gets an exclusive access to the requested resources. It means that user have to specify how e.g. many cpus, memory and time they need and then the scheduler will make a decision when and where to run the job. There are many rules and factors determining scheduling but in general they should assure equal access to available computational resources. The jobs are submitted to partitions which already provide coarse-grained limits on jobs. Currently existing partitions are:
Name | Max. mem | Default mem | Max. time | Default time | Notes |
---|---|---|---|---|---|
short | 8 GB | 1 GB | 2 days | 2 days | only for serial jobs (1 CPU), default partition |
long | 3 GB | 1 GB | 14 days | 7 days | |
bigmem | 60 GB | 4 GB | 7 days | 7 days | max. 126 GB used by all running jobs of a given user |
para | NONE | 1 GB | 7 days | 7 days | only for parallel jobs (>1 CPU, can use multiple nodes) |
gpu | NONE | 8 GB | 7 days | 7 days | only for jobs using gpus |
interactive | 3 GB | 2 GB | 2 days | 2 days | for interactive, serial jobs (e.g. compilation/debugging); interactive jobs can be however started on all other partitions |
You can get more parameters of a partition using command: `scontrol show partiton <partition name>
`.
There are also certain global limits like max. 256 cpu cores in total per user for all running jobs (32 cores for guests, does not apply for jobs waiting in the queue).
The standard workflow on the cluster consists of compilation of the code on the frontend and then submitting a job.
Warning! Be careful when compiling on chuck (frontend) with option -march=native. The frontend is newer than most of the nodes and the code compiled this way will fail on them. To be able to use all nodes use -march=sandybridge. You can optimize for newer architectures (see the hardware inventory table), e.g. -march=broadwell but then you have to request apriopriate nodes in the job script using option '-C broadwell'. If you don't use -march nor -mtune options the code will run everywhere.
Here is an example job script:
#! /bin/bash -l ## job name #SBATCH -J testjob ## number of nodes #SBATCH -N 1 #SBATCH --ntasks-per-node=1 #SBATCH --mem-per-cpu=1GB #SBATCH --time=01:00:00 ## partition to use #SBATCH -p short #SBATCH --output="stdout.txt" #SBATCH --error="stderr.txt" ## commands/steps to execute # go to the submission directory cd $SLURM_SUBMIT_DIR hostname > out.txt my_code >> out.txt
and it is submited using command command 'sbatch <script_name
>
'. Please note that:
sbatch
as optionsman sbatch
`
After submission you will get a job ID - this is an important number used as parameter for many other SLURM commands, e.g. to cancel a job. Please give this number when asking for support.
Some other important SLURM commands (for details see relevant man pages):
sinfo
- list available partitions and their statussqueue -
list all jobs; `squeue -u <username>`
will list only jobs of a given userscancel
<job id> - remove a jobsacct
- accounting info about completed and running jobsscontrol show partition
- details of partitionssshare -la
- fairshare records per account and per user
Email: cluster©camk.edu.pl
Please provide the job ID and job script location when applicable.