official documentation<\/a> for details and tutorials.<\/p>\n\n\n\nStatus commands<\/h5>\n\n\n\n
sinfo<\/code> [-p<\/code>partition<\/em>] List all partitions you’re allowed to use, or a specific partition<\/em> if you’re allowed there, with information about which hosts are up and which are in use. squeue<\/code> [-p<\/code>partition<\/em>] List all current sessions (whether batch jobs started with sbatch<\/em> or interactive sessions via srun<\/em> or salloc<\/em>) on any partition you’re allowed to use, or on the named partition<\/em>. The listing includes the user responsible, the real time the session has been running, and which nodes it is using. If a session is waiting for resources before running, the node list will say (Resources)<\/code>.<\/p>\n\n\n\nRunning batch jobs<\/h5>\n\n\n\n
This is the best way to run long computations. sbatch<\/code>-p<\/code>partition<\/em> [-N<\/code>nodecount<\/em>] [-c<\/code>ncpus<\/em>] [--gres gpu<\/code>] [-o<\/code>outfile<\/em>] [-e<\/code> errfile] [script<\/em>] Queue a batch job, running the named script<\/em> (default the contents of standard input). The options are: -N<\/code>nodecode<\/em> Allocate nodecount<\/em> nodes (computers) to this job. The default is 1; that usually the maximum. -c<\/code>ncpus<\/em> Allocate ncpus<\/em> CPU cores to this job; default 1. --gres gpu<\/code> Allocate a GPU for this job; default none. No other job will be allowed to use that GPU while yours is running. -o<\/code>outfile<\/em> Store standard output in outfile<\/em>; default .\/slurm-<\/code>nnn<\/em>.out<\/code>, nnn<\/em> the job number reported when the job is queued. -e<\/code> errfile Store standard error in errfile<\/em>; default the same place as standard output.<\/p>\n\n\n\nIf enough resources (e.g. enough nodes) are available, the job will be started right away; otherwise it will wait.<\/p>\n\n\n\n
Beware that your job will be restricted to a single host CPU core unless you use the -c<\/code> option, and a GPU will be available only if you use -gres gpu<\/code>. If a system has two GPUs, there may be two jobs running at the same time, each using one; arrangements are made to point the CUDA code at the one allocated for your job.<\/p>\n\n\n\nRunning commands interactively<\/h5>\n\n\n\n
To avoid tying up nodes when others have work to do, interactive commands should be used only for short tests and debugging, not for long computations. An interactive session may be automatically interrupted after a certain amount (ten minutes, say) of real time.<\/p>\n\n\n\n
If enough resources (e.g. enough nodes) are available, the interactive session will be started right away; otherwise it will print a message like
srun: job 208 queued and waiting for resources<\/code>
and wait. Hit the interrupt key to give up on waiting. srun<\/code>-p<\/code>partition<\/em> [-N<\/code>nodecount<\/em>] [-c<\/code>ncpus<\/em>] [--gres gpu<\/code>] [-o<\/code>outfile<\/em>] [-e<\/code> errfile] [--pty<\/code>] [-l<\/code>] [-I<\/code>] command<\/em> Allocate nodes, cores, and GPUs as for sbatch<\/em> from partition<\/em>, and run command<\/em> on each. Give each node a copy of srun<\/em>‘s standard input; send standard output and error to those for srun<\/em> unless redirected with -o<\/code> and -e options. Outfile<\/em> and errfile<\/em> must be in directories accessible from any Teaching Labs system, not just the host where srun<\/em> is called; in particular it won’t work to use system directories like \/tmp<\/code>.<\/p>\n\n\n\nIf -l<\/code> is given, label each line of standard output or error with a decimal task number (different on each node) followed by a colon.<\/p>\n\n\n\nNormally, if there aren’t enough nodes or GPUs or cores, srun<\/em> will wait until enough are available. If -I<\/code> is given, srun<\/em> will give up immediately instead.<\/p>\n\n\n\nFor example:<\/p>\n\n\n\n
$ srun -p coral -N 3 -l hostname
0: coral01
1: coral02
2: coral03
$
<\/code> salloc<\/code>-p<\/code>partition<\/em> [-N<\/code>nodecount<\/em>] [-c<\/code>ncpus<\/em>] [--gres gpu<\/code>] [-I<\/code>] [command<\/em>] Allocate nodes, cores, and GPUs as for sbatch<\/em> from partition<\/em>, then run command<\/em> (default your login shell) with environment variables describing the allocation. Calling srun<\/em> under control of salloc<\/em> will run tasks within the allocation:<\/p>\n\n\n\n\nsrun -N <\/code>nodecount<\/em> command<\/em> runs command<\/em> on each of nodecount<\/em> hosts chosen from within the allocation;<\/li>\n\n\n\nsrun<\/code> command<\/em> runs it on every allocated host.<\/li>\n<\/ul>\n\n\n\nOption -I<\/code> is like that in srun<\/em>: don’t wait if too few hosts are available.<\/p>\n\n\n\nFor example:<\/p>\n\n\n\n
$ salloc -N 5 -p prawn
salloc: Granted job allocation 192
$ srun hostname
prawn05
prawn02
prawn04
prawn03
prawn01
$ srun -N 3 hostname
prawn02
prawn01
prawn03
$ exit
salloc: Relinquishing job allocation 192
$<\/code><\/p>\n","protected":false},"excerpt":{"rendered":"The Teaching Labs have a small computing cluster. It is meant for teaching distributed computing, scientific computing, GPU programming, and the like; it is not powerful enough nor intended for production computation. Access is allowed only to students registered in specific courses. Cluster systems can be accessed only through the Slurm workload manager; direct login […]<\/p>\n","protected":false},"author":5,"featured_media":0,"parent":133,"menu_order":7,"comment_status":"closed","ping_status":"closed","template":"templates\/template-full-width.php","meta":[],"_links":{"self":[{"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/pages\/210"}],"collection":[{"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/comments?post=210"}],"version-history":[{"count":4,"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/pages\/210\/revisions"}],"predecessor-version":[{"id":270,"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/pages\/210\/revisions\/270"}],"up":[{"embeddable":true,"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/pages\/133"}],"wp:attachment":[{"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/media?parent=210"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}