{"id":210,"date":"2023-06-12T11:05:53","date_gmt":"2023-06-12T15:05:53","guid":{"rendered":"https:\/\/wwwdev.teach.cs.toronto.edu\/?page_id=210"},"modified":"2023-06-12T11:46:47","modified_gmt":"2023-06-12T15:46:47","slug":"remote-gpu-computing","status":"publish","type":"page","link":"https:\/\/wwwdev.teach.cs.toronto.edu\/using-labs\/remote-gpu-computing\/","title":{"rendered":"Remote GPU Computing"},"content":{"rendered":"\n<p>The Teaching Labs have a small computing cluster. It is meant for teaching distributed computing, scientific computing, GPU programming, and the like; it is not powerful enough nor intended for production computation. Access is allowed only to students registered in specific courses.<\/p>\n\n\n\n<p>Cluster systems can be accessed only through the <a href=\"https:\/\/slurm.schedmd.com\/\">Slurm workload manager<\/a>; direct login (e.g. with <em>ssh<\/em>) is not allowed. See below for <a href=\"https:\/\/www.teach.cs.toronto.edu\/using_cdf\/cluster.html#slurmcmd\">a list of basic Slurm commands<\/a>. Your instructor should offer details, in particular which partitions are available to your course.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Nodes and partitions<\/h5>\n\n\n\n<p>The cluster contains these nodes: <code>coral01-coral08<\/code> 8 rack-mounted Supermicro 1019GP-TT systems, each with:<\/p>\n\n\n\n<ul>\n<li>An eight-core Intel Xeon Silver 4208 CPU running at 2.10 GHz, with 11MiB of cache<\/li>\n\n\n\n<li>96GiB of memory<\/li>\n\n\n\n<li>Two NVIDIA GeForce RTX A4000 GPUs, with 16GiB of memory; the NVIDIA CUDA 11.7 software suite (same as on Teaching Labs workstations with GPUs).<\/li>\n\n\n\n<li>10Gbps Ethernet networking to every other <code>coral<\/code><em>nn<\/em> system; 10Gbps uplink to other <code>teach.cs<\/code> systems.<\/li>\n<\/ul>\n\n\n\n<p><code>prawn01-prawn14<\/code> 14 rack-mounted Supermicro 1019GP-TT systems, each with:<\/p>\n\n\n\n<ul>\n<li>An eight-core Intel Xeon Silver 4208 CPU running at 2.10 GHz, with 11MiB of cache<\/li>\n\n\n\n<li>96GiB of memory<\/li>\n\n\n\n<li>Two NVIDIA GeForce RTX 2080 Ti GPUs, with 11GiB of memory; the NVIDIA CUDA 11.7 software suite (same as on Teaching Labs workstations with GPUs).<\/li>\n\n\n\n<li>10Gbps Ethernet networking to every other <code>prawn<\/code><em>nn<\/em> system; 10Gbps uplink to other <code>teach.cs<\/code> systems.<\/li>\n<\/ul>\n\n\n\n<p>Students in a given course will have access only to certain partitions, made from specific subsets of these nodes. Your instructor will tell you which; the <em>sinfo<\/em> command will show the partitions you&#8217;re allowed to use.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Basic Slurm commands<\/h5>\n\n\n\n<p>Here is a summary of frequently-used Slurm commands. They may be run from any Teaching Labs workstation or remote-access system. See the <a href=\"https:\/\/slurm.schedmd.com\">official documentation<\/a> for details and tutorials.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Status commands<\/h5>\n\n\n\n<p><code>sinfo<\/code> [<code>-p<\/code><em>partition<\/em>] List all partitions you&#8217;re allowed to use, or a specific <em>partition<\/em> if you&#8217;re allowed there, with information about which hosts are up and which are in use. <code>squeue<\/code> [<code>-p<\/code><em>partition<\/em>] List all current sessions (whether batch jobs started with <em>sbatch<\/em> or interactive sessions via <em>srun<\/em> or <em>salloc<\/em>) on any partition you&#8217;re allowed to use, or on the named <em>partition<\/em>. The listing includes the user responsible, the real time the session has been running, and which nodes it is using. If a session is waiting for resources before running, the node list will say <code>(Resources)<\/code>.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Running batch jobs<\/h5>\n\n\n\n<p>This is the best way to run long computations. <code>sbatch<\/code><code>-p<\/code><em>partition<\/em> [<code>-N<\/code><em>nodecount<\/em>] [<code>-c<\/code><em>ncpus<\/em>] [<code>--gres gpu<\/code>] [<code>-o<\/code><em>outfile<\/em>] [<code>-e<\/code> errfile] [<em>script<\/em>] Queue a batch job, running the named <em>script<\/em> (default the contents of standard input). The options are: <code>-N<\/code><em>nodecode<\/em> Allocate <em>nodecount<\/em> nodes (computers) to this job. The default is 1; that usually the maximum. <code>-c<\/code><em>ncpus<\/em> Allocate <em>ncpus<\/em> CPU cores to this job; default 1. <code>--gres gpu<\/code> Allocate a GPU for this job; default none. No other job will be allowed to use that GPU while yours is running. <code>-o<\/code><em>outfile<\/em> Store standard output in <em>outfile<\/em>; default <code>.\/slurm-<\/code><em>nnn<\/em><code>.out<\/code>, <em>nnn<\/em> the job number reported when the job is queued. <code>-e<\/code> errfile Store standard error in <em>errfile<\/em>; default the same place as standard output.<\/p>\n\n\n\n<p>If enough resources (e.g. enough nodes) are available, the job will be started right away; otherwise it will wait.<\/p>\n\n\n\n<p>Beware that your job will be restricted to a single host CPU core unless you use the <code>-c<\/code> option, and a GPU will be available only if you use <code>-gres gpu<\/code>. If a system has two GPUs, there may be two jobs running at the same time, each using one; arrangements are made to point the CUDA code at the one allocated for your job.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Running commands interactively<\/h5>\n\n\n\n<p>To avoid tying up nodes when others have work to do, interactive commands should be used only for short tests and debugging, not for long computations. An interactive session may be automatically interrupted after a certain amount (ten minutes, say) of real time.<\/p>\n\n\n\n<p>If enough resources (e.g. enough nodes) are available, the interactive session will be started right away; otherwise it will print a message like<br><code>srun: job 208 queued and waiting for resources<\/code><br>and wait. Hit the interrupt key to give up on waiting. <code>srun<\/code><code>-p<\/code><em>partition<\/em> [<code>-N<\/code><em>nodecount<\/em>] [<code>-c<\/code><em>ncpus<\/em>] [<code>--gres gpu<\/code>] [<code>-o<\/code><em>outfile<\/em>] [<code>-e<\/code> errfile] [<code>--pty<\/code>] [<code>-l<\/code>] [<code>-I<\/code>] <em>command<\/em> Allocate nodes, cores, and GPUs as for <em>sbatch<\/em> from <em>partition<\/em>, and run <em>command<\/em> on each. Give each node a copy of <em>srun<\/em>&#8216;s standard input; send standard output and error to those for <em>srun<\/em> unless redirected with <code>-o<\/code> and -e options. <em>Outfile<\/em> and <em>errfile<\/em> must be in directories accessible from any Teaching Labs system, not just the host where <em>srun<\/em> is called; in particular it won&#8217;t work to use system directories like <code>\/tmp<\/code>.<\/p>\n\n\n\n<p>If <code>-l<\/code> is given, label each line of standard output or error with a decimal task number (different on each node) followed by a colon.<\/p>\n\n\n\n<p>Normally, if there aren&#8217;t enough nodes or GPUs or cores, <em>srun<\/em> will wait until enough are available. If <code>-I<\/code> is given, <em>srun<\/em> will give up immediately instead.<\/p>\n\n\n\n<p>For example:<\/p>\n\n\n\n<p><code><br>$ srun -p coral -N 3 -l hostname<br>0: coral01<br>1: coral02<br>2: coral03<br>$<br><\/code> <code>salloc<\/code><code>-p<\/code><em>partition<\/em> [<code>-N<\/code><em>nodecount<\/em>] [<code>-c<\/code><em>ncpus<\/em>] [<code>--gres gpu<\/code>] [<code>-I<\/code>] [<em>command<\/em>] Allocate nodes, cores, and GPUs as for <em>sbatch<\/em> from <em>partition<\/em>, then run <em>command<\/em> (default your login shell) with environment variables describing the allocation. Calling <em>srun<\/em> under control of <em>salloc<\/em> will run tasks within the allocation:<\/p>\n\n\n\n<ul>\n<li><code>srun -N <\/code><em>nodecount<\/em> <em>command<\/em> runs <em>command<\/em> on each of <em>nodecount<\/em> hosts chosen from within the allocation;<\/li>\n\n\n\n<li><code>srun<\/code> <em>command<\/em> runs it on every allocated host.<\/li>\n<\/ul>\n\n\n\n<p>Option <code>-I<\/code> is like that in <em>srun<\/em>: don&#8217;t wait if too few hosts are available.<\/p>\n\n\n\n<p>For example:<\/p>\n\n\n\n<p><code><br>$ salloc -N 5 -p prawn<br>salloc: Granted job allocation 192<br>$ srun hostname<br>prawn05<br>prawn02<br>prawn04<br>prawn03<br>prawn01<br>$ srun -N 3 hostname<br>prawn02<br>prawn01<br>prawn03<br>$ exit<br>salloc: Relinquishing job allocation 192<br>$<\/code><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Teaching Labs have a small computing cluster. It is meant for teaching distributed computing, scientific computing, GPU programming, and the like; it is not powerful enough nor intended for production computation. Access is allowed only to students registered in specific courses. Cluster systems can be accessed only through the Slurm workload manager; direct login [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":0,"parent":133,"menu_order":7,"comment_status":"closed","ping_status":"closed","template":"templates\/template-full-width.php","meta":[],"_links":{"self":[{"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/pages\/210"}],"collection":[{"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/comments?post=210"}],"version-history":[{"count":4,"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/pages\/210\/revisions"}],"predecessor-version":[{"id":270,"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/pages\/210\/revisions\/270"}],"up":[{"embeddable":true,"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/pages\/133"}],"wp:attachment":[{"href":"https:\/\/wwwdev.teach.cs.toronto.edu\/wp-json\/wp\/v2\/media?parent=210"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}