Chris, First, I’m not sure how well the GRES works in the 2.6.x series and I’d encourage an upgrade to a later code base. I think there may have been some gres issues in the 2.6 series, but not quite sure. I know we’ve seen a few small bugs in 14.11 that have been fixed. With that I’ll mention how we do GPUs on the two of our systems.
We’ve redefined the default salloc command to handle interactive sessions for this reason in a sense. Our SallocDefaultCommand in slurm.conf looks like the following: SallocDefaultCommand="srun --mpi=none -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --gres=gpu:0 $SHELL" Then when you would like to use gpus through salloc as: $ salloc --gres=gpu:1 If the application doesn’t require srun to start up, a application should launch as necessary, but if you use srun (to launch steps / parallel jobs), it will look like it hangs, but it really is that srun is blocking because it the GPU was already in use by another job step (i.e., the original srun command). So basically what happens during the salloc is that the user request resources (nodes, cpus, gpus, etc.) and that is used to build the allocation, but the interactive shell is actually spawned without using any of the resources. Hopefully this gives you some insight. Best, Jared From: John Desantis [mailto:[email protected]] Sent: Friday, January 22, 2016 7:43 AM To: slurm-dev <[email protected]> Subject: [slurm-dev] Re: problem using srun to start an interactive job with GPU gres Chris, Ok, at least we got the obvious out of the way! What does your gres.conf look like? Do you have one per node on the GPU enabled nodes, or a single system wide gres.conf? Here is an example of what we're using (some unneeded content removed; sanitized hostnames): # slurm.conf GresTypes=gpu NodeName=racka-[1-8] CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=48258 Feature="..." Gres=gpu:1 Weight=1000 NodeName=rackb-[1-10,19-28] CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32076 Feature="..." Gres=gpu:2 Weight=1000 NodeName=rackb-29 CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32073 Feature="...." Gres=gpu:2 Weight=1000 # gres.conf NodeName=racka-[1-8] Name=gpu File=/dev/nvidia0 NodeName=rackb-[1-10,19-29] Name=gpu File=/dev/nvidia[0-1] John DeSantis 2016-01-21 23:21 GMT-05:00 Chris Paciorek <[email protected]<mailto:[email protected]>>: Whoops, there was a bug in my posting - I actually was using --pty. The invocation leading to the error message is: srun --gres=gpu:1 --pty /bin/bash On Thu, Jan 21, 2016 at 3:23 PM, John Desantis <[email protected]<mailto:[email protected]>> wrote: > Chris, > > Try using "--pty /bin/bash" to get a shell, and see if that helps. > > John DeSantis > > On Jan 21, 2016 5:47 PM, "Chris Paciorek" > <[email protected]<mailto:[email protected]>> wrote: >> >> >> We've been trying out the use of gres to control access to our GPU. It >> works fine for a batch submission but when submitting via srun to get >> an interactive session we get the following error: >> >> paciorek@machine:~/> srun --gres=gpu:1 /bin/bash >> srun: error: gres_plugin_job_state_unpack: no plugin configured to >> unpack data type 7696487 from job 10884 >> srun: gres_plugin_step_state_unpack: no plugin configured to unpack >> data type 7696487 from step 10884.0 >> srun: error: Task launch for 10884.0 failed on node scf-sm20: Invalid >> job credential >> srun: error: Application launch failed: Invalid job credential >> srun: Job step aborted: Waiting up to 2 seconds for job step to finish. >> srun: error: Timed out waiting for job step to complete >> >> We're running on set of Ubuntu 14.04 machines, with SLURM v. 2.6.5 >> (i.e., the slurm-llnl 2.6.5-1 Ubuntu package that is the latest for >> 14.04). >> >> We set up gconf in the way suggested in the SLURM documentation (here >> are the relevant lines from slurm.conf): >> GresTypes=gpu >> NodeName=our_gpu_nodename CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 >> ThreadsPerCore=2 RealMemory=128908 TmpDisk=469325 Gres=gpu:1 >> >> Any ideas? >> >> Thanks, >> Chris >> >> >> ---------------------------------------------------------------------------------------------- >> Chris Paciorek >> >> Statistical Computing Consultant >> Statistical Computing Facility, Econometrics Laboratory, Berkeley >> Research Computing >> >> Office: 495 Evans Hall Email: >> [email protected]<mailto:[email protected]> >> Mailing Address: Voice: >> 510-842-6670<tel:510-842-6670> >> Department of Statistics Fax: >> 510-642-7892<tel:510-642-7892> >> 367 Evans Hall Skype: cjpaciorek >> University of California, Berkeley WWW: >> www.stat.berkeley.edu/~paciorek<http://www.stat.berkeley.edu/~paciorek> >> Berkeley, CA 94720 USA Permanent forward: >> [email protected]<mailto:[email protected]>
