Chris, Ok, at least we got the obvious out of the way!
What does your gres.conf look like? Do you have one per node on the GPU enabled nodes, or a single system wide gres.conf? Here is an example of what we're using (some unneeded content removed; sanitized hostnames): # slurm.conf GresTypes=gpu NodeName=racka-[1-8] CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=48258 Feature="..." Gres=gpu:1 Weight=1000 NodeName=rackb-[1-10,19-28] CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32076 Feature="..." Gres=gpu:2 Weight=1000 NodeName=rackb-29 CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32073 Feature="...." Gres=gpu:2 Weight=1000 # gres.conf NodeName=racka-[1-8] Name=gpu File=/dev/nvidia0 NodeName=rackb-[1-10,19-29] Name=gpu File=/dev/nvidia[0-1] John DeSantis 2016-01-21 23:21 GMT-05:00 Chris Paciorek <[email protected]>: > > Whoops, there was a bug in my posting - I actually was using --pty. > The invocation leading to the error message is: > > srun --gres=gpu:1 --pty /bin/bash > > On Thu, Jan 21, 2016 at 3:23 PM, John Desantis <[email protected]> > wrote: > > Chris, > > > > Try using "--pty /bin/bash" to get a shell, and see if that helps. > > > > John DeSantis > > > > On Jan 21, 2016 5:47 PM, "Chris Paciorek" <[email protected]> wrote: > >> > >> > >> We've been trying out the use of gres to control access to our GPU. It > >> works fine for a batch submission but when submitting via srun to get > >> an interactive session we get the following error: > >> > >> paciorek@machine:~/> srun --gres=gpu:1 /bin/bash > >> srun: error: gres_plugin_job_state_unpack: no plugin configured to > >> unpack data type 7696487 from job 10884 > >> srun: gres_plugin_step_state_unpack: no plugin configured to unpack > >> data type 7696487 from step 10884.0 > >> srun: error: Task launch for 10884.0 failed on node scf-sm20: Invalid > >> job credential > >> srun: error: Application launch failed: Invalid job credential > >> srun: Job step aborted: Waiting up to 2 seconds for job step to finish. > >> srun: error: Timed out waiting for job step to complete > >> > >> We're running on set of Ubuntu 14.04 machines, with SLURM v. 2.6.5 > >> (i.e., the slurm-llnl 2.6.5-1 Ubuntu package that is the latest for > >> 14.04). > >> > >> We set up gconf in the way suggested in the SLURM documentation (here > >> are the relevant lines from slurm.conf): > >> GresTypes=gpu > >> NodeName=our_gpu_nodename CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 > >> ThreadsPerCore=2 RealMemory=128908 TmpDisk=469325 Gres=gpu:1 > >> > >> Any ideas? > >> > >> Thanks, > >> Chris > >> > >> > >> > ---------------------------------------------------------------------------------------------- > >> Chris Paciorek > >> > >> Statistical Computing Consultant > >> Statistical Computing Facility, Econometrics Laboratory, Berkeley > >> Research Computing > >> > >> Office: 495 Evans Hall Email: > >> [email protected] > >> Mailing Address: Voice: 510-842-6670 > >> Department of Statistics Fax: 510-642-7892 > >> 367 Evans Hall Skype: cjpaciorek > >> University of California, Berkeley WWW: > >> www.stat.berkeley.edu/~paciorek > >> Berkeley, CA 94720 USA Permanent forward: > >> [email protected] >
