Actually, I just realized that in my testing, I was not sync'ing the slurm.conf file to all the worker nodes, just to the server running slurmctld and to the worker that has the GPU on it. When I sync'ed the slurm.conf containing the gres information to all the nodes the problem went away.
Thanks for the responses. On Fri, Jan 22, 2016 at 10:13 AM, John Desantis <[email protected]> wrote: > Chris, > > Could you enable the Gres debugging via the DebugFlags and post the relevant > output? > > It would be interesting to see what the logs state concerning what Gres > types have been found on the node in question. > > John DeSantis > > > 2016-01-22 12:31 GMT-05:00 Chris Paciorek <[email protected]>: >> >> >> Hi John, we have one gres.conf per node on the GPU node. It's a >> one-line file containing this line: >> >> Name=gpu File=/dev/nvidia0 >> >> On Fri, Jan 22, 2016 at 6:42 AM, John Desantis <[email protected]> >> wrote: >> > Chris, >> > >> > Ok, at least we got the obvious out of the way! >> > >> > What does your gres.conf look like? Do you have one per node on the GPU >> > enabled nodes, or a single system wide gres.conf? >> > >> > Here is an example of what we're using (some unneeded content removed; >> > sanitized hostnames): >> > >> > # slurm.conf >> > GresTypes=gpu >> > NodeName=racka-[1-8] CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=48258 >> > Feature="..." Gres=gpu:1 Weight=1000 >> > NodeName=rackb-[1-10,19-28] CPUs=16 CoresPerSocket=8 Sockets=2 >> > RealMemory=32076 Feature="..." Gres=gpu:2 Weight=1000 >> > NodeName=rackb-29 CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32073 >> > Feature="...." Gres=gpu:2 Weight=1000 >> > >> > # gres.conf >> > NodeName=racka-[1-8] Name=gpu File=/dev/nvidia0 >> > NodeName=rackb-[1-10,19-29] Name=gpu File=/dev/nvidia[0-1] >> > >> > John DeSantis >> > >> > >> > 2016-01-21 23:21 GMT-05:00 Chris Paciorek <[email protected]>: >> >> >> >> >> >> Whoops, there was a bug in my posting - I actually was using --pty. >> >> The invocation leading to the error message is: >> >> >> >> srun --gres=gpu:1 --pty /bin/bash >> >> >> >> On Thu, Jan 21, 2016 at 3:23 PM, John Desantis <[email protected]> >> >> wrote: >> >> > Chris, >> >> > >> >> > Try using "--pty /bin/bash" to get a shell, and see if that helps. >> >> > >> >> > John DeSantis >> >> > >> >> > On Jan 21, 2016 5:47 PM, "Chris Paciorek" <[email protected]> >> >> > wrote: >> >> >> >> >> >> >> >> >> We've been trying out the use of gres to control access to our GPU. >> >> >> It >> >> >> works fine for a batch submission but when submitting via srun to >> >> >> get >> >> >> an interactive session we get the following error: >> >> >> >> >> >> paciorek@machine:~/> srun --gres=gpu:1 /bin/bash >> >> >> srun: error: gres_plugin_job_state_unpack: no plugin configured to >> >> >> unpack data type 7696487 from job 10884 >> >> >> srun: gres_plugin_step_state_unpack: no plugin configured to unpack >> >> >> data type 7696487 from step 10884.0 >> >> >> srun: error: Task launch for 10884.0 failed on node scf-sm20: >> >> >> Invalid >> >> >> job credential >> >> >> srun: error: Application launch failed: Invalid job credential >> >> >> srun: Job step aborted: Waiting up to 2 seconds for job step to >> >> >> finish. >> >> >> srun: error: Timed out waiting for job step to complete >> >> >> >> >> >> We're running on set of Ubuntu 14.04 machines, with SLURM v. 2.6.5 >> >> >> (i.e., the slurm-llnl 2.6.5-1 Ubuntu package that is the latest for >> >> >> 14.04). >> >> >> >> >> >> We set up gconf in the way suggested in the SLURM documentation >> >> >> (here >> >> >> are the relevant lines from slurm.conf): >> >> >> GresTypes=gpu >> >> >> NodeName=our_gpu_nodename CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 >> >> >> ThreadsPerCore=2 RealMemory=128908 TmpDisk=469325 Gres=gpu:1 >> >> >> >> >> >> Any ideas? >> >> >> >> >> >> Thanks, >> >> >> Chris >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ---------------------------------------------------------------------------------------------- >> >> >> Chris Paciorek >> >> >> >> >> >> Statistical Computing Consultant >> >> >> Statistical Computing Facility, Econometrics Laboratory, Berkeley >> >> >> Research Computing >> >> >> >> >> >> Office: 495 Evans Hall Email: >> >> >> [email protected] >> >> >> Mailing Address: Voice: 510-842-6670 >> >> >> Department of Statistics Fax: 510-642-7892 >> >> >> 367 Evans Hall Skype: cjpaciorek >> >> >> University of California, Berkeley WWW: >> >> >> www.stat.berkeley.edu/~paciorek >> >> >> Berkeley, CA 94720 USA Permanent forward: >> >> >> [email protected] >> > >> > > >
