Chris, Could you enable the Gres debugging via the DebugFlags and post the relevant output?
It would be interesting to see what the logs state concerning what Gres types have been found on the node in question. John DeSantis 2016-01-22 12:31 GMT-05:00 Chris Paciorek <[email protected]>: > > Hi John, we have one gres.conf per node on the GPU node. It's a > one-line file containing this line: > > Name=gpu File=/dev/nvidia0 > > On Fri, Jan 22, 2016 at 6:42 AM, John Desantis <[email protected]> > wrote: > > Chris, > > > > Ok, at least we got the obvious out of the way! > > > > What does your gres.conf look like? Do you have one per node on the GPU > > enabled nodes, or a single system wide gres.conf? > > > > Here is an example of what we're using (some unneeded content removed; > > sanitized hostnames): > > > > # slurm.conf > > GresTypes=gpu > > NodeName=racka-[1-8] CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=48258 > > Feature="..." Gres=gpu:1 Weight=1000 > > NodeName=rackb-[1-10,19-28] CPUs=16 CoresPerSocket=8 Sockets=2 > > RealMemory=32076 Feature="..." Gres=gpu:2 Weight=1000 > > NodeName=rackb-29 CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32073 > > Feature="...." Gres=gpu:2 Weight=1000 > > > > # gres.conf > > NodeName=racka-[1-8] Name=gpu File=/dev/nvidia0 > > NodeName=rackb-[1-10,19-29] Name=gpu File=/dev/nvidia[0-1] > > > > John DeSantis > > > > > > 2016-01-21 23:21 GMT-05:00 Chris Paciorek <[email protected]>: > >> > >> > >> Whoops, there was a bug in my posting - I actually was using --pty. > >> The invocation leading to the error message is: > >> > >> srun --gres=gpu:1 --pty /bin/bash > >> > >> On Thu, Jan 21, 2016 at 3:23 PM, John Desantis <[email protected]> > >> wrote: > >> > Chris, > >> > > >> > Try using "--pty /bin/bash" to get a shell, and see if that helps. > >> > > >> > John DeSantis > >> > > >> > On Jan 21, 2016 5:47 PM, "Chris Paciorek" <[email protected]> > wrote: > >> >> > >> >> > >> >> We've been trying out the use of gres to control access to our GPU. > It > >> >> works fine for a batch submission but when submitting via srun to get > >> >> an interactive session we get the following error: > >> >> > >> >> paciorek@machine:~/> srun --gres=gpu:1 /bin/bash > >> >> srun: error: gres_plugin_job_state_unpack: no plugin configured to > >> >> unpack data type 7696487 from job 10884 > >> >> srun: gres_plugin_step_state_unpack: no plugin configured to unpack > >> >> data type 7696487 from step 10884.0 > >> >> srun: error: Task launch for 10884.0 failed on node scf-sm20: Invalid > >> >> job credential > >> >> srun: error: Application launch failed: Invalid job credential > >> >> srun: Job step aborted: Waiting up to 2 seconds for job step to > finish. > >> >> srun: error: Timed out waiting for job step to complete > >> >> > >> >> We're running on set of Ubuntu 14.04 machines, with SLURM v. 2.6.5 > >> >> (i.e., the slurm-llnl 2.6.5-1 Ubuntu package that is the latest for > >> >> 14.04). > >> >> > >> >> We set up gconf in the way suggested in the SLURM documentation (here > >> >> are the relevant lines from slurm.conf): > >> >> GresTypes=gpu > >> >> NodeName=our_gpu_nodename CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 > >> >> ThreadsPerCore=2 RealMemory=128908 TmpDisk=469325 Gres=gpu:1 > >> >> > >> >> Any ideas? > >> >> > >> >> Thanks, > >> >> Chris > >> >> > >> >> > >> >> > >> >> > ---------------------------------------------------------------------------------------------- > >> >> Chris Paciorek > >> >> > >> >> Statistical Computing Consultant > >> >> Statistical Computing Facility, Econometrics Laboratory, Berkeley > >> >> Research Computing > >> >> > >> >> Office: 495 Evans Hall Email: > >> >> [email protected] > >> >> Mailing Address: Voice: 510-842-6670 > >> >> Department of Statistics Fax: 510-642-7892 > >> >> 367 Evans Hall Skype: cjpaciorek > >> >> University of California, Berkeley WWW: > >> >> www.stat.berkeley.edu/~paciorek > >> >> Berkeley, CA 94720 USA Permanent forward: > >> >> [email protected] > > > > >
