Chris,

Ok, at least we got the obvious out of the way!

What does your gres.conf look like?  Do you have one per node on the GPU
enabled nodes, or a single system wide gres.conf?

Here is an example of what we're using (some unneeded content removed;
sanitized hostnames):

# slurm.conf
GresTypes=gpu
NodeName=racka-[1-8] CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=48258
Feature="..." Gres=gpu:1 Weight=1000
NodeName=rackb-[1-10,19-28] CPUs=16 CoresPerSocket=8 Sockets=2
RealMemory=32076 Feature="..." Gres=gpu:2 Weight=1000
NodeName=rackb-29 CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32073
Feature="...." Gres=gpu:2 Weight=1000

# gres.conf
NodeName=racka-[1-8] Name=gpu File=/dev/nvidia0
NodeName=rackb-[1-10,19-29] Name=gpu File=/dev/nvidia[0-1]

John DeSantis


2016-01-21 23:21 GMT-05:00 Chris Paciorek <[email protected]>:

>
> Whoops, there was a bug in my posting - I actually was using --pty.
> The invocation leading to the error message is:
>
> srun --gres=gpu:1 --pty /bin/bash
>
> On Thu, Jan 21, 2016 at 3:23 PM, John Desantis <[email protected]>
> wrote:
> > Chris,
> >
> > Try using "--pty /bin/bash" to get a shell, and see if that helps.
> >
> > John DeSantis
> >
> > On Jan 21, 2016 5:47 PM, "Chris Paciorek" <[email protected]> wrote:
> >>
> >>
> >> We've been trying out the use of gres to control access to our GPU. It
> >> works fine for a batch submission but when submitting via srun to get
> >> an interactive session we get the following error:
> >>
> >> paciorek@machine:~/> srun --gres=gpu:1 /bin/bash
> >> srun: error: gres_plugin_job_state_unpack: no plugin configured to
> >> unpack data type 7696487 from job 10884
> >> srun: gres_plugin_step_state_unpack: no plugin configured to unpack
> >> data type 7696487 from step 10884.0
> >> srun: error: Task launch for 10884.0 failed on node scf-sm20: Invalid
> >> job credential
> >> srun: error: Application launch failed: Invalid job credential
> >> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> >> srun: error: Timed out waiting for job step to complete
> >>
> >> We're running on set of Ubuntu 14.04 machines, with SLURM v. 2.6.5
> >> (i.e., the slurm-llnl 2.6.5-1 Ubuntu package that is the latest for
> >> 14.04).
> >>
> >> We set up gconf in the way suggested in the SLURM documentation (here
> >> are the relevant lines from slurm.conf):
> >> GresTypes=gpu
> >> NodeName=our_gpu_nodename CPUs=24 SocketsPerBoard=2 CoresPerSocket=6
> >> ThreadsPerCore=2 RealMemory=128908 TmpDisk=469325 Gres=gpu:1
> >>
> >> Any ideas?
> >>
> >> Thanks,
> >> Chris
> >>
> >>
> >>
> ----------------------------------------------------------------------------------------------
> >> Chris Paciorek
> >>
> >> Statistical Computing Consultant
> >> Statistical Computing Facility, Econometrics Laboratory, Berkeley
> >> Research Computing
> >>
> >> Office: 495 Evans Hall                      Email:
> >> [email protected]
> >> Mailing Address:                            Voice: 510-842-6670
> >> Department of Statistics                    Fax:   510-642-7892
> >> 367 Evans Hall                              Skype: cjpaciorek
> >> University of California, Berkeley          WWW:
> >> www.stat.berkeley.edu/~paciorek
> >> Berkeley, CA 94720 USA                      Permanent forward:
> >> [email protected]
>

Reply via email to