Chris,

First, I’m not sure how well the GRES works in the 2.6.x series and I’d 
encourage an upgrade to a later code base. I think there may have been some 
gres issues in the 2.6 series, but not quite sure. I know we’ve seen a few 
small bugs in 14.11 that have been fixed. With that I’ll mention how we do GPUs 
on the two of our systems.

We’ve redefined the default salloc command to handle interactive sessions for 
this reason in a sense. Our SallocDefaultCommand in slurm.conf looks like the 
following:

SallocDefaultCommand="srun --mpi=none -n1 -N1 --mem-per-cpu=0 --pty 
--preserve-env --gres=gpu:0 $SHELL"

Then when you would like to use gpus through salloc as:

$ salloc --gres=gpu:1

If the application doesn’t require srun to start up, a application should 
launch as necessary, but if you use srun (to launch steps / parallel jobs), it 
will look like it hangs, but it really is that srun is blocking because it the 
GPU was already in use by another job step (i.e., the original srun command). 
So basically what happens during the salloc is that the user request resources 
(nodes, cpus, gpus, etc.) and that is used to build the allocation, but the 
interactive shell is actually spawned without using any of the resources.

Hopefully this gives you some insight.

Best, Jared

From: John Desantis [mailto:[email protected]]
Sent: Friday, January 22, 2016 7:43 AM
To: slurm-dev <[email protected]>
Subject: [slurm-dev] Re: problem using srun to start an interactive job with 
GPU gres

Chris,

Ok, at least we got the obvious out of the way!

What does your gres.conf look like?  Do you have one per node on the GPU 
enabled nodes, or a single system wide gres.conf?

Here is an example of what we're using (some unneeded content removed; 
sanitized hostnames):

# slurm.conf
GresTypes=gpu
NodeName=racka-[1-8] CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=48258 
Feature="..." Gres=gpu:1 Weight=1000
NodeName=rackb-[1-10,19-28] CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32076 
Feature="..." Gres=gpu:2 Weight=1000
NodeName=rackb-29 CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32073 
Feature="...." Gres=gpu:2 Weight=1000

# gres.conf
NodeName=racka-[1-8] Name=gpu File=/dev/nvidia0
NodeName=rackb-[1-10,19-29] Name=gpu File=/dev/nvidia[0-1]

John DeSantis


2016-01-21 23:21 GMT-05:00 Chris Paciorek 
<[email protected]<mailto:[email protected]>>:

Whoops, there was a bug in my posting - I actually was using --pty.
The invocation leading to the error message is:

srun --gres=gpu:1 --pty /bin/bash

On Thu, Jan 21, 2016 at 3:23 PM, John Desantis 
<[email protected]<mailto:[email protected]>> wrote:
> Chris,
>
> Try using "--pty /bin/bash" to get a shell, and see if that helps.
>
> John DeSantis
>
> On Jan 21, 2016 5:47 PM, "Chris Paciorek" 
> <[email protected]<mailto:[email protected]>> wrote:
>>
>>
>> We've been trying out the use of gres to control access to our GPU. It
>> works fine for a batch submission but when submitting via srun to get
>> an interactive session we get the following error:
>>
>> paciorek@machine:~/> srun --gres=gpu:1 /bin/bash
>> srun: error: gres_plugin_job_state_unpack: no plugin configured to
>> unpack data type 7696487 from job 10884
>> srun: gres_plugin_step_state_unpack: no plugin configured to unpack
>> data type 7696487 from step 10884.0
>> srun: error: Task launch for 10884.0 failed on node scf-sm20: Invalid
>> job credential
>> srun: error: Application launch failed: Invalid job credential
>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
>> srun: error: Timed out waiting for job step to complete
>>
>> We're running on set of Ubuntu 14.04 machines, with SLURM v. 2.6.5
>> (i.e., the slurm-llnl 2.6.5-1 Ubuntu package that is the latest for
>> 14.04).
>>
>> We set up gconf in the way suggested in the SLURM documentation (here
>> are the relevant lines from slurm.conf):
>> GresTypes=gpu
>> NodeName=our_gpu_nodename CPUs=24 SocketsPerBoard=2 CoresPerSocket=6
>> ThreadsPerCore=2 RealMemory=128908 TmpDisk=469325 Gres=gpu:1
>>
>> Any ideas?
>>
>> Thanks,
>> Chris
>>
>>
>> ----------------------------------------------------------------------------------------------
>> Chris Paciorek
>>
>> Statistical Computing Consultant
>> Statistical Computing Facility, Econometrics Laboratory, Berkeley
>> Research Computing
>>
>> Office: 495 Evans Hall                      Email:
>> [email protected]<mailto:[email protected]>
>> Mailing Address:                            Voice: 
>> 510-842-6670<tel:510-842-6670>
>> Department of Statistics                    Fax:   
>> 510-642-7892<tel:510-642-7892>
>> 367 Evans Hall                              Skype: cjpaciorek
>> University of California, Berkeley          WWW:
>> www.stat.berkeley.edu/~paciorek<http://www.stat.berkeley.edu/~paciorek>
>> Berkeley, CA 94720 USA                      Permanent forward:
>> [email protected]<mailto:[email protected]>

Reply via email to