Actually, I just realized that in my testing, I was not sync'ing the
slurm.conf file to all the worker nodes, just to the server running
slurmctld and to the worker that has the GPU on it. When I sync'ed the
slurm.conf containing the gres information to all the nodes the
problem went away.

Thanks for the responses.



On Fri, Jan 22, 2016 at 10:13 AM, John Desantis <[email protected]> wrote:
> Chris,
>
> Could you enable the Gres debugging via the DebugFlags and post the relevant
> output?
>
> It would be interesting to see what the logs state concerning what Gres
> types have been found on the node in question.
>
> John DeSantis
>
>
> 2016-01-22 12:31 GMT-05:00 Chris Paciorek <[email protected]>:
>>
>>
>> Hi John, we have one gres.conf per node on the GPU node. It's a
>> one-line file containing this line:
>>
>> Name=gpu File=/dev/nvidia0
>>
>> On Fri, Jan 22, 2016 at 6:42 AM, John Desantis <[email protected]>
>> wrote:
>> > Chris,
>> >
>> > Ok, at least we got the obvious out of the way!
>> >
>> > What does your gres.conf look like?  Do you have one per node on the GPU
>> > enabled nodes, or a single system wide gres.conf?
>> >
>> > Here is an example of what we're using (some unneeded content removed;
>> > sanitized hostnames):
>> >
>> > # slurm.conf
>> > GresTypes=gpu
>> > NodeName=racka-[1-8] CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=48258
>> > Feature="..." Gres=gpu:1 Weight=1000
>> > NodeName=rackb-[1-10,19-28] CPUs=16 CoresPerSocket=8 Sockets=2
>> > RealMemory=32076 Feature="..." Gres=gpu:2 Weight=1000
>> > NodeName=rackb-29 CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32073
>> > Feature="...." Gres=gpu:2 Weight=1000
>> >
>> > # gres.conf
>> > NodeName=racka-[1-8] Name=gpu File=/dev/nvidia0
>> > NodeName=rackb-[1-10,19-29] Name=gpu File=/dev/nvidia[0-1]
>> >
>> > John DeSantis
>> >
>> >
>> > 2016-01-21 23:21 GMT-05:00 Chris Paciorek <[email protected]>:
>> >>
>> >>
>> >> Whoops, there was a bug in my posting - I actually was using --pty.
>> >> The invocation leading to the error message is:
>> >>
>> >> srun --gres=gpu:1 --pty /bin/bash
>> >>
>> >> On Thu, Jan 21, 2016 at 3:23 PM, John Desantis <[email protected]>
>> >> wrote:
>> >> > Chris,
>> >> >
>> >> > Try using "--pty /bin/bash" to get a shell, and see if that helps.
>> >> >
>> >> > John DeSantis
>> >> >
>> >> > On Jan 21, 2016 5:47 PM, "Chris Paciorek" <[email protected]>
>> >> > wrote:
>> >> >>
>> >> >>
>> >> >> We've been trying out the use of gres to control access to our GPU.
>> >> >> It
>> >> >> works fine for a batch submission but when submitting via srun to
>> >> >> get
>> >> >> an interactive session we get the following error:
>> >> >>
>> >> >> paciorek@machine:~/> srun --gres=gpu:1 /bin/bash
>> >> >> srun: error: gres_plugin_job_state_unpack: no plugin configured to
>> >> >> unpack data type 7696487 from job 10884
>> >> >> srun: gres_plugin_step_state_unpack: no plugin configured to unpack
>> >> >> data type 7696487 from step 10884.0
>> >> >> srun: error: Task launch for 10884.0 failed on node scf-sm20:
>> >> >> Invalid
>> >> >> job credential
>> >> >> srun: error: Application launch failed: Invalid job credential
>> >> >> srun: Job step aborted: Waiting up to 2 seconds for job step to
>> >> >> finish.
>> >> >> srun: error: Timed out waiting for job step to complete
>> >> >>
>> >> >> We're running on set of Ubuntu 14.04 machines, with SLURM v. 2.6.5
>> >> >> (i.e., the slurm-llnl 2.6.5-1 Ubuntu package that is the latest for
>> >> >> 14.04).
>> >> >>
>> >> >> We set up gconf in the way suggested in the SLURM documentation
>> >> >> (here
>> >> >> are the relevant lines from slurm.conf):
>> >> >> GresTypes=gpu
>> >> >> NodeName=our_gpu_nodename CPUs=24 SocketsPerBoard=2 CoresPerSocket=6
>> >> >> ThreadsPerCore=2 RealMemory=128908 TmpDisk=469325 Gres=gpu:1
>> >> >>
>> >> >> Any ideas?
>> >> >>
>> >> >> Thanks,
>> >> >> Chris
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> ----------------------------------------------------------------------------------------------
>> >> >> Chris Paciorek
>> >> >>
>> >> >> Statistical Computing Consultant
>> >> >> Statistical Computing Facility, Econometrics Laboratory, Berkeley
>> >> >> Research Computing
>> >> >>
>> >> >> Office: 495 Evans Hall                      Email:
>> >> >> [email protected]
>> >> >> Mailing Address:                            Voice: 510-842-6670
>> >> >> Department of Statistics                    Fax:   510-642-7892
>> >> >> 367 Evans Hall                              Skype: cjpaciorek
>> >> >> University of California, Berkeley          WWW:
>> >> >> www.stat.berkeley.edu/~paciorek
>> >> >> Berkeley, CA 94720 USA                      Permanent forward:
>> >> >> [email protected]
>> >
>> >
>
>

Reply via email to