Kilian, when you specify your CPU bindings in gres.conf, are you using the same 
IDs that show up in nvidia-smi?

We noticed that our CPU IDs were being remapped from their nvidia-smi values by 
SLURM according to hwloc, so to get affinity working we needed to use these 
remapped values.

I'm wondering if --accel-bind=g is not using these same remappings, because 
when our jobs hang with the option, slurmd.log reports "fatal: Invalid gres 
data for gpu, CPUs=16-31".  But when we omit the option, we get no such error 
and everything seems to work fine, including GPU affinity.

Thanks
Dave

-----Original Message-----
From: Kilian Cavalotti [mailto:[email protected]] 
Sent: Friday, October 27, 2017 2:44 PM
To: slurm-dev <[email protected]>
Subject: [slurm-dev] Re: CPU/GPU Affinity Not Working


On Fri, Oct 27, 2017 at 12:45 PM, Dave Sizer <[email protected]> wrote:
> Also, supposedly adding the "--accel-bind=g" option to srun will do this, 
> though we are observing that this is broken and causes jobs to hang.
>
> Can anyone confirm this?

Not really, it doesn't seem to be hanging for us:

-- 8< -----------------------------------------------------------------------
$ srun  --gres=gpu:1  --accel-bind=g --pty bash
srun: job 2682093 queued and waiting for resources
srun: job 2682093 has been allocated resources
[kilian@sh-113-01 ~]$
[kilian@sh-113-01 ~]$ nvidia-smi topo -m
       GPU0    mlx5_0  CPU Affinity
GPU0     X      PHB     10-10
mlx5_0  PHB      X
[kilian@sh-113-01 ~]$
-- 8< -----------------------------------------------------------------------

How do you submit your job? You can try with "srun -vvv" to display some more 
information about the submission process.

Cheers,
--
Kilian

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to