Kilian, when you specify your CPU bindings in gres.conf, are you using the same IDs that show up in nvidia-smi?
We noticed that our CPU IDs were being remapped from their nvidia-smi values by SLURM according to hwloc, so to get affinity working we needed to use these remapped values. I'm wondering if --accel-bind=g is not using these same remappings, because when our jobs hang with the option, slurmd.log reports "fatal: Invalid gres data for gpu, CPUs=16-31". But when we omit the option, we get no such error and everything seems to work fine, including GPU affinity. Thanks Dave -----Original Message----- From: Kilian Cavalotti [mailto:[email protected]] Sent: Friday, October 27, 2017 2:44 PM To: slurm-dev <[email protected]> Subject: [slurm-dev] Re: CPU/GPU Affinity Not Working On Fri, Oct 27, 2017 at 12:45 PM, Dave Sizer <[email protected]> wrote: > Also, supposedly adding the "--accel-bind=g" option to srun will do this, > though we are observing that this is broken and causes jobs to hang. > > Can anyone confirm this? Not really, it doesn't seem to be hanging for us: -- 8< ----------------------------------------------------------------------- $ srun --gres=gpu:1 --accel-bind=g --pty bash srun: job 2682093 queued and waiting for resources srun: job 2682093 has been allocated resources [kilian@sh-113-01 ~]$ [kilian@sh-113-01 ~]$ nvidia-smi topo -m GPU0 mlx5_0 CPU Affinity GPU0 X PHB 10-10 mlx5_0 PHB X [kilian@sh-113-01 ~]$ -- 8< ----------------------------------------------------------------------- How do you submit your job? You can try with "srun -vvv" to display some more information about the submission process. Cheers, -- Kilian ----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -----------------------------------------------------------------------------------
