Dear Community,

We are trying to activate sharding.
Our Compute Node are configured with 64 cores, 4 phisical GPU MI250x ( 8 
logical ) 4 Numa Domain. 1 Phisical Gpu  / 2 logical GPU for each Numa Domain. 
1 Logical GPU for each l3 cache domain

gres.conf
AutoDetect=rsmi
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD128 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD129 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD130 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD131 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD132 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD133 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD134 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD135 Count=4

If I ask 2 cores with block:cyclic I have the expected result
srun -N1 -n2 -c1 --cpu-bind=cores -m block:cyclic  --pty bash
cpuset cgroup is 1,17

But if I add 2 shard in the request I don't expect this result
srun -N1 -n2 -c1 --cpu-bind=cores --gres=shard:2 -m block:cyclic  --pty bash
cpuset cgroup is 1-2 
ROCR_VISIBILE_DEVICES=0


Is it possibile request 2 sharding in round robin fashion, in order to run a 
multigpu job on different GPUs?
srun -N1 -n2 -c1 --cpu-bind=cores --gres=shard:2 -m block:cyclic  --pty bash

Practically, I would to have this result 
cpuset cgroup is 1-17
ROCR_VISIBILE_DEVICES=0,1

Thank you in advance,
Alessandro

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to