On Mon, 23 Jan 2012 18:27:09 -0700, Moe Jette <je...@schedmd.com> wrote:
> What do you mean by "slurm will submit jobs to /dev/nvidia1"?
> 
> If you have task/cgroup configured, then those specific pathnames  
> should be in the job's cgroup and other device files should not,  
> although I have not personally verify cgroup support for gres files.
> 
> If you mean that SLURM sets the CUDA_VISIBLE_DEVICES environment  
> variable to 1 to represent the second GPU in SLURM's table (starting  
> the counting at zero) and the CUDA software treats that as  
> representing /dev/nvidia1, I could see that happening. If that is the  
> problem, the environment variable should probably be set based upon  
> the device file name rather than it's index number within the SLURM  
> gres.conf file. Not a trivial change. The relevant logic is the call  
> to gres_plugin_job_set_env() from slurmd/slurmstepd/slurmstepd.c


CUDA_VISIBLE_DEVICES is a CUDA API variable, and CUDA doesn't know
anything about "SLURM's table" of GPUs. If you are going to set
this variable from SLURM, it needs to set relative to the device
numbering known to CUDA, not SLURM's own device numbering.

mark

 
> Quoting Nicolas Bigaouette <nbigaoue...@gmail.com>:
> 
> > Hi all,
> >
> > I'm using slurm to submit jobs on workstations which have 3 GPUs. Two are
> > used for GPGPU and one is used for the monitor. Clearly, I don't want jobs
> > submited to the monitor card.
> >
> > Thus, my /etc/slurm/gres.conf contains the following:
> >
> >> Name=gpu File=/dev/nvidia0
> >> Name=gpu File=/dev/nvidia2
> >>
> > Note that /dev/nvidia1 is not present there: this dev entry is the
> > underpowered card that drives the user's monitor.
> >
> > Unfortunately, it seems slurm will submit jobs to /dev/nvidia1 when
> > /dev/nvidia0 is busy (or if asking for a job with 2 gpu).
> >
> > I've checked the source to see where the decision was taken. In
> > src/plugins/gres/gpu/gres_gpu.c, the function job_set_env() sets
> > "CUDA_VISIBLE_DEVICES", used to identify the device chosen by slurm for
> > cuda to run on.
> >
> > I have trouble understanding where the choice of which gpu device is taken.
> > I think this information is encoded in gres_job_ptr->gres_bit_alloc[0] and
> > that job_set_env() is fine, but can't find the logic to where this is set.
> >
> > Anyone could provide a clue?
> >
> > Thanks a lot.
> >
> > Nicolas
> >
> 
> 
> 

Reply via email to