The gres.conf file is ready by the slurmd daemon while the task launch and claiming GRES is done by the slurmstepd job step shepherd. Here are some options: 1. Move the gres_plugin_step_set_env() and gres_plugin_job_set_env() calls from slurmstepd to slurmd (this is probably the most efficient solution, but could break gres plugins that other people have developed) 2. Add a new gres plugin call for the slurmd to set the CUDA environment variables based upon file names that it already has and leave the other function calls in slurmstepd (more work, will not break any gres plugins developed by other people but still very efficient) OR 3. Modify slurmstepd to read gres.conf to get the nvidia device numbers (the simplest solution, but requires extra overhead for each job launch)

Quoting Nicolas Bigaouette <nbigaoue...@gmail.com>:

Hi Moe, thanks for the indications.

On Tue, Jan 24, 2012 at 4:15 PM, Moe Jette <je...@schedmd.com> wrote:

In the case of gres/gpu, it just sets the CUDA_VISIBLE_DEVICES environment
variable based upon the position(s) in the bitmap allocated to the job or
step.


Probably the simplest way to get the correct environment variable would be
to modify node_config_load() to cache device numbers and then use those
device numbers rather than the bitmap index to set CUDA_VISIBLE_DEVICES
values in job_set_env() and step_set_env()

This is exactly what I'm trying to do. I'm familiarizing myself with the
code and experimenting some stuff. Unfortunately, I don't see how
information can be "transfered" from node_config_load() to job_set_env().
node_config_load() only takes the file entries as input arguments and does
not have any output variables. Also, a variable global to the gres_gpu.c
file does not work as, it seems, job_set_env() is executed as a different
process then node_config_load() and as such is not sharing memory. It might
not be exactly this situation, but the memory is definitely not shared and
thus job_set_env() cannot access variables set by node_config_load().

So either there is a simple way for that sharing of information that I did
not found, or information will have to be passed through function
arguments. But then that would change the API...

I hope I'm just missing something obvious somewhere! How is
node_config_load() supposed to configure anything if job_set_env() can't
have access to that information?

Thanks

Nicolas




Reply via email to