Re: [slurm-dev] Status of cgroups implementation

Yiannis . Georgiou Fri, 29 Jul 2011 02:02:31 -0700

Hi Carles,

thanks for reporting this, I will check this out when I'm back from holidays 
after the 16th of August.


In the meanwhile don't hesitate to report if you find other problems.

Thanks,
Yiannis Georgiou

[email protected] a écrit : -----
A : [email protected]
De : Carles Fenoy 
Envoyé par : [email protected]
Date : 28/07/2011 18:18
Objet : Re: [slurm-dev] Status of cgroups implementation

Hi Yiannis,

I will not attend the User's Group Meeting, but my colleague will.

Thank you very much for the last updates of cgroups plugin. I've been testing 
it since Monday and found some problems when trying to use the devices 
subsystem.
I submit a job with:
sbatch --gres=gpu:1 --ntasks=1 --cpus-per-task=1 --wrap='env; srun env | grep 
CUDA;'
When it starts I got in the slurm-20256.out:

256]: bitstring.c:238: bit_nclear: Assertion `(start) < ((b)[1])' failed.

I found this is a problem with the devices subsystem. When the task/cgroup 
plugin tries to create the devices cgroup for the 
task(task_cgroup_devices_create  task_cgroup_devices.c:243) it calls 
gres_plugin_node_config_devices_path(...) which needs the gres plugin to have 
loaded the configuration.
I've solved it adding this line before(239) the 
gres_plugin_node_config_devices_path call:

gres_plugin_node_config_load(8); //the node has 8 cpus

but obviously this is not portable as it should have the slurmd_configured_cpus 
as a parameter, but I didn't find out how to get that value.

I hope you can find a solution for this.

I'll keep testing this plugin as it will be very useful for us.


Carles Fenoy
Barcelona Supercomputing Center



On Wed, Jul 27, 2011 at 5:23 PM, Jerry Smith <[email protected]> wrote:
This is great news, both the cgroups code being finalized and a tutorial being 
offered at the User's Group.

I will be pulling the latest version and running some tests on our side, and am 
now going to sign up for the User's Group Meeting.

--Jerry

[email protected] wrote:
Hi Carles, 

we have just finalized the code and you are welcome to test it and send us back 
your comments! 
You can either pull the latest development version from 
https://github.com/SchedMD/slurm/ (some last patches added on Monday so make 
sure you pull the latest version) 
or wait for the next release of 2.3.0 version which should be out shortly as 
Moe announced some days ago. 

By the way the devices part of cgroups is still considered as experimental from 
the kernel side but I hope that it will become stable by the end of the year. 

I don't know if you will be attending the User Group on September but there 
will be a tutorial session dedicated on the cgroups support upon SLURM and 
there will be discussions about future developments on the subject. 

Concerning your observation I think that this is the expected behaviour but 
perhaps Moe could answer us better on this one. 

Regards, 
Yiannis Georgiou 




De :        Carles Fenoy <[email protected]> 
A :        [email protected] 
Date :        07/22/2011 03:35 PM 
Objet :        [slurm-dev] Status of cgroups implementation 
Envoyé par :        [email protected] 



Hi all,

We are considering using cgroups in a new GPU cluster, and I want to know which 
is the current status of the devices part of the cgroups plugin.

We have also observed that the tasks, of a job requesting gres, that don't 
request generic resources explicitly are not assigned any resources. Example:

A job request 2 gpus with

sbatch --gres=gpu:1 --ntasks=2 --cpus-per-task=2 --wrap="env; srun env | grep 
CUDA"

The first env shows:
CUDA_VISIBLE_DEVICES=0

although "srun env" shows:
CUDA_VISIBLE_DEVICES=NoDevFiles
CUDA_VISIBLE_DEVICES=NoDevFiles

Is this the expected behavior?

Maybe if a job request gres and its steps don't, slurmstepd should not 
overwrite the job environment in:
        
gres_gpu.c(211):

        } else {
                /* The gres.conf file must identify specific device files
                 * in order to set the CUDA_VISIBLE_DEVICES env var */
                env_array_overwrite(job_env_ptr,"CUDA_VISIBLE_DEVICES",
                                    "NoDevFiles");
        }


-- 
--
Carles Fenoy




-- 
--
Carles Fenoy

Re: [slurm-dev] Status of cgroups implementation

Reply via email to