[slurm-dev] Re: CUDA_VISIBLE_DEVICES always set to 0 and mismatch with cgroups

2015-06-08 Thread Kilian Cavalotti
Hi Ewan, On Mon, Jun 8, 2015 at 2:39 AM, Roche Ewan ewan.ro...@epfl.ch wrote: The underlying problem seems to be that SLURM isn’t correctly setting CUDA_VISIBLE_DEVICES to match the device allowed by the cgroup. Slurm actually does the right thing. The real culprit here is the NVML. So for

[slurm-dev] Submitting to test slurmctl

2015-06-08 Thread Jordan Willis
Hello, I have successfully run the production and test slurmctld on our submission node. How do you actually specify which controller daemon to submit too? By default it’s using the production controller using the default ports, but I want to submit to my test controller that is using

[slurm-dev] Re: Limit user to use not more than N number of ntasks/cpu's on specific partition

2015-06-08 Thread Igor Chebotar
Hi, I saw that, but how do i set the the limit on partition? Do i have to use sacctmgr for each user to limit for GrpCPUs? or i can just edit the partitioname= in the slurm.conf? is there any example to see how can i configure it? Thanks, igor. On 08/06/15 17:44, Moe Jette wrote: See:

[slurm-dev] GDB Help

2015-06-08 Thread Dinesh Kumar
Dear All, I am using GDB to debug segment fault. While I am using backtrace (bt) option. I am getting forllowing results. How can I see the values of *item=optimized out*. Any of the other suggestion is also appreciatable ... at layouts_mgr.c:1975 1975if (keydef-flags

[slurm-dev] CUDA_VISIBLE_DEVICES always set to 0 and mismatch with cgroups

2015-06-08 Thread Roche Ewan
Hello, we’re seeing some odd behaviour with version 14.11.4 regarding the interaction between cgroups and GPUs allocated via GRES. The underlying problem seems to be that SLURM isn’t correctly setting CUDA_VISIBLE_DEVICES to match the device allowed by the cgroup. On one node we run two jobs

[slurm-dev] Memory leak options and usage in SLURM

2015-06-08 Thread Dinesh Kumar
Dear All, I want to use valgrind to check memory leakage. For that I found the option --enable-memory-leak-debug, But I want more about how to use that to check and resolve memory leakage with that options. Thanks for your help suggestion in advance. Regards Dineshkumar RAJAGOPAL *Grenoble

[slurm-dev] Limit user to use not more than N number of ntasks/cpu's on specific partition

2015-06-08 Thread Igor Chebotar
Hello, I was searching for an option to configure in slurm.conf partition that will limit each user to use not more than specific number of cpu's per on partition. Is it possible? i want to configure it like that so there will be no situation that one user is using all resources in the

[slurm-dev] Restated: slurmctld makes odd decisions about jobs that completed while it was down, was: State of the accounting database after a controller failure

2015-06-08 Thread Andy Riebs
Upon reflection, the sacct reports NODE_FAIL note that I reported is really just a symptom; the problem (as noted further down) is that slurmctld reports a node failure when a job was running at the time that slurmctld went offline, regardless of the state of the job when slurmctld comes

[slurm-dev] Re: Limit user to use not more than N number of ntasks/cpu's on specific partition

2015-06-08 Thread Moe Jette
See: http://slurm.schedmd.com/resource_limits.html Quoting Igor Chebotar ichebo...@univ.haifa.ac.il: Hello, I was searching for an option to configure in slurm.conf partition that will limit each user to use not more than specific number of cpu's per on partition. Is it possible? i want

[slurm-dev] Re: GDB Help

2015-06-08 Thread Bob Moench
If the segfault will still occur when you compile with -g, you should be good. You may also need -O0 to turn off optimizations. Bob On Mon, 8 Jun 2015, Dinesh Kumar wrote: Dear All, I am using GDB to debug segment fault. While I am using backtrace (bt) option. I am getting forllowing

[slurm-dev] Re: GDB Help

2015-06-08 Thread Dinesh Kumar
Resolved the problem .. Thanks Bob Regards Dineshkumar RAJAGOPAL *Grenoble Institute Of Technology* *Grenoble,France* On Mon, Jun 8, 2015 at 5:25 PM, Bob Moench r...@cray.com wrote: If the segfault will still occur when you compile with -g, you should be good. You may also need -O0 to

[slurm-dev] Re: Messing with job checkpointing

2015-06-08 Thread Manuel Rodríguez Pascual
That's exactly what I was looking for, thanks very much. 2015-06-02 16:30 GMT+02:00 Moe Jette je...@schedmd.com: See the MinJobAge configuration option: http://slurm.schedmd.com/slurm.conf.html Quoting Manuel Rodríguez Pascual manuel.rodriguez.pasc...@gmail.com: Hi all, I have been