[slurm-dev] Re: scch not found !!

2016-02-10 Thread David Roman
Yes, I did it. BLCR installation : In blcr-0.8.5 I not found any file named scch tar xzf blcr-0.8.5.tar.gz cd blcr-0.8.5 ./configure --enable-multilib=no make rpms cd rpm/RPMS/x86_64 rpm -ivh blcr-0.8.5-1.x86_64.rpm blcr-modules_2.6.32_504.30.3.el6.x86_64-0.8.5-1.x86_64.rpm

[slurm-dev] Re: scch not found !!

2016-02-10 Thread Manuel Rodríguez Pascual
Hi David, As can be read in man of slurm.conf, checkpoint/blcr Berkeley Lab Checkpoint Restart (BLCR). NOTE: If a file is found at sbin/scch (relative to the Slurm instal- lation location), it will be executed upon completion of the checkpoint. This can be a

[slurm-dev] GRES for both K80 GPU's

2016-02-10 Thread Michael Senizaiz
I have a couple nodes with 4xK80 GPU's in them (nvidia0-7). Is there a way to either request peer-to-peer GPU's, or force allocation to 2 GPU's at a time? We'd prefer for the former (run when peer-to-peer is available, unless you don't care) so we can fit more users onto the machine. However,

[slurm-dev] EL6 clusters with cgroups enabled

2016-02-10 Thread Christopher B Coffey
Hi, I’m curious which kernel you are running on your el6 clusters that have cgroups enabled in slurm. I have an issue where some workloads cause 100’s-1000’s of flocks to occur relating to the memory cleanup portion in the cgroup. On the schedmd slurm site, I see the mention of this: *

[slurm-dev] Re: slurmd ownership

2016-02-10 Thread Jonathon A Anderson
I suspect the slurm init script doesn’t use the `status` command because it is specific to upstart, and not necessarily available in all init systems (e.g., in sysvinit). The slurmd pid file on one of our compute nodes appears to be non-root-readable.[1] What error were you having when trying

[slurm-dev] Re: AllowGroups and AD

2016-02-10 Thread Janne Blomqvist
On 2016-02-10 15:12, Diego Zuccato wrote: Hello all. I think I'm doing something wrong, but I don't understand what. I'm trying to limit users allowed to use a partition (that, coming from Torque, I think is the equivalent of a queue), but obviously I'm failing. :( Frontend and work nodes

[slurm-dev] Re: EL6 clusters with cgroups enabled

2016-02-10 Thread Christopher Samuel
On 11/02/16 06:15, Christopher B Coffey wrote: > I’m curious which kernel you are running on your el6 clusters that > have cgroups enabled in slurm. I have an issue where some workloads > cause 100’s-1000’s of flocks to occur relating to the memory cleanup > portion in the cgroup. This is

[slurm-dev] Re: slurmd ownership

2016-02-10 Thread jupiter
Hi Jonathon, I figured out, the problem is not the root owner ship but the way /etc/init.d/slurm implements the "service slurm status", it checks the pid file and caused permission issue. Why did it simply run "status slurmd" which works perfectly? I've modified the status and works fine now,

[slurm-dev] Re: slurmd ownership

2016-02-10 Thread Christopher Samuel
On 11/02/16 12:31, jupiter wrote: > I am running slurm on CentOS 6. One thing I've just noticed is that the > slurmctld is running under the user slurm but the slurmd is running > under the root. slurmd has to run as root as it must be able to start processes as the users whose jobs are to run.

[slurm-dev] Setting up SLURM for a single multi-core node

2016-02-10 Thread Rohan Garg
Hello, I'm trying to set up SLURM-15.08.1 on a single multi-core node to manage multi-threaded jobs. The machine has 16 physical cores on 2 sockets with HyperThreading enabled. I'm using the EASY scheduling algorithm with backfilling. The goal is to fully utilize all the available cores at all

[slurm-dev] Re: slurm-dev Job not using cores from different nodes

2016-02-10 Thread Pierre Schneeberger
Dear Benjamin, Many thanks for your answer. The blades have all 32 cores but I left one free since there is a VM running on each of them, don't really know if that helps but I wanted to be on the safe side :) I submitted the job with sbatch and the following command: #!/bin/bash #SBATCH -n 80 #

[slurm-dev] scch not found !!

2016-02-10 Thread David Roman
Hello, I try to use checkpoint in slurm. But I have an error about /usr/sbin/scch not found. I don't know what I do to install this file ? Is it a part of BLCR or SLURM or other things ? Thanks a lot for your reply. David

[slurm-dev] Re: scch not found !!

2016-02-10 Thread Manuel Rodríguez Pascual
hi, You have to install BLCR first. It is quite straightforward though. You can download it from the Berkley Lab web at http://crd.lbl.gov/departments/computer-science/CLaSS/research/BLCR/berkeley-lab-checkpoint-restart-for-linux-blcr-downloads/ 2016-02-10 11:21 GMT+01:00 David Roman

[slurm-dev] AllowGroups and AD

2016-02-10 Thread Diego Zuccato
Hello all. I think I'm doing something wrong, but I don't understand what. I'm trying to limit users allowed to use a partition (that, coming from Torque, I think is the equivalent of a queue), but obviously I'm failing. :( Frontend and work nodes are all Debians joined to AD via Winbind (that

[slurm-dev] Re: slurmd ownership

2016-02-10 Thread Jonathon A Anderson
slurmd must run as root because it forks and execs processes on behalf of other users using the job owner’s uid. I don’t understand what trouble you’re having monitoring slurm with nagios. Could you give an example of what you’re trying to do, what you expect it to do, and what it does in

[slurm-dev] slurmd ownership

2016-02-10 Thread jupiter
Hi, I am running slurm on CentOS 6. One thing I've just noticed is that the slurmctld is running under the user slurm but the slurmd is running under the root. Not quite sure why those daemons in different owner ships, any inside explanation please? Further looked at both slurm daemon configure