Yes, I did it.
BLCR installation : In blcr-0.8.5 I not found any file named scch
tar xzf blcr-0.8.5.tar.gz
cd blcr-0.8.5
./configure --enable-multilib=no
make rpms
cd rpm/RPMS/x86_64
rpm -ivh blcr-0.8.5-1.x86_64.rpm
blcr-modules_2.6.32_504.30.3.el6.x86_64-0.8.5-1.x86_64.rpm
Hi David,
As can be read in man of slurm.conf,
checkpoint/blcr Berkeley Lab Checkpoint Restart (BLCR). NOTE: If a
file is found at sbin/scch (relative to the Slurm instal-
lation location), it will be executed
upon completion of the checkpoint. This can be a
I have a couple nodes with 4xK80 GPU's in them (nvidia0-7).
Is there a way to either request peer-to-peer GPU's, or force allocation to
2 GPU's at a time? We'd prefer for the former (run when peer-to-peer is
available, unless you don't care) so we can fit more users onto the
machine. However,
Hi,
I’m curious which kernel you are running on your el6 clusters that have cgroups
enabled in slurm. I have an issue where some workloads cause 100’s-1000’s of
flocks to occur relating to the memory cleanup portion in the cgroup. On the
schedmd slurm site, I see the mention of this:
*
I suspect the slurm init script doesn’t use the `status` command because it is
specific to upstart, and not necessarily available in all init systems (e.g.,
in sysvinit).
The slurmd pid file on one of our compute nodes appears to be
non-root-readable.[1] What error were you having when trying
On 2016-02-10 15:12, Diego Zuccato wrote:
Hello all.
I think I'm doing something wrong, but I don't understand what.
I'm trying to limit users allowed to use a partition (that, coming from
Torque, I think is the equivalent of a queue), but obviously I'm failing. :(
Frontend and work nodes
On 11/02/16 06:15, Christopher B Coffey wrote:
> I’m curious which kernel you are running on your el6 clusters that
> have cgroups enabled in slurm. I have an issue where some workloads
> cause 100’s-1000’s of flocks to occur relating to the memory cleanup
> portion in the cgroup.
This is
Hi Jonathon,
I figured out, the problem is not the root owner ship but the way
/etc/init.d/slurm implements the "service slurm status", it checks the pid
file and caused permission issue. Why did it simply run "status slurmd"
which works perfectly?
I've modified the status and works fine now,
On 11/02/16 12:31, jupiter wrote:
> I am running slurm on CentOS 6. One thing I've just noticed is that the
> slurmctld is running under the user slurm but the slurmd is running
> under the root.
slurmd has to run as root as it must be able to start processes as the
users whose jobs are to run.
Hello,
I'm trying to set up SLURM-15.08.1 on a single multi-core node to
manage multi-threaded jobs. The machine has 16 physical cores
on 2 sockets with HyperThreading enabled. I'm using the EASY
scheduling algorithm with backfilling. The goal is to fully utilize all
the available cores at all
Dear Benjamin,
Many thanks for your answer.
The blades have all 32 cores but I left one free since there is a VM
running on each of them, don't really know if that helps but I wanted to be
on the safe side :)
I submitted the job with sbatch and the following command:
#!/bin/bash
#SBATCH -n 80 #
Hello,
I try to use checkpoint in slurm. But I have an error about /usr/sbin/scch not
found.
I don't know what I do to install this file ? Is it a part of BLCR or SLURM or
other things ?
Thanks a lot for your reply.
David
hi,
You have to install BLCR first. It is quite straightforward though. You can
download it from the Berkley Lab web at
http://crd.lbl.gov/departments/computer-science/CLaSS/research/BLCR/berkeley-lab-checkpoint-restart-for-linux-blcr-downloads/
2016-02-10 11:21 GMT+01:00 David Roman
Hello all.
I think I'm doing something wrong, but I don't understand what.
I'm trying to limit users allowed to use a partition (that, coming from
Torque, I think is the equivalent of a queue), but obviously I'm failing. :(
Frontend and work nodes are all Debians joined to AD via Winbind (that
slurmd must run as root because it forks and execs processes on behalf of other
users using the job owner’s uid.
I don’t understand what trouble you’re having monitoring slurm with nagios.
Could you give an example of what you’re trying to do, what you expect it to
do, and what it does in
Hi,
I am running slurm on CentOS 6. One thing I've just noticed is that the
slurmctld is running under the user slurm but the slurmd is running under
the root. Not quite sure why those daemons in different owner ships, any
inside explanation please? Further looked at both slurm daemon configure
16 matches
Mail list logo