Hi!
I just discovered a possible way for a user to abuse the priority in Slurm.
This is the scenario:
1. A user has not run any jobs in a long time and has therefore has a
high fairshare priority. Lets say: 1.
2. The user submits 1000 jobs into the queue that is far above his
Hi all,
just a quick question regarding Slurm API.
In the following call,
(checkpoint.c)
/*
* slurm_checkpoint_complete - note the completion of a job step's
checkpoint operation.
* IN job_id - job on which to perform operation
* IN step_id - job step on which to perform operation
* IN
Chris and David,
Thanks for the help! I'm still trying to find out why the
compute nodes are down or not responding. Any tips
on where to start?
How about open ports? Right now I have 6817 and
6818 open as per my slurm.conf. I also have 22 and 80
open as well as 111, 2049, and 32806. I'm using
Actually I don't have all the ports open :( I can do that
though (I thought that might be a problem).
Thanks!
Jeff
Do you have all the ports open between all the compute nodes as well?
Since slurm builds a tree to communicate all the nodes need to talk to
every other node on those ports
Do you have all the ports open between all the compute nodes as well?
Since slurm builds a tree to communicate all the nodes need to talk to
every other node on those ports and do so with out a huge amount of
latency. You might want to try to up your timeouts.
-Paul Edmon-
On 03/31/2015
That's what I've done. Everything is in NFSv4 except for a
few bits:
/etc/slurm.conf
/etc/init.d/slurm
/var/log/slurm
/var/run/slurm
/var/spool/slurm
These bits are local to the node.
Will slurm have trouble in this case?
Thanks!
Jeff
The problem mentioned with NFSv4 is keeping your SLURM
Yes! There are problems if the clean-up scripts for cgroups reside on NFSv4.
Nodes will lock-up when they try to remove a job's
cgroup.
Am 31.03.2015 um 17:06 schrieb Jeff Layton:
That's what I've done. Everything is in NFSv4 except for a
few bits:
/etc/slurm.conf
/etc/init.d/slurm
Hi Magnus,
Unfortunately you found a bug. Here is a patch that will prevent users
from making persistent job priority changes. We should probably return
an error for this condition, but I would like to defer that change to
the next major release, v15.08.
I thanks for reply.
But i think is not a network problem, because i start this only on head
controller.
Can see my config? I don’t have installed iptables or anything that looks.
Config:
ControlMachine=JGSLURMHC
ControlAddr=172.16.40.42
#BackupController=
#BackupAddr=
AuthType=auth/munge
Put the slurmd and slurmctld in debug mode and retry the submission.
Then provide the logs.
Le 31/03/2015 16:28, Jeff Layton a écrit :
Chris and David,
Thanks for the help! I'm still trying to find out why the
compute nodes are down or not responding. Any tips
on where to start?
How
The problem mentioned with NFSv4 is keeping your SLURM installation on NFS, eg.
exported to but not physically residing on your nodes.
--
*Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS |-*O*-
||_// Biomedical | Ryan
11 matches
Mail list logo