[slurm-dev] A way of abuse the priority option in Slurm?

2015-03-31 Thread Magnus Jonsson
Hi! I just discovered a possible way for a user to abuse the priority in Slurm. This is the scenario: 1. A user has not run any jobs in a long time and has therefore has a high fairshare priority. Lets say: 1. 2. The user submits 1000 jobs into the queue that is far above his

[slurm-dev] error code in slurm_checkpoint_complete

2015-03-31 Thread Manuel Rodríguez Pascual
Hi all, just a quick question regarding Slurm API. In the following call, (checkpoint.c) /* * slurm_checkpoint_complete - note the completion of a job step's checkpoint operation. * IN job_id - job on which to perform operation * IN step_id - job step on which to perform operation * IN

[slurm-dev] Re: Problems running job

2015-03-31 Thread Jeff Layton
Chris and David, Thanks for the help! I'm still trying to find out why the compute nodes are down or not responding. Any tips on where to start? How about open ports? Right now I have 6817 and 6818 open as per my slurm.conf. I also have 22 and 80 open as well as 111, 2049, and 32806. I'm using

[slurm-dev] Re: Problems running job

2015-03-31 Thread Jeff Layton
Actually I don't have all the ports open :( I can do that though (I thought that might be a problem). Thanks! Jeff Do you have all the ports open between all the compute nodes as well? Since slurm builds a tree to communicate all the nodes need to talk to every other node on those ports

[slurm-dev] Re: Problems running job

2015-03-31 Thread Paul Edmon
Do you have all the ports open between all the compute nodes as well? Since slurm builds a tree to communicate all the nodes need to talk to every other node on those ports and do so with out a huge amount of latency. You might want to try to up your timeouts. -Paul Edmon- On 03/31/2015

[slurm-dev] Re: Problems running job

2015-03-31 Thread Jeff Layton
That's what I've done. Everything is in NFSv4 except for a few bits: /etc/slurm.conf /etc/init.d/slurm /var/log/slurm /var/run/slurm /var/spool/slurm These bits are local to the node. Will slurm have trouble in this case? Thanks! Jeff The problem mentioned with NFSv4 is keeping your SLURM

[slurm-dev] Re: Problems running job

2015-03-31 Thread Uwe Sauter
Yes! There are problems if the clean-up scripts for cgroups reside on NFSv4. Nodes will lock-up when they try to remove a job's cgroup. Am 31.03.2015 um 17:06 schrieb Jeff Layton: That's what I've done. Everything is in NFSv4 except for a few bits: /etc/slurm.conf /etc/init.d/slurm

[slurm-dev] A way of abuse the priority option in Slurm?

2015-03-31 Thread Moe Jette
Hi Magnus, Unfortunately you found a bug. Here is a patch that will prevent users from making persistent job priority changes. We should probably return an error for this condition, but I would like to defer that change to the next major release, v15.08.

[slurm-dev] Re: Error connecting slurm stream socket at IP:6817: Connection refused

2015-03-31 Thread Jorge Gois
I thanks for reply. But i think is not a network problem, because i start this only on head controller. Can see my config? I don’t have installed iptables or anything that looks. Config: ControlMachine=JGSLURMHC ControlAddr=172.16.40.42 #BackupController= #BackupAddr= AuthType=auth/munge

[slurm-dev] Re: Problems running job

2015-03-31 Thread Mehdi Denou
Put the slurmd and slurmctld in debug mode and retry the submission. Then provide the logs. Le 31/03/2015 16:28, Jeff Layton a écrit : Chris and David, Thanks for the help! I'm still trying to find out why the compute nodes are down or not responding. Any tips on where to start? How

[slurm-dev] Re: Problems running job

2015-03-31 Thread Novosielski, Ryan
The problem mentioned with NFSv4 is keeping your SLURM installation on NFS, eg. exported to but not physically residing on your nodes. -- *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |-*O*- ||_// Biomedical | Ryan