[slurm-dev] NodeName and PartitionName format in slurm.conf

2016-01-19 Thread Andrus, Brian Contractor
All, I am testing our slurm to replace our torque/moab setup here. The issue I have is to try and put all our node names in the NodeName and PartitionName entries. In our cluster, we name our nodes compute-- That seems to be problem enough with the abilities to use ranges in slurm, but it is co

[slurm-dev] Re: NodeName and PartitionName format in slurm.conf

2016-01-20 Thread Andrus, Brian Contractor
PUs, MIC cards, Infiniband, etc). Brian Andrus -Original Message- From: Benjamin Redling [mailto:benjamin.ra...@uni-jena.de] Sent: Wednesday, January 20, 2016 2:00 AM To: slurm-dev Subject: [slurm-dev] Re: NodeName and PartitionName format in slurm.conf Am 19.01.2016 um 20:37 schrie

[slurm-dev] Adjust an array job's maximum simultaneous running tasks

2016-01-20 Thread Andrus, Brian Contractor
All, Is there a way to change the maximum simultaneous running tasks of an array job that is currently running? For example I have sbatch --array=1-100%2 and I want to change it to effectively be: sbatch --array=1-100%5 to cause slurm to start running 5 at a time right away. And be able to do

[slurm-dev] Re: Adjust an array job's maximum simultaneous running tasks

2016-01-21 Thread Andrus, Brian Contractor
ional Corporation<http://www.decisionsciencescorp.com/> On Wed, Jan 20, 2016 at 6:49 PM, Andrus, Brian Contractor mailto:bdand...@nps.edu>> wrote: All, Is there a way to change the maximum simultaneous running tasks of an array job that is currently running? For example I have sbatch --a

[slurm-dev] Update job and partition for shared jobs

2016-01-26 Thread Andrus, Brian Contractor
All, I am in the process of transitioning from Torque to Slurm. So far it is doing very well, especially handling arrays. Now I have one array job that is running across several nodes, but only using some of the node resources. I would like to have slurm start sharing the nodes so some of the a

[slurm-dev] Re: Update job and partition for shared jobs

2016-01-26 Thread Andrus, Brian Contractor
ohn DeSantis 2016-01-26 15:20 GMT-05:00 Andrus, Brian Contractor mailto:bdand...@nps.edu>>: All, I am in the process of transitioning from Torque to Slurm. So far it is doing very well, especially handling arrays. Now I have one array job that is running across several nodes, but only using

[slurm-dev] distribution for array jobs

2016-01-27 Thread Andrus, Brian Contractor
016-01-26 20:05 GMT-05:00 Andrus, Brian Contractor mailto:bdand...@nps.edu>>: John, Thanks. That seemed to help; a job started on a node that had a job on it once the job that had been on it (‘using’ all the memory) completed. But now all my jobs won’t start and have a status of ‘JobHoldMaxRe

[slurm-dev] Re: distribution for array jobs

2016-01-28 Thread Andrus, Brian Contractor
there are some other creative things you can do with them. Ryan On 01/27/2016 06:47 PM, Andrus, Brian Contractor wrote: I ended up just doing ‘scancel’ on all the jobs and resubmitting them. I seem to be making progress. Now I am having trouble figuring out the –distribution option. I want to ha

[slurm-dev] List resources used/available

2016-02-04 Thread Andrus, Brian Contractor
All, I am trying to find a way to see what resources are used/remaining on a per node basis. In particular memory and sockets/cpus/cores/threads Not seeing anything in the sinfo or scontrol man pages that show specifically that.. Any insight is appreciated. Brian Andrus

[slurm-dev] Fully utilizing nodes

2016-08-09 Thread Andrus, Brian Contractor
All, I am trying to figure out the bits required to allow users to use part of a node and not block others from using remaining resources. It looks like the "OverSubscribe" option is what I need, but that doesn't seem to quite be all of it. I would like users to be able to request --exclusive

[slurm-dev] Re: Fully utilizing nodes

2016-08-09 Thread Andrus, Brian Contractor
ug 9, 2016 at 11:06 AM, Andrus, Brian Contractor mailto:bdand...@nps.edu>> wrote: All, I am trying to figure out the bits required to allow users to use part of a node and not block others from using remaining resources. It looks like the “OverSubscribe” option is what I need, but that doe

[slurm-dev] Re: Fully utilizing nodes

2016-08-15 Thread Andrus, Brian Contractor
Ok, I am still having trouble here and am not sure where to look. Slurm is configured with: SelectType = select/cons_res SelectTypeParameters= CR_CORE_MEMORY,CR_ONE_TASK_PER_CORE I have a node which has 64 cores: NodeName=compute-2-1 Arch=x86_64 CoresPerSoc

[slurm-dev] Re: Fully utilizing nodes

2016-08-16 Thread Andrus, Brian Contractor
nodes Hi Brian, Looks like your default memory allocation for jobs is 258307 MB, which is just how much memory you have on the node. Try to request less memory with --mem. Best wishes, Marius 16. aug. 2016 kl. 01.44 skrev Andrus, Brian Contractor mailto:bdand...@nps.edu>>: NodeName=com

[slurm-dev] PIDfile on CentOS7 and compute nodes

2016-11-25 Thread Andrus, Brian Contractor
All, I have been having an issue where if I try to run the slurm daemon under systemd, it hangs for some time and then errors out with: systemd[1]: Starting LSB: slurm daemon management... systemd[1]: PID file /var/run/slurmctld.pid not readable (yet?) after start. systemd[1]: slurm.service: cont

[slurm-dev] squeue returns "invalid user" for a user that has jobs running

2016-11-25 Thread Andrus, Brian Contractor
All, Don't quite get this: # squeue|head JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 751071_17703 primary PARAMEIG clwalton CG 3-00:00:19 1 compute-3-87 751071_[36752-6220 primary PARAMEIG clwalton PD 0:00 1 (Resources)

[slurm-dev] Re: PIDfile on CentOS7 and compute nodes

2016-11-28 Thread Andrus, Brian Contractor
iki https://wiki.fysik.dtu.dk/niflheim/SLURM /Ole On 11/25/2016 05:04 PM, Andrus, Brian Contractor wrote: > All, > > I have been having an issue where if I try to run the slurm daemon > under systemd, it hangs for some time and then errors out with: > > > > syst

[slurm-dev] Re: squeue returns "invalid user" for a user that has jobs running

2016-11-28 Thread Andrus, Brian Contractor
: [slurm-dev] Re: squeue returns "invalid user" for a user that has jobs running Hi Brian, Is there actual username longer than 8 characters? The default squeue format includes "%.8u" for the username. Paddy On Fri, Nov 25, 2016 at 08:26:36PM -0800, Andrus, Brian Cont

[slurm-dev] Re: squeue returns "invalid user" for a user that has jobs running

2016-11-28 Thread Andrus, Brian Contractor
"invalid user" for a user that has jobs running Hi, Is the user defined in all the compute nodes? Does it has the same UID in all the hosts? Regards, Carlos On Mon, Nov 28, 2016 at 6:54 PM, Andrus, Brian Contractor mailto:bdand...@nps.edu>> wrote: Paddy, Nope, it is exac

[slurm-dev] Re: squeue returns "invalid user" for a user that has jobs running

2016-11-28 Thread Andrus, Brian Contractor
I take that back. It was indeed the issue. User name is clwalton1... Doh! Thanks for pointing me in the right direction. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -Original Message- From: Andrus, Brian Contractor

[slurm-dev] RE: Restrict access for a user group to certain nodes

2016-12-01 Thread Andrus, Brian Contractor
The way we did that was to put the nodes in their own partition which is only accessible by that group. PartitionName=beardq Nodes=compute-8-[1,5,9,13,17] AllowGroups=beards DefaultTime=01:00:00 MaxTime=INFINITE State=UP So here is a partition "beardq" which is only available to folks in the gr

[slurm-dev] Backup controller not responding to requests

2017-01-30 Thread Andrus, Brian Contractor
All, I have configured a backup slurmctld system and it appears to work at first, but not in practice. In particular, when I start it, it says it is running in background mode: [2017-01-25T14:23:37.648] slurmctld version 16.05.6 started on cluster hamming [2017-01-25T14:23:37.650] slurmctld runni

[slurm-dev] Re: Backup controller not responding to requests

2017-01-30 Thread Andrus, Brian Contractor
On Mon, Jan 30, 2017 at 08:21:59AM -0800, Andrus, Brian Contractor wrote: > All, > > I have configured a backup slurmctld system and it appears to work at first, > but not in practice. > In particular, when I start it, it says it is running in background mode: > [2017-01-25T1

[slurm-dev] Re: Backup controller not responding to requests

2017-01-30 Thread Andrus, Brian Contractor
: slurm-dev Subject: [slurm-dev] Re: Backup controller not responding to requests Does it work if you use "scontrol takeover" to shut down the primary controller and switch immediately to the backup controller? 2017-01-30 19:41 GMT+01:00 Andrus, Brian Contractor : > Paddy, > >

[slurm-dev] Re: Backup controller not responding to requests

2017-01-31 Thread Andrus, Brian Contractor
controller not responding to requests What is the output of scontrol show config | grep SlurmctldTimeout ? 2017-01-31 6:57 GMT+01:00 Andrus, Brian Contractor : > Yes, if I do scontrol takeover, it successfully goes to the backup. > > > Brian Andrus > ITACS/Research Computing >