All,
I am testing our slurm to replace our torque/moab setup here.
The issue I have is to try and put all our node names in the NodeName and
PartitionName entries.
In our cluster, we name our nodes compute--
That seems to be problem enough with the abilities to use ranges in slurm, but
it is co
PUs, MIC
cards, Infiniband, etc).
Brian Andrus
-Original Message-
From: Benjamin Redling [mailto:benjamin.ra...@uni-jena.de]
Sent: Wednesday, January 20, 2016 2:00 AM
To: slurm-dev
Subject: [slurm-dev] Re: NodeName and PartitionName format in slurm.conf
Am 19.01.2016 um 20:37 schrie
All,
Is there a way to change the maximum simultaneous running tasks of an array job
that is currently running?
For example I have
sbatch --array=1-100%2
and I want to change it to effectively be:
sbatch --array=1-100%5
to cause slurm to start running 5 at a time right away.
And be able to do
ional
Corporation<http://www.decisionsciencescorp.com/>
On Wed, Jan 20, 2016 at 6:49 PM, Andrus, Brian Contractor
mailto:bdand...@nps.edu>> wrote:
All,
Is there a way to change the maximum simultaneous running tasks of an array job
that is currently running?
For example I have
sbatch --a
All,
I am in the process of transitioning from Torque to Slurm.
So far it is doing very well, especially handling arrays.
Now I have one array job that is running across several nodes, but only using
some of the node resources. I would like to have slurm start sharing the nodes
so some of the a
ohn DeSantis
2016-01-26 15:20 GMT-05:00 Andrus, Brian Contractor
mailto:bdand...@nps.edu>>:
All,
I am in the process of transitioning from Torque to Slurm.
So far it is doing very well, especially handling arrays.
Now I have one array job that is running across several nodes, but only using
016-01-26 20:05 GMT-05:00 Andrus, Brian Contractor
mailto:bdand...@nps.edu>>:
John,
Thanks. That seemed to help; a job started on a node that had a job on it once
the job that had been on it (‘using’ all the memory) completed.
But now all my jobs won’t start and have a status of ‘JobHoldMaxRe
there are some other creative things you can do with them.
Ryan
On 01/27/2016 06:47 PM, Andrus, Brian Contractor wrote:
I ended up just doing ‘scancel’ on all the jobs and resubmitting them.
I seem to be making progress.
Now I am having trouble figuring out the –distribution option.
I want to ha
All,
I am trying to find a way to see what resources are used/remaining on a per
node basis. In particular memory and sockets/cpus/cores/threads
Not seeing anything in the sinfo or scontrol man pages that show specifically
that..
Any insight is appreciated.
Brian Andrus
All,
I am trying to figure out the bits required to allow users to use part of a
node and not block others from using remaining resources.
It looks like the "OverSubscribe" option is what I need, but that doesn't seem
to quite be all of it.
I would like users to be able to request --exclusive
ug 9, 2016 at 11:06 AM, Andrus, Brian Contractor
mailto:bdand...@nps.edu>> wrote:
All,
I am trying to figure out the bits required to allow users to use part of a
node and not block others from using remaining resources.
It looks like the “OverSubscribe” option is what I need, but that doe
Ok, I am still having trouble here and am not sure where to look.
Slurm is configured with:
SelectType = select/cons_res
SelectTypeParameters= CR_CORE_MEMORY,CR_ONE_TASK_PER_CORE
I have a node which has 64 cores:
NodeName=compute-2-1 Arch=x86_64 CoresPerSoc
nodes
Hi Brian,
Looks like your default memory allocation for jobs is 258307 MB, which is just
how much memory you have on the node. Try to request less memory with --mem.
Best wishes,
Marius
16. aug. 2016 kl. 01.44 skrev Andrus, Brian Contractor
mailto:bdand...@nps.edu>>:
NodeName=com
All,
I have been having an issue where if I try to run the slurm daemon under
systemd, it hangs for some time and then errors out with:
systemd[1]: Starting LSB: slurm daemon management...
systemd[1]: PID file /var/run/slurmctld.pid not readable (yet?) after start.
systemd[1]: slurm.service: cont
All,
Don't quite get this:
# squeue|head
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
751071_17703 primary PARAMEIG clwalton CG 3-00:00:19 1 compute-3-87
751071_[36752-6220 primary PARAMEIG clwalton PD 0:00 1 (Resources)
iki
https://wiki.fysik.dtu.dk/niflheim/SLURM
/Ole
On 11/25/2016 05:04 PM, Andrus, Brian Contractor wrote:
> All,
>
> I have been having an issue where if I try to run the slurm daemon
> under systemd, it hangs for some time and then errors out with:
>
>
>
> syst
: [slurm-dev] Re: squeue returns "invalid user" for a user that has jobs
running
Hi Brian,
Is there actual username longer than 8 characters? The default squeue format
includes "%.8u" for the username.
Paddy
On Fri, Nov 25, 2016 at 08:26:36PM -0800, Andrus, Brian Cont
"invalid user" for a user that has jobs
running
Hi,
Is the user defined in all the compute nodes? Does it has the same UID in all
the hosts?
Regards,
Carlos
On Mon, Nov 28, 2016 at 6:54 PM, Andrus, Brian Contractor
mailto:bdand...@nps.edu>> wrote:
Paddy,
Nope, it is exac
I take that back. It was indeed the issue. User name is clwalton1...
Doh!
Thanks for pointing me in the right direction.
Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238
-Original Message-
From: Andrus, Brian Contractor
The way we did that was to put the nodes in their own partition which is only
accessible by that group.
PartitionName=beardq Nodes=compute-8-[1,5,9,13,17] AllowGroups=beards
DefaultTime=01:00:00 MaxTime=INFINITE State=UP
So here is a partition "beardq" which is only available to folks in the gr
All,
I have configured a backup slurmctld system and it appears to work at first,
but not in practice.
In particular, when I start it, it says it is running in background mode:
[2017-01-25T14:23:37.648] slurmctld version 16.05.6 started on cluster hamming
[2017-01-25T14:23:37.650] slurmctld runni
On Mon, Jan 30, 2017 at 08:21:59AM -0800, Andrus, Brian Contractor wrote:
> All,
>
> I have configured a backup slurmctld system and it appears to work at first,
> but not in practice.
> In particular, when I start it, it says it is running in background mode:
> [2017-01-25T1
: slurm-dev
Subject: [slurm-dev] Re: Backup controller not responding to requests
Does it work if you use "scontrol takeover" to shut down the primary controller
and switch immediately to the backup controller?
2017-01-30 19:41 GMT+01:00 Andrus, Brian Contractor :
> Paddy,
>
>
controller not responding to requests
What is the output of
scontrol show config | grep SlurmctldTimeout
?
2017-01-31 6:57 GMT+01:00 Andrus, Brian Contractor :
> Yes, if I do scontrol takeover, it successfully goes to the backup.
>
>
> Brian Andrus
> ITACS/Research Computing
>
24 matches
Mail list logo