[slurm-dev] options for relocating the primary slurm controller, slurmctld?

2015-10-26 Thread Chris Harwell
Hi,

I have some new hardware I am hoping to move the slurmctld to and am
wondering if I need to haul along the IP address or if there is some other
sneaky method.

Has anyone done this recently and been able to keep jobs from going to CG
state?

I was testing this out with a dev cluster, started some jobs, shutdown
ctld/slurmd and updated the config file and started things backup.  I then
noticed squeue showing jobs in CG state and upon logging in to the
corresponding compute nodes found that slurmstepd processes were in the
process list. An strace on them showed they were trying to contact the old
controller for the rpc for job complete.  Maybe I messed up the order of
the operations. Or maybe I really should haul along the IP.  Would
slurmstepd fall back to the backup controller in the slurm.conf file it
read on starting that job (if I had one defined)? e.g. Could I define the
new IP as the backup, roll that out, wait until some point when all jobs
running prior to the change completed and then switch the primary and
backup controllers in the slurm.conf file?


Is this up to date?
http://slurm.schedmd.com/faq.html#controller

*2. How should I relocate the primary or backup controller?*
If the cluster's computers used for the primary or backup controller will
be out of service for an extended period of time, it may be desirable to
relocate them. In order to do so, follow this procedure:

   1. Stop all Slurm daemons
   2. Modify the *ControlMachine*, *ControlAddr*, *BackupController*,
   and/or *BackupAddr* in the *slurm.conf* file
   3. Distribute the updated *slurm.conf* file to all nodes
   4. Copy the *StateSaveLocation* directory to the new host and make sure
   the permissions allow the *SlurmUser* to read and write it.
   5. Restart all Slurm daemons

There should be no loss of any running or pending jobs. Insure that any
nodes added to the cluster have a current *slurm.conf* file installed.
*CAUTION:* If two nodes are simultaneously configured as the primary
controller (two nodes on which *ControlMachine* specify the local host and
the *slurmctld* daemon is executing on each), system behavior will be
destructive. If a compute node has an incorrect *ControlMachine* or
*BackupController* parameter, that node may be rendered unusable, but no
other harm will result.


[slurm-dev] login node configuration?

2015-10-26 Thread Liam Forbes

Hello,

I’m in the process of setting up SLURM 15.08.X for the first time. I’ve got a 
head node and ten compute nodes working fine for serial and parallel jobs. I’m 
struggling with figuring out how to configure a pair of login nodes to be able 
to get status and submit jobs. I’ve been searching through the documentation 
and googling, but I’ve not found a reference other than brief mentions of login 
nodes or submit hosts.

Our pair of login nodes share an ethernet network with the head node. They do 
not, currently, share a network, ethernet or IB, with the compute nodes. When I 
execute SLURM commands like sinfo or squeue, using strace I can see them 
attempting to communicate with port 6817 on the head node. However, the head 
node doesn’t respond. lsof on the head node seems to indicate slurmctld is only 
listening on the private network shared with the compute nodes.

Is there a reference for configuring login nodes to be able to query slurmctld 
and submit jobs? If not, would somebody with a similar configuration be willing 
to share some guidance?

Do the login nodes need to be on the same network as the compute nodes, or at 
least the same network as slurmctld is listening on?

Apologies if these questions have been answered somewhere else, I just haven’t 
found that documentation.

Regards,
-liam

-There are uncountably more irrational fears than rational ones. -P. Dolan
Liam Forbes lofor...@alaska.edu ph: 907-450-8618 fax: 907-450-8601
UAF Research Computing Systems Senior HPC EngineerLPIC1, CISSP


[slurm-dev] Re: login node configuration?

2015-10-26 Thread Paul Edmon


For clarity, they should not need to talk to the compute nodes unless 
you intend to do interactive work.  You should only need to talk to the 
master to submit jobs.


-Paul Edmon-

On 10/26/2015 9:45 PM, Paul Edmon wrote:


What we did was that we just opened up port 6817 between the two 
VLAN's.  So long as the traffic is routeable and they see the same 
slurm.conf that should work.  All the login node needs is slurm.conf, 
and the slurm, slurm-munge, and slurm-plugin rpms. You don't need to 
run the slurm service to submit as all you need is the ability of the 
login node to talk to the master.


-Paul Edmon-

On 10/26/2015 8:02 PM, Liam Forbes wrote:

Hello,

I’m in the process of setting up SLURM 15.08.X for the first time. 
I’ve got a head node and ten compute nodes working fine for serial 
and parallel jobs. I’m struggling with figuring out how to configure 
a pair of login nodes to be able to get status and submit jobs. I’ve 
been searching through the documentation and googling, but I’ve not 
found a reference other than brief mentions of login nodes or submit 
hosts.


Our pair of login nodes share an ethernet network with the head node. 
They do not, currently, share a network, ethernet or IB, with the 
compute nodes. When I execute SLURM commands like sinfo or squeue, 
using strace I can see them attempting to communicate with port 6817 
on the head node. However, the head node doesn’t respond. lsof on the 
head node seems to indicate slurmctld is only listening on the 
private network shared with the compute nodes.


Is there a reference for configuring login nodes to be able to query 
slurmctld and submit jobs? If not, would somebody with a similar 
configuration be willing to share some guidance?


Do the login nodes need to be on the same network as the compute 
nodes, or at least the same network as slurmctld is listening on?


Apologies if these questions have been answered somewhere else, I 
just haven’t found that documentation.


Regards,
-liam

-There are uncountably more irrational fears than rational ones. -P. 
Dolan
Liam Forbes lofor...@alaska.edu ph: 907-450-8618 fax: 
907-450-8601

UAF Research Computing Systems Senior HPC Engineer LPIC1, CISSP


[slurm-dev] Re: login node configuration?

2015-10-26 Thread Paul Edmon


What we did was that we just opened up port 6817 between the two 
VLAN's.  So long as the traffic is routeable and they see the same 
slurm.conf that should work.  All the login node needs is slurm.conf, 
and the slurm, slurm-munge, and slurm-plugin rpms.  You don't need to 
run the slurm service to submit as all you need is the ability of the 
login node to talk to the master.


-Paul Edmon-

On 10/26/2015 8:02 PM, Liam Forbes wrote:

Hello,

I’m in the process of setting up SLURM 15.08.X for the first time. I’ve got a 
head node and ten compute nodes working fine for serial and parallel jobs. I’m 
struggling with figuring out how to configure a pair of login nodes to be able 
to get status and submit jobs. I’ve been searching through the documentation 
and googling, but I’ve not found a reference other than brief mentions of login 
nodes or submit hosts.

Our pair of login nodes share an ethernet network with the head node. They do 
not, currently, share a network, ethernet or IB, with the compute nodes. When I 
execute SLURM commands like sinfo or squeue, using strace I can see them 
attempting to communicate with port 6817 on the head node. However, the head 
node doesn’t respond. lsof on the head node seems to indicate slurmctld is only 
listening on the private network shared with the compute nodes.

Is there a reference for configuring login nodes to be able to query slurmctld 
and submit jobs? If not, would somebody with a similar configuration be willing 
to share some guidance?

Do the login nodes need to be on the same network as the compute nodes, or at 
least the same network as slurmctld is listening on?

Apologies if these questions have been answered somewhere else, I just haven’t 
found that documentation.

Regards,
-liam

-There are uncountably more irrational fears than rational ones. -P. Dolan
Liam Forbes lofor...@alaska.edu ph: 907-450-8618 fax: 907-450-8601
UAF Research Computing Systems Senior HPC EngineerLPIC1, CISSP


[slurm-dev] Re: Slurm, job preemption, and swap.

2015-10-26 Thread Daniel Letai


It would be easy if there was a way to force TRES 
allocation/reconfiguration, e.g.
Add the swap as GRES/swap, and on suspend transfer the allocation from 
TRES=mem=64,GRES/swap=0 to TRES=mem=0,GRES/swap=64. Then you could start 
the new job which requires available mem.


Will it be possible to add such a mechanism to scontrol update ?

On 10/23/2015 03:14 AM, Bill Broadley wrote:

I've been using the example documented at:
   http://slurm.schedmd.com/preempt.html

Specifically  Excerpt from slurm.conf
PartitionName=low Nodes=linux Default=YES Shared=NO  Priority=10
PreemptMode=requeue
PartitionName=med Nodes=linux Default=NO  Shared=FORCE:1 Priority=20
PreemptMode=suspend
PartitionName=high  Nodes=linux Default=NO  Shared=FORCE:1 Priority=30
PreemptMode=off

All my compute nodes have at least as much swap as ram.  This works quite well,
so high priority jobs can suspend medium priority jobs and if there's memory
pressure on the node suspended jobs can pushed to swap.  I enforce the memory
limits so jobs using more ram than they ask for get killed.  With the slurm
2.6.5 to 14.11 upgrade slurm added the ability so manage memory limits as well
as CPU.

So I started adding GrpMemory to users so if they purchase 4 nodes they can
allocate a total of 4 nodes of CPUs or 4 nodes of memory in the high priority
queue.  So I have entries like:
User-'test':Partition='high':DefaultAccount='testgrp':GrpCPUs=128:GrpMemory=256000

So I set DefMemPerCPU=2000, so that users who do not ask for a specific memory
allocation they get 2GB per CPU.  My nodes have 64GB ram and 32 CPUs.  This
works quite well, but it broke preemption.

So now if I'm running 32 2GB jobs in the medium queue, no high priority jobs can
run because all ram is allocated.  That seems quite weird to me, if a job is
SIGSTOP'd to suspend any memory pressure should force suspended memory pages
into swap.  Given that the suspended job isn't running that shouldn't cause too
much I/O since each page is written just once, no churning.

Is there any way to get slurm to not count suspended jobs memory allocation
towards the node's memory used total?

Any suggestions on how to get the old behavior back where high priority jobs can
be suspended?


[slurm-dev] Re: salloc timeout question

2015-10-26 Thread Deric

Kyle Stern  writes:

> 
>  
> 
> 
> 
> 
> 
> 
> 
> When I set a time limit for my salloc, why do I still have to exit from it
once the allocation is revoked? For example:
> > salloc -t1
> salloc: Granted job allocation 112645
> > salloc: Job 112645 has exceeded its time limit and its allocation has
been revoked.
> 
> ^C
> > echo $SLURM_JOBID112645
> > exit
> > echo $SLURM_JOBID>
> I'm using v14.11.8.
> 
> 
> 

Hi Kyle,
Maybe the following can help answer your question.  Just my guess, I haven't
tested this.
>From the salloc man page, --time (-t) option:
"When the  time  limit  is reached, each task in each job step is sent
SIGTERM followed by SIGKILL."
Notice that it says "each task in each job step".  Your shell that gets
created as a result of your salloc command is not a task in a job step.

If you used the method described in the FAQ to launch your shell then I
would guess the salloc shell would also be terminated.  See:
http://slurm.schedmd.com/faq.html#salloc_default_command

The man page for scancel might also help explain the difference in
terminating a job step vs the batch command.  Especially the "--batch" and
"--signal" options and the whole "ARGUMENTS" section.

Deric


[slurm-dev] Re: Slurm, job preemption, and swap.

2015-10-26 Thread Trey Dockendorf
We ran into this swap issue when using SelectTypeParameters=CR_CPU_Memory.
We are still on 14.03.10 and have a very ugly hack that adds a
SchedulerParameter of "assume_swap" which basically forces SLURM to ignore
memory allocations of swapped jobs.  The patch was very rushed so likely we
ended up just making SLURM behave like it's configured with CR_CPU instead
of CR_CPU_Memory.  When we upgrade to 15.08.x we will be using CR_CPU
without our patch since we define MaxMemoryPerCPU on all partitions.  So
far in testing, CR_CPU and MaxMemoryPerCPU results in behavior where a 64GB
node can have 64GB worth of suspended jobs and still run 64GB worth of
active jobs.  If a user requests 1 CPU and 64GB with MaxMemoryPerCPU=2000,
they end up with 32 CPUs which we use for QOS resource limits and
accounting.

Attached are the patches.  They likely only work on 14.03.x releases.  I
wouldn't recommend using the patches, but they may give an idea of how to
implement a proper solution that is worthy of being submitted for inclusion
in SLURM.

- Trey

=

Trey Dockendorf
Systems Analyst I
Texas A University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treyd...@tamu.edu
Jabber: treyd...@tamu.edu

On Mon, Oct 26, 2015 at 9:06 AM, Daniel Letai  wrote:

>
> It would be easy if there was a way to force TRES
> allocation/reconfiguration, e.g.
> Add the swap as GRES/swap, and on suspend transfer the allocation from
> TRES=mem=64,GRES/swap=0 to TRES=mem=0,GRES/swap=64. Then you could start
> the new job which requires available mem.
>
> Will it be possible to add such a mechanism to scontrol update ?
>
> On 10/23/2015 03:14 AM, Bill Broadley wrote:
>
>> I've been using the example documented at:
>>http://slurm.schedmd.com/preempt.html
>>
>> Specifically  Excerpt from slurm.conf
>> PartitionName=low Nodes=linux Default=YES Shared=NO  Priority=10
>> PreemptMode=requeue
>> PartitionName=med Nodes=linux Default=NO  Shared=FORCE:1 Priority=20
>> PreemptMode=suspend
>> PartitionName=high  Nodes=linux Default=NO  Shared=FORCE:1 Priority=30
>> PreemptMode=off
>>
>> All my compute nodes have at least as much swap as ram.  This works quite
>> well,
>> so high priority jobs can suspend medium priority jobs and if there's
>> memory
>> pressure on the node suspended jobs can pushed to swap.  I enforce the
>> memory
>> limits so jobs using more ram than they ask for get killed.  With the
>> slurm
>> 2.6.5 to 14.11 upgrade slurm added the ability so manage memory limits as
>> well
>> as CPU.
>>
>> So I started adding GrpMemory to users so if they purchase 4 nodes they
>> can
>> allocate a total of 4 nodes of CPUs or 4 nodes of memory in the high
>> priority
>> queue.  So I have entries like:
>>
>> User-'test':Partition='high':DefaultAccount='testgrp':GrpCPUs=128:GrpMemory=256000
>>
>> So I set DefMemPerCPU=2000, so that users who do not ask for a specific
>> memory
>> allocation they get 2GB per CPU.  My nodes have 64GB ram and 32 CPUs.
>> This
>> works quite well, but it broke preemption.
>>
>> So now if I'm running 32 2GB jobs in the medium queue, no high priority
>> jobs can
>> run because all ram is allocated.  That seems quite weird to me, if a job
>> is
>> SIGSTOP'd to suspend any memory pressure should force suspended memory
>> pages
>> into swap.  Given that the suspended job isn't running that shouldn't
>> cause too
>> much I/O since each page is written just once, no churning.
>>
>> Is there any way to get slurm to not count suspended jobs memory
>> allocation
>> towards the node's memory used total?
>>
>> Any suggestions on how to get the old behavior back where high priority
>> jobs can
>> be suspended?
>>
>


0001-select_cons_res.c.patch
Description: Binary data


0002-add-assume_swap-config-option.patch
Description: Binary data


[slurm-dev] job_submit/lua plugin: Fix DefaultQOS lookup

2015-10-26 Thread Trey Dockendorf
Attached is a patch to resolve an issue when a user does not exist in the
accounting database (to prevent job submission), a job_submit.lua that uses
the default QOS lookup will segfault.  I was unsure if using
'accounting_enforce' was the correct method, or just checking if assoc_ptr
was NULL.

Thanks,
- Trey

=

Trey Dockendorf
Systems Analyst I
Texas A University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treyd...@tamu.edu
Jabber: treyd...@tamu.edu


job_submit-default_qos-fix-15.08.patch
Description: Binary data