[slurm-dev] Multinode setup trouble

2017-05-16 Thread Ben Mann
Hello Slurm dev,

I just set up a small test cluster on two Ubuntu 14.04 machines, installed
SLURM 17.02 from source. I started slurmctld, slurmdbd and slurmd on a
master and just slurmd on a slave. When I run a job on two nodes, it
completes instantly on master, but never on slave.

Here are my .conf files, which are on a NAS and symlinked from
/usr/local/etc/ as well as log files for the srun below
https://gist.github.com/8enmann/0637ee2cbb6e6f5aaedef6b3c3f24a1d

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  2   idle [91-92]

$ srun -l hostname
0: 91.cirrascale.sci.openai.org

$ srun -l -N2 hostname
0: 91.cirrascale.sci.openai.org
$ srun -N2 -l hostname
0: 91.cirrascale.sci.openai.org
srun: error: timeout waiting for task launch, started 1 of 2 tasks
srun: Job step 36.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
36 debug hostname  ben  R   8:42  2 [91-92]
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  2  alloc [91-92]

I'm guessing I misconfigured something, but I don't see anything in the
logs suggesting what it might be. I've also tried cranking up verbosity and
didn't see anything. I know it's not recommended to use root to run
everything, but doesn't at least slurmd need root to manage cgroups?

Thanks in advance!!
Ben


[slurm-dev] Re: KNL node down after reboot

2017-05-16 Thread Costin Caramarcu
Hi,

A few suggestions:
   1) Try increasing the timeouts:
SlurmctldTimeout=600
SlurmdTimeout=600
ResumeTimeout=600
   2) Make sure that when slurm starts the node finished mounting file
systems and the whole boot procedure is done,

Regards,
Costin

On Tue, May 16, 2017 at 10:40 AM, Ryan Novosielski 
wrote:

>
> SLURM has worked this way as long as I can remember. If you don't use
> scontrol reboot_nodes, nodes are "down" when they come back because SLURM
> wasn't notified about the reboot. This is configurable in slurm.conf.
> 
> From: nico.faer...@id.unibe.ch 
> Sent: Tuesday, May 16, 2017 10:32:14 AM
> To: slurm-dev
> Subject: [slurm-dev] KNL node down after reboot
>
> Hi,
>
> We want to introduce Intel Knights Landing (KNL) nodes into our cluster,
> and observed the following problem: Node reboots successfully with desired
> NUMA and MCDRAM modes, but remains in state down.
>
> From slurmctl.log:
>
> (…)[
> 2017-05-16T15:16:10.437] _update_node_avail_features: nodes knlnode01
> available features set to: a2a,hemi,quad,snc2,snc4,cache,
> flat,hybrid,auto,knights_landing
> [2017-05-16T15:16:10.437] _update_node_active_features: nodes knlnode01
> active features set to: a2a,flat,knights_landing
> [2017-05-16T15:16:10.437] Node knlnode01 now responding
> [2017-05-16T15:16:10.437] validate_node_specs: Node knlnode01 unexpectedly
> rebooted boot_time=1494940521 last response=1494940189
> [2017-05-16T15:16:10.437] requeue job 21433841 due to failure of node
> knlnode01
> [2017-05-16T15:16:10.437] Requeuing JobID=21433841 State=0x0 NodeCnt=0
> (…)
>
> From  slurmd.log:
>
> (…)
> [2017-05-16T15:16:07.152] CPUs=256 Boards=1 Sockets=1 Cores=64 Threads=4
> Memory=112514 TmpDisk=174986 Uptime=49 CPUSpecList=(null)
> FeaturesAvail=a2a,hemi,quad,snc2,snc4,cache,flat,hybrid,auto
> FeaturesActive=a2a,flat
> [2017-05-16T15:18:05.944] syscfg /d BIOSSETTINGS Cluster Mode
> [2017-05-16T15:18:05.944]
> Cluster Mode
> 
> Current Value : All2All
> ---
> Possible Values
> ---
> All2All : 00
> SNC-2 : 01
> SNC-4 : 02
> Hemisphere : 03
> Quadrant : 04
> Auto : 05
>
> [2017-05-16T15:18:06.158] syscfg /d BIOSSETTINGS Memory Mode
> [2017-05-16T15:18:06.158]
> Memory Mode
> ===
> Current Value : Flat
> 
> Possible Values
> ---
> Cache : 00
> Flat : 01
> Hybrid : 02
> Auto : 03
> (…)
>
>
> Here some additional information:
>
> Slurm version:  17.02.2 (slurmctld and slurmd)
>
> slurm.conf:
> (…)
> # KNL
> NodeFeaturesPlugins=knl_generic
> DebugFlags=NodeFeatures
> RebootProgram=/sbin/reboot
> (…)
> NodeName=DEFAULT CPUs=256 RealMemory=109568 Sockets=1 CoresPerSocket=64
> ThreadsPerCore=4 TmpDisk=172032
> NodeName=knlnode01 NodeAddr=10.1.12.1 Feature=knights_landing State=UNKNOWN
> (…)
>
> knl_generic.conf:
> SyscfgPath=/usr/bin/syscfg/syscfg
> DefaultNUMA=a2a
> AllowNUMA=a2a,snc2
> DefaultMCDRAM=cache
>
>
> sbatch Command:
> sbatch -p phi -C flat,a2a -N1 --exclusive job.slurm
>
>
>
> Any suggestion is welcome.
>
> Cheers,
> Nico
>


[slurm-dev] Re: KNL node down after reboot

2017-05-16 Thread Ryan Novosielski

SLURM has worked this way as long as I can remember. If you don't use scontrol 
reboot_nodes, nodes are "down" when they come back because SLURM wasn't 
notified about the reboot. This is configurable in slurm.conf.

From: nico.faer...@id.unibe.ch 
Sent: Tuesday, May 16, 2017 10:32:14 AM
To: slurm-dev
Subject: [slurm-dev] KNL node down after reboot

Hi,

We want to introduce Intel Knights Landing (KNL) nodes into our cluster, and 
observed the following problem: Node reboots successfully with desired NUMA and 
MCDRAM modes, but remains in state down.

From slurmctl.log:

(…)[
2017-05-16T15:16:10.437] _update_node_avail_features: nodes knlnode01 available 
features set to: a2a,hemi,quad,snc2,snc4,cache,flat,hybrid,auto,knights_landing
[2017-05-16T15:16:10.437] _update_node_active_features: nodes knlnode01 active 
features set to: a2a,flat,knights_landing
[2017-05-16T15:16:10.437] Node knlnode01 now responding
[2017-05-16T15:16:10.437] validate_node_specs: Node knlnode01 unexpectedly 
rebooted boot_time=1494940521 last response=1494940189
[2017-05-16T15:16:10.437] requeue job 21433841 due to failure of node knlnode01
[2017-05-16T15:16:10.437] Requeuing JobID=21433841 State=0x0 NodeCnt=0
(…)

From  slurmd.log:

(…)
[2017-05-16T15:16:07.152] CPUs=256 Boards=1 Sockets=1 Cores=64 Threads=4 
Memory=112514 TmpDisk=174986 Uptime=49 CPUSpecList=(null) 
FeaturesAvail=a2a,hemi,quad,snc2,snc4,cache,flat,hybrid,auto 
FeaturesActive=a2a,flat
[2017-05-16T15:18:05.944] syscfg /d BIOSSETTINGS Cluster Mode
[2017-05-16T15:18:05.944]
Cluster Mode

Current Value : All2All
---
Possible Values
---
All2All : 00
SNC-2 : 01
SNC-4 : 02
Hemisphere : 03
Quadrant : 04
Auto : 05

[2017-05-16T15:18:06.158] syscfg /d BIOSSETTINGS Memory Mode
[2017-05-16T15:18:06.158]
Memory Mode
===
Current Value : Flat

Possible Values
---
Cache : 00
Flat : 01
Hybrid : 02
Auto : 03
(…)


Here some additional information:

Slurm version:  17.02.2 (slurmctld and slurmd)

slurm.conf:
(…)
# KNL
NodeFeaturesPlugins=knl_generic
DebugFlags=NodeFeatures
RebootProgram=/sbin/reboot
(…)
NodeName=DEFAULT CPUs=256 RealMemory=109568 Sockets=1 CoresPerSocket=64 
ThreadsPerCore=4 TmpDisk=172032
NodeName=knlnode01 NodeAddr=10.1.12.1 Feature=knights_landing State=UNKNOWN
(…)

knl_generic.conf:
SyscfgPath=/usr/bin/syscfg/syscfg
DefaultNUMA=a2a
AllowNUMA=a2a,snc2
DefaultMCDRAM=cache


sbatch Command:
sbatch -p phi -C flat,a2a -N1 --exclusive job.slurm



Any suggestion is welcome.

Cheers,
Nico


[slurm-dev] KNL node down after reboot

2017-05-16 Thread nico.faerber
Hi,

We want to introduce Intel Knights Landing (KNL) nodes into our cluster, and 
observed the following problem: Node reboots successfully with desired NUMA and 
MCDRAM modes, but remains in state down.

From slurmctl.log:

(…)[
2017-05-16T15:16:10.437] _update_node_avail_features: nodes knlnode01 available 
features set to: a2a,hemi,quad,snc2,snc4,cache,flat,hybrid,auto,knights_landing
[2017-05-16T15:16:10.437] _update_node_active_features: nodes knlnode01 active 
features set to: a2a,flat,knights_landing
[2017-05-16T15:16:10.437] Node knlnode01 now responding
[2017-05-16T15:16:10.437] validate_node_specs: Node knlnode01 unexpectedly 
rebooted boot_time=1494940521 last response=1494940189
[2017-05-16T15:16:10.437] requeue job 21433841 due to failure of node knlnode01
[2017-05-16T15:16:10.437] Requeuing JobID=21433841 State=0x0 NodeCnt=0
(…)

From  slurmd.log:

(…)
[2017-05-16T15:16:07.152] CPUs=256 Boards=1 Sockets=1 Cores=64 Threads=4 
Memory=112514 TmpDisk=174986 Uptime=49 CPUSpecList=(null) 
FeaturesAvail=a2a,hemi,quad,snc2,snc4,cache,flat,hybrid,auto 
FeaturesActive=a2a,flat
[2017-05-16T15:18:05.944] syscfg /d BIOSSETTINGS Cluster Mode
[2017-05-16T15:18:05.944]
Cluster Mode

Current Value : All2All
---
Possible Values
---
All2All : 00
SNC-2 : 01
SNC-4 : 02
Hemisphere : 03
Quadrant : 04
Auto : 05

[2017-05-16T15:18:06.158] syscfg /d BIOSSETTINGS Memory Mode
[2017-05-16T15:18:06.158]
Memory Mode
===
Current Value : Flat

Possible Values
---
Cache : 00
Flat : 01
Hybrid : 02
Auto : 03
(…)


Here some additional information:

Slurm version:  17.02.2 (slurmctld and slurmd)

slurm.conf:
(…)
# KNL
NodeFeaturesPlugins=knl_generic
DebugFlags=NodeFeatures
RebootProgram=/sbin/reboot
(…)
NodeName=DEFAULT CPUs=256 RealMemory=109568 Sockets=1 CoresPerSocket=64 
ThreadsPerCore=4 TmpDisk=172032
NodeName=knlnode01 NodeAddr=10.1.12.1 Feature=knights_landing State=UNKNOWN
(…)

knl_generic.conf:
SyscfgPath=/usr/bin/syscfg/syscfg
DefaultNUMA=a2a
AllowNUMA=a2a,snc2
DefaultMCDRAM=cache


sbatch Command:
sbatch -p phi -C flat,a2a -N1 --exclusive job.slurm



Any suggestion is welcome.

Cheers,
Nico


[slurm-dev] Re: Is there anyway to commit job with different user?

2017-05-16 Thread John Hearns
Sun,
as the others have responded, you should make sure your userids are the
same across the cluster.
You really must put in the effort to do that.


However - SGE does have a usermapping feature
https://linux.die.net/man/5/sge_usermapping
I do not know if there is somethig similar in Slurm.




On 16 May 2017 at 14:26, E.S. Rosenberg 
wrote:

> On Tue, May 16, 2017 at 11:39 AM, Sun Chenggen 
> wrote:
>
>> Yes, user on my cluster synchronized, but I want to submit job on my
>> client machine, not on the cluster.
>>
> So only if you synchronize, for instance by making sure your UID/GID on
> your client matches your UID/GID on the cluster.
>
>>
>> 发件人: Felip Moll 
>> 答复: slurm-dev 
>> 日期: 2017年5月16日 星期二 下午4:25
>> 至: slurm-dev 
>> 主题: [slurm-dev] Re: Is there anyway to commit job with different user?
>>
>> It is not possible, at least in a supported way.
>>
>> The first requirement of the admin guide tells:
>>
>>1. Make sure the clocks, users and groups (UIDs and GIDs) are
>>synchronized across the cluster.
>>
>> From:
>>
>> https://slurm.schedmd.com/quickstart_admin.html
>>
>>
>>
>>
>>
>>
>> * -- Felip Moll Marquès*
>> Computer Science Engineer
>> E-Mail - lip...@gmail.com
>> WebPage - http://lipix.ciutadella.es
>>
>> 2017-05-16 9:24 GMT+02:00 Sun Chenggen :
>>
>>> Hi everyone:
>>> Is there anyway to commit job with different user? My slurm cluster
>>> doesn’t have the same user config as my local  slurm-client machine. If I
>>> commit job on my local machine , it failed with the message “srun: error:
>>> Application launch failed: User not found on host”.
>>> Do I have to distribute my local /etc/passwd to cluster? I don’t want to
>>> do this way. Is there a better way to commit srun job with different user
>>> account?
>>>
>>> Thanks for your help,
>>> Sun
>>>
>>
>>
>


[slurm-dev] Re: Is there anyway to commit job with different user?

2017-05-16 Thread E.S. Rosenberg
On Tue, May 16, 2017 at 11:39 AM, Sun Chenggen 
wrote:

> Yes, user on my cluster synchronized, but I want to submit job on my
> client machine, not on the cluster.
>
So only if you synchronize, for instance by making sure your UID/GID on
your client matches your UID/GID on the cluster.

>
> 发件人: Felip Moll 
> 答复: slurm-dev 
> 日期: 2017年5月16日 星期二 下午4:25
> 至: slurm-dev 
> 主题: [slurm-dev] Re: Is there anyway to commit job with different user?
>
> It is not possible, at least in a supported way.
>
> The first requirement of the admin guide tells:
>
>1. Make sure the clocks, users and groups (UIDs and GIDs) are
>synchronized across the cluster.
>
> From:
>
> https://slurm.schedmd.com/quickstart_admin.html
>
>
>
>
>
>
> * -- Felip Moll Marquès*
> Computer Science Engineer
> E-Mail - lip...@gmail.com
> WebPage - http://lipix.ciutadella.es
>
> 2017-05-16 9:24 GMT+02:00 Sun Chenggen :
>
>> Hi everyone:
>> Is there anyway to commit job with different user? My slurm cluster
>> doesn’t have the same user config as my local  slurm-client machine. If I
>> commit job on my local machine , it failed with the message “srun: error:
>> Application launch failed: User not found on host”.
>> Do I have to distribute my local /etc/passwd to cluster? I don’t want to
>> do this way. Is there a better way to commit srun job with different user
>> account?
>>
>> Thanks for your help,
>> Sun
>>
>
>


[slurm-dev] Re: Adjusting MaxJobCount and SlurmctldPort settings

2017-05-16 Thread gilles
 Mark,



 hanging SlurmctldPort will not help with the queue limit anyway, so i 
suggest you update it (if really needed) during a maintenance when no 
job is running



Cheers,



Gilles

- Original Message -

Hello,

Changing slurmctld port should probably wait until all jobs have stopped 
running.  Running jobs won't fail in this case, but there is a good 
chance they will fail to complete properly, and the compute node 
operating them might get stuck in the completing state (since the 
slurmstepd operating the job would fail to communicate properly, owing 
to the change in port).

Changing a parameter like MaxJobCount should be safe in my opinion, 
possibly even with just a change to slurm.conf followed by `scontrol 
reconfigure`; though if that doesn't work you might need to restart 
slurmctld.

-Doug




Doug Jacobsen, Ph.D. 
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center 
dmjacob...@lbl.gov 

- __o 
-- _ '\<,_ 
--(_)/  (_)__ 


On Tue, May 16, 2017 at 2:22 AM, Mark S. Holliman  wrote:

Hi everyone,

Does anyone know if changing the slurmctld settings for MaxJobCount and 
SlurmctldPort will cause jobs already running/waiting to fail?  My users 
have hit the default 10,000 queue limit, and I'd like to increase that, 
but not if it's going to kill everything that's running.  I know most 
settings can be changed (and slurmctld/slurmd restarted) without issue.  
But given that scheduler changes can cause existing jobs to get killed I
'm uncertain about these parameters...

Cheers,
  Mark

---
Mark Holliman
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
---



 


[slurm-dev] Re: Adjusting MaxJobCount and SlurmctldPort settings

2017-05-16 Thread Douglas Jacobsen
Hello,

Changing slurmctld port should probably wait until all jobs have stopped
running.  Running jobs won't fail in this case, but there is a good chance
they will fail to complete properly, and the compute node operating them
might get stuck in the completing state (since the slurmstepd operating the
job would fail to communicate properly, owing to the change in port).

Changing a parameter like MaxJobCount should be safe in my opinion,
possibly even with just a change to slurm.conf followed by `scontrol
reconfigure`; though if that doesn't work you might need to restart
slurmctld.

-Doug




Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center 
dmjacob...@lbl.gov

- __o
-- _ '\<,_
--(_)/  (_)__


On Tue, May 16, 2017 at 2:22 AM, Mark S. Holliman  wrote:

>
> Hi everyone,
>
> Does anyone know if changing the slurmctld settings for MaxJobCount and
> SlurmctldPort will cause jobs already running/waiting to fail?  My users
> have hit the default 10,000 queue limit, and I'd like to increase that, but
> not if it's going to kill everything that's running.  I know most settings
> can be changed (and slurmctld/slurmd restarted) without issue.  But given
> that scheduler changes can cause existing jobs to get killed I'm uncertain
> about these parameters...
>
> Cheers,
>   Mark
>
> ---
> Mark Holliman
> Wide Field Astronomy Unit
> Institute for Astronomy
> University of Edinburgh
> ---
>
>
>


[slurm-dev] Adjusting MaxJobCount and SlurmctldPort settings

2017-05-16 Thread Mark S. Holliman

Hi everyone,

Does anyone know if changing the slurmctld settings for MaxJobCount and 
SlurmctldPort will cause jobs already running/waiting to fail?  My users have 
hit the default 10,000 queue limit, and I'd like to increase that, but not if 
it's going to kill everything that's running.  I know most settings can be 
changed (and slurmctld/slurmd restarted) without issue.  But given that 
scheduler changes can cause existing jobs to get killed I'm uncertain about 
these parameters...

Cheers,
  Mark

---
Mark Holliman
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
---




[slurm-dev] Re: How to get pids of a job

2017-05-16 Thread Bjørn-Helge Mevik
GHui  writes:

> The command "scontrol show jobs jobid" will show the nodes. And then
> then command "ssh nodes scontrol listpids jobid" wiil show the pids.
>
> But this is a little complex. Is there more simple command, like LSF bjobs,
> show the nodes and the pids.

Not that I know of, but it should be possible to script.

> And how to parse the nodelist like "cn[11033,11069],gn[1103-1120]" ?

scontrol show hostnames cn[11033,11069],gn[1103-1120]

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


[slurm-dev] Re: Is there anyway to commit job with different user?

2017-05-16 Thread Felip Moll
It is not possible, at least in a supported way.

The first requirement of the admin guide tells:

   1. Make sure the clocks, users and groups (UIDs and GIDs) are
   synchronized across the cluster.

From:

https://slurm.schedmd.com/quickstart_admin.html






*--Felip Moll Marquès*
Computer Science Engineer
E-Mail - lip...@gmail.com
WebPage - http://lipix.ciutadella.es

2017-05-16 9:24 GMT+02:00 Sun Chenggen :

> Hi everyone:
> Is there anyway to commit job with different user? My slurm cluster
> doesn’t have the same user config as my local  slurm-client machine. If I
> commit job on my local machine , it failed with the message “srun: error:
> Application launch failed: User not found on host”.
> Do I have to distribute my local /etc/passwd to cluster? I don’t want to
> do this way. Is there a better way to commit srun job with different user
> account?
>
> Thanks for your help,
> Sun
>


[slurm-dev] Is there anyway to commit job with different user?

2017-05-16 Thread Sun Chenggen
Hi everyone:
Is there anyway to commit job with different user? My slurm cluster doesn’t 
have the same user config as my local  slurm-client machine. If I commit job on 
my local machine , it failed with the message “srun: error: Application launch 
failed: User not found on host”.
Do I have to distribute my local /etc/passwd to cluster? I don’t want to do 
this way. Is there a better way to commit srun job with different user account?

Thanks for your help,
Sun