[slurm-dev] A way of abuse the priority option in Slurm?

2015-03-31 Thread Magnus Jonsson


Hi!

I just discovered a possible way for a user to abuse the priority in Slurm.

This is the scenario:

1. A user has not run any jobs in a long time and has therefore has a 
high fairshare priority. Lets say: 1.


2. The user submits 1000 jobs into the queue that is far above his 
fairshare target.


3. The user changes the priority of his job (It's ok for a user to lower 
the priority of jobs as long as the user is the owner) to lets say:  
(still a high priority. +-1 is in practice nothing). (scontrol update 
jobid=1 priority=


4. The users jobs starts and the fairshare priority lowers. But here is 
the big _BUT_ the jobs with changed priority does not seams to change 
leaving the users job with maximum priority until all of the jobs are 
completed.


Have I missed something in this scenario?

If this is true what do we do about it? Should users be able to change 
the priority at all?


The user can use the 'nice' option to alter the priority of a job within 
a small limit that does not alter the priority as defined above.


Please let me be wrong :-)

/Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] error code in slurm_checkpoint_complete

2015-03-31 Thread Manuel Rodríguez Pascual
Hi all,

just a quick question regarding Slurm API.

In the following call,

(checkpoint.c)
/*
 * slurm_checkpoint_complete - note the completion of a job step's
checkpoint operation.
 * IN job_id  - job on which to perform operation
 * IN step_id - job step on which to perform operation
 * IN begin_time - time at which checkpoint began
 * IN error_code - error code, highest value for all complete calls is
preserved
 * IN error_msg - error message, preserved for highest error_code
 * RET 0 or a slurm error code
 */
extern int slurm_checkpoint_complete (uint32_t job_id, uint32_t step_id,
time_t begin_time, uint32_t error_code, char *error_msg);


what is error_code employed for? man says

Error code for checkpoint operation. Only the highest value is preserved.

but given that it is an input parameter I don't really see the sense,
moreover given that another error code is returned by the function. Can
anyone enlighten me?

Thanks for your attention. Best regards,


Manuel




-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN


[slurm-dev] Re: Problems running job

2015-03-31 Thread Jeff Layton


Chris and David,

Thanks for the help! I'm still trying to find out why the
compute nodes are down or not responding. Any tips
on where to start?

How about open ports? Right now I have 6817 and
6818 open as per my slurm.conf. I also have 22 and 80
open as well as 111, 2049, and 32806. I'm using NFSv4
but don't know if that is causing the problem or not
(I REALLY want to stick to NFSv4).

Thanks!

Jeff


On 31/03/15 07:31, Jeff Layton wrote:


Good afternoon!

Hiya Jeff,

[...]

But it doesn't seem to run. Here is the output of sinfo
and squeue:

[...]

Actually it does appear to get started (at least), but..


[ec2-user@ip-10-0-1-72 ec2-user]$ squeue
  JOBID PARTITION NAME USER ST TIME  NODES
NODELIST(REASON)
  2 debug slurmtes ec2-user CG 0:00  1 ip-10-0-2-101

...the CG state you see there is the completing state, i.e. the state
when a job is finishing up.


The system logs on the master node (contoller node) don't show too much:

Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
_slurm_rpc_submit_batch_job JobId=2 usec=239
Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
NodeList=ip-10-0-2-101 #CPUs=1

OK, node allocated.


Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0

Job finishes.


Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
JobID=2 State=0x8000 NodeCnt=1 per user/system request
Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x8000 NodeCnt=1 done

Not sure of the implication of that requeue there, unless it's the
transition to the CG state?


Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
ip-10-0-2-[101-102] not responding
Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102
not responding, setting DOWN
Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101
not responding, setting DOW

Now the nodes stop responding (not before).


 From these logs, it looks like the compute nodes are not
responding to the control node (master node).

Not sure how to debug this - any tips?

I would suggest looking at the slurmd logs on the compute nodes to see
if they report any problems, and check to see what state the processes
are in - especially if they're stuck in a 'D' state waiting on some form
of device I/O.

I know some people have reported strange interactions between Slurm
being on an NFSv4 mount (NFSv3 is fine).

Good luck!
Chris


[slurm-dev] Re: Problems running job

2015-03-31 Thread Jeff Layton


Actually I don't have all the ports open :(  I can do that
though (I thought that might be a problem).

Thanks!

Jeff



Do you have all the ports open between all the compute nodes as well?  
Since slurm builds a tree to communicate all the nodes need to talk to 
every other node on those ports and do so with out a huge amount of 
latency. You might want to try to up your timeouts.


-Paul Edmon-

On 03/31/2015 10:28 AM, Jeff Layton wrote:


Chris and David,

Thanks for the help! I'm still trying to find out why the
compute nodes are down or not responding. Any tips
on where to start?

How about open ports? Right now I have 6817 and
6818 open as per my slurm.conf. I also have 22 and 80
open as well as 111, 2049, and 32806. I'm using NFSv4
but don't know if that is causing the problem or not
(I REALLY want to stick to NFSv4).

Thanks!

Jeff


On 31/03/15 07:31, Jeff Layton wrote:


Good afternoon!

Hiya Jeff,

[...]

But it doesn't seem to run. Here is the output of sinfo
and squeue:

[...]

Actually it does appear to get started (at least), but..


[ec2-user@ip-10-0-1-72 ec2-user]$ squeue
  JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
  2 debug slurmtes ec2-user CG 0:00 1 
ip-10-0-2-101

...the CG state you see there is the completing state, i.e. the state
when a job is finishing up.

The system logs on the master node (contoller node) don't show too 
much:


Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
_slurm_rpc_submit_batch_job JobId=2 usec=239
Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
NodeList=ip-10-0-2-101 #CPUs=1

OK, node allocated.


Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0

Job finishes.


Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
JobID=2 State=0x8000 NodeCnt=1 per user/system request
Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x8000 NodeCnt=1 done

Not sure of the implication of that requeue there, unless it's the
transition to the CG state?


Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
ip-10-0-2-[101-102] not responding
Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes 
ip-10-0-2-102

not responding, setting DOWN
Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes 
ip-10-0-2-101

not responding, setting DOW

Now the nodes stop responding (not before).


 From these logs, it looks like the compute nodes are not
responding to the control node (master node).

Not sure how to debug this - any tips?

I would suggest looking at the slurmd logs on the compute nodes to see
if they report any problems, and check to see what state the processes
are in - especially if they're stuck in a 'D' state waiting on some 
form

of device I/O.

I know some people have reported strange interactions between Slurm
being on an NFSv4 mount (NFSv3 is fine).

Good luck!
Chris




[slurm-dev] Re: Problems running job

2015-03-31 Thread Paul Edmon


Do you have all the ports open between all the compute nodes as well?  
Since slurm builds a tree to communicate all the nodes need to talk to 
every other node on those ports and do so with out a huge amount of 
latency. You might want to try to up your timeouts.


-Paul Edmon-

On 03/31/2015 10:28 AM, Jeff Layton wrote:


Chris and David,

Thanks for the help! I'm still trying to find out why the
compute nodes are down or not responding. Any tips
on where to start?

How about open ports? Right now I have 6817 and
6818 open as per my slurm.conf. I also have 22 and 80
open as well as 111, 2049, and 32806. I'm using NFSv4
but don't know if that is causing the problem or not
(I REALLY want to stick to NFSv4).

Thanks!

Jeff


On 31/03/15 07:31, Jeff Layton wrote:


Good afternoon!

Hiya Jeff,

[...]

But it doesn't seem to run. Here is the output of sinfo
and squeue:

[...]

Actually it does appear to get started (at least), but..


[ec2-user@ip-10-0-1-72 ec2-user]$ squeue
  JOBID PARTITION NAME USER ST TIME  NODES
NODELIST(REASON)
  2 debug slurmtes ec2-user CG 0:00  1 
ip-10-0-2-101

...the CG state you see there is the completing state, i.e. the state
when a job is finishing up.

The system logs on the master node (contoller node) don't show too 
much:


Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
_slurm_rpc_submit_batch_job JobId=2 usec=239
Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
NodeList=ip-10-0-2-101 #CPUs=1

OK, node allocated.


Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0

Job finishes.


Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
JobID=2 State=0x8000 NodeCnt=1 per user/system request
Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x8000 NodeCnt=1 done

Not sure of the implication of that requeue there, unless it's the
transition to the CG state?


Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
ip-10-0-2-[101-102] not responding
Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes 
ip-10-0-2-102

not responding, setting DOWN
Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes 
ip-10-0-2-101

not responding, setting DOW

Now the nodes stop responding (not before).


 From these logs, it looks like the compute nodes are not
responding to the control node (master node).

Not sure how to debug this - any tips?

I would suggest looking at the slurmd logs on the compute nodes to see
if they report any problems, and check to see what state the processes
are in - especially if they're stuck in a 'D' state waiting on some form
of device I/O.

I know some people have reported strange interactions between Slurm
being on an NFSv4 mount (NFSv3 is fine).

Good luck!
Chris


[slurm-dev] Re: Problems running job

2015-03-31 Thread Jeff Layton


That's what I've done. Everything is in NFSv4 except for a
few bits:

/etc/slurm.conf
/etc/init.d/slurm
/var/log/slurm
/var/run/slurm
/var/spool/slurm

These bits are local to the node.

Will slurm have trouble in this case?

Thanks!

Jeff


The problem mentioned with NFSv4 is keeping your SLURM installation on NFS, eg. 
exported to but not physically residing on your nodes.

--
 *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
  || \\UTGERS  |-*O*-
  ||_// Biomedical | Ryan Novosielski - Senior Technologist
  || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922)
  ||  \\  Sciences | OIRT/High Perf  Res Comp - MSB C630, Newark
   `'

From: Jeff Layton [layto...@att.net]
Sent: Tuesday, March 31, 2015 10:28 AM
To: slurm-dev
Subject: [slurm-dev] Re: Problems running job

Chris and David,

Thanks for the help! I'm still trying to find out why the
compute nodes are down or not responding. Any tips
on where to start?

How about open ports? Right now I have 6817 and
6818 open as per my slurm.conf. I also have 22 and 80
open as well as 111, 2049, and 32806. I'm using NFSv4
but don't know if that is causing the problem or not
(I REALLY want to stick to NFSv4).

Thanks!

Jeff


On 31/03/15 07:31, Jeff Layton wrote:


Good afternoon!

Hiya Jeff,

[...]

But it doesn't seem to run. Here is the output of sinfo
and squeue:

[...]

Actually it does appear to get started (at least), but..


[ec2-user@ip-10-0-1-72 ec2-user]$ squeue
   JOBID PARTITION NAME USER ST TIME  NODES
NODELIST(REASON)
   2 debug slurmtes ec2-user CG 0:00  1 ip-10-0-2-101

...the CG state you see there is the completing state, i.e. the state
when a job is finishing up.


The system logs on the master node (contoller node) don't show too much:

Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
_slurm_rpc_submit_batch_job JobId=2 usec=239
Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
NodeList=ip-10-0-2-101 #CPUs=1

OK, node allocated.


Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0

Job finishes.


Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
JobID=2 State=0x8000 NodeCnt=1 per user/system request
Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x8000 NodeCnt=1 done

Not sure of the implication of that requeue there, unless it's the
transition to the CG state?


Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
ip-10-0-2-[101-102] not responding
Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102
not responding, setting DOWN
Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101
not responding, setting DOW

Now the nodes stop responding (not before).


  From these logs, it looks like the compute nodes are not
responding to the control node (master node).

Not sure how to debug this - any tips?

I would suggest looking at the slurmd logs on the compute nodes to see
if they report any problems, and check to see what state the processes
are in - especially if they're stuck in a 'D' state waiting on some form
of device I/O.

I know some people have reported strange interactions between Slurm
being on an NFSv4 mount (NFSv3 is fine).

Good luck!
Chris


[slurm-dev] Re: Problems running job

2015-03-31 Thread Uwe Sauter

Yes! There are problems if the clean-up scripts for cgroups reside on NFSv4. 
Nodes will lock-up when they try to remove a job's
cgroup.


Am 31.03.2015 um 17:06 schrieb Jeff Layton:
 
 That's what I've done. Everything is in NFSv4 except for a
 few bits:
 
 /etc/slurm.conf
 /etc/init.d/slurm
 /var/log/slurm
 /var/run/slurm
 /var/spool/slurm
 
 These bits are local to the node.
 
 Will slurm have trouble in this case?
 
 Thanks!
 
 Jeff
 
 The problem mentioned with NFSv4 is keeping your SLURM installation on NFS, 
 eg. exported to but not physically residing on your
 nodes.

 -- 
  *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
   || \\UTGERS  |-*O*-
   ||_// Biomedical | Ryan Novosielski - Senior Technologist
   || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922)
   ||  \\  Sciences | OIRT/High Perf  Res Comp - MSB C630, Newark
`'
 
 From: Jeff Layton [layto...@att.net]
 Sent: Tuesday, March 31, 2015 10:28 AM
 To: slurm-dev
 Subject: [slurm-dev] Re: Problems running job

 Chris and David,

 Thanks for the help! I'm still trying to find out why the
 compute nodes are down or not responding. Any tips
 on where to start?

 How about open ports? Right now I have 6817 and
 6818 open as per my slurm.conf. I also have 22 and 80
 open as well as 111, 2049, and 32806. I'm using NFSv4
 but don't know if that is causing the problem or not
 (I REALLY want to stick to NFSv4).

 Thanks!

 Jeff

 On 31/03/15 07:31, Jeff Layton wrote:

 Good afternoon!
 Hiya Jeff,

 [...]
 But it doesn't seem to run. Here is the output of sinfo
 and squeue:
 [...]

 Actually it does appear to get started (at least), but..

 [ec2-user@ip-10-0-1-72 ec2-user]$ squeue
JOBID PARTITION NAME USER ST TIME  NODES
 NODELIST(REASON)
2 debug slurmtes ec2-user CG 0:00  1 
 ip-10-0-2-101
 ...the CG state you see there is the completing state, i.e. the state
 when a job is finishing up.

 The system logs on the master node (contoller node) don't show too much:

 Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
 _slurm_rpc_submit_batch_job JobId=2 usec=239
 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
 NodeList=ip-10-0-2-101 #CPUs=1
 OK, node allocated.

 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
 Job finishes.

 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
 JobID=2 State=0x8000 NodeCnt=1 per user/system request
 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
 State=0x8000 NodeCnt=1 done
 Not sure of the implication of that requeue there, unless it's the
 transition to the CG state?

 Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
 ip-10-0-2-[101-102] not responding
 Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102
 not responding, setting DOWN
 Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101
 not responding, setting DOW
 Now the nodes stop responding (not before).

   From these logs, it looks like the compute nodes are not
 responding to the control node (master node).

 Not sure how to debug this - any tips?
 I would suggest looking at the slurmd logs on the compute nodes to see
 if they report any problems, and check to see what state the processes
 are in - especially if they're stuck in a 'D' state waiting on some form
 of device I/O.

 I know some people have reported strange interactions between Slurm
 being on an NFSv4 mount (NFSv3 is fine).

 Good luck!
 Chris


[slurm-dev] A way of abuse the priority option in Slurm?

2015-03-31 Thread Moe Jette


Hi Magnus,

Unfortunately you found a bug. Here is a patch that will prevent users  
from making persistent job priority changes. We should probably return  
an error for this condition, but I would like to defer that change to  
the next major release, v15.08.


https://github.com/SchedMD/slurm/commit/4454316ef527b8700743d94c958811a39609e7d5.patch
--
Morris Moe Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support


[slurm-dev] Re: Error connecting slurm stream socket at IP:6817: Connection refused

2015-03-31 Thread Jorge Gois
I thanks for reply.

But i think is not a network problem, because i start this only on head 
controller.

Can see my config? I don’t have installed iptables or anything that looks.

Config:

ControlMachine=JGSLURMHC
ControlAddr=172.16.40.42
#BackupController=
#BackupAddr=

AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge

MpiDefault=none

ProctrackType=proctrack/pgid

ReturnToService=1

SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=root

StateSaveLocation=/tmp
SwitchType=switch/none

TaskPlugin=task/none

InactiveLimit=0
KillWait=30

MinJobAge=300

SlurmctldTimeout=120
SlurmdTimeout=300

Waittime=0

FastSchedule=1

SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear

AccountingStorageType=accounting_storage/none

AccountingStoreJobComment=YES
ClusterName=cluster

JobCompType=jobcomp/none

JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmdLogFile=/var/log/slurmd
SlurmctldLogFile=/var/log/slurm
SlurmdDebug=3

# COMPUTE NODES
NodeName=JGSLURMHC CPUs=1 State=UNKNOWN
#PartitionName=debug Nodes=JGNODE[1-1] Default=YES MaxTime=INFINITE State=UP


Bests,
JG



[slurm-dev] Re: Problems running job

2015-03-31 Thread Mehdi Denou

Put the slurmd and slurmctld in debug mode and retry the submission.
Then provide the logs.

Le 31/03/2015 16:28, Jeff Layton a écrit :

 Chris and David,

 Thanks for the help! I'm still trying to find out why the
 compute nodes are down or not responding. Any tips
 on where to start?

 How about open ports? Right now I have 6817 and
 6818 open as per my slurm.conf. I also have 22 and 80
 open as well as 111, 2049, and 32806. I'm using NFSv4
 but don't know if that is causing the problem or not
 (I REALLY want to stick to NFSv4).

 Thanks!

 Jeff

 On 31/03/15 07:31, Jeff Layton wrote:

 Good afternoon!
 Hiya Jeff,

 [...]
 But it doesn't seem to run. Here is the output of sinfo
 and squeue:
 [...]

 Actually it does appear to get started (at least), but..

 [ec2-user@ip-10-0-1-72 ec2-user]$ squeue
   JOBID PARTITION NAME USER ST TIME  NODES
 NODELIST(REASON)
   2 debug slurmtes ec2-user CG 0:00  1
 ip-10-0-2-101
 ...the CG state you see there is the completing state, i.e. the state
 when a job is finishing up.

 The system logs on the master node (contoller node) don't show too
 much:

 Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
 _slurm_rpc_submit_batch_job JobId=2 usec=239
 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
 NodeList=ip-10-0-2-101 #CPUs=1
 OK, node allocated.

 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
 Job finishes.

 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
 JobID=2 State=0x8000 NodeCnt=1 per user/system request
 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
 State=0x8000 NodeCnt=1 done
 Not sure of the implication of that requeue there, unless it's the
 transition to the CG state?

 Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
 ip-10-0-2-[101-102] not responding
 Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes
 ip-10-0-2-102
 not responding, setting DOWN
 Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes
 ip-10-0-2-101
 not responding, setting DOW
 Now the nodes stop responding (not before).

  From these logs, it looks like the compute nodes are not
 responding to the control node (master node).

 Not sure how to debug this - any tips?
 I would suggest looking at the slurmd logs on the compute nodes to see
 if they report any problems, and check to see what state the processes
 are in - especially if they're stuck in a 'D' state waiting on some form
 of device I/O.

 I know some people have reported strange interactions between Slurm
 being on an NFSv4 mount (NFSv3 is fine).

 Good luck!
 Chris

-- 
---
Mehdi Denou
International HPC support
+336 45 57 66 56


[slurm-dev] Re: Problems running job

2015-03-31 Thread Novosielski, Ryan

The problem mentioned with NFSv4 is keeping your SLURM installation on NFS, eg. 
exported to but not physically residing on your nodes. 

--
 *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
 || \\UTGERS  |-*O*-
 ||_// Biomedical | Ryan Novosielski - Senior Technologist
 || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922)
 ||  \\  Sciences | OIRT/High Perf  Res Comp - MSB C630, Newark
  `'

From: Jeff Layton [layto...@att.net]
Sent: Tuesday, March 31, 2015 10:28 AM
To: slurm-dev
Subject: [slurm-dev] Re: Problems running job

Chris and David,

Thanks for the help! I'm still trying to find out why the
compute nodes are down or not responding. Any tips
on where to start?

How about open ports? Right now I have 6817 and
6818 open as per my slurm.conf. I also have 22 and 80
open as well as 111, 2049, and 32806. I'm using NFSv4
but don't know if that is causing the problem or not
(I REALLY want to stick to NFSv4).

Thanks!

Jeff

 On 31/03/15 07:31, Jeff Layton wrote:

 Good afternoon!
 Hiya Jeff,

 [...]
 But it doesn't seem to run. Here is the output of sinfo
 and squeue:
 [...]

 Actually it does appear to get started (at least), but..

 [ec2-user@ip-10-0-1-72 ec2-user]$ squeue
   JOBID PARTITION NAME USER ST TIME  NODES
 NODELIST(REASON)
   2 debug slurmtes ec2-user CG 0:00  1 ip-10-0-2-101
 ...the CG state you see there is the completing state, i.e. the state
 when a job is finishing up.

 The system logs on the master node (contoller node) don't show too much:

 Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
 _slurm_rpc_submit_batch_job JobId=2 usec=239
 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
 NodeList=ip-10-0-2-101 #CPUs=1
 OK, node allocated.

 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
 Job finishes.

 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
 JobID=2 State=0x8000 NodeCnt=1 per user/system request
 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
 State=0x8000 NodeCnt=1 done
 Not sure of the implication of that requeue there, unless it's the
 transition to the CG state?

 Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
 ip-10-0-2-[101-102] not responding
 Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102
 not responding, setting DOWN
 Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101
 not responding, setting DOW
 Now the nodes stop responding (not before).

  From these logs, it looks like the compute nodes are not
 responding to the control node (master node).

 Not sure how to debug this - any tips?
 I would suggest looking at the slurmd logs on the compute nodes to see
 if they report any problems, and check to see what state the processes
 are in - especially if they're stuck in a 'D' state waiting on some form
 of device I/O.

 I know some people have reported strange interactions between Slurm
 being on an NFSv4 mount (NFSv3 is fine).

 Good luck!
 Chris