[slurm-dev] Re: Problems running job

Novosielski, Ryan Tue, 31 Mar 2015 07:57:07 -0700

The problem mentioned with NFSv4 is keeping your SLURM installation on NFS, eg. 
exported to but not physically residing on your nodes.


--
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
 || \\UTGERS      |---------------------*O*---------------------
 ||_// Biomedical | Ryan Novosielski - Senior Technologist
 || \\ and Health | [email protected] - 973/972.0922 (2x0922)
 ||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
      `'
________________________________________
From: Jeff Layton [[email protected]]
Sent: Tuesday, March 31, 2015 10:28 AM
To: slurm-dev
Subject: [slurm-dev] Re: Problems running job

Chris and David,

Thanks for the help! I'm still trying to find out why the
compute nodes are down or not responding. Any tips
on where to start?

How about open ports? Right now I have 6817 and
6818 open as per my slurm.conf. I also have 22 and 80
open as well as 111, 2049, and 32806. I'm using NFSv4
but don't know if that is causing the problem or not
(I REALLY want to stick to NFSv4).

Thanks!

Jeff

> On 31/03/15 07:31, Jeff Layton wrote:
>
>> Good afternoon!
> Hiya Jeff,
>
> [...]
>> But it doesn't seem to run. Here is the output of sinfo
>> and squeue:
> [...]
>
> Actually it does appear to get started (at least), but..
>
>> [ec2-user@ip-10-0-1-72 ec2-user]$ squeue
>>               JOBID PARTITION     NAME     USER ST TIME  NODES
>> NODELIST(REASON)
>>                   2     debug slurmtes ec2-user CG 0:00      1 ip-10-0-2-101
> ...the CG state you see there is the completing state, i.e. the state
> when a job is finishing up.
>
>> The system logs on the master node (contoller node) don't show too much:
>>
>> Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
>> _slurm_rpc_submit_batch_job JobId=2 usec=239
>> Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
>> NodeList=ip-10-0-2-101 #CPUs=1
> OK, node allocated.
>
>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
>> State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
> Job finishes.
>
>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
>> JobID=2 State=0x8000 NodeCnt=1 per user/system request
>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
>> State=0x8000 NodeCnt=1 done
> Not sure of the implication of that "requeue" there, unless it's the
> transition to the CG state?
>
>> Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
>> ip-10-0-2-[101-102] not responding
>> Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102
>> not responding, setting DOWN
>> Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101
>> not responding, setting DOW
> Now the nodes stop responding (not before).
>
>>  From these logs, it looks like the compute nodes are not
>> responding to the control node (master node).
>>
>> Not sure how to debug this - any tips?
> I would suggest looking at the slurmd logs on the compute nodes to see
> if they report any problems, and check to see what state the processes
> are in - especially if they're stuck in a 'D' state waiting on some form
> of device I/O.
>
> I know some people have reported strange interactions between Slurm
> being on an NFSv4 mount (NFSv3 is fine).
>
> Good luck!
> Chris

[slurm-dev] Re: Problems running job

Reply via email to