[slurm-dev] Re: Problems running job

Mehdi Denou Tue, 31 Mar 2015 07:32:14 -0700

Put the slurmd and slurmctld in debug mode and retry the submission.
Then provide the logs.


Le 31/03/2015 16:28, Jeff Layton a écrit :
>
> Chris and David,
>
> Thanks for the help! I'm still trying to find out why the
> compute nodes are down or not responding. Any tips
> on where to start?
>
> How about open ports? Right now I have 6817 and
> 6818 open as per my slurm.conf. I also have 22 and 80
> open as well as 111, 2049, and 32806. I'm using NFSv4
> but don't know if that is causing the problem or not
> (I REALLY want to stick to NFSv4).
>
> Thanks!
>
> Jeff
>
>> On 31/03/15 07:31, Jeff Layton wrote:
>>
>>> Good afternoon!
>> Hiya Jeff,
>>
>> [...]
>>> But it doesn't seem to run. Here is the output of sinfo
>>> and squeue:
>> [...]
>>
>> Actually it does appear to get started (at least), but..
>>
>>> [ec2-user@ip-10-0-1-72 ec2-user]$ squeue
>>>               JOBID PARTITION     NAME     USER ST TIME  NODES
>>> NODELIST(REASON)
>>>                   2     debug slurmtes ec2-user CG 0:00      1
>>> ip-10-0-2-101
>> ...the CG state you see there is the completing state, i.e. the state
>> when a job is finishing up.
>>
>>> The system logs on the master node (contoller node) don't show too
>>> much:
>>>
>>> Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
>>> _slurm_rpc_submit_batch_job JobId=2 usec=239
>>> Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
>>> NodeList=ip-10-0-2-101 #CPUs=1
>> OK, node allocated.
>>
>>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
>>> State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
>> Job finishes.
>>
>>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
>>> JobID=2 State=0x8000 NodeCnt=1 per user/system request
>>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
>>> State=0x8000 NodeCnt=1 done
>> Not sure of the implication of that "requeue" there, unless it's the
>> transition to the CG state?
>>
>>> Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
>>> ip-10-0-2-[101-102] not responding
>>> Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes
>>> ip-10-0-2-102
>>> not responding, setting DOWN
>>> Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes
>>> ip-10-0-2-101
>>> not responding, setting DOW
>> Now the nodes stop responding (not before).
>>
>>>  From these logs, it looks like the compute nodes are not
>>> responding to the control node (master node).
>>>
>>> Not sure how to debug this - any tips?
>> I would suggest looking at the slurmd logs on the compute nodes to see
>> if they report any problems, and check to see what state the processes
>> are in - especially if they're stuck in a 'D' state waiting on some form
>> of device I/O.
>>
>> I know some people have reported strange interactions between Slurm
>> being on an NFSv4 mount (NFSv3 is fine).
>>
>> Good luck!
>> Chris

-- 
---
Mehdi Denou
International HPC support
+336 45 57 66 56

[slurm-dev] Re: Problems running job

Reply via email to