[slurm-dev] Re: Problems running job

Uwe Sauter Tue, 31 Mar 2015 08:12:44 -0700

Yes! There are problems if the clean-up scripts for cgroups reside on NFSv4. 
Nodes will lock-up when they try to remove a job's
cgroup.



Am 31.03.2015 um 17:06 schrieb Jeff Layton:
> 
> That's what I've done. Everything is in NFSv4 except for a
> few bits:
> 
> /etc/slurm.conf
> /etc/init.d/slurm
> /var/log/slurm
> /var/run/slurm
> /var/spool/slurm
> 
> These bits are local to the node.
> 
> Will slurm have trouble in this case?
> 
> Thanks!
> 
> Jeff
> 
>> The problem mentioned with NFSv4 is keeping your SLURM installation on NFS, 
>> eg. exported to but not physically residing on your
>> nodes.
>>
>> -- 
>> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
>>   || \\UTGERS      |---------------------*O*---------------------
>>   ||_// Biomedical | Ryan Novosielski - Senior Technologist
>>   || \\ and Health | [email protected] - 973/972.0922 (2x0922)
>>   ||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
>>        `'
>> ________________________________________
>> From: Jeff Layton [[email protected]]
>> Sent: Tuesday, March 31, 2015 10:28 AM
>> To: slurm-dev
>> Subject: [slurm-dev] Re: Problems running job
>>
>> Chris and David,
>>
>> Thanks for the help! I'm still trying to find out why the
>> compute nodes are down or not responding. Any tips
>> on where to start?
>>
>> How about open ports? Right now I have 6817 and
>> 6818 open as per my slurm.conf. I also have 22 and 80
>> open as well as 111, 2049, and 32806. I'm using NFSv4
>> but don't know if that is causing the problem or not
>> (I REALLY want to stick to NFSv4).
>>
>> Thanks!
>>
>> Jeff
>>
>>> On 31/03/15 07:31, Jeff Layton wrote:
>>>
>>>> Good afternoon!
>>> Hiya Jeff,
>>>
>>> [...]
>>>> But it doesn't seem to run. Here is the output of sinfo
>>>> and squeue:
>>> [...]
>>>
>>> Actually it does appear to get started (at least), but..
>>>
>>>> [ec2-user@ip-10-0-1-72 ec2-user]$ squeue
>>>>                JOBID PARTITION     NAME     USER ST TIME  NODES
>>>> NODELIST(REASON)
>>>>                    2     debug slurmtes ec2-user CG 0:00      1 
>>>> ip-10-0-2-101
>>> ...the CG state you see there is the completing state, i.e. the state
>>> when a job is finishing up.
>>>
>>>> The system logs on the master node (contoller node) don't show too much:
>>>>
>>>> Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
>>>> _slurm_rpc_submit_batch_job JobId=2 usec=239
>>>> Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
>>>> NodeList=ip-10-0-2-101 #CPUs=1
>>> OK, node allocated.
>>>
>>>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
>>>> State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
>>> Job finishes.
>>>
>>>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
>>>> JobID=2 State=0x8000 NodeCnt=1 per user/system request
>>>> Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
>>>> State=0x8000 NodeCnt=1 done
>>> Not sure of the implication of that "requeue" there, unless it's the
>>> transition to the CG state?
>>>
>>>> Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
>>>> ip-10-0-2-[101-102] not responding
>>>> Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102
>>>> not responding, setting DOWN
>>>> Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101
>>>> not responding, setting DOW
>>> Now the nodes stop responding (not before).
>>>
>>>>   From these logs, it looks like the compute nodes are not
>>>> responding to the control node (master node).
>>>>
>>>> Not sure how to debug this - any tips?
>>> I would suggest looking at the slurmd logs on the compute nodes to see
>>> if they report any problems, and check to see what state the processes
>>> are in - especially if they're stuck in a 'D' state waiting on some form
>>> of device I/O.
>>>
>>> I know some people have reported strange interactions between Slurm
>>> being on an NFSv4 mount (NFSv3 is fine).
>>>
>>> Good luck!
>>> Chris

[slurm-dev] Re: Problems running job

Reply via email to