Took me about 4 weeks to get behind that as the runtime or cgroup usage seemed 
to affect if the node actually froze. Short jobs
were ok but longer ones reliably caused the kernel to hang with those annoying 
"task didn't react for  more than 120 sec"
messages. The effect was that Slurm wasn't able to communicate and to drain the 
node.

        Uwe


Am 24.03.2015 um 21:12 schrieb Paul Edmon:
> 
> Interesting.  Yeah we use v3 here.  Hadn't tried out v4, and good thing we 
> didn't then.
> 
> -Paul Edmon-
> 
> On 03/24/2015 04:05 PM, Uwe Sauter wrote:
>> And if you are planning on using cgroups, don't use NFSv4. There are 
>> problems that cause the NFS client process to freeze (and
>> with that freeze the node) when the cgroup removal script is called.
>>
>> Regards,
>>
>>     Uwe Sauter
>>
>> Am 24.03.2015 um 20:50 schrieb Paul Edmon:
>>> Yup, that's exactly what we do.  We make sure to export it read only and 
>>> make sure that it is synced and hard mounted.  Not much
>>> else to it.
>>>
>>> -Paul Edmon-
>>>
>>> On 03/24/2015 03:43 PM, Jeff Layton wrote:
>>>> Good afternoon,
>>>>
>>>> I apologies for the newb question but I'm setting up slurm
>>>> for the first time in a very long time. I've got a small cluster
>>>> of a master node and 4 compute nodes. I'd like to install
>>>> slurm on an NFS file system that is exported from the master
>>>> node and mounted on the compute nodes. I've been reading
>>>> a bit about this but does anyone have recommendations on
>>>> what to watch out for?
>>>>
>>>> Thanks!
>>>>
>>>> Jeff

Reply via email to