[slurm-dev] Re: cgroup release_memory agent creates too many pids

Andrej Filipcic Tue, 30 Oct 2012 03:16:06 -0700


Thank you very much. I will try the patches from the git.


Best regards,
Andrej

On 10/23/2012 10:31 PM, Matthieu Hautreux wrote:
> Andrej,
>
> a set of patches was applied to the current dev branch of Slurm (2.5,
> current master git branch) and should correct the issue you reported
> concerning the behavior of the tack/cgroup memory subsystem logic.
>
> According to Moe, the official 2.5 version sould be available within a
> month. If you want to try the fix against slurm-2.4.3, I can send you
> the patches or you can get them from
> https://github.com/SchedMD/slurm/commits/master using the commit range
>   66e80a49ff...9a548ec199.
>
> Regards,
> Matthieu
>
> 2012/10/10 Andrej Filipcic <[email protected]>:
>>
>> Thanks for extensive info. In the meantime, I had disabled
>> task/affinity, and I am using only task/cgroup. Much lower number of
>> release_agent calls. Waiting for the new development then...
>>
>> Best regards,
>> Andrej
>>
>> On 10/10/2012 02:39 PM, Matthieu Hautreux wrote:
>>> Hi,
>>>
>>> the locking that you have removed is necessary to ensure the proper
>>> behavior of the cgroup directory creation.
>>> It could result in the memory cgroup plugin no longer working as
>>> expected and some jobs or job steps no being ran in a memory cgroup at
>>> all.
>>>
>>> This is mostly due to the fact that the cgroup directory hierarchy
>>> (uid/job_id/step_id) is automatically removed by the release agent
>>> mechanism of the cgroup and not directly by the cgroup logic of SLURM.
>>> As a result, when creating a new step, you can have situation where
>>> you check that the job directory is present and then add the step
>>> directory but in the meantime, a release agent has removed the job dir
>>> and this creation failed. To avoid that, the flock of the cgroup
>>> subsystem root directory was introduced. This logic was not designed
>>> with "high throughput" computing in mind and so it does not really
>>> work with your workload.
>>>
>>> Mark Grondonna has added the ability to remove the step level cgroup
>>> directory directly in the SLURM logic in slurm-2.4.x and I have worked
>>> also on applying the same logic for both the job and the user level of
>>> the hierarchy but it is not yet included in any official version of
>>> SLURM. I will work on that again and hope to have something working
>>> better for slurm-2.5 (most probably for november according to
>>> schedmd). I hope that the speedup will be sufficient for you.
>>>
>>> In the meantime, I would suggest to no longer use the cgroup memory
>>> logic if you experiment the issue I mentionned at the beginning of
>>> this email.
>>>
>>> Best regards,
>>> Matthieu
>>>
>>>
>>>
>>>
>>> 2012/10/1 Andrej Filipcic <[email protected]>:
>>>> Found out that the release_memory is called many times for the same path
>>>> unlike with the others (cpusets), 4k for 100 jobs.
>>>>
>>>> It seems to work much better if I replace this line:
>>>>           flock -x ${mountdir} -c "$0 sync $@"
>>>> with
>>>>           flock -x -w 2 ${rmcg} -c "$0 sync $@"
>>>>
>>>> So, locking on the directory to be removed. I am not sure if this has
>>>> any side effects... But at least, there is no excessive number of
>>>> processes created and the memory cgroup tree is cleaned properly after
>>>> all the jobs finish.
>>>>
>>>> Cheers,
>>>> Andrej
>>>>
>>>> On 09/30/2012 01:19 PM, Andrej Filipcic wrote:
>>>>> Hi,
>>>>>
>>>>> On 64-core nodes while submitting many short jobs, the number of calls
>>>>> to release_memory agent (symlink to release_common from slurm 2.4.3
>>>>> release) can be extremely high. It seems that the script is too slow for
>>>>> memory, which results in few 10k agent processes being spawned in a
>>>>> short time after job completion, and the processes stay alive for a long
>>>>> time. In extreme cases, the pid numbers can be exhausted preventing new
>>>>> processes being spawned. To fix it partially, I had commented the "sleep
>>>>> 1" in the sync part of the script. But there can still be up to few k
>>>>> processes after 64 jobs complete in roughly the same time.
>>>>>
>>>>> Each job has about 10 processes, so the number of agent calls can be high.
>>>>>
>>>>> I did not notice that on the nodes with lower no of cores/jobs, and the
>>>>> problem is not present for other cgroups.
>>>>>
>>>>> Any advice how to fix this problem?
>>>>>
>>>>> Cheers,
>>>>> Andrej
>>>>>
>>>> --
>>>> _____________________________________________________________
>>>>      prof. dr. Andrej Filipcic,   E-mail: [email protected]
>>>>      Department of Experimental High Energy Physics - F9
>>>>      Jozef Stefan Institute, Jamova 39, P.o.Box 3000
>>>>      SI-1001 Ljubljana, Slovenia
>>>>      Tel.: +386-1-477-3674    Fax: +386-1-477-3166
>>>> -------------------------------------------------------------
>>
>> --
>> _____________________________________________________________
>>     prof. dr. Andrej Filipcic,   E-mail: [email protected]
>>     Department of Experimental High Energy Physics - F9
>>     Jozef Stefan Institute, Jamova 39, P.o.Box 3000
>>     SI-1001 Ljubljana, Slovenia
>>     Tel.: +386-1-477-3674    Fax: +386-1-425-7074
>> -------------------------------------------------------------


-- 
_____________________________________________________________
    prof. dr. Andrej Filipcic,   E-mail: [email protected]
    Department of Experimental High Energy Physics - F9
    Jozef Stefan Institute, Jamova 39, P.o.Box 3000
    SI-1001 Ljubljana, Slovenia
    Tel.: +386-1-477-3674    Fax: +386-1-477-3166
-------------------------------------------------------------

[slurm-dev] Re: cgroup release_memory agent creates too many pids

Reply via email to