Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

Mark Dixon Thu, 14 Jun 2012 08:07:50 -0700

On Fri, 8 Jun 2012, Rayson Ho wrote:
...

I am working with other software that uses cgroups, and someone
pointed me to email yesterday:


http://www.spinics.net/lists/cgroups/msg02622.html


I don't think the per-cgroup AS limit is that hard - in the end the
"memory.memsw.usage_in_bytes" file already shows the recorded max
memory+Swap usage. I believe the kernel does its own internally
accounting when memory is allocated, so it may be a matter of
enforcing the limit in a different way.

http://www.kernel.org/doc/Documentation/cgroups/memory.txt

Of course, the difficult part is that the kernel overcommits memory,
so it's likely that actual memory is not allocated & accounted by
"memory.memsw.usage_in_bytes" until page faults occur - which was what
we've discussed previously...


Hi Rayson,

Thanks for the thorough response, it's much appreciated :)

I guess it's less an issue about whether a per-cgroup AS limit is hard toput in to cgroups, but more about whether it's generally useful for peoplein more cases than just maintaining backwards compatibility with olderversions of gridengine.

If I could wave a wand and get what I wished for from the kernel fairy,I'd rather have a per-cgroup memory overcommit setting (that actuallyworks: I read the link you quoted above - yikes!) and drop the PDC per-jobAS behaviour.

It might not play well with maintaining compatibility, but I've alwaysthought the per-job AS stuff to be a bit of a kludge (but I'm sure was thebest option available at the time it was written).

...

May be I should also add a test case to test for it. But can you think
of a case where:

( h_vmem >= memory.memsw.usage_in_bytes )

is FALSE?

I am not a kernel hacker, but I worry about how anonymous stuff like I/Ocaches are dealt with - and interactions with some of the more exoticstuff like InfiniBand, Lustre, GPFS, etc.

I've seen some wacky behaviour with memory and MPI over InfiniBand, forexample.

...

IMO, if we still poll the /proc filesystem for the h_vmem (ie. sum of
h_vmem of all processes in a job) periodically but less as frequent,
then it should not be a real issue. If a process exceeds the h_vmem
limit, then it also means that it exceeds the limit imposed by
setrlimit(2), which is also set even when OGS/GE is using cgroups. So
with /procfs PDC or cgroups PDC, the process would get the same
treatment by the kernel... But if the sum of h_vmem of all processes
of a job exceeds the h_vmem, then the periodic procfs poll would still
catch this case, and the action taken would be the same for both
cases.

...

As long as innocent jobs don't get killed, and system performance is
not hurting due to cgroups integration, then everyone is happy...

Not sure if I have already covered all cases... or am I still missing
something??


I think that what you've said is true, until the last two paragraphs ;)

When I read that you set the AS setrlimit "even when OGS/GE is usingcgroups" and "if the sum of h_vmem of all processes", I suspect you're notfixing my main issue with gridengine's existing h_vmem mechanism(apologies if I've misunderstood).

h_vmem is NOT a good proxy for memory usage (as permemory.memsw.usage_in_bytes definition), as it can vastly over-estimate itin various modern common cases - as I've described in an earlier message.It should therefore not be used for this.


In summary:

* I do NOT believe that AS should be summed when the processes of a jobare being polled. Instead it should be RSS+SWAP (or similar).

* I DO believe that the per-process AS setrlimit should be settable to avalue specified by the user (but defaulting to unlimited), to maintain theexisting functionality described in earlier messages.

This would revert s_vmem and h_vmem to being treated by gridengine justthe same as ordinary setrlimit resources, in exactly the same way as s_rss/ h_rss and s_data / h_data and friends are, without any per-jobinterpretation.

I appreciate that this will cause more compatibility breakage what I thinkis your approach, but it's a much-needed break from the past which takesadvantage of cgroups to greatly improve utilisation of the resourcesavailable.


How does that sound?

Cheers,

Mark
--
-----------------------------------------------------------------
Mark Dixon                       Email    : [email protected]
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

Reply via email to