On Fri, 8 Jun 2012, Rayson Ho wrote:
...
I am working with other software that uses cgroups, and someone
pointed me to email yesterday:
http://www.spinics.net/lists/cgroups/msg02622.html
I don't think the per-cgroup AS limit is that hard - in the end the
"memory.memsw.usage_in_bytes" file already shows the recorded max
memory+Swap usage. I believe the kernel does its own internally
accounting when memory is allocated, so it may be a matter of
enforcing the limit in a different way.
http://www.kernel.org/doc/Documentation/cgroups/memory.txt
Of course, the difficult part is that the kernel overcommits memory,
so it's likely that actual memory is not allocated & accounted by
"memory.memsw.usage_in_bytes" until page faults occur - which was what
we've discussed previously...
Hi Rayson,
Thanks for the thorough response, it's much appreciated :)
I guess it's less an issue about whether a per-cgroup AS limit is hard to
put in to cgroups, but more about whether it's generally useful for people
in more cases than just maintaining backwards compatibility with older
versions of gridengine.
If I could wave a wand and get what I wished for from the kernel fairy,
I'd rather have a per-cgroup memory overcommit setting (that actually
works: I read the link you quoted above - yikes!) and drop the PDC per-job
AS behaviour.
It might not play well with maintaining compatibility, but I've always
thought the per-job AS stuff to be a bit of a kludge (but I'm sure was the
best option available at the time it was written).
...
May be I should also add a test case to test for it. But can you think
of a case where:
( h_vmem >= memory.memsw.usage_in_bytes )
is FALSE?
I am not a kernel hacker, but I worry about how anonymous stuff like I/O
caches are dealt with - and interactions with some of the more exotic
stuff like InfiniBand, Lustre, GPFS, etc.
I've seen some wacky behaviour with memory and MPI over InfiniBand, for
example.
...
IMO, if we still poll the /proc filesystem for the h_vmem (ie. sum of
h_vmem of all processes in a job) periodically but less as frequent,
then it should not be a real issue. If a process exceeds the h_vmem
limit, then it also means that it exceeds the limit imposed by
setrlimit(2), which is also set even when OGS/GE is using cgroups. So
with /procfs PDC or cgroups PDC, the process would get the same
treatment by the kernel... But if the sum of h_vmem of all processes
of a job exceeds the h_vmem, then the periodic procfs poll would still
catch this case, and the action taken would be the same for both
cases.
...
As long as innocent jobs don't get killed, and system performance is
not hurting due to cgroups integration, then everyone is happy...
Not sure if I have already covered all cases... or am I still missing
something??
I think that what you've said is true, until the last two paragraphs ;)
When I read that you set the AS setrlimit "even when OGS/GE is using
cgroups" and "if the sum of h_vmem of all processes", I suspect you're not
fixing my main issue with gridengine's existing h_vmem mechanism
(apologies if I've misunderstood).
h_vmem is NOT a good proxy for memory usage (as per
memory.memsw.usage_in_bytes definition), as it can vastly over-estimate it
in various modern common cases - as I've described in an earlier message.
It should therefore not be used for this.
In summary:
* I do NOT believe that AS should be summed when the processes of a job
are being polled. Instead it should be RSS+SWAP (or similar).
* I DO believe that the per-process AS setrlimit should be settable to a
value specified by the user (but defaulting to unlimited), to maintain the
existing functionality described in earlier messages.
This would revert s_vmem and h_vmem to being treated by gridengine just
the same as ordinary setrlimit resources, in exactly the same way as s_rss
/ h_rss and s_data / h_data and friends are, without any per-job
interpretation.
I appreciate that this will cause more compatibility breakage what I think
is your approach, but it's a much-needed break from the past which takes
advantage of cgroups to greatly improve utilisation of the resources
available.
How does that sound?
Cheers,
Mark
--
-----------------------------------------------------------------
Mark Dixon Email : [email protected]
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users