Re: [gridengine users] Linux OOM killer oom_adj

Peter van Heusden Thu, 30 Aug 2012 02:51:25 -0700

I've had exactly the same experience. Java seems to do some kind of
calculation based on total system memory, and you need to size h_vmem to
much more than the java application's real memory use in order to make
the job run.
Reuti, have you dealt with this problem? Brian, could you share the
memkiller script you use?


Thanks,
Peter

On 08/29/2012 06:09 PM, Brian Smith wrote:
> We found h_vmem to be highly unpredictable, especially with java-based
> applications.  Stack settings were screwed up, certain applications
> wouldn't launch (segfaults), and hard limits were hard to determine
> for things like MPI applications.  When your master has to launch 1024
> MPI sub-tasks (qrsh), it generally eats up more VMEM than the slave
> tasks do.  It was just hard to get right.
>
> -Brian
>
> Brian Smith
> Sr. System Administrator
> Research Computing, University of South Florida
> 4202 E. Fowler Ave. SVC4010
> Office Phone: +1 813 974-1467
> Organization URL: http://rc.usf.edu
>
> On 08/29/2012 11:33 AM, Reuti wrote:
>> Am 29.08.2012 um 17:21 schrieb Brian Smith:
>>
>>> We use mem_free variable as a consumable.  Then, we use a cronjob
>>> called memkiller that terminates jobs if they go over their
>>> requested (or default) memory allocation and
>>
>> It would be more straight forward to use directly h_vmem. This is
>> controlled by SGE and the job exceeding the limit will be killed by
>> SGE. If you consume it as a consumable on a exechost level, it could
>> be set to the built in physical memory.
>>
>> Was there any reason to use mem_free?
>>
>> -- Reuti
>>
>>
>>>
>>> 1. Swap space on node is used
>>> 2. Swap rate is greater than 100 I/Os per second
>>>
>>> The user gets emailed with a report if this happens.
>>>
>>> This has made dealing with the oom killer a thing of the past in our
>>> shop.
>>>
>>> We manage memory on the principle that swap should NEVER be used. 
>>> If you're hitting oom killer, you're pretty far beyond that in terms
>>> of memory utilization; if performance is a consideration, MHO is you
>>> should be looking to schedule your memory usage accordingly.  Oom
>>> killer shouldn't be a factor if memory is handled as a scheduler
>>> consideration.
>>>
>>> -Brian
>>>
>>> Brian Smith
>>> Sr. System Administrator
>>> Research Computing, University of South Florida
>>> 4202 E. Fowler Ave. SVC4010
>>> Office Phone: +1 813 974-1467
>>> Organization URL: http://rc.usf.edu
>>>
>>> On 08/29/2012 11:02 AM, Ben De Luca wrote:
>>>> I was wondering, how people deal with oom conditions on there cluster.
>>>> We constantly have machines that die because the oom killer takes out
>>>> critical system services.
>>>>
>>>> Has any experiance with the oom_adj proc value, or a patch to grid to
>>>> support it?
>>>>
>>>>
>>>>   /proc/[pid]/oom_adj (since Linux 2.6.11)
>>>>                This file can be used to adjust the score used to
>>>> select
>>>> which process
>>>>                should be killed in an out-of-memory (OOM) situation.
>>>> The kernel uses
>>>>                this value for a bit-shift operation of the process's
>>>> oom_score value:
>>>>                valid values are in the range -16 to +15, plus the
>>>> special value -17,
>>>>                which disables OOM-killing altogether for this process.
>>>> A positive
>>>>                score increases the likelihood of this process being
>>>> killed by the OOM-
>>>>                killer; a negative score decreases the likelihood.  The
>>>> default value
>>>>                for this file is 0; a new process inherits its
>>>> parent's oom_adj
>>>>                setting.  A process must be privileged
>>>> (CAP_SYS_RESOURCE) to update
>>>>                this file.
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Linux OOM killer oom_adj

Reply via email to