Am 30.08.2012 um 11:48 schrieb Peter van Heusden: > I've had exactly the same experience. Java seems to do some kind of > calculation based on total system memory, and you need to size h_vmem to > much more than the java application's real memory use in order to make > the job run. > Reuti, have you dealt with this problem?
No, we don't run Java applications. > Brian, could you share the > memkiller script you use? > > Thanks, > Peter > > On 08/29/2012 06:09 PM, Brian Smith wrote: >> We found h_vmem to be highly unpredictable, especially with java-based >> applications. Stack settings were screwed up, certain applications >> wouldn't launch (segfaults), and hard limits were hard to determine >> for things like MPI applications. When your master has to launch 1024 >> MPI sub-tasks (qrsh), Nowadays sub-taks are forked locally, i.e. only one `qrsh` per slave host is necessary. -- Reuti >> it generally eats up more VMEM than the slave >> tasks do. It was just hard to get right. >> >> -Brian >> >> Brian Smith >> Sr. System Administrator >> Research Computing, University of South Florida >> 4202 E. Fowler Ave. SVC4010 >> Office Phone: +1 813 974-1467 >> Organization URL: http://rc.usf.edu >> >> On 08/29/2012 11:33 AM, Reuti wrote: >>> Am 29.08.2012 um 17:21 schrieb Brian Smith: >>> >>>> We use mem_free variable as a consumable. Then, we use a cronjob >>>> called memkiller that terminates jobs if they go over their >>>> requested (or default) memory allocation and >>> >>> It would be more straight forward to use directly h_vmem. This is >>> controlled by SGE and the job exceeding the limit will be killed by >>> SGE. If you consume it as a consumable on a exechost level, it could >>> be set to the built in physical memory. >>> >>> Was there any reason to use mem_free? >>> >>> -- Reuti >>> >>> >>>> >>>> 1. Swap space on node is used >>>> 2. Swap rate is greater than 100 I/Os per second >>>> >>>> The user gets emailed with a report if this happens. >>>> >>>> This has made dealing with the oom killer a thing of the past in our >>>> shop. >>>> >>>> We manage memory on the principle that swap should NEVER be used. >>>> If you're hitting oom killer, you're pretty far beyond that in terms >>>> of memory utilization; if performance is a consideration, MHO is you >>>> should be looking to schedule your memory usage accordingly. Oom >>>> killer shouldn't be a factor if memory is handled as a scheduler >>>> consideration. >>>> >>>> -Brian >>>> >>>> Brian Smith >>>> Sr. System Administrator >>>> Research Computing, University of South Florida >>>> 4202 E. Fowler Ave. SVC4010 >>>> Office Phone: +1 813 974-1467 >>>> Organization URL: http://rc.usf.edu >>>> >>>> On 08/29/2012 11:02 AM, Ben De Luca wrote: >>>>> I was wondering, how people deal with oom conditions on there cluster. >>>>> We constantly have machines that die because the oom killer takes out >>>>> critical system services. >>>>> >>>>> Has any experiance with the oom_adj proc value, or a patch to grid to >>>>> support it? >>>>> >>>>> >>>>> /proc/[pid]/oom_adj (since Linux 2.6.11) >>>>> This file can be used to adjust the score used to >>>>> select >>>>> which process >>>>> should be killed in an out-of-memory (OOM) situation. >>>>> The kernel uses >>>>> this value for a bit-shift operation of the process's >>>>> oom_score value: >>>>> valid values are in the range -16 to +15, plus the >>>>> special value -17, >>>>> which disables OOM-killing altogether for this process. >>>>> A positive >>>>> score increases the likelihood of this process being >>>>> killed by the OOM- >>>>> killer; a negative score decreases the likelihood. The >>>>> default value >>>>> for this file is 0; a new process inherits its >>>>> parent's oom_adj >>>>> setting. A process must be privileged >>>>> (CAP_SYS_RESOURCE) to update >>>>> this file. >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> [email protected] >>>> https://gridengine.org/mailman/listinfo/users >>> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
