Yes I have hit this, reservation needs to be off for all jobs.

I found the section of the code allocating the memory and as far as I Can
tell commenting it does nothing. If you look through the past emails on the
list you will see me writing about it this time (almost exactly + 2 weeks)
2 years ago .

will send my patch for an earlier grid on monday




On Tue, Mar 4, 2014 at 5:16 AM, Joshua Baker-LePain <[email protected]> wrote:

> On Mon, 3 Mar 2014 at 2:35pm, Reuti wrote
>
>
>  I'm back with what feels like another bug.  Our environment is OGS
>>> 2011.11p1 on 600+ nodes (of widely varying vintage) with 4000+ slots. Our
>>> queue setup is a bit odd, with 3 queues on each node (with each queue
>>> having slots=cores) -- one for high priority jobs, one for low priority
>>> jobs, and one for short jobs.
>>>
>>> Over the weekend, the scheduler was whacked by the OOM killer (on a
>>> machine with 48GB of RAM).  I tracked the issue down to 3 array jobs (each
>>> with 100 tasks).  My first thought was that the combination of
>>> array/parallel/reservations was too memory hungry, but turning reservations
>>> off for these jobs didn't help.  I then had the user re-submit one array
>>> job as 100 individual jobs.  If I enabled (read: released the hold on) them
>>> a few at a time, they ran just fine.  But as soon as I hit a certain number
>>> (which I *think* correlated with SGE not being able to launch them all in
>>> the first scheduling run), things blew up again.  Limiting the jobs to a
>>> single queue didn't help either.
>>>
>>
>> The setting of "max_pending_tasks_per_job" in the scheduler setting was
>> still the default 50? Maybe a smaller value is better in your case.
>>
>
> It is still at 50.  But lowering it won't help the case where a user (or
> users) submits individual jobs with flexible slot requests.  Also, in my
> testing today, as few as 10 jobs were able to trigger the memory explosion
> (and I suspect that fewer could do so if the queues were more full).  And
> I'd rather not limit the throughput of jobs to get around what really
> smells like a bug.
>
>
> --
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to