Yes I have hit this, reservation needs to be off for all jobs. I found the section of the code allocating the memory and as far as I Can tell commenting it does nothing. If you look through the past emails on the list you will see me writing about it this time (almost exactly + 2 weeks) 2 years ago .
will send my patch for an earlier grid on monday On Tue, Mar 4, 2014 at 5:16 AM, Joshua Baker-LePain <[email protected]> wrote: > On Mon, 3 Mar 2014 at 2:35pm, Reuti wrote > > > I'm back with what feels like another bug. Our environment is OGS >>> 2011.11p1 on 600+ nodes (of widely varying vintage) with 4000+ slots. Our >>> queue setup is a bit odd, with 3 queues on each node (with each queue >>> having slots=cores) -- one for high priority jobs, one for low priority >>> jobs, and one for short jobs. >>> >>> Over the weekend, the scheduler was whacked by the OOM killer (on a >>> machine with 48GB of RAM). I tracked the issue down to 3 array jobs (each >>> with 100 tasks). My first thought was that the combination of >>> array/parallel/reservations was too memory hungry, but turning reservations >>> off for these jobs didn't help. I then had the user re-submit one array >>> job as 100 individual jobs. If I enabled (read: released the hold on) them >>> a few at a time, they ran just fine. But as soon as I hit a certain number >>> (which I *think* correlated with SGE not being able to launch them all in >>> the first scheduling run), things blew up again. Limiting the jobs to a >>> single queue didn't help either. >>> >> >> The setting of "max_pending_tasks_per_job" in the scheduler setting was >> still the default 50? Maybe a smaller value is better in your case. >> > > It is still at 50. But lowering it won't help the case where a user (or > users) submits individual jobs with flexible slot requests. Also, in my > testing today, as few as 10 jobs were able to trigger the memory explosion > (and I suspect that fewer could do so if the queues were more full). And > I'd rather not limit the throughput of jobs to get around what really > smells like a bug. > > > -- > Joshua Baker-LePain > QB3 Shared Cluster Sysadmin > UCSF > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
