Re: [gridengine users] Parallel jobs with flexible slot requests cause huge memory use

Joshua Baker-LePain Mon, 03 Mar 2014 21:18:24 -0800

On Mon, 3 Mar 2014 at 2:35pm, Reuti wrote

I'm back with what feels like another bug. Our environment is OGS2011.11p1 on 600+ nodes (of widely varying vintage) with 4000+ slots.Our queue setup is a bit odd, with 3 queues on each node (with eachqueue having slots=cores) -- one for high priority jobs, one for lowpriority jobs, and one for short jobs.
Over the weekend, the scheduler was whacked by the OOM killer (on amachine with 48GB of RAM). I tracked the issue down to 3 array jobs(each with 100 tasks). My first thought was that the combination ofarray/parallel/reservations was too memory hungry, but turningreservations off for these jobs didn't help. I then had the userre-submit one array job as 100 individual jobs. If I enabled (read:released the hold on) them a few at a time, they ran just fine. But assoon as I hit a certain number (which I *think* correlated with SGE notbeing able to launch them all in the first scheduling run), things blewup again. Limiting the jobs to a single queue didn't help either.
The setting of "max_pending_tasks_per_job" in the scheduler setting wasstill the default 50? Maybe a smaller value is better in your case.

It is still at 50. But lowering it won't help the case where a user (orusers) submits individual jobs with flexible slot requests. Also, in mytesting today, as few as 10 jobs were able to trigger the memory explosion(and I suspect that fewer could do so if the queues were more full). AndI'd rather not limit the throughput of jobs to get around what reallysmells like a bug.


--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Parallel jobs with flexible slot requests cause huge memory use

Reply via email to