[gridengine users] Parallel jobs with flexible slot requests cause huge memory use

Joshua Baker-LePain Mon, 03 Mar 2014 14:39:07 -0800

I'm back with what feels like another bug. Our environment is OGS2011.11p1 on 600+ nodes (of widely varying vintage) with 4000+ slots. Ourqueue setup is a bit odd, with 3 queues on each node (with each queuehaving slots=cores) -- one for high priority jobs, one for low priorityjobs, and one for short jobs.

Over the weekend, the scheduler was whacked by the OOM killer (on amachine with 48GB of RAM). I tracked the issue down to 3 array jobs (eachwith 100 tasks). My first thought was that the combination ofarray/parallel/reservations was too memory hungry, but turningreservations off for these jobs didn't help. I then had the userre-submit one array job as 100 individual jobs. If I enabled (read:released the hold on) them a few at a time, they ran just fine. But assoon as I hit a certain number (which I *think* correlated with SGE notbeing able to launch them all in the first scheduling run), things blew upagain. Limiting the jobs to a single queue didn't help either.

It turns out that the magic bullet was the PE request. The jobs weresubmitted with "-pe smp 2-16". If I changed that to "-pe smp N" (thevalue of N doesn't really matter, just somewhere within those limits),then the scheduler could handle the original array jobs just fine.

According to the man page, the flexible slot request is valid. Has anyoneelse seen this before? Any idea what kind of configs could be triggeringthis?


Thanks.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] Parallel jobs with flexible slot requests cause huge memory use

Reply via email to