On Mon, 3 Mar 2014 at 2:35pm, Reuti wrote
I'm back with what feels like another bug. Our environment is OGS
2011.11p1 on 600+ nodes (of widely varying vintage) with 4000+ slots.
Our queue setup is a bit odd, with 3 queues on each node (with each
queue having slots=cores) -- one for high priority jobs, one for low
priority jobs, and one for short jobs.
Over the weekend, the scheduler was whacked by the OOM killer (on a
machine with 48GB of RAM). I tracked the issue down to 3 array jobs
(each with 100 tasks). My first thought was that the combination of
array/parallel/reservations was too memory hungry, but turning
reservations off for these jobs didn't help. I then had the user
re-submit one array job as 100 individual jobs. If I enabled (read:
released the hold on) them a few at a time, they ran just fine. But as
soon as I hit a certain number (which I *think* correlated with SGE not
being able to launch them all in the first scheduling run), things blew
up again. Limiting the jobs to a single queue didn't help either.
The setting of "max_pending_tasks_per_job" in the scheduler setting was
still the default 50? Maybe a smaller value is better in your case.
It is still at 50. But lowering it won't help the case where a user (or
users) submits individual jobs with flexible slot requests. Also, in my
testing today, as few as 10 jobs were able to trigger the memory explosion
(and I suspect that fewer could do so if the queues were more full). And
I'd rather not limit the throughput of jobs to get around what really
smells like a bug.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users