On Mon, 3 Mar 2014 at 2:35pm, Reuti wrote

I'm back with what feels like another bug. Our environment is OGS 2011.11p1 on 600+ nodes (of widely varying vintage) with 4000+ slots. Our queue setup is a bit odd, with 3 queues on each node (with each queue having slots=cores) -- one for high priority jobs, one for low priority jobs, and one for short jobs.

Over the weekend, the scheduler was whacked by the OOM killer (on a machine with 48GB of RAM). I tracked the issue down to 3 array jobs (each with 100 tasks). My first thought was that the combination of array/parallel/reservations was too memory hungry, but turning reservations off for these jobs didn't help. I then had the user re-submit one array job as 100 individual jobs. If I enabled (read: released the hold on) them a few at a time, they ran just fine. But as soon as I hit a certain number (which I *think* correlated with SGE not being able to launch them all in the first scheduling run), things blew up again. Limiting the jobs to a single queue didn't help either.

The setting of "max_pending_tasks_per_job" in the scheduler setting was still the default 50? Maybe a smaller value is better in your case.

It is still at 50. But lowering it won't help the case where a user (or users) submits individual jobs with flexible slot requests. Also, in my testing today, as few as 10 jobs were able to trigger the memory explosion (and I suspect that fewer could do so if the queues were more full). And I'd rather not limit the throughput of jobs to get around what really smells like a bug.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to