I'm back with what feels like another bug. Our environment is OGS 2011.11p1 on 600+ nodes (of widely varying vintage) with 4000+ slots. Our queue setup is a bit odd, with 3 queues on each node (with each queue having slots=cores) -- one for high priority jobs, one for low priority jobs, and one for short jobs.

Over the weekend, the scheduler was whacked by the OOM killer (on a machine with 48GB of RAM). I tracked the issue down to 3 array jobs (each with 100 tasks). My first thought was that the combination of array/parallel/reservations was too memory hungry, but turning reservations off for these jobs didn't help. I then had the user re-submit one array job as 100 individual jobs. If I enabled (read: released the hold on) them a few at a time, they ran just fine. But as soon as I hit a certain number (which I *think* correlated with SGE not being able to launch them all in the first scheduling run), things blew up again. Limiting the jobs to a single queue didn't help either.

It turns out that the magic bullet was the PE request. The jobs were submitted with "-pe smp 2-16". If I changed that to "-pe smp N" (the value of N doesn't really matter, just somewhere within those limits), then the scheduler could handle the original array jobs just fine.

According to the man page, the flexible slot request is valid. Has anyone else seen this before? Any idea what kind of configs could be triggering this?

Thanks.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to