I'm back with what feels like another bug. Our environment is OGS
2011.11p1 on 600+ nodes (of widely varying vintage) with 4000+ slots. Our
queue setup is a bit odd, with 3 queues on each node (with each queue
having slots=cores) -- one for high priority jobs, one for low priority
jobs, and one for short jobs.
Over the weekend, the scheduler was whacked by the OOM killer (on a
machine with 48GB of RAM). I tracked the issue down to 3 array jobs (each
with 100 tasks). My first thought was that the combination of
array/parallel/reservations was too memory hungry, but turning
reservations off for these jobs didn't help. I then had the user
re-submit one array job as 100 individual jobs. If I enabled (read:
released the hold on) them a few at a time, they ran just fine. But as
soon as I hit a certain number (which I *think* correlated with SGE not
being able to launch them all in the first scheduling run), things blew up
again. Limiting the jobs to a single queue didn't help either.
It turns out that the magic bullet was the PE request. The jobs were
submitted with "-pe smp 2-16". If I changed that to "-pe smp N" (the
value of N doesn't really matter, just somewhere within those limits),
then the scheduler could handle the original array jobs just fine.
According to the man page, the flexible slot request is valid. Has anyone
else seen this before? Any idea what kind of configs could be triggering
this?
Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users