Am 03.03.2014 um 23:20 schrieb Joshua Baker-LePain:

> I'm back with what feels like another bug.  Our environment is OGS 2011.11p1 
> on 600+ nodes (of widely varying vintage) with 4000+ slots.  Our queue setup 
> is a bit odd, with 3 queues on each node (with each queue having slots=cores) 
> -- one for high priority jobs, one for low priority jobs, and one for short 
> jobs.
> 
> Over the weekend, the scheduler was whacked by the OOM killer (on a machine 
> with 48GB of RAM).  I tracked the issue down to 3 array jobs (each with 100 
> tasks).  My first thought was that the combination of 
> array/parallel/reservations was too memory hungry, but turning reservations 
> off for these jobs didn't help.  I then had the user re-submit one array job 
> as 100 individual jobs.  If I enabled (read: released the hold on) them a few 
> at a time, they ran just fine.  But as soon as I hit a certain number (which 
> I *think* correlated with SGE not being able to launch them all in the first 
> scheduling run), things blew up again.  Limiting the jobs to a single queue 
> didn't help either.

The setting of "max_pending_tasks_per_job" in the scheduler setting was still 
the default 50? Maybe a smaller value is better in your case.

-- Reuti


> It turns out that the magic bullet was the PE request.  The jobs were 
> submitted with "-pe smp 2-16".  If I changed that to "-pe smp N" (the value 
> of N doesn't really matter, just somewhere within those limits), then the 
> scheduler could handle the original array jobs just fine.
> 
> According to the man page, the flexible slot request is valid.  Has anyone 
> else seen this before?  Any idea what kind of configs could be triggering 
> this?
> 
> Thanks.
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to