Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

Reuti Thu, 15 Mar 2012 14:19:10 -0400

Am 15.03.2012 um 18:14 schrieb Joshua Baker-LePain:

> On Thu, 15 Mar 2012 at 1:53pm, Reuti wrote
> 
>> PS: In your example you also had the case 2 slots in the low priority queue, 
>> what is the actual setup in your cluster?
> 
> Our actual setup is:
> 
> o lab.q, slots=numprocs, load_thresholds=np_load_avg=1.5, labs (=SGE
>   projects) limited by RQS to a number of slots equal to their "share" of
>   the cluster, seq_no=0, priority=0.
> 
> o long.q, slots=numprocs, load_thresholds=np_load_avg=0.9, seq_no=1,
>   priority=19
> 
> o short.q, slots=numprocs, load_thresholds=np_load_avg=1.25, users
>   limited by RQS to 200 slots, runtime limited to 30 minutes, seq_no=2,
>   priority=10
> 
> Users are instructed to not select a queue when submitting jobs.  The theory 
> is that even if non-contributing users have filled the cluster with long.q 
> jobs, contributing users will still have instant access to "their" lab.q 
> slots, overloading nodes with jobs running at a higher priority than the 
> long.q jobs.  long.q jobs won't start on nodes full of lab.q jobs. And 
> short.q is for quick, high priority jobs regardless of cluster status (the 
> main use case being processing MRI data into images while a patient is 
> physically in the scanner).


Thx for posting the information. Avoiding to get slots from different queues 
isn't complex:

1. Define each PE three times, like "orte_lab", "orte_long" and "orte_short". 
Attach the corresponding one to each queue and only this one, i.e. "long.q" 
gets "orte_long" etc.

2. The `qsub` command needs to include a wildcard like "-pe orte* 64" instead 
of the plain "orte" which is used right now I guess.

Once SGE selected a PE for the job, it will stay in this PE, and as the PE is 
attached to only one queue no foreign slots will be assigned any longer. Jobs 
may have to wait a little bit longer, as for now the slots are collected from 
all queues.

NB: Do you use "-R y" and a set h_rt  to avoid starvation of parallel jobs 
already?

-- Reuti


> The truth is our cluster is primarily used for, and thus SGE is tuned for, 
> large numbers of serial jobs.  We do have *some* folks running parallel code, 
> and it *is* starting to get to the point where I need to reconfigure things 
> to make that part work better.
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

Reply via email to