Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

Joshua Baker-LePain Wed, 14 Mar 2012 13:30:39 -0400

On Wed, 14 Mar 2012 at 9:33am, Reuti wrote

I can run as many threads as I like on a single system with noproblems, even if those threads are running at different nice levels.
How do they get different nice levels - you renice them? I would assumethat all start at the same of the parent. In your test program youposted there are no threads.

Ah, thanks for pointing this out. Yes, when a job runs on a single host(even if SGE has assigned it to multiple queues), there's no qrshinvolved. There's just a simple mpirun and all the threads run at thesame priority. I did try renicing half the threads, and the job didn'tfail.

The problem seems to arise when I'm both a) running across multiplemachines and b) running threads at differing nice levels (which oftenhappens as a result of our queueing setup).
This sounds like you are getting slots from different queues assigned toone and the same job. My experience: don't do it, unless you neeed it.

You are correct -- the problem is specific to a parallel job getting slotsfrom different queues. Our cluster is used by a combination of folkswho've financially supported it, and those that haven't. Our highpriority queue, lab.q, runs un-niced and is available only to those whohave donated money and/or machines to us. Our low priority queue, long.q,runs nice 19 and is available to all. The goal is to ensure instantaccess by a lab to its "share" of the cluster while letting both thoseusers and non-supporting users to use as many cores as they can in long.q.We explicitly allow overloading to further support our goal of keeping theusage both full and fair.

The setup is a bit convoluted, but it has kept the users (and, moreimportantly, the PIs) happy. Until the recent upgrade to CentOS 6 andconcomitant switch from MPICH2 to Open MPI, we've had no issues withparallel jobs and this queue setup. And the test jobs I've tried with ourold MPICH2 install (and the MPICH tight integration) running under CentOS6 don't fail either.

Do you face the same if you stay in one and the same queue across themachines?


Jobs don't crash if they either:

a) all run in the same queue, or

b) run in multiple queues all on one machine

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

Reply via email to