On Wed, 14 Mar 2012 at 9:33am, Reuti wrote

I can run as many threads as I like on a single system with no problems, even if those threads are running at different nice levels.

How do they get different nice levels - you renice them? I would assume that all start at the same of the parent. In your test program you posted there are no threads.

Ah, thanks for pointing this out. Yes, when a job runs on a single host (even if SGE has assigned it to multiple queues), there's no qrsh involved. There's just a simple mpirun and all the threads run at the same priority. I did try renicing half the threads, and the job didn't fail.

The problem seems to arise when I'm both a) running across multiple machines and b) running threads at differing nice levels (which often happens as a result of our queueing setup).

This sounds like you are getting slots from different queues assigned to one and the same job. My experience: don't do it, unless you neeed it.

You are correct -- the problem is specific to a parallel job getting slots from different queues. Our cluster is used by a combination of folks who've financially supported it, and those that haven't. Our high priority queue, lab.q, runs un-niced and is available only to those who have donated money and/or machines to us. Our low priority queue, long.q, runs nice 19 and is available to all. The goal is to ensure instant access by a lab to its "share" of the cluster while letting both those users and non-supporting users to use as many cores as they can in long.q. We explicitly allow overloading to further support our goal of keeping the usage both full and fair.

The setup is a bit convoluted, but it has kept the users (and, more importantly, the PIs) happy. Until the recent upgrade to CentOS 6 and concomitant switch from MPICH2 to Open MPI, we've had no issues with parallel jobs and this queue setup. And the test jobs I've tried with our old MPICH2 install (and the MPICH tight integration) running under CentOS 6 don't fail either.

Do you face the same if you stay in one and the same queue across the machines?

Jobs don't crash if they either:

a) all run in the same queue, or

b) run in multiple queues all on one machine

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF

Reply via email to