On Wed, 14 Mar 2012 at 9:33am, Reuti wrote
I can run as many threads as I like on a single system with no
problems, even if those threads are running at different nice levels.
How do they get different nice levels - you renice them? I would assume
that all start at the same of the parent. In your test program you
posted there are no threads.
Ah, thanks for pointing this out. Yes, when a job runs on a single host
(even if SGE has assigned it to multiple queues), there's no qrsh
involved. There's just a simple mpirun and all the threads run at the
same priority. I did try renicing half the threads, and the job didn't
fail.
The problem seems to arise when I'm both a) running across multiple
machines and b) running threads at differing nice levels (which often
happens as a result of our queueing setup).
This sounds like you are getting slots from different queues assigned to
one and the same job. My experience: don't do it, unless you neeed it.
You are correct -- the problem is specific to a parallel job getting slots
from different queues. Our cluster is used by a combination of folks
who've financially supported it, and those that haven't. Our high
priority queue, lab.q, runs un-niced and is available only to those who
have donated money and/or machines to us. Our low priority queue, long.q,
runs nice 19 and is available to all. The goal is to ensure instant
access by a lab to its "share" of the cluster while letting both those
users and non-supporting users to use as many cores as they can in long.q.
We explicitly allow overloading to further support our goal of keeping the
usage both full and fair.
The setup is a bit convoluted, but it has kept the users (and, more
importantly, the PIs) happy. Until the recent upgrade to CentOS 6 and
concomitant switch from MPICH2 to Open MPI, we've had no issues with
parallel jobs and this queue setup. And the test jobs I've tried with our
old MPICH2 install (and the MPICH tight integration) running under CentOS
6 don't fail either.
Do you face the same if you stay in one and the same queue across the
machines?
Jobs don't crash if they either:
a) all run in the same queue, or
b) run in multiple queues all on one machine
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF