Hi
I don't use Slurm, and our clusters are fairly small (few tens of nodes,
few hundred cores).
Having said that, I know that Torque, which we use here,
requires specific system configuration changes on large clusters,
like increasing the maximum number of open files,
increasing the ARP cache size, etc.
Apparently Slrum also needs some system tweaking on large clusters:
https://computing.llnl.gov/linux/slurm/big_sys.html
Could this be the problem?
Anyway, just a thought.
Gus Correa
On 10/12/2012 09:27 AM, Michael Di Domenico wrote:
what isn't working is when i fire off an MPI job with over 800 ranks,
they don't all actually start up a process
fe, if i do srun -n 1024 --ntasks-per-node 12 xhpl
and then do a 'pgrep xhpl | wc -l', on all of the allocated nodes, not
all of them have actually started xhpl
most will read 12 started processes, but an inconsistent list of nodes
will fail to actually start xhpl and stall the whole job
if i look at all the nodes allocated to my job, it does start the orte
process though
what i need to figure out, is why the orte process starts, but fails
to actually start xhpl on some of the nodes
unfortunately, the list of nodes that don't start xhpl during my runs
changes each time and no hardware errors are being detected. if i
cancel the job and restart the job over and over, eventually one will
actually kick off and run to completion.
if i run the process outside of slurm just using openmpi, it seems to
behave correctly, so i'm leaning towards a slurm interacting with
openmpi problem.
what i'd like to do is instrument a debug in openmpi that will tell me
what openmpi is waiting on in order to kick off the xhpl binary
i'm testing to see whether it's a psm related problem now, i'll check
back if i can narrow the scope a little more
On Thu, Oct 11, 2012 at 10:21 PM, Ralph Castain<r...@open-mpi.org> wrote:
I'm afraid I'm confused - I don't understand what is and isn't working. What
"next process" isn't starting?
On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico
<mdidomeni...@gmail.com> wrote:
adding some additional info
did an strace on an orted process where xhpl failed to start, i did
this after the mpirun execution, so i probably missed some output, but
it keeps scrolling
poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8,
events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13,
events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16,
events=POLLIN}], 9, 1000) = 0 (Timeout)
i didn't see anything useful in /proc under those file descriptors,
but perhaps i missed something i don't know to look for
On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico
<mdidomeni...@gmail.com> wrote:
too add a little more detail, it looks like xhpl is not actually
starting on all nodes when i kick off the mpirun
each time i cancel and restart the job, the nodes that do not start
change, so i can't call it a bad node
if i disable infiniband with --mca btl self,sm,tcp on occasion i can
get xhpl to actually run, but it's not consistent
i'm going to check my ethernet network and make sure there's no
problems there (could this be an OOB error with mpirun?), on the nodes
that fail to start xhpl, i do see the orte process, but nothing in the
logs about why it failed to launch xhpl
On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico
<mdidomeni...@gmail.com> wrote:
I'm trying to diagnose an MPI job (in this case xhpl), that fails to
start when the rank count gets fairly high into the thousands.
My symptom is the jobs fires up via slurm, and I can see all the xhpl
processes on the nodes, but it never kicks over to the next process.
My question is, what debugs should I turn on to tell me what the
system might be waiting on?
I've checked a bunch of things, but I'm probably overlooking something
trivial (which is par for me).
I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with
Infiniband/PSM
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users