adding some additional info did an strace on an orted process where xhpl failed to start, i did this after the mpirun execution, so i probably missed some output, but it keeps scrolling
poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8, events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13, events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16, events=POLLIN}], 9, 1000) = 0 (Timeout) i didn't see anything useful in /proc under those file descriptors, but perhaps i missed something i don't know to look for On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico <mdidomeni...@gmail.com> wrote: > too add a little more detail, it looks like xhpl is not actually > starting on all nodes when i kick off the mpirun > > each time i cancel and restart the job, the nodes that do not start > change, so i can't call it a bad node > > if i disable infiniband with --mca btl self,sm,tcp on occasion i can > get xhpl to actually run, but it's not consistent > > i'm going to check my ethernet network and make sure there's no > problems there (could this be an OOB error with mpirun?), on the nodes > that fail to start xhpl, i do see the orte process, but nothing in the > logs about why it failed to launch xhpl > > > > On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico > <mdidomeni...@gmail.com> wrote: >> I'm trying to diagnose an MPI job (in this case xhpl), that fails to >> start when the rank count gets fairly high into the thousands. >> >> My symptom is the jobs fires up via slurm, and I can see all the xhpl >> processes on the nodes, but it never kicks over to the next process. >> >> My question is, what debugs should I turn on to tell me what the >> system might be waiting on? >> >> I've checked a bunch of things, but I'm probably overlooking something >> trivial (which is par for me). >> >> I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with Infiniband/PSM