too add a little more detail, it looks like xhpl is not actually starting on all nodes when i kick off the mpirun
each time i cancel and restart the job, the nodes that do not start change, so i can't call it a bad node if i disable infiniband with --mca btl self,sm,tcp on occasion i can get xhpl to actually run, but it's not consistent i'm going to check my ethernet network and make sure there's no problems there (could this be an OOB error with mpirun?), on the nodes that fail to start xhpl, i do see the orte process, but nothing in the logs about why it failed to launch xhpl On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico <mdidomeni...@gmail.com> wrote: > I'm trying to diagnose an MPI job (in this case xhpl), that fails to > start when the rank count gets fairly high into the thousands. > > My symptom is the jobs fires up via slurm, and I can see all the xhpl > processes on the nodes, but it never kicks over to the next process. > > My question is, what debugs should I turn on to tell me what the > system might be waiting on? > > I've checked a bunch of things, but I'm probably overlooking something > trivial (which is par for me). > > I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with Infiniband/PSM