adding some additional info

did an strace on an orted process where xhpl failed to start, i did
this after the mpirun execution, so i probably missed some output, but
it keeps scrolling

poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8,
events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13,
events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16,
events=POLLIN}], 9, 1000) = 0 (Timeout)

i didn't see anything useful in /proc under those file descriptors,
but perhaps i missed something i don't know to look for

On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico
<mdidomeni...@gmail.com> wrote:
> too add a little more detail, it looks like xhpl is not actually
> starting on all nodes when i kick off the mpirun
>
> each time i cancel and restart the job, the nodes that do not start
> change, so i can't call it a bad node
>
> if i disable infiniband with --mca btl self,sm,tcp on occasion i can
> get xhpl to actually run, but it's not consistent
>
> i'm going to check my ethernet network and make sure there's no
> problems there (could this be an OOB error with mpirun?), on the nodes
> that fail to start xhpl, i do see the orte process, but nothing in the
> logs about why it failed to launch xhpl
>
>
>
> On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico
> <mdidomeni...@gmail.com> wrote:
>> I'm trying to diagnose an MPI job (in this case xhpl), that fails to
>> start when the rank count gets fairly high into the thousands.
>>
>> My symptom is the jobs fires up via slurm, and I can see all the xhpl
>> processes on the nodes, but it never kicks over to the next process.
>>
>> My question is, what debugs should I turn on to tell me what the
>> system might be waiting on?
>>
>> I've checked a bunch of things, but I'm probably overlooking something
>> trivial (which is par for me).
>>
>> I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with Infiniband/PSM

Reply via email to