The test program is available here:
http://code.google.com/p/pypar/source/browse/source/mpi_test.c

Hopefully, someone can help us troubleshoot why communications stop when
multiple nodes are involved and CPU usage goes to 100% for as long as we
leave the program running.

Many thanks
Ole Nielsen


---------- Forwarded message ----------
From: Ole Nielsen <ole.moller.niel...@gmail.com>
List-Post: users@lists.open-mpi.org
Date: Mon, Sep 19, 2011 at 3:39 PM
Subject: Re: MPI hangs on multiple nodes
To: us...@open-mpi.org


Further to the posting below, I can report that the test program (attached -
this time correctly) is chewing up CPU time on both compute nodes for as
long as I care to let it continue.
It would appear that MPI_Receive which is the next command after the print
statements in the test program.

Has anyone else seen this behavior or can anyone give me a hint on how to
troubleshoot.

Cheers and thanks
Ole Nielsen

Output:

nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host
node17,node18 --npernode 2 a.out
Number of processes = 4
Test repeated 3 times for reliability
I am process 2 on node node18
P2: Waiting to receive from to P1
I am process 0 on node node17
Run 1 of 3
P0: Sending to P1
I am process 1 on node node17
P1: Waiting to receive from to P0
I am process 3 on node node18
P3: Waiting to receive from to P2
P0: Waiting to receive from P3

P1: Sending to to P2
P1: Waiting to receive from to P0
P2: Sending to to P3

P0: Received from to P3
Run 2 of 3
P0: Sending to P1
P3: Sending to to P0

P3: Waiting to receive from to P2
P2: Waiting to receive from to P1
P1: Sending to to P2
P0: Waiting to receive from P3









On Mon, Sep 19, 2011 at 11:04 AM, Ole Nielsen
<ole.moller.niel...@gmail.com>wrote:

>
> Hi all
>
> We have been using OpenMPI for many years with Ubuntu on our 20-node
> cluster. Each node has 2 quad cores, so we usually run up to 8 processes on
> each node up to a maximum of 160 processes.
>
> However, we just upgraded the cluster to Ubuntu 11.04 with Open MPI 1.4.3
> and and have come across a strange behavior where mpi programs run perfectly
> well when confined to one node but hangs during communication across
> multiple nodes. We have no idea why and would like some help in debugging
> this. A small MPI test program is attached and typical output shown below.
>
> Hope someone can help us
> Cheers and thanks
> Ole Nielsen
>
> -------------------- Test output across two nodes (This one hangs)
> --------------------------
> nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts
> --host node17,node18 --npernode 2 a.out
> Number of processes = 4
> Test repeated 3 times for reliability
> I am process 1 on node node17
> P1: Waiting to receive from to P0
> I am process 0 on node node17
> Run 1 of 3
> P0: Sending to P1
> I am process 2 on node node18
> P2: Waiting to receive from to P1
> I am process 3 on node node18
> P3: Waiting to receive from to P2
> P1: Sending to to P2
>
>
> -------------------- Test output within one node (This one is OK)
> --------------------------
> nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts
> --host node17 --npernode 4 a.out
> Number of processes = 4
> Test repeated 3 times for reliability
> I am process 2 on node node17
> P2: Waiting to receive from to P1
> I am process 0 on node node17
> Run 1 of 3
> P0: Sending to P1
> I am process 1 on node node17
> P1: Waiting to receive from to P0
> I am process 3 on node node17
> P3: Waiting to receive from to P2
> P1: Sending to to P2
> P2: Sending to to P3
> P1: Waiting to receive from to P0
> P2: Waiting to receive from to P1
> P3: Sending to to P0
> P0: Received from to P3
> Run 2 of 3
> P0: Sending to P1
> P3: Waiting to receive from to P2
> P1: Sending to to P2
> P2: Sending to to P3
> P1: Waiting to receive from to P0
> P3: Sending to to P0
> P2: Waiting to receive from to P1
> P0: Received from to P3
> Run 3 of 3
> P0: Sending to P1
> P3: Waiting to receive from to P2
> P1: Sending to to P2
> P2: Sending to to P3
> P1: Done
> P2: Done
> P3: Sending to to P0
> P0: Received from to P3
> P0: Done
> P3: Done
>
>
>
>

Reply via email to