Hi,

sorry for the confusion, that was indeed the trunk version of things I was
running.

Here's the same problem using

http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar.bz2

command-line:

../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX -npernode 11
./bin/mpi_test

back trace on sender:

(gdb) bt
#0  0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6
#1  0x00007fa004f43a4b in epoll_dispatch ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
#2  0x00007fa004f4b5fa in opal_event_base_loop ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
#3  0x00007fa004f1ce69 in opal_progress ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
#4  0x00007f9ffe69be95 in mca_pml_ob1_recv ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/openmpi/mca_pml_ob1.so
#5  0x00007fa004ebb35c in PMPI_Recv ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
#6  0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
buf=0x7fff8f5cbb50, count=1, datatype=..., source=29,
    tag=100, status=...)
    at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
#7  0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78)
    at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:30
(gdb)

back trace on receiver:

(gdb) bt
#0  0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6
#1  0x00007fcce2f1ea4b in epoll_dispatch ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
#2  0x00007fcce2f265fa in opal_event_base_loop ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
#3  0x00007fcce2ef7e69 in opal_progress ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
#4  0x00007fccdc677b1d in mca_pml_ob1_send ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/openmpi/mca_pml_ob1.so
#5  0x00007fcce2e9874f in PMPI_Send ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
#6  0x000000000040adda in MPI::Comm::Send (this=0x612800,
buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100)
    at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
#7  0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48)
    at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:38
(gdb)

and attached is my mpi_test file for reference.

thanks,
John


On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <r...@open-mpi.org> wrote:

> You clearly have an issue with version confusion. The file cited in your
> warning:
>
> > [wgsg0:29074] Warning -- mutex was double locked from errmgr_hnp.c:772
>
> does not exist in 1.5rc5. It only exists in the developer's trunk at this
> time. Check to ensure you have the right paths set, blow away the install
> area (in case you have multiple versions installed on top of each other),
> etc.
>
>
>
> On Aug 5, 2010, at 5:16 PM, John Hsu wrote:
>
> > Hi All,
> > I am new to openmpi and have encountered an issue using pre-release
> 1.5rc5, for a simple mpi code (see attached).  In this test, nodes 1 to n
> sends out a random number to node 0, node 0 sums all numbers received.
> >
> > This code works fine on 1 machine with any number of nodes, and on 3
> machines running 10 nodes per machine, but when we try to run 11 nodes per
> machine this warning appears:
> >
> > [wgsg0:29074] Warning -- mutex was double locked from errmgr_hnp.c:772
> >
> > And node 0 (master summing node) hangs on receiving plus another random
> node hangs on sending indefinitely.  Below are the back traces:
> >
> > (gdb) bt
> > #0  0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6
> > #1  0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0, arg=0x22f91f0,
> tv=0x7fff90f623e0) at epoll.c:215
> > #2  0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0, flags=2)
> at event.c:838
> > #3  0x00007f0c6053ac27 in opal_event_loop (flags=2) at event.c:766
> > #4  0x00007f0c604ebb5a in opal_progress () at runtime/opal_progress.c:189
> > #5  0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0,
> m=0x7f0c60800400) at ../../../../opal/threads/
> > condition.h:99
> > #6  0x00007f0c59b79dff in ompi_request_wait_completion (req=0x2538d80) at
> ../../../../ompi/request/request.h:377
> > #7  0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0, count=1,
> datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40,
> >     status=0x7fff90f62668) at pml_ob1_irecv.c:104
> > #8  0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0, count=1,
> type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40,
> status=0x7fff90f62668)
> >     at precv.c:78
> > #9  0x000000000040ae14 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100, status=...)
> >     at
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> > #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8)
> >     at
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:30
> > (gdb)
> >
> > and for sender is:
> >
> > (gdb) bt
> > #0  0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6
> > #1  0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880, arg=0x216c6e0,
> tv=0x7ffffa8a4130) at epoll.c:215
> > #2  0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880, flags=2)
> at event.c:838
> > #3  0x00007fedba5edbaf in opal_event_loop (flags=2) at event.c:766
> > #4  0x00007fedba59c43a in opal_progress () at runtime/opal_progress.c:189
> > #5  0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0,
> m=0x7fedba8ba740)
> >     at ../../../../opal/threads/condition.h:99
> > #6  0x00007fedb279742e in ompi_request_wait_completion (req=0x2392d80) at
> ../../../../ompi/request/request.h:377
> > #7  0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210, count=100,
> datatype=0x612600, dst=0, tag=100,
> >     sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at
> pml_ob1_isend.c:125
> > #8  0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100,
> type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80)
> >     at psend.c:75
> > #9  0x000000000040ae52 in MPI::Comm::Send (this=0x612800, buf=0x23b6210,
> count=100, datatype=..., dest=0, tag=100)
> >     at
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> > #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658)
> >     at
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:42
> > (gdb)
> >
> > The "deadlock" appears to be a machine dependent race condition,
> different machines fails with different combinations of nodes / machine.
> >
> > Below is my command line for reference:
> >
> > $ ../openmpi_devel/bin/mpirun -x PATH -hostfile hostfiles/hostfile.wgsgX
> -npernode 11 -mca btl tcp,sm,self -mca orte_base_help_aggregate 0 -mca
> opal_debug_locks 1  ./bin/mpi_test
> >
> > The problem does not exist in release 1.4.2 or earlier.  We are testing
> unreleased codes for potential knem benefits, but can fall back to 1.4.2 if
> necessary.
> >
> > My apologies in advance if I've missed something basic, thanks for any
> help :)
> >
> > regards,
> > John
> > <test.cpp>_______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[]) {
  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME];

  MPI::Init(argc, argv);
  //MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  //MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  rank = MPI::COMM_WORLD.Get_rank();
  numprocs = MPI::COMM_WORLD.Get_size();
  MPI::Get_processor_name(processor_name, namelen);

  //printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);

  // init random seed
  srand(time(NULL)+rank);

  // simple summing test
  if (rank == 0)
  {
    double buf;
    double sum = 0;
    MPI::Status status;
    for (int i=1; i < MPI::COMM_WORLD.Get_size(); i++)
    {
      MPI::COMM_WORLD.Recv(&buf, 1, MPI_DOUBLE , i, 100, status);
      sum += buf;
    }
    printf("sum: %f\n",sum);
  }
  else
  {
    double x = (double)rand()/(double)RAND_MAX;
    MPI::COMM_WORLD.Send(&x, 1, MPI_DOUBLE , 0, 100);
    printf("node: %d x: %f send complete\n",rank,x);
  }

  MPI::Finalize();
}

Reply via email to