We found a locking error in vader - this has been fixed in the OMPI master and will be in the 1.8.5 nightly tarball tomorrow.
Thanks! Ralph > On Apr 9, 2015, at 1:26 PM, Thomas Klimpel <jacques.gent...@gmail.com> wrote: > > I tried 1.8.5rc1 now. It behaves very similar to 1.8.4 from my point of view > (and completely different from 1.6.5). The warning > [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one > event_base_loop can run on each event_base at once. > is still there. > > It's easy for me to (re)produce a deadlock with both 1.8.4 and 1.8.5rc1. With > 1.8.5rc1, I sometimes even get the deadlock without the warning. The > following seems crucial for reproducing the deadlock > > 1) start a worker on the same node as the master > 2) chop big messages into 1k blocks. With 2k blocks, the deadlocks become > rarer, and with 4k blocks (or no choping at all), the deadlocks seem to be > gone. > > the deadlock happens even with a single worker > > #0 0x000000363f20e054 in __lll_lock_wait () from /lib64/libpthread.so.0 > #1 0x000000363f209388 in _L_lock_854 () from /lib64/libpthread.so.0 > #2 0x000000363f209257 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #3 0x00007f9901d47343 in mca_btl_vader_component_progress () from > /homes/data/public/Development/3rdParty/install/openmpi-1.8.5rc1/Linux-x86_64-redhat.6.3/M64/lib/openmpi/mca_btl_vader.so > #4 0x00007f9910a9b49a in opal_progress () from > /homes/data/public/Development/3rdParty/install/openmpi-1.8.5rc1/Linux-x86_64-redhat.6.3/M64/lib/libopen-pal.so.6 > #5 0x00007f990170594d in mca_pml_ob1_send () from > /homes/data/public/Development/3rdParty/install/openmpi-1.8.5rc1/Linux-x86_64-redhat.6.3/M64/lib/openmpi/mca_pml_ob1.so > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26662.php