We found a locking error in vader - this has been fixed in the OMPI master and 
will be in the 1.8.5 nightly tarball tomorrow.

Thanks!
Ralph

> On Apr 9, 2015, at 1:26 PM, Thomas Klimpel <jacques.gent...@gmail.com> wrote:
> 
> I tried 1.8.5rc1 now. It behaves very similar to 1.8.4 from my point of view 
> (and completely different from 1.6.5). The warning
> [warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only one 
> event_base_loop can run on each event_base at once.
> is still there.
> 
> It's easy for me to (re)produce a deadlock with both 1.8.4 and 1.8.5rc1. With 
> 1.8.5rc1, I sometimes even get the deadlock without the warning. The 
> following seems crucial for reproducing the deadlock
> 
> 1) start a worker on the same node as the master
> 2) chop big messages into 1k blocks. With 2k blocks, the deadlocks become 
> rarer, and with 4k blocks (or no choping at all), the deadlocks seem to be 
> gone.
> 
> the deadlock happens even with a single worker
> 
> #0  0x000000363f20e054 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x000000363f209388 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x000000363f209257 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x00007f9901d47343 in mca_btl_vader_component_progress () from 
> /homes/data/public/Development/3rdParty/install/openmpi-1.8.5rc1/Linux-x86_64-redhat.6.3/M64/lib/openmpi/mca_btl_vader.so
> #4  0x00007f9910a9b49a in opal_progress () from 
> /homes/data/public/Development/3rdParty/install/openmpi-1.8.5rc1/Linux-x86_64-redhat.6.3/M64/lib/libopen-pal.so.6
> #5  0x00007f990170594d in mca_pml_ob1_send () from 
> /homes/data/public/Development/3rdParty/install/openmpi-1.8.5rc1/Linux-x86_64-redhat.6.3/M64/lib/openmpi/mca_pml_ob1.so
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26662.php

Reply via email to