Re: [OMPI users] libevent hangs on app finalize stage

Leonid Thu, 15 Jan 2015 13:26:53 -0500 (EST)

Hi Ralph.

Of course that may indicate an issue with custom compiler, but giventhat it fails with gcc and inserted delay I still think it is a OMPIbug, since such a delay could be caused by operating system at thatexact point.

For me simply commenting out "base->event_gotterm = base->event_break =0;" seems to do the trick, but I am not completely sure if that won'tcause any other troubles.

I've tried to update my master branch to the latest version (includingyour fix) but now it just crashes for me on *all* benchmarks that I amtrying (both with gcc and our compiler).


On 15.01.2015 18:57, Ralph Castain wrote:

Thought about this some more and realized that the orte progress engine wasn’t 
using the opal_progress_thread support functions, which include a “break” event 
to kick us out of just such problems. So I changed it on the master. From your 
citing of libevent 2.0.22, I believe that must be where you are working, yes?

If so, give the changed version a try and see if your problem is resolved.

On Jan 15, 2015, at 12:55 AM, Ralph Castain <r...@open-mpi.org> wrote:

Given that you could only reproduce it with either your custom compiler or by 
forcibly introducing a delay, is this indicating an issue with the custom 
compiler? It does seem strange that we don't see this anywhere else, given the 
number of times that code gets run.

Only alternative solution I can think of would be to push the finalize request 
into the event loop, and thus execute the loopbreak from within an event. You 
might try and see if that solves the problem.

On Jan 14, 2015, at 11:54 PM, Leonid <lchis...@pathscale.com> wrote:

Hi all.

I believe there is a bug in event_base_loop() function from file event.c 
(opal/mca/event/libevent2022/libevent/).

Consider the case when application is going to be finalized and both 
event_base_loop() and event_base_loopbreak() are called in the same time in 
parallel threads.

Then if event_base_loopbreak() happens to acquire lock first, it will set 
"event_base->event_break = 1", but won't send any signal to event loop, because 
it did not started yet.

After that, event_base_loop() will acquire the lock and will clear event_break flag with the 
following statement: "base->event_gotterm = base->event_break = 0;". Then it 
will go into polling with timeout = -1 and thus block forever.

This issue was reproduced on a custom compiler (using Lulesh benchmark and x86 
4-core PC), but it can be also reproduced for me with GCC compiler (on almost 
any benchmark and in same HW settings) by putting some delay to 
orte_progress_thread_engine() function:

static void* orte_progress_thread_engine(opal_object_t *obj)
{
   while (orte_event_base_active) {
     usleep(1000); // add sleep to allow orte_ess_base_app_finalize() set 
orte_event_base_active flag to false
     opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
   }
   return OPAL_THREAD_CANCELLED;
}

I am not completely sure what should be the best fix for described problem.



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/01/26181.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/01/26185.php

Re: [OMPI users] libevent hangs on app finalize stage

Reply via email to