Thought about this some more and realized that the orte progress engine wasn’t 
using the opal_progress_thread support functions, which include a “break” event 
to kick us out of just such problems. So I changed it on the master. From your 
citing of libevent 2.0.22, I believe that must be where you are working, yes?

If so, give the changed version a try and see if your problem is resolved.


> On Jan 15, 2015, at 12:55 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> Given that you could only reproduce it with either your custom compiler or by 
> forcibly introducing a delay, is this indicating an issue with the custom 
> compiler? It does seem strange that we don't see this anywhere else, given 
> the number of times that code gets run.
> 
> Only alternative solution I can think of would be to push the finalize 
> request into the event loop, and thus execute the loopbreak from within an 
> event. You might try and see if that solves the problem.
> 
> 
>> On Jan 14, 2015, at 11:54 PM, Leonid <lchis...@pathscale.com> wrote:
>> 
>> Hi all.
>> 
>> I believe there is a bug in event_base_loop() function from file event.c 
>> (opal/mca/event/libevent2022/libevent/).
>> 
>> Consider the case when application is going to be finalized and both 
>> event_base_loop() and event_base_loopbreak() are called in the same time in 
>> parallel threads.
>> 
>> Then if event_base_loopbreak() happens to acquire lock first, it will set 
>> "event_base->event_break = 1", but won't send any signal to event loop, 
>> because it did not started yet.
>> 
>> After that, event_base_loop() will acquire the lock and will clear 
>> event_break flag with the following statement: "base->event_gotterm = 
>> base->event_break = 0;". Then it will go into polling with timeout = -1 and 
>> thus block forever.
>> 
>> This issue was reproduced on a custom compiler (using Lulesh benchmark and 
>> x86 4-core PC), but it can be also reproduced for me with GCC compiler (on 
>> almost any benchmark and in same HW settings) by putting some delay to 
>> orte_progress_thread_engine() function:
>> 
>> static void* orte_progress_thread_engine(opal_object_t *obj)
>> {
>>   while (orte_event_base_active) {
>>     usleep(1000); // add sleep to allow orte_ess_base_app_finalize() set 
>> orte_event_base_active flag to false
>>     opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
>>   }
>>   return OPAL_THREAD_CANCELLED;
>> }
>> 
>> I am not completely sure what should be the best fix for described problem.
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/01/26181.php
> 

Reply via email to