Hi Ralph.
Of course that may indicate an issue with custom compiler, but given
that it fails with gcc and inserted delay I still think it is a OMPI
bug, since such a delay could be caused by operating system at that
exact point.
For me simply commenting out "base->event_gotterm = base->event_break =
0;" seems to do the trick, but I am not completely sure if that won't
cause any other troubles.
I've tried to update my master branch to the latest version (including
your fix) but now it just crashes for me on *all* benchmarks that I am
trying (both with gcc and our compiler).
On 15.01.2015 18:57, Ralph Castain wrote:
Thought about this some more and realized that the orte progress engine wasn’t
using the opal_progress_thread support functions, which include a “break” event
to kick us out of just such problems. So I changed it on the master. From your
citing of libevent 2.0.22, I believe that must be where you are working, yes?
If so, give the changed version a try and see if your problem is resolved.
On Jan 15, 2015, at 12:55 AM, Ralph Castain <r...@open-mpi.org> wrote:
Given that you could only reproduce it with either your custom compiler or by
forcibly introducing a delay, is this indicating an issue with the custom
compiler? It does seem strange that we don't see this anywhere else, given the
number of times that code gets run.
Only alternative solution I can think of would be to push the finalize request
into the event loop, and thus execute the loopbreak from within an event. You
might try and see if that solves the problem.
On Jan 14, 2015, at 11:54 PM, Leonid <lchis...@pathscale.com> wrote:
Hi all.
I believe there is a bug in event_base_loop() function from file event.c
(opal/mca/event/libevent2022/libevent/).
Consider the case when application is going to be finalized and both
event_base_loop() and event_base_loopbreak() are called in the same time in
parallel threads.
Then if event_base_loopbreak() happens to acquire lock first, it will set
"event_base->event_break = 1", but won't send any signal to event loop, because
it did not started yet.
After that, event_base_loop() will acquire the lock and will clear event_break flag with the
following statement: "base->event_gotterm = base->event_break = 0;". Then it
will go into polling with timeout = -1 and thus block forever.
This issue was reproduced on a custom compiler (using Lulesh benchmark and x86
4-core PC), but it can be also reproduced for me with GCC compiler (on almost
any benchmark and in same HW settings) by putting some delay to
orte_progress_thread_engine() function:
static void* orte_progress_thread_engine(opal_object_t *obj)
{
while (orte_event_base_active) {
usleep(1000); // add sleep to allow orte_ess_base_app_finalize() set
orte_event_base_active flag to false
opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
}
return OPAL_THREAD_CANCELLED;
}
I am not completely sure what should be the best fix for described problem.
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/01/26181.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/01/26185.php