Thanks!
On Fri, Sep 26, 2014 at 12:56 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:
> Ralph,
>
> i just commited r32799 in order to fix this issue.
> i cmr'ed (#4923) and set the target for 1.8.4
>
> Cheers,
>
> Gilles
>
>
> On 2014/09/23 22:55, Ralph Castain wrote:
>
> Thanks
Ralph,
i just commited r32799 in order to fix this issue.
i cmr'ed (#4923) and set the target for 1.8.4
Cheers,
Gilles
On 2014/09/23 22:55, Ralph Castain wrote:
> Thanks! I won't have time to work on it this week, but appreciate your
> effort. Also, thanks for clarifying the race condition vis
Thanks! I won't have time to work on it this week, but appreciate your effort.
Also, thanks for clarifying the race condition vis 1.8 - I agree it is not a
blocker for that release.
Ralph
On Sep 22, 2014, at 4:49 PM, Gilles Gouaillardet
wrote:
> Ralph,
>
> here is the patch i am using so fa
Ralph,
here is the patch i am using so far.
i will resume working on this from Wednesday (there is at least one
remaining race condition yet) unless you have the time to take care of it
today.
so far, the race condition has only been observed in real life with the
grpcomm/rcd module, and this is
Gilles - please let me know if/when you think you'll do this. I'm debating
about adding it to 1.8.3, but don't want to delay that release too long.
Alternatively, I can take care of it if you don't have time (I'm asking if you
can do it solely because you have the reproducer).
On Sep 21, 2014,
Sounds fine with me - please go ahead, and thanks
On Sep 20, 2014, at 10:26 PM, Gilles Gouaillardet
wrote:
> Thanks for the pointer George !
>
> On Sat, Sep 20, 2014 at 5:46 AM, George Bosilca wrote:
> Or copy the handshake protocol design of the TCP BTL...
>
>
> the main difference between
Thanks for the pointer George !
On Sat, Sep 20, 2014 at 5:46 AM, George Bosilca wrote:
> Or copy the handshake protocol design of the TCP BTL...
>
>
the main difference between oob/tcp and btl/tcp is the way we resolve the
situation in which two processes send their first message to each other a
Or copy the handshake protocol design of the TCP BTL...
George.
On Fri, Sep 19, 2014 at 6:23 PM, Ralph Castain wrote:
> You know, I'm almost beginning to dread opening my email in the morning
> for fear of seeing another "race condition" subject line! :-)
>
> I think the correct answer here i
You know, I'm almost beginning to dread opening my email in the morning for
fear of seeing another "race condition" subject line! :-)
I think the correct answer here is that orted 3 should be entering "retry" when
it sees the peer state change to "closed", regardless of what happened in the
sen
Ralph,
let me detail the new race condition.
orted 2 and 3 are not connected to each other and send a message to each
other
orted 2 and 3 call send_process (that set peer->state =
MCA_OOB_TCP_PEER_CONNECTING)
they both end up calling mca_oob_tcp_peer_try_connect
now if orted 3 calls mca_oob_tcp
Ralph,
i found an other race condition.
in a very specific scenario, vpid3 is in the MCA_OOB_TCP_CLOSED state,
and processes data from the socket received from vpid 2
vpid3 is in the MCA_OOB_TCP_CLOSED state because vpid2 called retry()
and closed all its both sockets to vpid 3
vpid3 read the ack
The patch looks fine to me - please go ahead and apply it. Thanks!
On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet
wrote:
> Ralph,
>
> yes and no ...
>
> mpi hello world with four nodes can be used to reproduce the issue,
>
>
> you can increase the likelyhood of producing the race conditi
Ralph,
yes and no ...
mpi hello world with four nodes can be used to reproduce the issue,
you can increase the likelyhood of producing the race condition by hacking
./opal/mca/event/libevent2021/libevent/poll.c
and replace
i = random() % nfds;
with
if (nfds < 2) {
i =
Do you have a reproducer you can share for testing this? I'm unable to get it
to happen on my machine, but maybe you have a test code that triggers it so I
can continue debugging
Ralph
On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet
wrote:
> Thanks Ralph,
>
> this is much better but there
Thanks Ralph,
this is much better but there is still a bug :
with the very same scenario i described earlier, vpid 2 does not send
its message to vpid 3 once the connection has been established.
i tried to debug it but i have been pretty unsuccessful so far ..
vpid 2 calls tcp_peer_connected and
Hi Gilles
I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned it to
you for review. Give it a try and let me know if I (hopefully) got it.
The approach we have used in the past is to have both sides close their
connections, and then have the higher vpid retry while the low
Ralph,
here is the full description of a race condition in oob/tcp i very briefly
mentionned in a previous post :
the race condition can occur when two not connected orted try to send a
message to each other for the first time and at the same time.
that can occur when running mpi helloworld on 4
17 matches
Mail list logo