[ 
https://issues.apache.org/jira/browse/MESOS-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873908#comment-16873908
 ] 

Benno Evers commented on MESOS-9867:
------------------------------------

This was fixed upstream with libevent-2.1.10: 
https://github.com/libevent/libevent/commit/9b5a527f5bf898250a797dde59cadb4f64e8967a

> Libevent fd cleanup failure may cause hangs in subsequent tests
> ---------------------------------------------------------------
>
>                 Key: MESOS-9867
>                 URL: https://issues.apache.org/jira/browse/MESOS-9867
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benno Evers
>            Priority: Major
>              Labels: libevent, libprocess, ssl, tls
>
> A listening LibeventSSLSocket will check cryptographic certificate validity 
> during the OpenSSL handshake and afterwards call the `openssl::verify()` 
> function to perform hostname validation and other checks on the client 
> certificate. If these checks fail, the bufferevent is deleted and the 
> connection closed:
> {noformat}
> // libevent_ssl_socket.cpp, accept_SSL_callback()
>           if (verify.isError()) {
>             VLOG(1) << "Failed accept, verification error: " << 
> verify.error();
>             request->promise.fail(verify.error());
>             SSL_free(ssl);
>             bufferevent_free(bev);
>             // TODO(jmlvanre): Clean up for readability. Consider RAII
>             // or constructing the impl earlier.
>             CHECK(request->socket >= 0);
>             Try<Nothing> close = os::close(request->socket);
>             if (close.isError()) {
>               LOG(FATAL)
>                 << "Failed to close socket " << stringify(request->socket)
>                 << ": " << close.error();
>             }
>             delete request;
>             return;
>           }
> {noformat}
> However, when we close the socket fd in the above code, libevent had already 
> registered that file descriptor with epoll() to watch for read and write 
> events on that socket. Since the socket is closed, attempts to remove the 
> corresponding fd from the epoll() structs will fail: (See also: 
> https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/)
> {noformat}
> [warn] Epoll MOD(4) on fd 9 failed.  Old events were 6; read change was 2 
> (del); write change was 0 (none): Bad file descriptor
> [warn] Epoll MOD(1) on fd 9 failed.  Old events were 6; read change was 0 
> (none); write change was 2 (del): Bad file descriptor
> {noformat}
> However, that in itself is harmless since the kernel will remove the kernel 
> object that was associated with fd 9 from the data structure associated with 
> that epoll instance in the kernel. So while we get an error attempting to 
> remove fd 9, there is actually nothing left to remove. However, in a case of 
> epoll failure, libprocess does not adjust the number of readers and writers 
> on that file descriptor:
> {noformat}
>         // evmap.c, evmap_io_del()
>         [...]
>         if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
>                return (-1);
>         [...]
>         ctx->nread = nread;
>         ctx->nwrite = nwrite;
> {noformat}
> In the above, ctx is part of an array collecting information for each file 
> descriptor. That still wouldn't be so bad, however libevent also only adds 
> file descriptors to `epoll()` struct the *first* time we attempt to create a 
> read or write event on that file descriptor:
> {noformat}
>         // evmap.c, evmap_io_del()
>         if (ev->ev_events & EV_READ) {
>                 if (++nread == 1)
>                         res |= EV_READ;
>         }
>         if (ev->ev_events & EV_WRITE) {
>                 if (++nwrite == 1)
>                         res |= EV_WRITE;
>         }
>         [...]
>         if (res) {
>                 [...]
>                 if (evsel->add(base, ev->ev_fd,
>                         old, (ev->ev_events & EV_ET) | res, extra) == -1)
>                         return (-1);
>                 [...]
>         }
> {noformat}
> So when the same file descriptor is attempted to be used again by libevent 
> for epoll() polling, the process will hang because reads or writes to that 
> file descriptor are never noticed.
> This can be reproduced for example by running a test where the 
> `verify()`-callback fails on the server side twice in a row: (note, the 
> LIBPROCESS_IP below is set in order to induce a test failure, result may vary 
> on your local network and ssl configuration)
> {noformat}
> LIBPROCESS_IP=127.0.1.1 ./libprocess-tests 
> --gtest_filter="*VerifyCertificate*" --gtest_repeat=2
> {noformat}
> There is a chance that the issue described here is the same as the ominous 
> "issues" described in MESOS-3008, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to