Re: accept socket occasional read error

2020-02-18 Thread Bo Lorentsen
On 18.02.2020 12.24, Marc Lehmann wrote:
> On Sun, Feb 16, 2020 at 08:43:20AM +0100, Bo Lorentsen  wrote:
>> I have a callback that gets called on every accept (where I loop on EAGAIN
>> to empty the backlog), and for each new socket I make an protocol structure
> You mean loop while accept is successful?

Sorry, I guess I meant the other way around, this is the central part of
the listen socket read event, send by libev :

    struct sockaddr saddr;
    unsigned len = sizeof( saddr );

    while( true ) {
    int sock = ::accept( sfd, &saddr, &len );

    if( sock < 0 ) {
    if( errno == ECONNABORTED ) // Just get rid of
this backlog entry
    continue;

    if( errno == EWOULDBLOCK || errno == EAGAIN ||
errno == EINTR ) // lets wait for next read event
    break;

    throw system_error( error_code(errno,
std::generic_category()));
    }

So, of cause, on EAGAIN++ we just wait for libev to send another read
event, that is what the break here does (see test code for more context
if needed).

>> up to about 20k connections, and concurrency of 500 i gets a few sockets
>> that I can't read from, at all. recv returns -1 and EAGAIN, but it never
>> gets any data.
> You should not get readyness events from libev for sockets and then have recv
> return EAGAIN.

No, and this is fortunately not what I experience :-) But I get a socket
from accept, and I try to read from it ( a single time ) right away,
just to see if it already contains data, and sometimes it does, at
leased a bit of it, and sometime it just gives me EAGAIN++, as expected.

The problem is that sometimes I get a socket that never give me any data
(or so it seems), so I ended up writing a small ev timer that "poked"
this socket if no data was received for more then 10 sec (this is purely
for benchmark purpose, so not for real life usage), and it did indeed
still return EAGAIN++ as expected, and no data at all.

I am quite sure libev is not the problem, it is doing a really nice job
and is blistering fast. There may be some kind of strange shadow effect,
but to me at least it seems to boil down to me getting a socket from
accept that is valid, but no data ever arrive here.

The client end is also confused, and simply complains about missing data
(data length error), I may need to consider my blind faith in this
benchmark tool if this persist, as the client could easily open a socket
to my server and not sending any data (as a result of another error
type), in some rare error cases, and would end up hunting ghosts :-)

>> Are there something I am not aware of, or have others seen something like
>> this ? I really appreciate any comments, I have tried google and found
>> nothing useful on this specific subject.
> Well, this is not a very specific problem - for example, your code could
> simply be buggy and confuse fds (i.e. read from the wrong fd, move fds
> around and so on). Or it could corrupt ata structures (e.g. by modifying
> or freeing in-use watchers). There could also be some network problem,
> caused by an external setup - tcpdump'ing everything and later looking at
> the actual traffic for the socket that failed to work can help find out
> what is going on. netstat/ss can display socket buffers, which can tell
> you where the missing data is, or if there actually is any data waiting
> (for example, you could have wrongly read the data already).

You are absolutely right, this is very likely my code that have a bug,
and this may be a cascading effect i don't understand. I just wrote
here, to see if anyone have some experience with something like this, or
maybe could spot the error in the test code. It is hard to find people
in the C/C++ world that do things like this :-)

> You could try recompiling libev with e.g. -DEV_VERIFY=2 or even 3 and see
> if that maybe catches an issue (although if what you describe is correct,
> it will probably not catch anything).
I think I will try to make an even simple impl. just to make sure where
the problem exactly is located, libev is not the problem here but my
handling is the most likely error, if only I could spot it.
>
> Personally, I'd start with whichever is easier. There should be at least some
> insight to be gained from tcpdump/netstat/EV_VERIFY.
>
Makes sense, and thanks for your insight and time.

Regards

/BL



___
libev mailing list
libev@lists.schmorp.de
http://lists.schmorp.de/mailman/listinfo/libev


Re: accept socket occasional read error

2020-02-18 Thread Marc Lehmann
On Sun, Feb 16, 2020 at 08:43:20AM +0100, Bo Lorentsen  wrote:
> I have a callback that gets called on every accept (where I loop on EAGAIN
> to empty the backlog), and for each new socket I make an protocol structure

You mean loop while accept is successful?

> up to about 20k connections, and concurrency of 500 i gets a few sockets
> that I can't read from, at all. recv returns -1 and EAGAIN, but it never
> gets any data.

You should not get readyness events from libev for sockets and then have recv
return EAGAIN.

> Are there something I am not aware of, or have others seen something like
> this ? I really appreciate any comments, I have tried google and found
> nothing useful on this specific subject.

Well, this is not a very specific problem - for example, your code could
simply be buggy and confuse fds (i.e. read from the wrong fd, move fds
around and so on). Or it could corrupt ata structures (e.g. by modifying
or freeing in-use watchers). There could also be some network problem,
caused by an external setup - tcpdump'ing everything and later looking at
the actual traffic for the socket that failed to work can help find out
what is going on. netstat/ss can display socket buffers, which can tell
you where the missing data is, or if there actually is any data waiting
(for example, you could have wrongly read the data already).

You could try recompiling libev with e.g. -DEV_VERIFY=2 or even 3 and see
if that maybe catches an issue (although if what you describe is correct,
it will probably not catch anything).

Personally, I'd start with whichever is easier. There should be at least some
insight to be gained from tcpdump/netstat/EV_VERIFY.

-- 
The choice of a   Deliantra, the free code+content MORPG
  -==- _GNU_  http://www.deliantra.net
  ==-- _   generation
  ---==---(_)__  __   __  Marc Lehmann
  --==---/ / _ \/ // /\ \/ /  schm...@schmorp.de
  -=/_/_//_/\_,_/ /_/\_\

___
libev mailing list
libev@lists.schmorp.de
http://lists.schmorp.de/mailman/listinfo/libev