Re: non-blocking servers are leaking sockets

Jules Cisek Wed, 22 Jan 2014 14:34:11 -0800

On Wed, Jan 22, 2014 at 2:01 PM, Roger Meier <[email protected]>wrote:


> You need to catch the IOException that was thrown by TNonblockingSocket
> during read within your application, see here:
>
> https://git-wip-us.apache.org/repos/asf/thrift/repo?p=thrift.git;a=blob;f=li
>
> b/java/src/org/apache/thrift/transport/TNonblockingSocket.java;h=482bd149ab0
> a993e90315e4f719d0903c89ac1f0;hb=HEAD#l140
>
> Thrift library does not know what to do on network issues or similar issues
> that can cause a read to fail within your environment.
>

i don't know how i can catch the exception on the server side since the
error is thrown outside any code path i have control over.  the try/catch
blocks i have in my remote methods never see these network/timeout errors.

the clients are too numerous and in too many different languages to fix it
on their side.  and i submit that the server should be able to recover from
a misbehaving client.

~j


>
> ;-r
>
> -----Original Message-----
> From: Jules Cisek [mailto:[email protected]]
> Sent: Mittwoch, 22. Januar 2014 21:10
> To: [email protected]
> Subject: Re: non-blocking servers are leaking sockets
>
> this service actually needs to respond in under 100ms (and usually does in
> less than 20) so a short delay is just not possible.
>
> on the server, i see a lot of this in the logs:
>
> 14/01/22 19:15:27 WARN Thread-3 server.TThreadedSelectorServer: Got an
> IOException in internalRead!
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcher.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>         at
>
> org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:
> 141)
>         at
>
> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(
> AbstractNonblockingServer.java:515)
>         at
>
> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(Abstract
> NonblockingServer.java:305)
>         at
>
> org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.hand
> leRead(AbstractNonblockingServer.java:202)
>         at
>
> org.apache.thrift.server.TThreadedSelectorServer$SelectorThread.select(TThre
> adedSelectorServer.java:576)
>         at
>
> org.apache.thrift.server.TThreadedSelectorServer$SelectorThread.run(TThreade
> dSelectorServer.java:536)
>
> (note that these resets happen when the async client doesn't get a response
> from the server in the time set using client.setTimeout(m) which in our
> case
> can be quite often and we're ok with that)
>
> i'm not sure why the thrift library feels it's necessary to log this stuff
> since clients drop connections all the time and should be expected to and
> frankly it makes me think that somehow this common error is not being
> properly handled (although looking through the code it does look like
> eventually the SocketChannel gets close()'ed)
>
> ~j
>
>
>
> On Mon, Jan 20, 2014 at 12:15 PM, Sammons, Mark
> <[email protected]>wrote:
>
> > Hi, Jules.
> >
> > I'm not sure my problems are completely analogous to yours, but I had
> > a situation where a client program making many short calls to a remote
> > thrift server was getting a "no route to host" exception after some
> > number of calls, and it appeared to be due to slow release of closed
> > sockets.  I found that adding a short (20ms) delay between calls
> > resolved the problem.
> >
> > I realize this is not exactly a solution, but it has at least allowed
> > me to keep working...
> >
> > Regards,
> >
> > Mark
> >
> > ________________________________________
> > From: Jules Cisek [[email protected]]
> > Sent: Monday, January 20, 2014 12:39 PM
> > To: [email protected]
> > Subject: non-blocking servers are leaking sockets
> >
> > i'm running java TThreadedSelectorServer and THsHaServer based servers
> > and both seem to be leaking sockets (thrift 0.9.0)
> >
> > googling around searching for answers i keep running into
> > https://issues.apache.org/jira/browse/THRIFT-1653 which puts the blame
> > on the TCP config on the server while acknowledging that perhaps a
> > problem in the application layer does exist (see last entry)
> >
> > i prefer not to mess with the TCP config on the machine because it is
> > used for various tasks, also i did not have these issues with a
> > TThreadPoolServer and a TSocket (blocking + TBufferedTransport) or any
> > non-thrift server on the same machine.
> >
> > what happens is i get a bunch of TCP connections in a CLOSE_WAIT state
> > and these remain in that state indefinitely.  but what is even more
> > concerning, i get many sockets that don't show up in netstat at all
> > and only lsof can show me that they exist.  on Linux lsof shows them
> > as "can't identify protocol".  according to
> > https://idea.popcount.org/2012-12-09-lsof-cant-identify-protocol/
> > these sockets are in a "half closed state" and the linux kernel has no
> > idea what to do with them.
> >
> > i'm pretty sure there's a problem with misbehaving clients, but the
> > server should not fall leak resources because of a client side bug.
> >
> > my only recourse is to run a cronjob that looks at the lsof output and
> > restarts the server whenever the socket count gets dangerously close
> > to "too many open files" (8192 in my case)
> >
> > any ideas?
> >
> > --
> > jules cisek | [email protected]
> >
>
>
>
> --
> jules cisek | [email protected]
>
>


-- 
jules cisek | [email protected]

Re: non-blocking servers are leaking sockets

Reply via email to