Hi Jules Are you able to provide a test case on this?
Did you test with a try-catch for the socketChannel_.read() within TNonblockingSocket.java line 141? You can catch a NotYetConnectedException there and do proper error handling. http://openjdk.java.net/projects/nio/javadoc/java/nio/channels/SocketChannel .html Could you try this? -roger -----Original Message----- From: Jules Cisek [mailto:[email protected]] Sent: Mittwoch, 22. Januar 2014 23:33 To: [email protected] Subject: Re: non-blocking servers are leaking sockets On Wed, Jan 22, 2014 at 2:01 PM, Roger Meier <[email protected]>wrote: > You need to catch the IOException that was thrown by > TNonblockingSocket during read within your application, see here: > > https://git-wip-us.apache.org/repos/asf/thrift/repo?p=thrift.git;a=blo > b;f=li > > b/java/src/org/apache/thrift/transport/TNonblockingSocket.java;h=482bd > 149ab0 > a993e90315e4f719d0903c89ac1f0;hb=HEAD#l140 > > Thrift library does not know what to do on network issues or similar > issues that can cause a read to fail within your environment. > i don't know how i can catch the exception on the server side since the error is thrown outside any code path i have control over. the try/catch blocks i have in my remote methods never see these network/timeout errors. the clients are too numerous and in too many different languages to fix it on their side. and i submit that the server should be able to recover from a misbehaving client. ~j > > ;-r > > -----Original Message----- > From: Jules Cisek [mailto:[email protected]] > Sent: Mittwoch, 22. Januar 2014 21:10 > To: [email protected] > Subject: Re: non-blocking servers are leaking sockets > > this service actually needs to respond in under 100ms (and usually > does in less than 20) so a short delay is just not possible. > > on the server, i see a lot of this in the logs: > > 14/01/22 19:15:27 WARN Thread-3 server.TThreadedSelectorServer: Got an > IOException in internalRead! > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) > at sun.nio.ch.IOUtil.read(IOUtil.java:224) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) > at > > org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java: > 141) > at > > org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.interna > lRead( > AbstractNonblockingServer.java:515) > at > > org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(Ab > stract > NonblockingServer.java:305) > at > > org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThrea > d.hand > leRead(AbstractNonblockingServer.java:202) > at > > org.apache.thrift.server.TThreadedSelectorServer$SelectorThread.select > (TThre > adedSelectorServer.java:576) > at > > org.apache.thrift.server.TThreadedSelectorServer$SelectorThread.run(TT > hreade > dSelectorServer.java:536) > > (note that these resets happen when the async client doesn't get a > response from the server in the time set using client.setTimeout(m) > which in our case can be quite often and we're ok with that) > > i'm not sure why the thrift library feels it's necessary to log this > stuff since clients drop connections all the time and should be > expected to and frankly it makes me think that somehow this common > error is not being properly handled (although looking through the code > it does look like eventually the SocketChannel gets close()'ed) > > ~j > > > > On Mon, Jan 20, 2014 at 12:15 PM, Sammons, Mark > <[email protected]>wrote: > > > Hi, Jules. > > > > I'm not sure my problems are completely analogous to yours, but I > > had a situation where a client program making many short calls to a > > remote thrift server was getting a "no route to host" exception > > after some number of calls, and it appeared to be due to slow > > release of closed sockets. I found that adding a short (20ms) delay > > between calls resolved the problem. > > > > I realize this is not exactly a solution, but it has at least > > allowed me to keep working... > > > > Regards, > > > > Mark > > > > ________________________________________ > > From: Jules Cisek [[email protected]] > > Sent: Monday, January 20, 2014 12:39 PM > > To: [email protected] > > Subject: non-blocking servers are leaking sockets > > > > i'm running java TThreadedSelectorServer and THsHaServer based > > servers and both seem to be leaking sockets (thrift 0.9.0) > > > > googling around searching for answers i keep running into > > https://issues.apache.org/jira/browse/THRIFT-1653 which puts the > > blame on the TCP config on the server while acknowledging that > > perhaps a problem in the application layer does exist (see last > > entry) > > > > i prefer not to mess with the TCP config on the machine because it > > is used for various tasks, also i did not have these issues with a > > TThreadPoolServer and a TSocket (blocking + TBufferedTransport) or > > any non-thrift server on the same machine. > > > > what happens is i get a bunch of TCP connections in a CLOSE_WAIT > > state and these remain in that state indefinitely. but what is even > > more concerning, i get many sockets that don't show up in netstat at > > all and only lsof can show me that they exist. on Linux lsof shows > > them as "can't identify protocol". according to > > https://idea.popcount.org/2012-12-09-lsof-cant-identify-protocol/ > > these sockets are in a "half closed state" and the linux kernel has > > no idea what to do with them. > > > > i'm pretty sure there's a problem with misbehaving clients, but the > > server should not fall leak resources because of a client side bug. > > > > my only recourse is to run a cronjob that looks at the lsof output > > and restarts the server whenever the socket count gets dangerously > > close to "too many open files" (8192 in my case) > > > > any ideas? > > > > -- > > jules cisek | [email protected] > > > > > > -- > jules cisek | [email protected] > > -- jules cisek | [email protected]
