On Wed, Jan 22, 2014 at 2:01 PM, Roger Meier <[email protected]>wrote:
> You need to catch the IOException that was thrown by TNonblockingSocket > during read within your application, see here: > > https://git-wip-us.apache.org/repos/asf/thrift/repo?p=thrift.git;a=blob;f=li > > b/java/src/org/apache/thrift/transport/TNonblockingSocket.java;h=482bd149ab0 > a993e90315e4f719d0903c89ac1f0;hb=HEAD#l140 > > Thrift library does not know what to do on network issues or similar issues > that can cause a read to fail within your environment. > i don't know how i can catch the exception on the server side since the error is thrown outside any code path i have control over. the try/catch blocks i have in my remote methods never see these network/timeout errors. the clients are too numerous and in too many different languages to fix it on their side. and i submit that the server should be able to recover from a misbehaving client. ~j > > ;-r > > -----Original Message----- > From: Jules Cisek [mailto:[email protected]] > Sent: Mittwoch, 22. Januar 2014 21:10 > To: [email protected] > Subject: Re: non-blocking servers are leaking sockets > > this service actually needs to respond in under 100ms (and usually does in > less than 20) so a short delay is just not possible. > > on the server, i see a lot of this in the logs: > > 14/01/22 19:15:27 WARN Thread-3 server.TThreadedSelectorServer: Got an > IOException in internalRead! > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) > at sun.nio.ch.IOUtil.read(IOUtil.java:224) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) > at > > org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java: > 141) > at > > org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead( > AbstractNonblockingServer.java:515) > at > > org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(Abstract > NonblockingServer.java:305) > at > > org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.hand > leRead(AbstractNonblockingServer.java:202) > at > > org.apache.thrift.server.TThreadedSelectorServer$SelectorThread.select(TThre > adedSelectorServer.java:576) > at > > org.apache.thrift.server.TThreadedSelectorServer$SelectorThread.run(TThreade > dSelectorServer.java:536) > > (note that these resets happen when the async client doesn't get a response > from the server in the time set using client.setTimeout(m) which in our > case > can be quite often and we're ok with that) > > i'm not sure why the thrift library feels it's necessary to log this stuff > since clients drop connections all the time and should be expected to and > frankly it makes me think that somehow this common error is not being > properly handled (although looking through the code it does look like > eventually the SocketChannel gets close()'ed) > > ~j > > > > On Mon, Jan 20, 2014 at 12:15 PM, Sammons, Mark > <[email protected]>wrote: > > > Hi, Jules. > > > > I'm not sure my problems are completely analogous to yours, but I had > > a situation where a client program making many short calls to a remote > > thrift server was getting a "no route to host" exception after some > > number of calls, and it appeared to be due to slow release of closed > > sockets. I found that adding a short (20ms) delay between calls > > resolved the problem. > > > > I realize this is not exactly a solution, but it has at least allowed > > me to keep working... > > > > Regards, > > > > Mark > > > > ________________________________________ > > From: Jules Cisek [[email protected]] > > Sent: Monday, January 20, 2014 12:39 PM > > To: [email protected] > > Subject: non-blocking servers are leaking sockets > > > > i'm running java TThreadedSelectorServer and THsHaServer based servers > > and both seem to be leaking sockets (thrift 0.9.0) > > > > googling around searching for answers i keep running into > > https://issues.apache.org/jira/browse/THRIFT-1653 which puts the blame > > on the TCP config on the server while acknowledging that perhaps a > > problem in the application layer does exist (see last entry) > > > > i prefer not to mess with the TCP config on the machine because it is > > used for various tasks, also i did not have these issues with a > > TThreadPoolServer and a TSocket (blocking + TBufferedTransport) or any > > non-thrift server on the same machine. > > > > what happens is i get a bunch of TCP connections in a CLOSE_WAIT state > > and these remain in that state indefinitely. but what is even more > > concerning, i get many sockets that don't show up in netstat at all > > and only lsof can show me that they exist. on Linux lsof shows them > > as "can't identify protocol". according to > > https://idea.popcount.org/2012-12-09-lsof-cant-identify-protocol/ > > these sockets are in a "half closed state" and the linux kernel has no > > idea what to do with them. > > > > i'm pretty sure there's a problem with misbehaving clients, but the > > server should not fall leak resources because of a client side bug. > > > > my only recourse is to run a cronjob that looks at the lsof output and > > restarts the server whenever the socket count gets dangerously close > > to "too many open files" (8192 in my case) > > > > any ideas? > > > > -- > > jules cisek | [email protected] > > > > > > -- > jules cisek | [email protected] > > -- jules cisek | [email protected]
