Hey Thomas: You are using hadoop 0..20.2 or something? And hbase 0.20.5 or so?
You might try http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/. In particlular, it has HDFS-1118 "Fix socketleak on DFSClient". St.Ack On Tue, Jul 20, 2010 at 1:58 AM, Thomas Downing <[email protected]> wrote: > Yes, I did try the timeout of 0. As expected, I did not see sockets > in FIN_WAIT2 or TIME_WAIT for very long. > > I still leak sockets at the ingest rates I need - the FIN_WAIT1 > problem. Also, with the more careful observations this time around, > I noted that even before the FIN_WAIT1 problem starts to crop > up (at around 1600M inserts) there is already a slower socket > leakage with timeout=0 and no FIN_WAIT1 problem. At 100M > sockets were hovering around 50-60, by 800M they were around > 200, and at 1600M they were at 400. This is slower than without > the timeout set to 0 (about half the rate), but it is still ultimately > fatal. > > This socket increase is all between hbase and hadoop, none > between test client and hbase. > > While the FIN_WAIT1 problem is triggered by an hbase side > issue, I have no indication of which side causes this other leak. > > thanks > > thomas downing > > On 7/19/2010 4:31 PM, Ryan Rawson wrote: >> >> Did you try the setting I suggested? There is/was a known bug in HDFS >> which can cause issues which may include "abandoned" sockets such as >> you are describing. >> >> -ryan >> >> On Mon, Jul 19, 2010 at 2:13 AM, Thomas Downing >> <[email protected]> wrote: >> >>> >>> Thanks for the response, but my problem is not with FIN_WAIT2, it >>> is with FIN_WAIT1. >>> >>> If it was FIN_WAIT2, the only concern would be socket leakage, >>> and if setting the time out solved the issue, that would be great. >>> >>> The problem with FIN_WAIT1 is twofold - first, it is incumbent on >>> the application to notice and handle this problem; from the TCP stack >>> point of view, there is nothing wrong. It is just a special case of slow >>> consumer. The other problem is that it implies that something will be >>> lost if the socket is abandoned, there is data in the send queue of the >>> socket in FIN_WAIT1 that has not yet been delivered to the peer. >>> >>> On 7/16/2010 3:56 PM, Ryan Rawson wrote: >>> >>>> >>>> I've been running with this setting on both the HDFS side and the >>>> HBase side for over a year now, it's a bit of voodoo but you might be >>>> running into well known suckage of HDFS. Try this one and restart >>>> your hbase& hdfs. >>>> >>>> The FIN_WAIT2/TIME_WAIT happens more on large concurrent gets, not so >>>> much for inserts. >>>> >>>> <property> >>>> <name>dfs.datanode.socket.write.timeout</name> >>>> <value>0</value> >>>> </property> >>>> >>>> -ryan >>>> >>>> >>>> On Fri, Jul 16, 2010 at 9:33 AM, Thomas Downing >>>> <[email protected]> wrote: >>>> >>>> >>>>> >>>>> Thanks for the response. >>>>> >>>>> My understanding is that TCP_FIN_TIMEOUT affects only FIN_WAIT2, >>>>> my problem is with FIN_WAIT1. >>>>> >>>>> While I do see some sockets in TIME_WAIT, they are only a few, and the >>>>> number is not growing. >>>>> >>>>> On 7/16/2010 12:07 PM, Hegner, Travis wrote: >>>>> >>>>> >>>>>> >>>>>> Hi Thomas, >>>>>> >>>>>> I ran into a very similar issue when running slony-I on postgresql to >>>>>> replicate 15-20 databases. >>>>>> >>>>>> Adjusting the TCP_FIN_TIMEOUT parameters for the kernel may help to >>>>>> slow >>>>>> (or hopefully stop), the leaking sockets. I found some notes about >>>>>> adjusting >>>>>> TCP parameters here: >>>>>> http://www.hikaro.com/linux/tweaking-tcpip-syctl-conf.html >>>>>> >>>>>> >>>>>> >>> >>> [snip] >>> >>> >> >> -- >> Follow this link to mark it as spam: >> >> http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=6A53327EB7.A78FD >> >> >> > >
