On Tue, Jul 20, 2010 at 10:15 AM, Thomas Downing <[email protected]> wrote: > Meanwhile, thanks to all who have responded to my posts. > Thanks for persisting with this Thomas.
You might also take a look at cloudera CDH3b2. It'll have the above fixes and then some. I've not looked too closely at what the 'then some' consists of recently -- and mighty Todd, our CDH-er is holidaying himself these times else he'd tell you himself -- but it might be worth checking it out. Yours, St.Ack > thomas downing > > On 7/20/2010 1:06 PM, Stack wrote: >> >> Hey Thomas: >> >> You are using hadoop 0..20.2 or something? And hbase 0.20.5 or so? >> >> You might try >> http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/. >> In particlular, it has HDFS-1118 "Fix socketleak on DFSClient". >> >> St.Ack >> >> On Tue, Jul 20, 2010 at 1:58 AM, Thomas Downing >> <[email protected]> wrote: >> >>> >>> Yes, I did try the timeout of 0. As expected, I did not see sockets >>> in FIN_WAIT2 or TIME_WAIT for very long. >>> >>> I still leak sockets at the ingest rates I need - the FIN_WAIT1 >>> problem. Also, with the more careful observations this time around, >>> I noted that even before the FIN_WAIT1 problem starts to crop >>> up (at around 1600M inserts) there is already a slower socket >>> leakage with timeout=0 and no FIN_WAIT1 problem. At 100M >>> sockets were hovering around 50-60, by 800M they were around >>> 200, and at 1600M they were at 400. This is slower than without >>> the timeout set to 0 (about half the rate), but it is still ultimately >>> fatal. >>> >>> This socket increase is all between hbase and hadoop, none >>> between test client and hbase. >>> >>> While the FIN_WAIT1 problem is triggered by an hbase side >>> issue, I have no indication of which side causes this other leak. >>> >>> thanks >>> >>> thomas downing >>> >>> On 7/19/2010 4:31 PM, Ryan Rawson wrote: >>> >>>> >>>> Did you try the setting I suggested? There is/was a known bug in HDFS >>>> which can cause issues which may include "abandoned" sockets such as >>>> you are describing. >>>> >>>> -ryan >>>> >>>> On Mon, Jul 19, 2010 at 2:13 AM, Thomas Downing >>>> <[email protected]> wrote: >>>> >>>> >>>>> >>>>> Thanks for the response, but my problem is not with FIN_WAIT2, it >>>>> is with FIN_WAIT1. >>>>> >>>>> If it was FIN_WAIT2, the only concern would be socket leakage, >>>>> and if setting the time out solved the issue, that would be great. >>>>> >>>>> The problem with FIN_WAIT1 is twofold - first, it is incumbent on >>>>> the application to notice and handle this problem; from the TCP stack >>>>> point of view, there is nothing wrong. It is just a special case of >>>>> slow >>>>> consumer. The other problem is that it implies that something will be >>>>> lost if the socket is abandoned, there is data in the send queue of the >>>>> socket in FIN_WAIT1 that has not yet been delivered to the peer. >>>>> >>>>> On 7/16/2010 3:56 PM, Ryan Rawson wrote: >>>>> >>>>> >>>>>> >>>>>> I've been running with this setting on both the HDFS side and the >>>>>> HBase side for over a year now, it's a bit of voodoo but you might be >>>>>> running into well known suckage of HDFS. Try this one and restart >>>>>> your hbase& hdfs. >>>>>> >>>>>> The FIN_WAIT2/TIME_WAIT happens more on large concurrent gets, not so >>>>>> much for inserts. >>>>>> >>>>>> <property> >>>>>> <name>dfs.datanode.socket.write.timeout</name> >>>>>> <value>0</value> >>>>>> </property> >>>>>> >>>>>> -ryan >>>>>> >>>>>> >>>>>> On Fri, Jul 16, 2010 at 9:33 AM, Thomas Downing >>>>>> <[email protected]> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> Thanks for the response. >>>>>>> >>>>>>> My understanding is that TCP_FIN_TIMEOUT affects only FIN_WAIT2, >>>>>>> my problem is with FIN_WAIT1. >>>>>>> >>>>>>> While I do see some sockets in TIME_WAIT, they are only a few, and >>>>>>> the >>>>>>> number is not growing. >>>>>>> >>>>>>> On 7/16/2010 12:07 PM, Hegner, Travis wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Hi Thomas, >>>>>>>> >>>>>>>> I ran into a very similar issue when running slony-I on postgresql >>>>>>>> to >>>>>>>> replicate 15-20 databases. >>>>>>>> >>>>>>>> Adjusting the TCP_FIN_TIMEOUT parameters for the kernel may help to >>>>>>>> slow >>>>>>>> (or hopefully stop), the leaking sockets. I found some notes about >>>>>>>> adjusting >>>>>>>> TCP parameters here: >>>>>>>> http://www.hikaro.com/linux/tweaking-tcpip-syctl-conf.html >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>>>> [snip] >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Follow this link to mark it as spam: >>>> >>>> >>>> http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=6A53327EB7.A78FD >>>> >>>> >>>> >>>> >>> >>> >> >> -- >> Follow this link to mark it as spam: >> >> http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=2E38F27E96.A72CF >> >> >> > >
