Re: High ingest rate and FIN_WAIT1 problems

Stack Tue, 20 Jul 2010 10:28:54 -0700

On Tue, Jul 20, 2010 at 10:15 AM, Thomas Downing
<[email protected]> wrote:
> Meanwhile, thanks to all who have responded to my posts.
>
Thanks for persisting with this Thomas.


You might also take a look at cloudera CDH3b2.  It'll have the above
fixes and then some.  I've not looked too closely at what the 'then
some' consists of recently -- and mighty Todd, our CDH-er is
holidaying himself these times else he'd tell you himself -- but it
might be worth checking it out.

Yours,
St.Ack


> thomas downing
>
> On 7/20/2010 1:06 PM, Stack wrote:
>>
>> Hey Thomas:
>>
>> You are using hadoop 0..20.2 or something?  And hbase 0.20.5 or so?
>>
>> You might try
>> http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/.
>>  In particlular, it has HDFS-1118 "Fix socketleak on DFSClient".
>>
>> St.Ack
>>
>> On Tue, Jul 20, 2010 at 1:58 AM, Thomas Downing
>> <[email protected]>  wrote:
>>
>>>
>>> Yes, I did try the timeout of 0.  As expected, I did not see sockets
>>> in FIN_WAIT2 or TIME_WAIT for very long.
>>>
>>> I still leak sockets at the ingest rates I need - the FIN_WAIT1
>>> problem.  Also, with the more careful observations this time around,
>>> I noted that even before the FIN_WAIT1 problem starts to crop
>>> up (at around 1600M inserts) there is already a slower socket
>>> leakage with timeout=0 and no FIN_WAIT1 problem.  At 100M
>>> sockets were hovering around 50-60, by 800M they were around
>>> 200, and at 1600M they were at 400.  This is slower than without
>>> the timeout set to 0 (about half the rate), but it is still ultimately
>>> fatal.
>>>
>>> This socket increase is all between hbase and hadoop, none
>>> between test client and hbase.
>>>
>>> While the FIN_WAIT1 problem is triggered by an hbase side
>>> issue, I have no indication of which side causes this other leak.
>>>
>>> thanks
>>>
>>> thomas downing
>>>
>>> On 7/19/2010 4:31 PM, Ryan Rawson wrote:
>>>
>>>>
>>>> Did you try the setting I suggested?  There is/was a known bug in HDFS
>>>> which can cause issues which may include "abandoned" sockets such as
>>>> you are describing.
>>>>
>>>> -ryan
>>>>
>>>> On Mon, Jul 19, 2010 at 2:13 AM, Thomas Downing
>>>> <[email protected]>    wrote:
>>>>
>>>>
>>>>>
>>>>> Thanks for the response, but my problem is not with FIN_WAIT2, it
>>>>> is with FIN_WAIT1.
>>>>>
>>>>> If it was FIN_WAIT2, the only concern would be socket leakage,
>>>>> and if  setting the time out solved the issue, that would be great.
>>>>>
>>>>> The problem with FIN_WAIT1 is twofold - first, it is incumbent on
>>>>> the application to notice and handle this problem; from the TCP stack
>>>>> point of view, there is nothing wrong.  It is just a special case of
>>>>> slow
>>>>> consumer.  The other problem is that it implies that something will be
>>>>> lost if the socket is abandoned, there is data in the send queue of the
>>>>> socket in FIN_WAIT1 that has not yet been delivered to the peer.
>>>>>
>>>>> On 7/16/2010 3:56 PM, Ryan Rawson wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> I've been running with this setting on both the HDFS side and the
>>>>>> HBase side for over a year now, it's a bit of voodoo but you might be
>>>>>> running into well known suckage of HDFS.  Try this one and restart
>>>>>> your hbase&      hdfs.
>>>>>>
>>>>>> The FIN_WAIT2/TIME_WAIT happens more on large concurrent gets, not so
>>>>>> much for inserts.
>>>>>>
>>>>>> <property>
>>>>>> <name>dfs.datanode.socket.write.timeout</name>
>>>>>> <value>0</value>
>>>>>> </property>
>>>>>>
>>>>>> -ryan
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 16, 2010 at 9:33 AM, Thomas Downing
>>>>>> <[email protected]>      wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks for the response.
>>>>>>>
>>>>>>> My understanding is that TCP_FIN_TIMEOUT affects only FIN_WAIT2,
>>>>>>> my problem is with FIN_WAIT1.
>>>>>>>
>>>>>>> While I do see some sockets in TIME_WAIT, they are only a few, and
>>>>>>> the
>>>>>>> number is not growing.
>>>>>>>
>>>>>>> On 7/16/2010 12:07 PM, Hegner, Travis wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Hi Thomas,
>>>>>>>>
>>>>>>>> I ran into a very similar issue when running slony-I on postgresql
>>>>>>>> to
>>>>>>>> replicate 15-20 databases.
>>>>>>>>
>>>>>>>> Adjusting the TCP_FIN_TIMEOUT parameters for the kernel may help to
>>>>>>>> slow
>>>>>>>> (or hopefully stop), the leaking sockets. I found some notes about
>>>>>>>> adjusting
>>>>>>>> TCP parameters here:
>>>>>>>> http://www.hikaro.com/linux/tweaking-tcpip-syctl-conf.html
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>> [snip]
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Follow this link to mark it as spam:
>>>>
>>>>
>>>> http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=6A53327EB7.A78FD
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>> --
>> Follow this link to mark it as spam:
>>
>> http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=2E38F27E96.A72CF
>>
>>
>>
>
>

Re: High ingest rate and FIN_WAIT1 problems

Reply via email to