Hey Thomas:

You are using hadoop 0..20.2 or something?  And hbase 0.20.5 or so?

You might try 
http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/.
 In particlular, it has HDFS-1118 "Fix socketleak on DFSClient".

St.Ack

On Tue, Jul 20, 2010 at 1:58 AM, Thomas Downing
<[email protected]> wrote:
> Yes, I did try the timeout of 0.  As expected, I did not see sockets
> in FIN_WAIT2 or TIME_WAIT for very long.
>
> I still leak sockets at the ingest rates I need - the FIN_WAIT1
> problem.  Also, with the more careful observations this time around,
> I noted that even before the FIN_WAIT1 problem starts to crop
> up (at around 1600M inserts) there is already a slower socket
> leakage with timeout=0 and no FIN_WAIT1 problem.  At 100M
> sockets were hovering around 50-60, by 800M they were around
> 200, and at 1600M they were at 400.  This is slower than without
> the timeout set to 0 (about half the rate), but it is still ultimately
> fatal.
>
> This socket increase is all between hbase and hadoop, none
> between test client and hbase.
>
> While the FIN_WAIT1 problem is triggered by an hbase side
> issue, I have no indication of which side causes this other leak.
>
> thanks
>
> thomas downing
>
> On 7/19/2010 4:31 PM, Ryan Rawson wrote:
>>
>> Did you try the setting I suggested?  There is/was a known bug in HDFS
>> which can cause issues which may include "abandoned" sockets such as
>> you are describing.
>>
>> -ryan
>>
>> On Mon, Jul 19, 2010 at 2:13 AM, Thomas Downing
>> <[email protected]>  wrote:
>>
>>>
>>> Thanks for the response, but my problem is not with FIN_WAIT2, it
>>> is with FIN_WAIT1.
>>>
>>> If it was FIN_WAIT2, the only concern would be socket leakage,
>>> and if  setting the time out solved the issue, that would be great.
>>>
>>> The problem with FIN_WAIT1 is twofold - first, it is incumbent on
>>> the application to notice and handle this problem; from the TCP stack
>>> point of view, there is nothing wrong.  It is just a special case of slow
>>> consumer.  The other problem is that it implies that something will be
>>> lost if the socket is abandoned, there is data in the send queue of the
>>> socket in FIN_WAIT1 that has not yet been delivered to the peer.
>>>
>>> On 7/16/2010 3:56 PM, Ryan Rawson wrote:
>>>
>>>>
>>>> I've been running with this setting on both the HDFS side and the
>>>> HBase side for over a year now, it's a bit of voodoo but you might be
>>>> running into well known suckage of HDFS.  Try this one and restart
>>>> your hbase&    hdfs.
>>>>
>>>> The FIN_WAIT2/TIME_WAIT happens more on large concurrent gets, not so
>>>> much for inserts.
>>>>
>>>> <property>
>>>> <name>dfs.datanode.socket.write.timeout</name>
>>>> <value>0</value>
>>>> </property>
>>>>
>>>> -ryan
>>>>
>>>>
>>>> On Fri, Jul 16, 2010 at 9:33 AM, Thomas Downing
>>>> <[email protected]>    wrote:
>>>>
>>>>
>>>>>
>>>>> Thanks for the response.
>>>>>
>>>>> My understanding is that TCP_FIN_TIMEOUT affects only FIN_WAIT2,
>>>>> my problem is with FIN_WAIT1.
>>>>>
>>>>> While I do see some sockets in TIME_WAIT, they are only a few, and the
>>>>> number is not growing.
>>>>>
>>>>> On 7/16/2010 12:07 PM, Hegner, Travis wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> Hi Thomas,
>>>>>>
>>>>>> I ran into a very similar issue when running slony-I on postgresql to
>>>>>> replicate 15-20 databases.
>>>>>>
>>>>>> Adjusting the TCP_FIN_TIMEOUT parameters for the kernel may help to
>>>>>> slow
>>>>>> (or hopefully stop), the leaking sockets. I found some notes about
>>>>>> adjusting
>>>>>> TCP parameters here:
>>>>>> http://www.hikaro.com/linux/tweaking-tcpip-syctl-conf.html
>>>>>>
>>>>>>
>>>>>>
>>>
>>> [snip]
>>>
>>>
>>
>> --
>> Follow this link to mark it as spam:
>>
>> http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=6A53327EB7.A78FD
>>
>>
>>
>
>

Reply via email to