Yes, hadoop 0.20.2 and hbase 0.20.5.

I will get the branch you suggest, and give it a whirl.  I am leaving on
vacation Thursday, so I may not have any results to report till I get
back.

When I do get back, I will catch up with versions/fixes and try some
more.

Meanwhile, thanks to all who have responded to my posts.

thomas downing

On 7/20/2010 1:06 PM, Stack wrote:
Hey Thomas:

You are using hadoop 0..20.2 or something?  And hbase 0.20.5 or so?

You might try 
http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/.
  In particlular, it has HDFS-1118 "Fix socketleak on DFSClient".

St.Ack

On Tue, Jul 20, 2010 at 1:58 AM, Thomas Downing
<tdown...@proteus-technologies.com>  wrote:
Yes, I did try the timeout of 0.  As expected, I did not see sockets
in FIN_WAIT2 or TIME_WAIT for very long.

I still leak sockets at the ingest rates I need - the FIN_WAIT1
problem.  Also, with the more careful observations this time around,
I noted that even before the FIN_WAIT1 problem starts to crop
up (at around 1600M inserts) there is already a slower socket
leakage with timeout=0 and no FIN_WAIT1 problem.  At 100M
sockets were hovering around 50-60, by 800M they were around
200, and at 1600M they were at 400.  This is slower than without
the timeout set to 0 (about half the rate), but it is still ultimately
fatal.

This socket increase is all between hbase and hadoop, none
between test client and hbase.

While the FIN_WAIT1 problem is triggered by an hbase side
issue, I have no indication of which side causes this other leak.

thanks

thomas downing

On 7/19/2010 4:31 PM, Ryan Rawson wrote:
Did you try the setting I suggested?  There is/was a known bug in HDFS
which can cause issues which may include "abandoned" sockets such as
you are describing.

-ryan

On Mon, Jul 19, 2010 at 2:13 AM, Thomas Downing
<tdown...@proteus-technologies.com>    wrote:

Thanks for the response, but my problem is not with FIN_WAIT2, it
is with FIN_WAIT1.

If it was FIN_WAIT2, the only concern would be socket leakage,
and if  setting the time out solved the issue, that would be great.

The problem with FIN_WAIT1 is twofold - first, it is incumbent on
the application to notice and handle this problem; from the TCP stack
point of view, there is nothing wrong.  It is just a special case of slow
consumer.  The other problem is that it implies that something will be
lost if the socket is abandoned, there is data in the send queue of the
socket in FIN_WAIT1 that has not yet been delivered to the peer.

On 7/16/2010 3:56 PM, Ryan Rawson wrote:

I've been running with this setting on both the HDFS side and the
HBase side for over a year now, it's a bit of voodoo but you might be
running into well known suckage of HDFS.  Try this one and restart
your hbase&      hdfs.

The FIN_WAIT2/TIME_WAIT happens more on large concurrent gets, not so
much for inserts.

<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>0</value>
</property>

-ryan


On Fri, Jul 16, 2010 at 9:33 AM, Thomas Downing
<tdown...@proteus-technologies.com>      wrote:


Thanks for the response.

My understanding is that TCP_FIN_TIMEOUT affects only FIN_WAIT2,
my problem is with FIN_WAIT1.

While I do see some sockets in TIME_WAIT, they are only a few, and the
number is not growing.

On 7/16/2010 12:07 PM, Hegner, Travis wrote:


Hi Thomas,

I ran into a very similar issue when running slony-I on postgresql to
replicate 15-20 databases.

Adjusting the TCP_FIN_TIMEOUT parameters for the kernel may help to
slow
(or hopefully stop), the leaking sockets. I found some notes about
adjusting
TCP parameters here:
http://www.hikaro.com/linux/tweaking-tcpip-syctl-conf.html



[snip]


--
Follow this link to mark it as spam:

http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=6A53327EB7.A78FD




--
Follow this link to mark it as spam:
http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=2E38F27E96.A72CF



Reply via email to