I've upgraded to HotSpot 64 Bit Server VM, with HBase 90.4 and all recommended 
config changes (100 region server handlers, mslab enabled, etc). No change, if 
anything it dies faster. Count of sockets in CLOSE_WAIT on 50010 increases 
linearly. I logged netstat from a random node in the cluster, periodically. 
Then dumped the output into excel using a pivot table to look at a behavior of 
TCP. Number of connections from the given node to others on 50010 was 
relatively uniform (no hotspot). Connections on 50010 from given node to *self* 
was much way higher than to other nodes, but that's probably a good thing. My 
guess is it's HBase leveraging locality of files for the region server. Just a 
guess.

next step will be to test with JD Cryans suggestion: 
" In order to completely rule out at least one thing, can you set 
ipc.server.max.queue.size to 1 and hbase.regionserver.handler.count to a low 
number (let's say 10)? If payload is putting too much memory pressure, we'll 
know."

...though I'm not sure what I'm supposed to observe with these settings...but 
I'll try it and report on the outcome.

-geoff

-----Original Message-----
From: Geoff Hendrey [mailto:[email protected]] 
Sent: Tuesday, September 13, 2011 4:50 PM
To: [email protected]; Andrew Purtell
Cc: Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd
Subject: RE: scanner deadlock?

1019 sockets on 50010 in CLOSED_WAIT state.

-geoff

-----Original Message-----
From: Andrew Purtell [mailto:[email protected]] 
Sent: Tuesday, September 13, 2011 4:00 PM
To: [email protected]
Cc: Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd
Subject: Re: scanner deadlock?



> My current working theory is that
> too many sockets are in CLOSE_WAIT state (leading to
> ClosedChannelException?). We're going to try to adjust some OS
> parameters.

How many sockets are in that state? netstat -an | grep CLOSE_WAIT | wc -l

CDH3U1 contains HDFS-1836... https://issues.apache.org/jira/browse/HDFS-1836

Best regards,

       - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)


>________________________________
>From: Geoff Hendrey <[email protected]>
>To: [email protected]
>Cc: Tony Wang <[email protected]>; Rohit Nigam <[email protected]>; Parmod 
>Mehta <[email protected]>; James Ladd <[email protected]>
>Sent: Tuesday, September 13, 2011 9:49 AM
>Subject: RE: scanner deadlock?
>
>Thanks Stack - 
>
>Answers to all your questions below. My current working theory is that
>too many sockets are in CLOSE_WAIT state (leading to
>ClosedChannelException?). We're going to try to adjust some OS
>parameters.
>
>" I'm asking if regionservers are bottlenecking on a single network
>resource; a particular datanode, dns?"
>
>Gotcha. I'm gathering some tools now to collect and analyze netstat
>output.
>
>" the regionserver is going slow getting data out of
>hdfs.  Whats iowait like at the time of slowness?  Has it changed from
>when all was running nicely?"
>
>iowait is high (20% above cpu), but not increasing. I'll try to quantify
>that better.
>
>" You talk to hbase in the reducer?   Reducers don't start writing hbase
>until job is 66% complete IIRC.    Perhaps its slowing as soon as it
>starts writing hbase?  Is that so?"
>
>My statement about "running fine" applies to after the reducer has
>completed sort. We have metrics produced by the reducer that log the
>results of scans ant Puts. So we know that scans and puts proceed
>without issue for hours.
> 

Reply via email to