I've upgraded to HotSpot 64 Bit Server VM, with HBase 90.4 and all recommended config changes (100 region server handlers, mslab enabled, etc). No change, if anything it dies faster. Count of sockets in CLOSE_WAIT on 50010 increases linearly. I logged netstat from a random node in the cluster, periodically. Then dumped the output into excel using a pivot table to look at a behavior of TCP. Number of connections from the given node to others on 50010 was relatively uniform (no hotspot). Connections on 50010 from given node to *self* was much way higher than to other nodes, but that's probably a good thing. My guess is it's HBase leveraging locality of files for the region server. Just a guess.
next step will be to test with JD Cryans suggestion: " In order to completely rule out at least one thing, can you set ipc.server.max.queue.size to 1 and hbase.regionserver.handler.count to a low number (let's say 10)? If payload is putting too much memory pressure, we'll know." ...though I'm not sure what I'm supposed to observe with these settings...but I'll try it and report on the outcome. -geoff -----Original Message----- From: Geoff Hendrey [mailto:[email protected]] Sent: Tuesday, September 13, 2011 4:50 PM To: [email protected]; Andrew Purtell Cc: Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd Subject: RE: scanner deadlock? 1019 sockets on 50010 in CLOSED_WAIT state. -geoff -----Original Message----- From: Andrew Purtell [mailto:[email protected]] Sent: Tuesday, September 13, 2011 4:00 PM To: [email protected] Cc: Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd Subject: Re: scanner deadlock? > My current working theory is that > too many sockets are in CLOSE_WAIT state (leading to > ClosedChannelException?). We're going to try to adjust some OS > parameters. How many sockets are in that state? netstat -an | grep CLOSE_WAIT | wc -l CDH3U1 contains HDFS-1836... https://issues.apache.org/jira/browse/HDFS-1836 Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) >________________________________ >From: Geoff Hendrey <[email protected]> >To: [email protected] >Cc: Tony Wang <[email protected]>; Rohit Nigam <[email protected]>; Parmod >Mehta <[email protected]>; James Ladd <[email protected]> >Sent: Tuesday, September 13, 2011 9:49 AM >Subject: RE: scanner deadlock? > >Thanks Stack - > >Answers to all your questions below. My current working theory is that >too many sockets are in CLOSE_WAIT state (leading to >ClosedChannelException?). We're going to try to adjust some OS >parameters. > >" I'm asking if regionservers are bottlenecking on a single network >resource; a particular datanode, dns?" > >Gotcha. I'm gathering some tools now to collect and analyze netstat >output. > >" the regionserver is going slow getting data out of >hdfs. Whats iowait like at the time of slowness? Has it changed from >when all was running nicely?" > >iowait is high (20% above cpu), but not increasing. I'll try to quantify >that better. > >" You talk to hbase in the reducer? Reducers don't start writing hbase >until job is 66% complete IIRC. Perhaps its slowing as soon as it >starts writing hbase? Is that so?" > >My statement about "running fine" applies to after the reducer has >completed sort. We have metrics produced by the reducer that log the >results of scans ant Puts. So we know that scans and puts proceed >without issue for hours. >
