> My current working theory is that > too many sockets are in CLOSE_WAIT state (leading to > ClosedChannelException?). We're going to try to adjust some OS > parameters.
How many sockets are in that state? netstat -an | grep CLOSE_WAIT | wc -l CDH3U1 contains HDFS-1836... https://issues.apache.org/jira/browse/HDFS-1836 Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) >________________________________ >From: Geoff Hendrey <[email protected]> >To: [email protected] >Cc: Tony Wang <[email protected]>; Rohit Nigam <[email protected]>; Parmod >Mehta <[email protected]>; James Ladd <[email protected]> >Sent: Tuesday, September 13, 2011 9:49 AM >Subject: RE: scanner deadlock? > >Thanks Stack - > >Answers to all your questions below. My current working theory is that >too many sockets are in CLOSE_WAIT state (leading to >ClosedChannelException?). We're going to try to adjust some OS >parameters. > >" I'm asking if regionservers are bottlenecking on a single network >resource; a particular datanode, dns?" > >Gotcha. I'm gathering some tools now to collect and analyze netstat >output. > >" the regionserver is going slow getting data out of >hdfs. Whats iowait like at the time of slowness? Has it changed from >when all was running nicely?" > >iowait is high (20% above cpu), but not increasing. I'll try to quantify >that better. > >" You talk to hbase in the reducer? Reducers don't start writing hbase >until job is 66% complete IIRC. Perhaps its slowing as soon as it >starts writing hbase? Is that so?" > >My statement about "running fine" applies to after the reducer has >completed sort. We have metrics produced by the reducer that log the >results of scans ant Puts. So we know that scans and puts proceed >without issue for hours. >
