On Tue, Sep 13, 2011 at 10:25 PM, Geoff Hendrey <[email protected]> wrote: > I've upgraded to HotSpot 64 Bit Server VM, with HBase 90.4 and all > recommended config changes (100 region server handlers, mslab enabled, etc). > No change, if anything it dies faster. Count of sockets in CLOSE_WAIT on > 50010 increases linearly. I logged netstat from a random node in the cluster, > periodically. Then dumped the output into excel using a pivot table to look > at a behavior of TCP. Number of connections from the given node to others on > 50010 was relatively uniform (no hotspot). Connections on 50010 from given > node to *self* was much way higher than to other nodes, but that's probably a > good thing. My guess is it's HBase leveraging locality of files for the > region server. Just a guess. > Yes. You have good locality. So maybe you are not bound up on a single network resource.
So when you jstack and you see that regionserver has its threads all stuck in next -- are they? -- then we are likely going to the local datanode. Anything in its logs when regionserver slows down? > next step will be to test with JD Cryans suggestion: > " In order to completely rule out at least one thing, can you set > ipc.server.max.queue.size to 1 and hbase.regionserver.handler.count to a low > number (let's say 10)? If payload is putting too much memory pressure, we'll > know." > > ...though I'm not sure what I'm supposed to observe with these settings...but > I'll try it and report on the outcome. > Well, you have GC logging already. If you look at the output do you see big pauses? I think J-D was thinking that regionservers would be using less memory if you make the queues smaller. You could try that. Maybe when queues are big, its taking a while to process them and client times out. What size are these rows? St.Ack > -geoff > > -----Original Message----- > From: Geoff Hendrey [mailto:[email protected]] > Sent: Tuesday, September 13, 2011 4:50 PM > To: [email protected]; Andrew Purtell > Cc: Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd > Subject: RE: scanner deadlock? > > 1019 sockets on 50010 in CLOSED_WAIT state. > > -geoff > > -----Original Message----- > From: Andrew Purtell [mailto:[email protected]] > Sent: Tuesday, September 13, 2011 4:00 PM > To: [email protected] > Cc: Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd > Subject: Re: scanner deadlock? > > > >> My current working theory is that >> too many sockets are in CLOSE_WAIT state (leading to >> ClosedChannelException?). We're going to try to adjust some OS >> parameters. > > How many sockets are in that state? netstat -an | grep CLOSE_WAIT | wc -l > > CDH3U1 contains HDFS-1836... https://issues.apache.org/jira/browse/HDFS-1836 > > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein (via > Tom White) > > >>________________________________ >>From: Geoff Hendrey <[email protected]> >>To: [email protected] >>Cc: Tony Wang <[email protected]>; Rohit Nigam <[email protected]>; Parmod >>Mehta <[email protected]>; James Ladd <[email protected]> >>Sent: Tuesday, September 13, 2011 9:49 AM >>Subject: RE: scanner deadlock? >> >>Thanks Stack - >> >>Answers to all your questions below. My current working theory is that >>too many sockets are in CLOSE_WAIT state (leading to >>ClosedChannelException?). We're going to try to adjust some OS >>parameters. >> >>" I'm asking if regionservers are bottlenecking on a single network >>resource; a particular datanode, dns?" >> >>Gotcha. I'm gathering some tools now to collect and analyze netstat >>output. >> >>" the regionserver is going slow getting data out of >>hdfs. Whats iowait like at the time of slowness? Has it changed from >>when all was running nicely?" >> >>iowait is high (20% above cpu), but not increasing. I'll try to quantify >>that better. >> >>" You talk to hbase in the reducer? Reducers don't start writing hbase >>until job is 66% complete IIRC. Perhaps its slowing as soon as it >>starts writing hbase? Is that so?" >> >>My statement about "running fine" applies to after the reducer has >>completed sort. We have metrics produced by the reducer that log the >>results of scans ant Puts. So we know that scans and puts proceed >>without issue for hours. >> >
