17 MR nodes, 8 reducers per machine = 138 concurrent reducers. (machines are 12-core, and I've found 8 reducers with 1GB allocated heap to be a happy medium that doesn't freeze out the data nodes or the region servers - or so I think :-).
-geoff -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Stack Sent: Wednesday, September 14, 2011 8:39 AM To: [email protected] Subject: Re: scanner deadlock? Also, how many nodes, and how many reducers. St.Ack On Wed, Sep 14, 2011 at 8:38 AM, Stack <[email protected]> wrote: > Yeah. Ten handlers and no queue for the RPCs, it'll just reject calls > when its > handler count. > > Also send a listing of your hbase rootdir: hadoop fs -lsr /hbase > > St.Ack > > On Tue, Sep 13, 2011 at 11:40 PM, Geoff Hendrey <[email protected]> wrote: >> As expected, J-D's suggestion basically causes the system to collapse almost >> immediately. All the region servers show stack traces such as these: >> >> 2011-09-13 23:34:43,151 WARN [IPC Reader 0 on port 60020] >> ipc.HBaseServer$Listener(526): IPC Server listener on 60020: readAndProcess >> threw exception java.io.IOException: Connection reset by peer. Count of >> bytes read: 0 >> java.io.IOException: Connection reset by peer >> at sun.nio.ch.FileDispatcher.read0(Native Method) >> at sun.nio.ch.SocketDispatcher.read(Unknown Source) >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) >> at sun.nio.ch.IOUtil.read(Unknown Source) >> at sun.nio.ch.SocketChannelImpl.read(Unknown Source) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316) >> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown >> Source) >> at java.util.concurrent.ThreadPoolExe >> >> >> 2011-09-13 23:34:43,293 WARN [IPC Server handler 0 on 60020] >> ipc.HBaseServer$Handler(1100): IPC Server handler 0 on 60020 caught: >> java.nio.channels.ClosedChannelException >> at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source) >> at sun.nio.ch.SocketChannelImpl.write(Unknown Source) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083) >> >> -----Original Message----- >> From: Geoff Hendrey [mailto:[email protected]] >> Sent: Tuesday, September 13, 2011 10:56 PM >> To: [email protected] >> Cc: Andrew Purtell; Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd >> Subject: RE: scanner deadlock? >> >> answers below >> >> -----Original Message----- >> From: [email protected] [mailto:[email protected]] On Behalf Of Stack >> Sent: Tuesday, September 13, 2011 10:41 PM >> To: [email protected] >> Cc: Andrew Purtell; Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd >> Subject: Re: scanner deadlock? >> >> On Tue, Sep 13, 2011 at 10:25 PM, Geoff Hendrey <[email protected]> wrote: >>> I've upgraded to HotSpot 64 Bit Server VM, with HBase 90.4 and all >>> recommended config changes (100 region server handlers, mslab enabled, >>> etc). No change, if anything it dies faster. Count of sockets in CLOSE_WAIT >>> on 50010 increases linearly. I logged netstat from a random node in the >>> cluster, periodically. Then dumped the output into excel using a pivot >>> table to look at a behavior of TCP. Number of connections from the given >>> node to others on 50010 was relatively uniform (no hotspot). Connections on >>> 50010 from given node to *self* was much way higher than to other nodes, >>> but that's probably a good thing. My guess is it's HBase leveraging >>> locality of files for the region server. Just a guess. >>> >> Yes. You have good locality. So maybe you are not bound up on a >> single network resource. >> >> So when you jstack and you see that regionserver has its threads all >> stuck in next -- are they? -- then we are likely going to the local >> datanode. >> >> ANSWER: yes, I believe they are stuck in next. I can see from logs on the >> client that the call to next periodically takes very long to return. You had >> earlier commented that my 5 seconds regionserver least was low. I had set it >> low intentionally to get a fail-fast on the call to next. If I don't set it >> low, it will just take whatever the timeout is to fail (60 seconds default). >> >> >> Anything in its logs when regionserver slows down? >> >> ANSWER: Yes. I see ScannerTimeoutException, and unknown scanner, and then >> ClosedChannelException. Stack trace shows the ClosedChannelException occurs >> when the server tries to write the response to the scanner. This seems like >> a bug to me. Once you close the channel you cannot write it, no? If you try >> to write it after you close it, you will get ClosedChannelException. >> >> >>> next step will be to test with JD Cryans suggestion: >>> " In order to completely rule out at least one thing, can you set >>> ipc.server.max.queue.size to 1 and hbase.regionserver.handler.count to a >>> low number (let's say 10)? If payload is putting too much memory pressure, >>> we'll know." >>> >>> ...though I'm not sure what I'm supposed to observe with these >>> settings...but I'll try it and report on the outcome. >>> >> >> Well, you have GC logging already. If you look at the output do you >> see big pauses? >> >> ANSWER: Nope, no long pauses at all. I've periodically run a few greps with >> a regex to try to find pauses one second or longer, and I haven't seen any >> of late. HOWEVER, one thing I don't understand at all is why ganglia reports >> HUGE gc pauses (like 1000 seconds!). But in reality I can *never* find such >> a pause in the GC log. Is there any known issue with ganglia graphs being on >> the wrong vertical scale for GC pauses? I know that sounds odd, but I just >> can't correlate the Ganglia graphs for GC to reality. >> >> I think J-D was thinking that regionservers would be >> using less memory if you make the queues smaller. You could try that. >> Maybe when queues are big, its taking a while to process them and >> client times out. What size are these rows? >> >> ANSWER: rows are about 100KB-750KB. Trying J-D's suggestion now. >> >> St.Ack >> >> >>> -geoff >>> >>> -----Original Message----- >>> From: Geoff Hendrey [mailto:[email protected]] >>> Sent: Tuesday, September 13, 2011 4:50 PM >>> To: [email protected]; Andrew Purtell >>> Cc: Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd >>> Subject: RE: scanner deadlock? >>> >>> 1019 sockets on 50010 in CLOSED_WAIT state. >>> >>> -geoff >>> >>> -----Original Message----- >>> From: Andrew Purtell [mailto:[email protected]] >>> Sent: Tuesday, September 13, 2011 4:00 PM >>> To: [email protected] >>> Cc: Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd >>> Subject: Re: scanner deadlock? >>> >>> >>> >>>> My current working theory is that >>>> too many sockets are in CLOSE_WAIT state (leading to >>>> ClosedChannelException?). We're going to try to adjust some OS >>>> parameters. >>> >>> How many sockets are in that state? netstat -an | grep CLOSE_WAIT | wc -l >>> >>> CDH3U1 contains HDFS-1836... https://issues.apache.org/jira/browse/HDFS-1836 >>> >>> Best regards, >>> >>> - Andy >>> >>> Problems worthy of attack prove their worth by hitting back. - Piet Hein >>> (via Tom White) >>> >>> >>>>________________________________ >>>>From: Geoff Hendrey <[email protected]> >>>>To: [email protected] >>>>Cc: Tony Wang <[email protected]>; Rohit Nigam <[email protected]>; Parmod >>>>Mehta <[email protected]>; James Ladd <[email protected]> >>>>Sent: Tuesday, September 13, 2011 9:49 AM >>>>Subject: RE: scanner deadlock? >>>> >>>>Thanks Stack - >>>> >>>>Answers to all your questions below. My current working theory is that >>>>too many sockets are in CLOSE_WAIT state (leading to >>>>ClosedChannelException?). We're going to try to adjust some OS >>>>parameters. >>>> >>>>" I'm asking if regionservers are bottlenecking on a single network >>>>resource; a particular datanode, dns?" >>>> >>>>Gotcha. I'm gathering some tools now to collect and analyze netstat >>>>output. >>>> >>>>" the regionserver is going slow getting data out of >>>>hdfs. Whats iowait like at the time of slowness? Has it changed from >>>>when all was running nicely?" >>>> >>>>iowait is high (20% above cpu), but not increasing. I'll try to quantify >>>>that better. >>>> >>>>" You talk to hbase in the reducer? Reducers don't start writing hbase >>>>until job is 66% complete IIRC. Perhaps its slowing as soon as it >>>>starts writing hbase? Is that so?" >>>> >>>>My statement about "running fine" applies to after the reducer has >>>>completed sort. We have metrics produced by the reducer that log the >>>>results of scans ant Puts. So we know that scans and puts proceed >>>>without issue for hours. >>>> >>> >> >
