RE: scanner deadlock?

Geoff Hendrey Wed, 14 Sep 2011 08:42:40 -0700

17 MR nodes, 8 reducers per machine = 138 concurrent reducers.
(machines are 12-core, and I've found 8 reducers with 1GB allocated heap to be 
a happy medium that doesn't freeze out the data nodes or the region servers - 
or so I think :-).



-geoff

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Stack
Sent: Wednesday, September 14, 2011 8:39 AM
To: [email protected]
Subject: Re: scanner deadlock?

Also, how many nodes, and how many reducers.
St.Ack

On Wed, Sep 14, 2011 at 8:38 AM, Stack <[email protected]> wrote:
> Yeah.  Ten handlers and no queue for the RPCs, it'll just reject calls
> when its > handler count.
>
> Also send a listing of your hbase rootdir: hadoop fs -lsr /hbase
>
> St.Ack
>
> On Tue, Sep 13, 2011 at 11:40 PM, Geoff Hendrey <[email protected]> wrote:
>> As expected, J-D's suggestion basically causes the system to collapse almost 
>> immediately. All the region servers show stack traces such as these:
>>
>> 2011-09-13 23:34:43,151 WARN  [IPC Reader 0 on port 60020] 
>> ipc.HBaseServer$Listener(526): IPC Server listener on 60020: readAndProcess 
>> threw exception java.io.IOException: Connection reset by peer. Count of 
>> bytes read: 0
>> java.io.IOException: Connection reset by peer
>>        at sun.nio.ch.FileDispatcher.read0(Native Method)
>>        at sun.nio.ch.SocketDispatcher.read(Unknown Source)
>>        at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
>>        at sun.nio.ch.IOUtil.read(Unknown Source)
>>        at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
>> Source)
>>        at java.util.concurrent.ThreadPoolExe
>>
>>
>> 2011-09-13 23:34:43,293 WARN  [IPC Server handler 0 on 60020] 
>> ipc.HBaseServer$Handler(1100): IPC Server handler 0 on 60020 caught: 
>> java.nio.channels.ClosedChannelException
>>        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
>>        at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
>>
>> -----Original Message-----
>> From: Geoff Hendrey [mailto:[email protected]]
>> Sent: Tuesday, September 13, 2011 10:56 PM
>> To: [email protected]
>> Cc: Andrew Purtell; Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd
>> Subject: RE: scanner deadlock?
>>
>> answers below
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Stack
>> Sent: Tuesday, September 13, 2011 10:41 PM
>> To: [email protected]
>> Cc: Andrew Purtell; Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd
>> Subject: Re: scanner deadlock?
>>
>> On Tue, Sep 13, 2011 at 10:25 PM, Geoff Hendrey <[email protected]> wrote:
>>> I've upgraded to HotSpot 64 Bit Server VM, with HBase 90.4 and all 
>>> recommended config changes (100 region server handlers, mslab enabled, 
>>> etc). No change, if anything it dies faster. Count of sockets in CLOSE_WAIT 
>>> on 50010 increases linearly. I logged netstat from a random node in the 
>>> cluster, periodically. Then dumped the output into excel using a pivot 
>>> table to look at a behavior of TCP. Number of connections from the given 
>>> node to others on 50010 was relatively uniform (no hotspot). Connections on 
>>> 50010 from given node to *self* was much way higher than to other nodes, 
>>> but that's probably a good thing. My guess is it's HBase leveraging 
>>> locality of files for the region server. Just a guess.
>>>
>> Yes.  You have good locality.  So maybe you are not bound up on a
>> single network resource.
>>
>> So when you jstack and you see that regionserver has its threads all
>> stuck in next -- are they? -- then we are likely going to the local
>> datanode.
>>
>> ANSWER: yes, I believe they are stuck in next. I can see from logs on the 
>> client that the call to next periodically takes very long to return. You had 
>> earlier commented that my 5 seconds regionserver least was low. I had set it 
>> low intentionally to get a fail-fast on the call to next. If I don't set it 
>> low, it will just take whatever the timeout is to fail (60 seconds default).
>>
>>
>> Anything in its logs when regionserver slows down?
>>
>> ANSWER: Yes. I see ScannerTimeoutException, and unknown scanner, and then 
>> ClosedChannelException. Stack trace shows the ClosedChannelException occurs 
>> when the server tries to write the response to the scanner. This seems like 
>> a bug to me. Once you close the channel you cannot write it, no? If you try 
>> to write it after you close it, you will get ClosedChannelException.
>>
>>
>>> next step will be to test with JD Cryans suggestion:
>>> " In order to completely rule out at least one thing, can you set 
>>> ipc.server.max.queue.size to 1 and hbase.regionserver.handler.count to a 
>>> low number (let's say 10)? If payload is putting too much memory pressure, 
>>> we'll know."
>>>
>>> ...though I'm not sure what I'm supposed to observe with these 
>>> settings...but I'll try it and report on the outcome.
>>>
>>
>> Well, you have GC logging already.  If you look at the output do you
>> see big pauses?
>>
>> ANSWER: Nope, no long pauses at all. I've periodically run a few greps with 
>> a regex to try to find pauses one second or longer, and I haven't seen any 
>> of late. HOWEVER, one thing I don't understand at all is why ganglia reports 
>> HUGE gc pauses (like 1000 seconds!). But in reality I can *never* find such 
>> a pause in the GC log. Is there any known issue with ganglia graphs being on 
>> the wrong vertical scale for GC pauses? I know that sounds odd, but I just 
>> can't correlate the Ganglia graphs for GC to reality.
>>
>>  I think J-D was thinking that regionservers would be
>> using less memory if you make the queues smaller.  You could try that.
>>  Maybe when queues are big, its taking a while to process them and
>> client times out.  What size are these rows?
>>
>> ANSWER: rows are about 100KB-750KB. Trying J-D's suggestion now.
>>
>> St.Ack
>>
>>
>>> -geoff
>>>
>>> -----Original Message-----
>>> From: Geoff Hendrey [mailto:[email protected]]
>>> Sent: Tuesday, September 13, 2011 4:50 PM
>>> To: [email protected]; Andrew Purtell
>>> Cc: Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd
>>> Subject: RE: scanner deadlock?
>>>
>>> 1019 sockets on 50010 in CLOSED_WAIT state.
>>>
>>> -geoff
>>>
>>> -----Original Message-----
>>> From: Andrew Purtell [mailto:[email protected]]
>>> Sent: Tuesday, September 13, 2011 4:00 PM
>>> To: [email protected]
>>> Cc: Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd
>>> Subject: Re: scanner deadlock?
>>>
>>>
>>>
>>>> My current working theory is that
>>>> too many sockets are in CLOSE_WAIT state (leading to
>>>> ClosedChannelException?). We're going to try to adjust some OS
>>>> parameters.
>>>
>>> How many sockets are in that state? netstat -an | grep CLOSE_WAIT | wc -l
>>>
>>> CDH3U1 contains HDFS-1836... https://issues.apache.org/jira/browse/HDFS-1836
>>>
>>> Best regards,
>>>
>>>        - Andy
>>>
>>> Problems worthy of attack prove their worth by hitting back. - Piet Hein 
>>> (via Tom White)
>>>
>>>
>>>>________________________________
>>>>From: Geoff Hendrey <[email protected]>
>>>>To: [email protected]
>>>>Cc: Tony Wang <[email protected]>; Rohit Nigam <[email protected]>; Parmod 
>>>>Mehta <[email protected]>; James Ladd <[email protected]>
>>>>Sent: Tuesday, September 13, 2011 9:49 AM
>>>>Subject: RE: scanner deadlock?
>>>>
>>>>Thanks Stack -
>>>>
>>>>Answers to all your questions below. My current working theory is that
>>>>too many sockets are in CLOSE_WAIT state (leading to
>>>>ClosedChannelException?). We're going to try to adjust some OS
>>>>parameters.
>>>>
>>>>" I'm asking if regionservers are bottlenecking on a single network
>>>>resource; a particular datanode, dns?"
>>>>
>>>>Gotcha. I'm gathering some tools now to collect and analyze netstat
>>>>output.
>>>>
>>>>" the regionserver is going slow getting data out of
>>>>hdfs.  Whats iowait like at the time of slowness?  Has it changed from
>>>>when all was running nicely?"
>>>>
>>>>iowait is high (20% above cpu), but not increasing. I'll try to quantify
>>>>that better.
>>>>
>>>>" You talk to hbase in the reducer?   Reducers don't start writing hbase
>>>>until job is 66% complete IIRC.    Perhaps its slowing as soon as it
>>>>starts writing hbase?  Is that so?"
>>>>
>>>>My statement about "running fine" applies to after the reducer has
>>>>completed sort. We have metrics produced by the reducer that log the
>>>>results of scans ant Puts. So we know that scans and puts proceed
>>>>without issue for hours.
>>>>
>>>
>>
>

RE: scanner deadlock?

Reply via email to