Re: Hanging regionservers

Stack Sun, 18 Jul 2010 14:12:55 -0700

This is a hang with timeout set to 0 but on CDH2?
St.Ack


On Sun, Jul 18, 2010 at 1:36 PM, Luke Forehand
<[email protected]> wrote:
> I experienced the hang on my second job attempt.  I will be pastebinning 
> stacktraces and logs of all three servers tonight.  The datanode log of one 
> of the servers is way bigger than the rest and that's all the analysis I've 
> done so far.  Meeting with cloudera on Monday and they'll probably want me to 
> migrate to CDH3.  Need to mow the lawn...  I'll report back soon.
>
> -Luke
>
> On 7/16/10 6:34 PM, "Ryan Rawson" <[email protected]> wrote:
>
> According to Todd, there is some kind of weird Thread coordination
> issue which is worked around by setting the timeout to 0, even though
> we actually arent hitting any timeouts in the failure case.
>
> And it might have been fixed in cdh3.  I haven't had chance to run it
> yet so I can't say.
>
> -ryan
>
> On Fri, Jul 16, 2010 at 3:32 PM, Stack <[email protected]> wrote:
>> So, it seems like you are by-passing issue by having no time out on
>> the socket.  Would be for sure interested though if you have the issue
>> still on cdh3b2.  Most folks will not be running with no socket
>> timeout.
>>
>> Thanks Luke.
>> St.Ack
>>
>>
>> On Fri, Jul 16, 2010 at 3:01 PM, Luke Forehand
>> <[email protected]> wrote:
>>> Using Ryan Rawson's suggested config tweaks, we have just completed a 
>>> successful job run with a 15GB sequence file, no hang.  I'm setting up to 
>>> have multiple files process this weekend with the new settings.  :-)  I 
>>> believe the dfs socket write timeout being indefinite was the trick.
>>>
>>> I'll post my results on Monday.  Thanks for the support thus far!
>>>
>>> -Luke
>>>
>>> On 7/15/10 10:17 PM, "Ryan Rawson" <[email protected]> wrote:
>>>
>>> I'm not seeing anything in that logfile, you are seeing compactions
>>> for various regions, but im not seeing flushes (typical during insert
>>> loads) and nothing else. One thing we look to see is a log message
>>> "Blocking updates" which indicates that a particular region has
>>> decided it's holding up to prevent taking too many inserts.
>>>
>>> Like I said, you could be seeing this on a different regionserver, if
>>> all the clients are blocked on 1 regionserver and can't get to the
>>> others then most will look idle and only one will actually show
>>> anything interesting in the log.
>>>
>>> Can you check for this behaviour?
>>>
>>> Also if you want to tweak the config with the values I pasted that should 
>>> help.
>>>
>>> On Thu, Jul 15, 2010 at 7:25 PM, Luke Forehand
>>> <[email protected]> wrote:
>>>> It looks like we are going straight from the default config, no expicit 
>>>> setting of anything.
>>>>
>>>> On 7/15/10 9:03 PM, "Ryan Rawson" <[email protected]> wrote:
>>>>
>>>> In this case the regionserver isn't actually doing anything - all the
>>>> IPC thread handlers are waiting in their queue handoff thingy (how
>>>> they get socket/work to do).
>>>>
>>>> Something elsewhere perhaps?  Check the logs of your jobs, there might
>>>> be something interesting there.
>>>>
>>>> One thing that frequently happens is you overrun 1 regionserver with
>>>> edits and it isnt flushing fast enough, so it pauses updates and all
>>>> clients end up stuck on it.
>>>>
>>>> What was that config again?  I use these settings:
>>>>
>>>> <property>
>>>>  <name>hbase.hstore.blockingStoreFiles</name>
>>>>  <value>15</value>
>>>> </property>
>>>>
>>>> <property>
>>>>  <name>dfs.datanode.socket.write.timeout</name>
>>>>  <value>0</value>
>>>> </property>
>>>>
>>>> <property>
>>>>  <name>hbase.hregion.memstore.block.multiplier</name>
>>>>  <value>8</value>
>>>> </property>
>>>>
>>>> perhaps try these ones?
>>>>
>>>> -ryan
>>>
>>
>

Re: Hanging regionservers

Reply via email to