Re: hbase master retries to RS/DN

Jack Levin Sat, 21 May 2011 12:22:52 -0700

I am pretty sure this is related to
https://issues.apache.org/jira/browse/HBASE-3285, a datanode is dead,
but master tries to create a pipeline to it, when splitting logs.  I
will be upgrading one of our clusters to 0.90.2 and test it there.


-Jack

On Fri, May 20, 2011 at 9:15 AM, Stack <[email protected]> wrote:
> Are you running at INFO level logging Jack?  Can you pastebin more log
> context.  I'd like to take a look.
> Thanks,
> St.Ack
>
> On Thu, May 19, 2011 at 11:36 PM, Jack Levin <[email protected]> wrote:
>> Thanks, now with setting that value to "2", we still get slow DN death
>> master recovery of logs:
>>
>> 2011-05-19 23:34:55,109 WARN org.apache.hadoop.hdfs.DFSClient: Failed
>> recovery attempt #3 from primary datanode 10.103.7.21:50010
>> java.net.ConnectException: Call to /10.103.7.21:50020 failed on
>> connection exception: java.net.ConnectException: Connection refused
>>
>>
>> It keeps trying to contact datanode that is not alive, doesn't it
>> suppose to make DN as dead-do-not-try-again?
>>
>> -Jack
>>
>> On Thu, May 19, 2011 at 2:22 PM, Jean-Daniel Cryans <[email protected]> 
>> wrote:
>>> The config and the retries you pasted are unrelated.
>>>
>>> The former controls the number of retries when regions are moving and
>>> the client must query .META. or -ROOT-
>>>
>>> The latter is the Hadoop RPC client timeout and looking at the code
>>> the config is ipc.client.connect.max.retries from
>>> https://github.com/apache/hadoop/blob/branch-0.20/src/core/org/apache/hadoop/ipc/Client.java#L631
>>>
>>> J-D
>>>
>>> On Thu, May 19, 2011 at 11:46 AM, Jack Levin <[email protected]> wrote:
>>>> Hello, we have a situation when when RS/DN crashes hard, master is
>>>> very slow to recover, we notice that it waits on these log lines:
>>>> 2011-05-19 11:20:57,766 INFO org.apache.hadoop.ipc.Client: Retrying
>>>> connect to server: /10.103.7.22:50020. Already tried 0 time(s).
>>>> 2011-05-19 11:20:58,767 INFO org.apache.hadoop.ipc.Client: Retrying
>>>> connect to server: /10.103.7.22:50020. Already tried 1 time(s).
>>>> 2011-05-19 11:20:59,768 INFO org.apache.hadoop.ipc.Client: Retrying
>>>> connect to server: /10.103.7.22:50020. Already tried 2 time(s).
>>>> 2011-05-19 11:21:00,768 INFO org.apache.hadoop.ipc.Client: Retrying
>>>> connect to server: /10.103.7.22:50020. Already tried 3 time(s).
>>>> 2011-05-19 11:21:01,769 INFO org.apache.hadoop.ipc.Client: Retrying
>>>> connect to server: /10.103.7.22:50020. Already tried 4 time(s).
>>>> 2011-05-19 11:21:02,769 INFO org.apache.hadoop.ipc.Client: Retrying
>>>> connect to server: /10.103.7.22:50020. Already tried 5 time(s).
>>>> 2011-05-19 11:21:03,770 INFO org.apache.hadoop.ipc.Client: Retrying
>>>> connect to server: /10.103.7.22:50020. Already tried 6 time(s).
>>>> 2011-05-19 11:21:04,771 INFO org.apache.hadoop.ipc.Client: Retrying
>>>> connect to server: /10.103.7.22:50020. Already tried 7 time(s).
>>>> 2011-05-19 11:21:05,771 INFO org.apache.hadoop.ipc.Client: Retrying
>>>> connect to server: /10.103.7.22:50020. Already tried 8 time(s).
>>>> 2011-05-19 11:21:06,772 INFO org.apache.hadoop.ipc.Client: Retrying
>>>> connect to server: /10.103.7.22:50020. Already tried 9 time(s).
>>>>
>>>> This set repeats multiple times for log splits.   So I look around,
>>>> and set this config to be:
>>>>
>>>>  <property>
>>>>    <name>hbase.client.retries.number</name>
>>>>    <value>2</value>
>>>>    <description>Maximum retries.  Used as maximum for all retryable
>>>>    operations such as fetching of the root region from root region
>>>>    server, getting a cell's value, starting a row update, etc.
>>>>    Default: 10.
>>>>    </description>
>>>>  </property>
>>>>
>>>> Unfortunately, next time server died, it made no difference.  Is this
>>>> a known issue for 0.89?  If so, was it resolved in 0.90.2?
>>>>
>>>> -Jack
>>>>
>>>
>>
>

Re: hbase master retries to RS/DN

Reply via email to