Hello, we have a situation when when RS/DN crashes hard, master is
very slow to recover, we notice that it waits on these log lines:
2011-05-19 11:20:57,766 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: /10.103.7.22:50020. Already tried 0 time(s).
2011-05-19 11:20:58,767 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: /10.103.7.22:50020. Already tried 1 time(s).
2011-05-19 11:20:59,768 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: /10.103.7.22:50020. Already tried 2 time(s).
2011-05-19 11:21:00,768 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: /10.103.7.22:50020. Already tried 3 time(s).
2011-05-19 11:21:01,769 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: /10.103.7.22:50020. Already tried 4 time(s).
2011-05-19 11:21:02,769 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: /10.103.7.22:50020. Already tried 5 time(s).
2011-05-19 11:21:03,770 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: /10.103.7.22:50020. Already tried 6 time(s).
2011-05-19 11:21:04,771 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: /10.103.7.22:50020. Already tried 7 time(s).
2011-05-19 11:21:05,771 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: /10.103.7.22:50020. Already tried 8 time(s).
2011-05-19 11:21:06,772 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: /10.103.7.22:50020. Already tried 9 time(s).
This set repeats multiple times for log splits. So I look around,
and set this config to be:
<property>
<name>hbase.client.retries.number</name>
<value>2</value>
<description>Maximum retries. Used as maximum for all retryable
operations such as fetching of the root region from root region
server, getting a cell's value, starting a row update, etc.
Default: 10.
</description>
</property>
Unfortunately, next time server died, it made no difference. Is this
a known issue for 0.89? If so, was it resolved in 0.90.2?
-Jack