Are you running at INFO level logging Jack? Can you pastebin more log context. I'd like to take a look. Thanks, St.Ack
On Thu, May 19, 2011 at 11:36 PM, Jack Levin <[email protected]> wrote: > Thanks, now with setting that value to "2", we still get slow DN death > master recovery of logs: > > 2011-05-19 23:34:55,109 WARN org.apache.hadoop.hdfs.DFSClient: Failed > recovery attempt #3 from primary datanode 10.103.7.21:50010 > java.net.ConnectException: Call to /10.103.7.21:50020 failed on > connection exception: java.net.ConnectException: Connection refused > > > It keeps trying to contact datanode that is not alive, doesn't it > suppose to make DN as dead-do-not-try-again? > > -Jack > > On Thu, May 19, 2011 at 2:22 PM, Jean-Daniel Cryans <[email protected]> > wrote: >> The config and the retries you pasted are unrelated. >> >> The former controls the number of retries when regions are moving and >> the client must query .META. or -ROOT- >> >> The latter is the Hadoop RPC client timeout and looking at the code >> the config is ipc.client.connect.max.retries from >> https://github.com/apache/hadoop/blob/branch-0.20/src/core/org/apache/hadoop/ipc/Client.java#L631 >> >> J-D >> >> On Thu, May 19, 2011 at 11:46 AM, Jack Levin <[email protected]> wrote: >>> Hello, we have a situation when when RS/DN crashes hard, master is >>> very slow to recover, we notice that it waits on these log lines: >>> 2011-05-19 11:20:57,766 INFO org.apache.hadoop.ipc.Client: Retrying >>> connect to server: /10.103.7.22:50020. Already tried 0 time(s). >>> 2011-05-19 11:20:58,767 INFO org.apache.hadoop.ipc.Client: Retrying >>> connect to server: /10.103.7.22:50020. Already tried 1 time(s). >>> 2011-05-19 11:20:59,768 INFO org.apache.hadoop.ipc.Client: Retrying >>> connect to server: /10.103.7.22:50020. Already tried 2 time(s). >>> 2011-05-19 11:21:00,768 INFO org.apache.hadoop.ipc.Client: Retrying >>> connect to server: /10.103.7.22:50020. Already tried 3 time(s). >>> 2011-05-19 11:21:01,769 INFO org.apache.hadoop.ipc.Client: Retrying >>> connect to server: /10.103.7.22:50020. Already tried 4 time(s). >>> 2011-05-19 11:21:02,769 INFO org.apache.hadoop.ipc.Client: Retrying >>> connect to server: /10.103.7.22:50020. Already tried 5 time(s). >>> 2011-05-19 11:21:03,770 INFO org.apache.hadoop.ipc.Client: Retrying >>> connect to server: /10.103.7.22:50020. Already tried 6 time(s). >>> 2011-05-19 11:21:04,771 INFO org.apache.hadoop.ipc.Client: Retrying >>> connect to server: /10.103.7.22:50020. Already tried 7 time(s). >>> 2011-05-19 11:21:05,771 INFO org.apache.hadoop.ipc.Client: Retrying >>> connect to server: /10.103.7.22:50020. Already tried 8 time(s). >>> 2011-05-19 11:21:06,772 INFO org.apache.hadoop.ipc.Client: Retrying >>> connect to server: /10.103.7.22:50020. Already tried 9 time(s). >>> >>> This set repeats multiple times for log splits. So I look around, >>> and set this config to be: >>> >>> <property> >>> <name>hbase.client.retries.number</name> >>> <value>2</value> >>> <description>Maximum retries. Used as maximum for all retryable >>> operations such as fetching of the root region from root region >>> server, getting a cell's value, starting a row update, etc. >>> Default: 10. >>> </description> >>> </property> >>> >>> Unfortunately, next time server died, it made no difference. Is this >>> a known issue for 0.89? If so, was it resolved in 0.90.2? >>> >>> -Jack >>> >> >
