One other question, we get this: 2014-02-13 02:46:12,768 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.101.5.5:50010 for file /hbase/img32/b97657bfcbf922045d96315a4ada0782/att/4890606694307129591 for block -9099107892773428976:java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.101.5.5:50010]
Why can't RS do this instead: hbase-root-regionserver-mtab5.prod.imageshack.com.log.2014-02-10:2014-02-10 22:05:11,763 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to / 10.103.8.109:50010, add to deadNodes and continue "add to deadNodes and continue" specifically? -Jack On Thu, Feb 13, 2014 at 8:55 PM, Jack Levin <[email protected]> wrote: > I meant to say, I can't upgrade now, its a petabyte storage system. A > little hard to keep a copy of something like that. > > > On Thu, Feb 13, 2014 at 3:20 PM, Jack Levin <[email protected]> wrote: > >> Can upgrade now but I would take suggestions on how to deal with this >> On Feb 13, 2014 2:02 PM, "Stack" <[email protected]> wrote: >> >>> Can you upgrade Jack? This stuff is better in later versions (dfsclient >>> keeps running list of bad datanodes...) >>> St.Ack >>> >>> >>> On Thu, Feb 13, 2014 at 1:41 PM, Jack Levin <[email protected]> wrote: >>> >>> > As far as I can tell I am hitting this issue: >>> > >>> > >>> > >>> http://grepcode.com/search/usages?type=method&id=repository.cloudera.com%24content%24repositories%[email protected]%[email protected]@org%24apache%24hadoop%24hdfs%24protocol@LocatedBlocks@findBlock%28long%29&k=u >>> > >>> > >>> > 1581 < >>> > >>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1581 >>> > > >>> > // search cached blocks first >>> > >>> > 1582 < >>> > >>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1582 >>> > > >>> > *int* targetBlockIdx = locatedBlocks >>> > < >>> > >>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#DFSClient.DFSInputStream.0locatedBlocks >>> > >.findBlock >>> > < >>> > >>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/protocol/LocatedBlocks.java#LocatedBlocks.findBlock%28long%29 >>> > >(offset); >>> > >>> > 1583 < >>> > >>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1583 >>> > > >>> > *if* (targetBlockIdx < 0) { // block is not cached >>> > >>> > >>> > Our RS DFSClient is asking for a block on a dead datanode because the >>> > block is somehow cached in DDFClient. It seems that after DN dies, >>> > DFSClients in 90.5v of HBase do not drop the cache reference where >>> > those blocks are. Seems like a problem. It would be good if there >>> > was an ability for that cache to expire because our dead DN was down >>> > since Sunday. >>> > >>> > >>> > -Jack >>> > >>> > >>> > >>> > >>> > On Thu, Feb 13, 2014 at 11:23 AM, Stack <[email protected]> wrote: >>> > >>> > > RS opens files and then keeps them open as long as the RS is alive. >>> > We're >>> > > failing read of this replica and then we succeed getting the block >>> > > elsewhere? You get that exception every time? What hadoop version >>> Jack? >>> > > You have short-circuit reads on? >>> > > St.Ack >>> > > >>> > > >>> > > On Thu, Feb 13, 2014 at 10:41 AM, Jack Levin <[email protected]> >>> wrote: >>> > > >>> > > > I meant its in the 'dead' list on HDFS namenode page. Hadoop fsck / >>> > shows >>> > > > no issues. >>> > > > >>> > > > >>> > > > On Thu, Feb 13, 2014 at 10:38 AM, Jack Levin <[email protected]> >>> > wrote: >>> > > > >>> > > > > Good morning -- >>> > > > > I had a question, we have had a datanode go down, and its been >>> down >>> > for >>> > > > > few days, however hbase is trying to talk to that dead datanode >>> still >>> > > > > 2014-02-13 08:57:23,073 WARN org.apache.hadoop.hdfs.DFSClient: >>> > Failed >>> > > to >>> > > > > connect to /10.101.5.5:50010 for file >>> > > > > >>> /hbase/img39/6388c3574c32c409e8387d3c4d10fcdb/att/2690638688138250544 >>> > > for >>> > > > > block 805865 >>> > > > > >>> > > > > so, question is, how come RS trying to talk to dead datanode, >>> its on >>> > in >>> > > > > HDFS list even. >>> > > > > >>> > > > > Isn't the RS is just HDFS client? And it should not talk to >>> offlined >>> > > > HDFS >>> > > > > datanode that went down? This caused a lot of issues in our >>> cluster. >>> > > > > >>> > > > > Thanks, >>> > > > > -Jack >>> > > > > >>> > > > >>> > > >>> > >>> >> >
