Re: Question about dead datanode

Jack Levin Thu, 13 Feb 2014 21:19:41 -0800

One other question, we get this:

2014-02-13 02:46:12,768 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /10.101.5.5:50010 for file
/hbase/img32/b97657bfcbf922045d96315a4ada0782/att/4890606694307129591 for
block -9099107892773428976:java.net.SocketTimeoutException: 60000 millis
timeout while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending remote=/10.101.5.5:50010]



Why can't RS do this instead:

hbase-root-regionserver-mtab5.prod.imageshack.com.log.2014-02-10:2014-02-10
22:05:11,763 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
10.103.8.109:50010, add to deadNodes and continue

"add to deadNodes and continue" specifically?

-Jack


On Thu, Feb 13, 2014 at 8:55 PM, Jack Levin <[email protected]> wrote:

> I meant to say, I can't upgrade now, its a petabyte storage system. A
> little hard to keep a copy of something like that.
>
>
> On Thu, Feb 13, 2014 at 3:20 PM, Jack Levin <[email protected]> wrote:
>
>> Can upgrade now but I would take suggestions on how to deal with this
>> On Feb 13, 2014 2:02 PM, "Stack" <[email protected]> wrote:
>>
>>> Can you upgrade Jack?  This stuff is better in later versions (dfsclient
>>> keeps running list of bad datanodes...)
>>> St.Ack
>>>
>>>
>>> On Thu, Feb 13, 2014 at 1:41 PM, Jack Levin <[email protected]> wrote:
>>>
>>> > As far as I can tell I am hitting this issue:
>>> >
>>> >
>>> >
>>> http://grepcode.com/search/usages?type=method&id=repository.cloudera.com%24content%24repositories%[email protected]%[email protected]@org%24apache%24hadoop%24hdfs%24protocol@LocatedBlocks@findBlock%28long%29&k=u
>>> >
>>> >
>>> > 1581 <
>>> >
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1581
>>> > >
>>> > // search cached blocks first
>>> >
>>> > 1582 <
>>> >
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1582
>>> > >
>>> > *int* targetBlockIdx = locatedBlocks
>>> > <
>>> >
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#DFSClient.DFSInputStream.0locatedBlocks
>>> > >.findBlock
>>> > <
>>> >
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/protocol/LocatedBlocks.java#LocatedBlocks.findBlock%28long%29
>>> > >(offset);
>>> >
>>> > 1583 <
>>> >
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1583
>>> > >
>>> > *if* (targetBlockIdx < 0) { // block is not cached
>>> >
>>> >
>>> > Our RS DFSClient is asking for a block on a dead datanode because the
>>> > block is somehow cached in DDFClient.  It seems that after DN dies,
>>> > DFSClients in 90.5v of HBase do not drop the cache reference where
>>> > those blocks are.  Seems like a problem.  It would be good if there
>>> > was an ability for that cache to expire because our dead DN was down
>>> > since Sunday.
>>> >
>>> >
>>> > -Jack
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Feb 13, 2014 at 11:23 AM, Stack <[email protected]> wrote:
>>> >
>>> > > RS opens files and then keeps them open as long as the RS is alive.
>>> >  We're
>>> > > failing read of this replica and then we succeed getting the block
>>> > > elsewhere?  You get that exception every time?  What hadoop version
>>> Jack?
>>> > >  You have short-circuit reads on?
>>> > > St.Ack
>>> > >
>>> > >
>>> > > On Thu, Feb 13, 2014 at 10:41 AM, Jack Levin <[email protected]>
>>> wrote:
>>> > >
>>> > > > I meant its in the 'dead' list on HDFS namenode page. Hadoop fsck /
>>> > shows
>>> > > > no issues.
>>> > > >
>>> > > >
>>> > > > On Thu, Feb 13, 2014 at 10:38 AM, Jack Levin <[email protected]>
>>> > wrote:
>>> > > >
>>> > > > >  Good morning --
>>> > > > > I had a question, we have had a datanode go down, and its been
>>> down
>>> > for
>>> > > > > few days, however hbase is trying to talk to that dead datanode
>>> still
>>> > > > >  2014-02-13 08:57:23,073 WARN org.apache.hadoop.hdfs.DFSClient:
>>> > Failed
>>> > > to
>>> > > > > connect to /10.101.5.5:50010 for file
>>> > > > >
>>> /hbase/img39/6388c3574c32c409e8387d3c4d10fcdb/att/2690638688138250544
>>> > > for
>>> > > > > block 805865
>>> > > > >
>>> > > > > so, question is, how come RS trying to talk to dead datanode,
>>> its on
>>> > in
>>> > > > > HDFS list even.
>>> > > > >
>>> > > > > Isn't the RS is just HDFS client?  And it should not talk to
>>> offlined
>>> > > > HDFS
>>> > > > > datanode that went down?  This caused a lot of issues in our
>>> cluster.
>>> > > > >
>>> > > > > Thanks,
>>> > > > > -Jack
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>

Re: Question about dead datanode

Reply via email to