Very often the "cannot open filename" happens when the region in
question was reopened somewhere else and that region was compacted. As
to why it was reassigned, most of the time it's because of garbage
collections taking too long. The master log should have all the
required evidence, and the region server should print some "slept for
Xms" (where X is some number of ms) messages before everything goes
bad.

Here are some general tips on debugging problems in HBase
http://hbase.apache.org/book/trouble.html

J-D

On Sat, May 7, 2011 at 2:10 AM, Stanley Xu <[email protected]> wrote:
> Dear all,
>
> We were using HBase 0.20.6 in our environment, and it is pretty stable in
> the last couple of month, but we met some reliability issue from last week.
> Our situation is very like the following link.
> http://search-hadoop.com/m/UJW6Efw4UW/Got+error+in+response+to+OP_READ_BLOCK+for+file&subj=HBase+fail+over+reliability+issues
>
> When we use a hbase client to connect to the hbase table, it looks stuck
> there. And we can find the logs like
>
> WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
> 10.24.166.74:50010 for *file* /hbase/users/73382377/data/312780071564432169
> for block -4841840178880951849:java.io.IOException: *Got* *error* in *
> response* to
> OP_READ_BLOCK for *file* /hbase/users/73382377/data/312780071564432169 for
> block -4841840178880951849
>
> INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020, call
> get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
> timeRange=[0,9223372036854775807), families={(family=data, columns=ALL})
> from 10.24.117.100:2365: *error*: java.io.IOException: Cannot open filename
> /hbase/users/73382377/data/312780071564432169
> java.io.IOException: Cannot open filename
> /hbase/users/73382377/data/312780071564432169
>
>
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.24.166.74:50010, storageID=DS-14401423-10.24.166.74-50010-1270741415211,
> infoPort=50075, ipcPort=50020):
> *Got* exception while serving blk_-4841840178880951849_50277 to /
> 10.25.119.113
> :
> java.io.IOException: Block blk_-4841840178880951849_50277 is not valid.
>
> in the server side.
>
> And if we do a flush and then a major compaction on the ".META.", the
> problem just went away, but will appear again some time later.
>
> At first we guess it might be the problem of xceiver. So we set the xceiver
> to 4096 as the link here.
> http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html
>
> But we still get the same problem. It looks that a restart of the whole
> HBase cluster will fix the problem for a while, but actually we could not
> say always trying to restart the server.
>
> I am waiting online, will really appreciate any message.
>
>
> Best wishes,
> Stanley Xu
>

Reply via email to