Thanks J-D. We are using Hadoop 0.20.2 with quite a couple of patches. Could you please tell me which patches does the WAL required? Do we need all the patches in the branch-0.20-append? We just patched the patch that add the support for the append function I thought.
Thanks. On Wed, May 11, 2011 at 12:50 AM, Jean-Daniel Cryans <[email protected]>wrote: > Data cannot be corrupted at all, since the files in HDFS are immutable > and CRC'ed (unless you are able to lose all 3 copies of every block). > > Corruption would happen at the metadata level, whereas the .META. > table which contains the regions for the tables would lose rows. This > is a likely scenario if the region server holding that region dies of > GC since the hadoop version you are using along hbase 0.20.6 doesn't > support appends, meaning that the write-ahead log would be missing > data that, obviously, cannot be replayed. > > The best advice I can give you is to upgrade. > > J-D > > On Tue, May 10, 2011 at 5:44 AM, Stanley Xu <[email protected]> wrote: > > Thanks J-D. A little more confused that is it looks when we have a > corrupt > > hbase table or some inconsistency data, we will got lots of message like > > that. But if the hbase table is proper, we will also get some lines of > > messages like that. > > > > How could I identify if it comes from a corruption in data or just some > > mis-hit in the scenario you mentioned? > > > > > > > > On Tue, May 10, 2011 at 6:23 AM, Jean-Daniel Cryans <[email protected] > >wrote: > > > >> Very often the "cannot open filename" happens when the region in > >> question was reopened somewhere else and that region was compacted. As > >> to why it was reassigned, most of the time it's because of garbage > >> collections taking too long. The master log should have all the > >> required evidence, and the region server should print some "slept for > >> Xms" (where X is some number of ms) messages before everything goes > >> bad. > >> > >> Here are some general tips on debugging problems in HBase > >> http://hbase.apache.org/book/trouble.html > >> > >> J-D > >> > >> On Sat, May 7, 2011 at 2:10 AM, Stanley Xu <[email protected]> wrote: > >> > Dear all, > >> > > >> > We were using HBase 0.20.6 in our environment, and it is pretty stable > in > >> > the last couple of month, but we met some reliability issue from last > >> week. > >> > Our situation is very like the following link. > >> > > >> > http://search-hadoop.com/m/UJW6Efw4UW/Got+error+in+response+to+OP_READ_BLOCK+for+file&subj=HBase+fail+over+reliability+issues > >> > > >> > When we use a hbase client to connect to the hbase table, it looks > stuck > >> > there. And we can find the logs like > >> > > >> > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to / > >> > 10.24.166.74:50010 for *file* > >> /hbase/users/73382377/data/312780071564432169 > >> > for block -4841840178880951849:java.io.IOException: *Got* *error* in * > >> > response* to > >> > OP_READ_BLOCK for *file* /hbase/users/73382377/data/312780071564432169 > >> for > >> > block -4841840178880951849 > >> > > >> > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on > 60020, > >> call > >> > get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1, > >> > timeRange=[0,9223372036854775807), families={(family=data, > columns=ALL}) > >> > from 10.24.117.100:2365: *error*: java.io.IOException: Cannot open > >> filename > >> > /hbase/users/73382377/data/312780071564432169 > >> > java.io.IOException: Cannot open filename > >> > /hbase/users/73382377/data/312780071564432169 > >> > > >> > > >> > WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > >> DatanodeRegistration( > >> > 10.24.166.74:50010, > >> storageID=DS-14401423-10.24.166.74-50010-1270741415211, > >> > infoPort=50075, ipcPort=50020): > >> > *Got* exception while serving blk_-4841840178880951849_50277 to / > >> > 10.25.119.113 > >> > : > >> > java.io.IOException: Block blk_-4841840178880951849_50277 is not > valid. > >> > > >> > in the server side. > >> > > >> > And if we do a flush and then a major compaction on the ".META.", the > >> > problem just went away, but will appear again some time later. > >> > > >> > At first we guess it might be the problem of xceiver. So we set the > >> xceiver > >> > to 4096 as the link here. > >> > > http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html > >> > > >> > But we still get the same problem. It looks that a restart of the > whole > >> > HBase cluster will fix the problem for a while, but actually we could > not > >> > say always trying to restart the server. > >> > > >> > I am waiting online, will really appreciate any message. > >> > > >> > > >> > Best wishes, > >> > Stanley Xu > >> > > >> > > >
