Thanks for the pointers.

The damage manifested as scanners skipping over a range in our time series 
data.  We knew from other systems that there should be some records in that 
region that weren't returned.  When we looked closely we saw an extremely 
improbable jump in rowkeys that should by evenly distributed UUIDs beneath an 
hourly prefix.  We checked the region listing and start/end keys in the 
regionserver UI, and found a region listed that wasn't being served.  We traced 
it back to a couple of possible locations under /hbase, and got some odd 
results when we tried to point the HFile main method at those files.

Here's the region we found missing along with the next one and the previous one:

Previous:
ets.derived.events.pb,2010-09-28-02:dcba1a8d00d945e6a90442c9561e8ac4,1285667269423
      ets-lax-prod-hadoop-10.corp.adobe.com:60030     
2010-09-28-02:dcba1a8d00d945e6a90442c9561e8ac4  
2010-09-28-05:5457075d4f9345908bdfd89b5b641d3d

Affected region:
ets.derived.events.pb,2010-09-28-05:5457075d4f9345908bdfd89b5b641d3d,1285684268773
      ets-lax-prod-hadoop-04.corp.adobe.com:60030     
2010-09-28-05:5457075d4f9345908bdfd89b5b641d3d  
2010-09-28-11:29664000a226486e9ecb7547a738d101

Next:
ets.derived.events.pb,2010-09-28-11:29664000a226486e9ecb7547a738d101,1285687842817
      ets-lax-prod-hadoop-07.corp.adobe.com:60030     
2010-09-28-11:29664000a226486e9ecb7547a738d101  
2010-09-28-12:f8fa9dc21bfe4091a4864d0adc655b4d


The affected region on RS UI:
ets.derived.events.pb,2010-09-28-05:5457075d4f9345908bdfd89b5b641d3d,1285684268773.1836172434
2010-09-28-05:5457075d4f9345908bdfd89b5b641d3d  
2010-09-28-11:29664000a226486e9ecb7547a738d101  stores=1, storefiles=1, 
storefileSizeMB=45, memstoreSizeMB=0, storefileIndexSizeMB=0


Directory for region on hdfs (guessing based on suffix from RS UI):
/hbase/ets.derived.events.pb/1836172434


Here's what happened when we ran HFile main method on those files:

Checked with HFile:

[hadoop@ets-lax-prod-hadoop-01 ~]$ hbase org.apache.hadoop.hbase.io.hfile.HFile 
-r 
'ets.derived.events.pb,2010-09-28-05:5457075d4f9345908bdfd89b5b641d3d,1285684268773.1836172434'
 -v
cat: /opt/hadoop/hbase/target/cached_classpath.txt: No such file or directory
region dir -> 
hdfs://ets-lax-prod-hadoop-01.corp.adobe.com:54310/hbase/ets.derived.events.pb/107531684
Number of region files found -> 0

Note that it found a different directory on hdfs than I would have thought.  
Look at that file with HFile and it doesn't like it:

[hadoop@ets-lax-prod-hadoop-01 ~]$ hbase org.apache.hadoop.hbase.io.hfile.HFile 
-f 
hdfs://ets-lax-prod-hadoop-01.corp.adobe.com:54310/hbase/ets.derived.events.pb/107531684
 -v -k
cat: /opt/hadoop/hbase/target/cached_classpath.txt: No such file or directory
Scanning -> 
hdfs://ets-lax-prod-hadoop-01.corp.adobe.com:54310/hbase/ets.derived.events.pb/107531684
ERROR, file doesnt exist: 
hdfs://ets-lax-prod-hadoop-01.corp.adobe.com:54310/hbase/ets.derived.events.pb/107531684

Put in the file I thought it was, and although it's there on HDFS, HFile can't 
find it:
[hadoop@ets-lax-prod-hadoop-01 ~]$ hbase org.apache.hadoop.hbase.io.hfile.HFile 
-f 
hdfs://ets-lax-prod-hadoop-01.corp.adobe.com:54310/hbase/ets.derived.events.pb/1836172434
 -v -k
cat: /opt/hadoop/hbase/target/cached_classpath.txt: No such file or directory
Scanning -> 
hdfs://ets-lax-prod-hadoop-01.corp.adobe.com:54310/hbase/ets.derived.events.pb/1836172434
java.io.FileNotFoundException: File does not exist: 
/hbase/ets.derived.events.pb/1836172434
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1586)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1577)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:428)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:185)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:431)
        at org.apache.hadoop.hbase.io.hfile.HFile$Reader.<init>(HFile.java:742)
        at org.apache.hadoop.hbase.io.hfile.HFile.main(HFile.java:1870)

[hadoop@ets-lax-prod-hadoop-01 ~]$ hadoop dfs -ls  
/hbase/ets.derived.events.pb/1836172434
Found 2 items
-rw-r--r--   3 hadoop hadoop        862 2010-09-28 07:31 
/hbase/ets.derived.events.pb/1836172434/.regioninfo
drwxr-xr-x   - hadoop hadoop          0 2011-05-06 16:36 
/hbase/ets.derived.events.pb/1836172434/f1


Ran an hbase hbck which came back clean.  Stopped HBase and restarted to find 
that hbck gave errors (not sure why it was ok before and not after - maybe a 
split happened in the interim or something - but we are running durable now so 
hopefully a change to META would not get lost).  After that I made a backup and 
tried add_table.rb, which seems to make the problem worse.   We eventually 
concluded that we must have lost a write to META last year when we were running 
Hadoop 0.20.1 and HBase 0.20.3 without durability (currently running CDH3b3).  
This is supported by the fact that other environments running the same code are 
OK and hadoop fsck / is also healthy.

My solution is to create a broadly similar table and read the HFiles from the 
old one directly into it.  So this would be an MR with an HFileInputFormat I 
wrote using the HFile API, and a TableOutputFormat into the new table (didn't 
want to put writing directly to HFiles on my plate at this time).  Once that's 
done and verified, I'll drop the older table and move on.

Because of the version of HBase we're running, we don' t have hbck -fix 
available, and I assume it's been months since the damage happened which might 
mean we have some regions overlapping.  It might be hard to manually stitch 
them back together, so this holistic approach seemed like the best bet.

One thing I can put as a win in HBase's column is that the damaged table still 
functions fine in the parts that don't have holes, which is the majority of the 
table.  So we can keep running for the majority of our dataset (and work) and 
take the time to fix the damage carefully.

Sandy

> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Stack
> Sent: Tuesday, May 31, 2011 13:10
> To: [email protected]
> Subject: Re: HFile.Reader scans return latest version?
> 
> On Tue, May 31, 2011 at 11:05 AM, Sandy Pratt <[email protected]> wrote:
> > Hi all,
> >
> > I'm doing some work to read records directly from the HFiles of a damaged
> table.  When I scan through the records in the HFile using
> org.apache.hadoop.hbase.io.hfile.HFileScanner, will I get only the latest
> version of the record as with a default HBase Scan?  Or do I need to do some
> work to pull out the latest version from several?
> >
> 
> It looks like it just returns all entries in the hfile.  See tests -- e.g. 
> TestHFile --
> for how to make make an HFile Reader instance and pull the values.  The tail
> of HFile has some examples too?
> 
> Tell us about the 'damaged table'.
> 
> St.Ack.

Reply via email to