I'm in the midst of trying to wrangle an HBase backup/restore and/or copy from 
one HBase to another process built around export/backup of 1 table at a time 
using org.apache.hadoop.hbase.mapreduce.Export from HBASE-1684.

Just a reminder:
Usage: Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]

In the psuedo code below: 

persistant_store is some kind of non-HBase store in the Cloud that you can just 
push stuff onto.
all_my_Hbase_tables_to_be_backedup is a list of table names
create_table is a function that would properly create a new HBase Table based 
on the schema passed in as an argument

Can I assume that if I do the following (psuedo_code) on HBase 0.20.3 or 0.90.x 
to get an initial full backup to S3:

starttime = begining_of_time
endtime = NOW_Minus_60_seconds
versions = 100000 (the largest number of versions we keep, we do some weird 
things with versions in some tables)

for table in all_my_Hbase_tables_to_be_backedup
do
        $HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-0.20.3.jar export \
                $table \
                s3n://somebucket/$table/ \
                $versions \
                $starttime \
                $endtime

        store_times_for_table_in_persistant_store( $table $starttime $endtime )
        store_schema_for_table_in_persistant_store( $table 
get_schema_from_HBase($table) )
done

Then do incremental backups from that point on:

endtime = NOW_Minus_60_seconds
versions = 100000

for table in all_my_Hbase_tables_to_be_backedup
do
        starttime = get_last_endtime_from_persistant_store( $table )

        $HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-0.20.3.jar export \
                $table \
                s3n://somebucket/$table/ \
                $versions \
                $starttime \
                $endtime

        store_times_for_table_in_persistant_store( $table $starttime $endtime )
        store_schema_for_table_in_persistant_store( $table 
get_schema_from_HBase($table) )
done

The Import usage:
Usage: Import <tablename> <inputdir>

If I wanted to restore a backed up table (table_foo) to a destination table 
(table_bar) in the HBase that is running this command which may or may not be 
the same HBase the table was originally backed up from from the exports to S3 I 
can do:

create_table( get_schema_from_persistant_store(table_bar) )

$HADOOP_HOME/bin/hadoop jar $HBASE_HOME/hbase-0.20.3.jar import \
        table_bar \
        s3n://somebucket/table_foo/


If I wanted to do a full restore I would just loop thru all the tables  the 
above import process on an HBase cluster that didn't yet have those tables.

Would I pretty much be guaranteed to get a proper backup snapshotted at the 
specified endtime of each run? 

Could this be used to copy an the data from one HBase cluster to another (in 
particular to go from a production HBase 0.20.3 to a fresh new 0.90.1)?

One normal backup/restore  thing that is missing is there is no easy way to get 
a restore at a point in time as opposed to the last backup. I presume the worse 
case would be to restore everything and then delete rows with timestamps after 
the early date one wanted?

Please let me know what I might be missing or what the down sides would be to 
such a way to do backups.

Thanks!
Rob

__________________
Robert J Berger - CTO
Runa Inc.       
520 San Antonio Rd Suite 210, Mountain View, CA 94040
+1 408-838-8896
http://blog.ibd.com

http://workatruna.com



Reply via email to