I guess you could say snap shot as in a point in time M/R job that exports all of the rows in the table before a specified time X which could default to the start time of the job.
Since you're running your export to the same cluster (but a different directory from /hbase, you don't really have to worry about the number of mappers. However, since its a backup... you may want to reduce the number of region files so you could reduce the data set to 10, 100, etc... depending on the size of the underlying table and then as you write out from the reducer you could write to S3 directly, but if you want more control... you reduce to the local HDFS, then in a separate job or single threaded program, you could open up a file at a time and trickle it in. (Or write a map only job that has a set number of mappers defined to run in parallel. The only caveat is that you need to make sure you have enough disk space to store the local copy until you complete the S3 write. Of course there are other permutations.... like if you have a NAS/SAN you could move the export there. (Hot == Hbase table. Warm == HDFS outside of HBase, Luke Warm == local attached disks, Cold = S3...) Again, it depends on the resources available to you and your enterprise. YMMV. On Jun 5, 2014, at 9:15 AM, Ted Yu <[email protected]> wrote: > bq. take a snapshot and write the file(s) > > Is the above referring to hbase snapshot ? > hbase 0.92.x doesn't support snapshot. > > FYI > > > On Thu, Jun 5, 2014 at 5:11 AM, Michael Segel <[email protected]> > wrote: > >> Ok... >> >> So when the basic tools don't work... >> How about roll your own? >> >> Step 1 take a snapshot and write the file(s) to a different location >> outside of /hbase. >> (Export to local disk on the cluster) >> >> Step 2 write your own M/R job and control the number of mappers who read >> from HDFS and write to S3. >> Assuming you want a block for block match. If you want to change the >> #files since each region would be a separate file, you could do the write >> to S3 in the reduce phase. >> (Which is what you want.) >> >> >> On Jun 4, 2014, at 7:39 AM, Damien Hardy <[email protected]> wrote: >> >>> Hello, >>> >>> We are trying to export HBase table on S3 for backup purpose. >>> By default export tool run a map per region and we want to limit output >>> bandwidth on internet (to amazon s3). >>> >>> We were thinking in adding some reducer to limit the number of writers >>> but this is explicitly hardcoded to 0 in Export class >>> ``` >>> // No reducers. Just write straight to output files. >>> job.setNumReduceTasks(0); >>> ``` >>> >>> Is there an other way (propertie?) in hadoop to limit output bandwidth ? >>> >>> -- >>> Damien >>> >> >> The opinions expressed here are mine, while they may reflect a cognitive >> thought, that is purely accidental. >> Use at your own risk. >> Michael Segel >> michael_segel (AT) hotmail.com >> >> >> >> >> >> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com
