Yes, it does deal with data merging and yes, doing a major compaction would be needed to guarantee the store files are as small as possible.
-chris On May 26, 2011, at 7:00 PM, Weihua JIANG <[email protected]> wrote: > Thanks. It seems quite useful. > > Does bulk load support data merging? I.e. there is a table with > existing data and I want to add more data into it. The new data row > key range is mixed with the existing data row key range. So, the final > effect is the new data shall be inserted into existing regions. > > If bulk load supports this feature, then it is the ideal solution to me? > > And do I need to perform a major compact after bulk load to ensure > store file number is small? > > > Thanks > Weihua > > 2011/5/27 Chris Tarnas <[email protected]>: >> Your second solution sounds quite similar to the bulk loader. Actually the >> bulk load is a bit simpler and bypasses even more of the regionserver's >> overhead: >> >> http://hbase.apache.org/bulk-loads.html >> >> Using M/R it creates HFiles in HDFS directly, then add the Hfiles them to >> the existing regionservers. >> >> -chris >> >> >> On May 26, 2011, at 12:38 AM, Weihua JIANG wrote: >> >>> Hi all, >>> >>> As I know, WAL is used to ensure the data is safe even if certain RS >>> or the whole HBase cluster is down. But, it is anyway a burden on each >>> put. >>> >>> I am wondering: is there any way to disable WAL while keeping data safety. >>> >>> An ideal solution to me looks like this: >>> 1. clients continuely put records with WAL disabled. >>> 2. clients call a certain HBase method to ensure all the >>> previously-put records are safely stored persistently, then it can >>> remove the records at client side. >>> 3. on errror, client re-put the maybe-lost records. >>> >>> Or a slightly different solution is: >>> 1. clients continuely put records on HDFS using sequential file. >>> 2. clients periodly flush HDFS file and remove the previously put >>> records at client side. >>> 3. after all records are stored on HDFS, use a map-reduce job to put >>> the records into HBase with WAL disabled. >>> 4. before each map-reduce task finish, a certain HBase method is >>> called to flush the memory data onto HDFS. >>> 5. if on error, certain map-reduce task is re-executed (equvalent to >>> replay log). >>> >>> Is there any way to do so in HBase? If no, do you have any plan to >>> support such usage model in near future? >>> >>> >>> Thanks >>> Weihua >> >>
