If you have a well defined key space, you'll get better performance if you pre-split your table and use the TotalOrderPartitioner with your MapReduce job.
You can see an example of pre-splitting here: http://hbase.apache.org/book.html#precreate.regions. -Joey On Mon, May 30, 2011 at 9:31 PM, Gan, Xiyun <[email protected]> wrote: > I used BulkLoad to import data. The step of writing HFiles using m/r is > fast, but the step of loading HFiles to hbase takes lots of time. It > says HFile at ****** no longer fits inside a single region. Splitting.... > Even worth, sometimes it throws Region is not online Exception. > > Thanks > > On Fri, May 27, 2011 at 1:18 PM, Chris Tarnas <[email protected]> wrote: > >> Yes, it does deal with data merging and yes, doing a major compaction would >> be needed to guarantee the store files are as small as possible. >> >> -chris >> >> >> >> On May 26, 2011, at 7:00 PM, Weihua JIANG <[email protected]> wrote: >> >> > Thanks. It seems quite useful. >> > >> > Does bulk load support data merging? I.e. there is a table with >> > existing data and I want to add more data into it. The new data row >> > key range is mixed with the existing data row key range. So, the final >> > effect is the new data shall be inserted into existing regions. >> > >> > If bulk load supports this feature, then it is the ideal solution to me? >> > >> > And do I need to perform a major compact after bulk load to ensure >> > store file number is small? >> > >> > >> > Thanks >> > Weihua >> > >> > 2011/5/27 Chris Tarnas <[email protected]>: >> >> Your second solution sounds quite similar to the bulk loader. Actually >> the bulk load is a bit simpler and bypasses even more of the regionserver's >> overhead: >> >> >> >> http://hbase.apache.org/bulk-loads.html >> >> >> >> Using M/R it creates HFiles in HDFS directly, then add the Hfiles them >> to the existing regionservers. >> >> >> >> -chris >> >> >> >> >> >> On May 26, 2011, at 12:38 AM, Weihua JIANG wrote: >> >> >> >>> Hi all, >> >>> >> >>> As I know, WAL is used to ensure the data is safe even if certain RS >> >>> or the whole HBase cluster is down. But, it is anyway a burden on each >> >>> put. >> >>> >> >>> I am wondering: is there any way to disable WAL while keeping data >> safety. >> >>> >> >>> An ideal solution to me looks like this: >> >>> 1. clients continuely put records with WAL disabled. >> >>> 2. clients call a certain HBase method to ensure all the >> >>> previously-put records are safely stored persistently, then it can >> >>> remove the records at client side. >> >>> 3. on errror, client re-put the maybe-lost records. >> >>> >> >>> Or a slightly different solution is: >> >>> 1. clients continuely put records on HDFS using sequential file. >> >>> 2. clients periodly flush HDFS file and remove the previously put >> >>> records at client side. >> >>> 3. after all records are stored on HDFS, use a map-reduce job to put >> >>> the records into HBase with WAL disabled. >> >>> 4. before each map-reduce task finish, a certain HBase method is >> >>> called to flush the memory data onto HDFS. >> >>> 5. if on error, certain map-reduce task is re-executed (equvalent to >> >>> replay log). >> >>> >> >>> Is there any way to do so in HBase? If no, do you have any plan to >> >>> support such usage model in near future? >> >>> >> >>> >> >>> Thanks >> >>> Weihua >> >> >> >> >> > > > > -- > Best wishes > Gan, Xiyun > -- Joseph Echeverria Cloudera, Inc. 443.305.9434
