Merging a large number of regions

Rob Roland Wed, 22 Aug 2012 14:29:29 -0700

Hi all,

I recently attempted to merge a large number of regions (trying to reduce
from 5,000 regions in a table), but the merge tool presents an interesting
challenge when trying to merge multiple regions into a single region - the
destination region key changes after each merge, so it's very difficult to
script this in advance.


The example from the script I generated from using HTable.getRegionsInfo():

bin/hbase org.apache.hadoop.hbase.util.Merge twitter
twitter,0008227cef42aa4eb5df9d23880968b7968c1ba2,1344404634774.c4dae24e01fe593bc72cd0b0a40c5a4a.
twitter,001870be5b0a84f8bdca2eb60b51790ab6d277df,1340832993671.6e546e88f4d227da401d9610c49a285d.

bin/hbase org.apache.hadoop.hbase.util.Merge twitter
twitter,0008227cef42aa4eb5df9d23880968b7968c1ba2,1344404634774.c4dae24e01fe593bc72cd0b0a40c5a4a.
twitter,0020957a4bed9e742536990faecf12a95e9b21d6,1340832946888.39a4c958d9f6c4866b88eaa3e252128d.

The second call will fail because the destination region has a new key as
soon as the first call completes.

One approach I can think of to handle this is a custom Merge that can
determine what the new region start key is after each step of the Merge.
The other option is to create a new table with the splits predefined and
run a map/reduce to job to copy data to the new, pre-split table, but that
m/r job would take a significant amount of time, and since I have a live
system, I would need to dual write new data into both tables while the m/r
job backfills - this would be a big load on the running cluster.

Has anyone else had this issue or any other approaches to dealing with it?

Thanks,

Rob Roland

Merging a large number of regions

Reply via email to