Hi all, I recently attempted to merge a large number of regions (trying to reduce from 5,000 regions in a table), but the merge tool presents an interesting challenge when trying to merge multiple regions into a single region - the destination region key changes after each merge, so it's very difficult to script this in advance.
The example from the script I generated from using HTable.getRegionsInfo(): bin/hbase org.apache.hadoop.hbase.util.Merge twitter twitter,0008227cef42aa4eb5df9d23880968b7968c1ba2,1344404634774.c4dae24e01fe593bc72cd0b0a40c5a4a. twitter,001870be5b0a84f8bdca2eb60b51790ab6d277df,1340832993671.6e546e88f4d227da401d9610c49a285d. bin/hbase org.apache.hadoop.hbase.util.Merge twitter twitter,0008227cef42aa4eb5df9d23880968b7968c1ba2,1344404634774.c4dae24e01fe593bc72cd0b0a40c5a4a. twitter,0020957a4bed9e742536990faecf12a95e9b21d6,1340832946888.39a4c958d9f6c4866b88eaa3e252128d. The second call will fail because the destination region has a new key as soon as the first call completes. One approach I can think of to handle this is a custom Merge that can determine what the new region start key is after each step of the Merge. The other option is to create a new table with the splits predefined and run a map/reduce to job to copy data to the new, pre-split table, but that m/r job would take a significant amount of time, and since I have a live system, I would need to dual write new data into both tables while the m/r job backfills - this would be a big load on the running cluster. Has anyone else had this issue or any other approaches to dealing with it? Thanks, Rob Roland
