What you are describing is solved usually by either: - pre-creating the regions so that you don't have to go through the "growing pains" of a new, virgin table. Use this sort of method: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor, byte[][])
- use the bulk loader: http://hbase.apache.org/bulk-loads.html J-D On Fri, Mar 18, 2011 at 5:46 AM, Ferdy Galema <[email protected]> wrote: > On second thought, removing the obsolete regionfolders was easily done by > hand. This way I can merge regions with the merge tool. > > However, I'm still bothered by the (performance) issues I ran into. Any > advice would be helpful. > > On 03/18/2011 11:06 AM, Ferdy Galema wrote: >> >> After exporting a tabel of about 30M rows (each row has about 500 columns, >> totalling 400GB of data), there were several issues when trying to import it >> again on an empty HBase. (HBase version is 0.90.1-CDH3B4, deployed on 15 >> nodes. LZO is enabled.) >> >> The reason for this export/import is to both reduce the number of regions >> and clean up regionfolders in the table that are no longer referred to. (I >> can see this because of the dfs timestamps). Btw, I'm aware of the Merge >> tool, which can only solve the merging part. The max region size is set to >> 1GB, which is not an uncommon number judging by other posts considering a >> big data set. >> >> To eliminate some of the write bottlenecks, I already disabled writing to >> the WAL by modifying the import tool. (I assume writing to the WAL is not >> necessary during import as long no regionservers crash. If one does, I can >> simply recreate an empty hbase and start over.) >> >> Also, I temporarily set hbase.hstore.compactionThreshold and >> hbase.hstore.blockingStoreFiles excessively high in order to disable minor >> compactions during the time of the import. With these changes it still takes >> about 100 hour to import the data, opposed to the 6 hour it took to read it. >> The importing starts with a single region on one node, and is split when the >> size is exceeded. The resulting regions are spread out over the other nodes, >> so that not a problem. The first tasks result in regionservers sometimes >> blocking updates because there flushing memstores. After a while (around 10% >> completion of the job) the logs mostly show the "LRU Stats", and sometimes >> "Updating" / "Opening" statements. Although I presumely disabled minor >> compactions and no major compact should be running yet, sometimes I also see >> Compacting statements. Why is that so? In other words, what does "because >> Region has references on open" mean? >> >> Aside of these performance issues, tasks are failing with region offline >> errors. These are always regions that were just split. The map/reduce >> framework tolerates these errors, still I thought splitting process was >> transparant to the user. >> >> Please correct me if I'm wrong in any of my assumptions. >> >> Ferdy. >
