I feel like I'm not understanding your need correctly, could you elicit what you think HBase you should be doing in order to give you a better life?
Thx, J-D On Mon, Mar 21, 2011 at 5:22 PM, Ferdy Galema <[email protected]> wrote: > These methods are certainly helpful, whenever I ever need to do a heavy > import. For now I got away with manually cleaning my regions/stores and > merging the data. I thought importing/exporting was the easy way to do that, > but I guess that's not (yet) true. > > On 03/21/2011 09:48 PM, Jean-Daniel Cryans wrote: >> >> What you are describing is solved usually by either: >> >> - pre-creating the regions so that you don't have to go through the >> "growing pains" of a new, virgin table. Use this sort of method: >> >> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor, >> byte[][]) >> >> - use the bulk loader: http://hbase.apache.org/bulk-loads.html >> >> J-D >> >> On Fri, Mar 18, 2011 at 5:46 AM, Ferdy Galema<[email protected]> >> wrote: >>> >>> On second thought, removing the obsolete regionfolders was easily done by >>> hand. This way I can merge regions with the merge tool. >>> >>> However, I'm still bothered by the (performance) issues I ran into. Any >>> advice would be helpful. >>> >>> On 03/18/2011 11:06 AM, Ferdy Galema wrote: >>>> >>>> After exporting a tabel of about 30M rows (each row has about 500 >>>> columns, >>>> totalling 400GB of data), there were several issues when trying to >>>> import it >>>> again on an empty HBase. (HBase version is 0.90.1-CDH3B4, deployed on 15 >>>> nodes. LZO is enabled.) >>>> >>>> The reason for this export/import is to both reduce the number of >>>> regions >>>> and clean up regionfolders in the table that are no longer referred to. >>>> (I >>>> can see this because of the dfs timestamps). Btw, I'm aware of the Merge >>>> tool, which can only solve the merging part. The max region size is set >>>> to >>>> 1GB, which is not an uncommon number judging by other posts considering >>>> a >>>> big data set. >>>> >>>> To eliminate some of the write bottlenecks, I already disabled writing >>>> to >>>> the WAL by modifying the import tool. (I assume writing to the WAL is >>>> not >>>> necessary during import as long no regionservers crash. If one does, I >>>> can >>>> simply recreate an empty hbase and start over.) >>>> >>>> Also, I temporarily set hbase.hstore.compactionThreshold and >>>> hbase.hstore.blockingStoreFiles excessively high in order to disable >>>> minor >>>> compactions during the time of the import. With these changes it still >>>> takes >>>> about 100 hour to import the data, opposed to the 6 hour it took to read >>>> it. >>>> The importing starts with a single region on one node, and is split when >>>> the >>>> size is exceeded. The resulting regions are spread out over the other >>>> nodes, >>>> so that not a problem. The first tasks result in regionservers sometimes >>>> blocking updates because there flushing memstores. After a while (around >>>> 10% >>>> completion of the job) the logs mostly show the "LRU Stats", and >>>> sometimes >>>> "Updating" / "Opening" statements. Although I presumely disabled minor >>>> compactions and no major compact should be running yet, sometimes I also >>>> see >>>> Compacting statements. Why is that so? In other words, what does >>>> "because >>>> Region has references on open" mean? >>>> >>>> Aside of these performance issues, tasks are failing with region offline >>>> errors. These are always regions that were just split. The map/reduce >>>> framework tolerates these errors, still I thought splitting process was >>>> transparant to the user. >>>> >>>> Please correct me if I'm wrong in any of my assumptions. >>>> >>>> Ferdy. >
