Re: importing dataset, some problems and performance issues

Ferdy Galema Fri, 18 Mar 2011 05:47:06 -0700

On second thought, removing the obsolete regionfolders was easily doneby hand. This way I can merge regions with the merge tool.

However, I'm still bothered by the (performance) issues I ran into. Anyadvice would be helpful.


On 03/18/2011 11:06 AM, Ferdy Galema wrote:

After exporting a tabel of about 30M rows (each row has about 500columns, totalling 400GB of data), there were several issues whentrying to import it again on an empty HBase. (HBase version is0.90.1-CDH3B4, deployed on 15 nodes. LZO is enabled.)
The reason for this export/import is to both reduce the number ofregions and clean up regionfolders in the table that are no longerreferred to. (I can see this because of the dfs timestamps). Btw, I'maware of the Merge tool, which can only solve the merging part. Themax region size is set to 1GB, which is not an uncommon number judgingby other posts considering a big data set.
To eliminate some of the write bottlenecks, I already disabled writingto the WAL by modifying the import tool. (I assume writing to the WALis not necessary during import as long no regionservers crash. If onedoes, I can simply recreate an empty hbase and start over.)
Also, I temporarily set hbase.hstore.compactionThreshold andhbase.hstore.blockingStoreFiles excessively high in order to disableminor compactions during the time of the import. With these changes itstill takes about 100 hour to import the data, opposed to the 6 hourit took to read it. The importing starts with a single region on onenode, and is split when the size is exceeded. The resulting regionsare spread out over the other nodes, so that not a problem. The firsttasks result in regionservers sometimes blocking updates because thereflushing memstores. After a while (around 10% completion of the job)the logs mostly show the "LRU Stats", and sometimes "Updating" /"Opening" statements. Although I presumely disabled minor compactionsand no major compact should be running yet, sometimes I also seeCompacting statements. Why is that so? In other words, what does"because Region has references on open" mean?
Aside of these performance issues, tasks are failing with regionoffline errors. These are always regions that were just split. Themap/reduce framework tolerates these errors, still I thought splittingprocess was transparant to the user.
Please correct me if I'm wrong in any of my assumptions.

Ferdy.

Re: importing dataset, some problems and performance issues

Reply via email to