Hi, We are using code similar to https://github.com/jrkinley/hbase-bulk-import-example/ in order to benchmark our HBase cluster. We are running a CDH4 installation, and HBase is version 0.92.1-cdh4.1.1.. The cluster is composed of 12 slaves and 1 master and 1 secondary master.
During the bulk load insert, roughly within 3 hours after the start (~200Gb), we notice a large drop in performance in the insert rate. At the same time, there is a spike in IO and CPU usage. Connecting to a Region Server (RS), the Monitored Task section shows that a compaction is started. I have set hbase.hregion.max.filesize to 107374182400 (100Gb), and disable automatic major compaction hbase.hregion.majorcompactionis set to 0. What we are doing is that we have 1000 files of synthetic data (csv), where each row in a file is one row to insert into HBase, each file contains 600K rows (or 600K events). Our loader works in the following way: 1. Look for a file 2. When a file is found, prepare a job for that file 3. Launch job 4. Wait for completion 5. Compute insert rate (nb of rows /time) 6. Repeat from 1 until there are no more files. What I understand of the bulk load M/R job is that it produces one HFile for each Region. Questions: - How is HStoreFileSize calclulated? - What do HStoreFileSize, storeFileSize and hbase.hregion.max.filesize have in common? - Can the number of HFiles trigger a major compaction? Thx for help. I hope my questions make sense. /Nicolas
