How to prevent major compaction when doing bulk load provisioning?

Nicolas Seyvet Thu, 21 Mar 2013 08:16:02 -0700

Hi,

We are using code similar to
https://github.com/jrkinley/hbase-bulk-import-example/ in order to
benchmark our HBase cluster.  We are running a CDH4 installation, and HBase
is version 0.92.1-cdh4.1.1..  The cluster is composed of 12 slaves and 1
master and 1 secondary master.


During the bulk load insert, roughly within 3 hours after the start
(~200Gb), we notice a large drop in performance in the insert rate. At the
same time, there is a spike in IO and CPU usage.  Connecting to a Region
Server (RS), the Monitored Task section shows that a compaction is started.

I have set hbase.hregion.max.filesize to 107374182400 (100Gb), and disable
automatic major compaction hbase.hregion.majorcompactionis set to 0.

What we are doing is that we have 1000 files of synthetic data (csv), where
each row in a file is one row to insert into HBase, each file contains 600K
rows (or 600K events).  Our loader works in the following way:
1. Look for a file
2. When a file is found, prepare a job for that file
3. Launch job
4. Wait for completion
5. Compute insert rate (nb of rows /time)
6. Repeat from 1 until there are no more files.

What I understand of the bulk load M/R job is that it produces one HFile
for each Region.

Questions:
- How is HStoreFileSize calclulated?
- What do HStoreFileSize, storeFileSize and hbase.hregion.max.filesize have
in common?
- Can the number of HFiles trigger a major compaction?

Thx for help.  I hope my questions make sense.

/Nicolas

How to prevent major compaction when doing bulk load provisioning?

Reply via email to