HBase already makes my life better, so no worries there :)

I agree the topic of this thread is not clear anymore. I also already know how to tackle my problem. So just for the record let me explain what I was thinking/doing:

The original intend was to clean up my HBase installation (remove floating regions and storefiles). We have had some crashes in the past and therefore there were still some minor inconsistencies. I had never ran the hbck tool, in fact I was not aware of it. A second intend was to decrease the number of regions.

However, I wrongly decided that the best way to do this is by doing an export and a consecutive import on a clean dataset. This way I could avoid the process of digging into the data files and merging the regions manually. Of course it would work if I tuned the (import) performance parameters better or simply accepted to wait for a long time for the import to finish. So my first posting was about these performance issues. After that, I quickly turned to manually cleaning/merging regions. This worked.

So although my initial problems were solved, I was still a bit concerned. I know that importing is more expensive than exporting, but I did not expect see that big a difference in the order of magnitude. I thought there might as well be something terribly wrong with my configuration, or my assumptions about the way the clients/regionservers can be tuned in order to increase bulkloading performance. For example, the assumption that increasing the hbase.store.compactionThreshold and hbase.store.blockingStoreFiles to excessive amounts will completely disable minor compactions. (By the way, I'm still not sure if it does and if it's smart to do that when importing).

Ferdy.

On 03/22/2011 02:22 AM, Jean-Daniel Cryans wrote:
I feel like I'm not understanding your need correctly, could you
elicit what you think HBase you should be doing in order to give you a
better life?

Thx,

J-D

On Mon, Mar 21, 2011 at 5:22 PM, Ferdy Galema<[email protected]>  wrote:
These methods are certainly helpful, whenever I ever need to do a heavy
import. For now I got away with manually cleaning my regions/stores and
merging the data. I thought importing/exporting was the easy way to do that,
but I guess that's not (yet) true.

On 03/21/2011 09:48 PM, Jean-Daniel Cryans wrote:
What you are describing is solved usually by either:

- pre-creating the regions so that you don't have to go through the
"growing pains" of a new, virgin table. Use this sort of method:

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor,
byte[][])

- use the bulk loader: http://hbase.apache.org/bulk-loads.html

J-D

On Fri, Mar 18, 2011 at 5:46 AM, Ferdy Galema<[email protected]>
  wrote:
On second thought, removing the obsolete regionfolders was easily done by
hand. This way I can merge regions with the merge tool.

However, I'm still bothered by the (performance) issues I ran into. Any
advice would be helpful.

On 03/18/2011 11:06 AM, Ferdy Galema wrote:
After exporting a tabel of about 30M rows (each row has about 500
columns,
totalling 400GB of data), there were several issues when trying to
import it
again on an empty HBase. (HBase version is 0.90.1-CDH3B4, deployed on 15
nodes. LZO is enabled.)

The reason for this export/import is to both reduce the number of
regions
and clean up regionfolders in the table that are no longer referred to.
(I
can see this because of the dfs timestamps). Btw, I'm aware of the Merge
tool, which can only solve the merging part. The max region size is set
to
1GB, which is not an uncommon number judging by other posts considering
a
big data set.

To eliminate some of the write bottlenecks, I already disabled writing
to
the WAL by modifying the import tool. (I assume writing to the WAL is
not
necessary during import as long no regionservers crash. If one does, I
can
simply recreate an empty hbase and start over.)

Also, I temporarily set hbase.hstore.compactionThreshold and
hbase.hstore.blockingStoreFiles excessively high in order to disable
minor
compactions during the time of the import. With these changes it still
takes
about 100 hour to import the data, opposed to the 6 hour it took to read
it.
The importing starts with a single region on one node, and is split when
the
size is exceeded. The resulting regions are spread out over the other
nodes,
so that not a problem. The first tasks result in regionservers sometimes
blocking updates because there flushing memstores. After a while (around
10%
completion of the job) the logs mostly show the "LRU Stats", and
sometimes
"Updating" / "Opening" statements. Although I presumely disabled minor
compactions and no major compact should be running yet, sometimes I also
see
Compacting statements. Why is that so? In other words, what does
"because
Region has references on open" mean?

Aside of these performance issues, tasks are failing with region offline
errors. These are always regions that were just split. The map/reduce
framework tolerates these errors, still I thought splitting process was
transparant to the user.

Please correct me if I'm wrong in any of my assumptions.

Ferdy.

Reply via email to