After exporting a tabel of about 30M rows (each row has about 500
columns, totalling 400GB of data), there were several issues when trying
to import it again on an empty HBase. (HBase version is 0.90.1-CDH3B4,
deployed on 15 nodes. LZO is enabled.)
The reason for this export/import is to both reduce the number of
regions and clean up regionfolders in the table that are no longer
referred to. (I can see this because of the dfs timestamps). Btw, I'm
aware of the Merge tool, which can only solve the merging part. The max
region size is set to 1GB, which is not an uncommon number judging by
other posts considering a big data set.
To eliminate some of the write bottlenecks, I already disabled writing
to the WAL by modifying the import tool. (I assume writing to the WAL is
not necessary during import as long no regionservers crash. If one does,
I can simply recreate an empty hbase and start over.)
Also, I temporarily set hbase.hstore.compactionThreshold and
hbase.hstore.blockingStoreFiles excessively high in order to disable
minor compactions during the time of the import. With these changes it
still takes about 100 hour to import the data, opposed to the 6 hour it
took to read it. The importing starts with a single region on one node,
and is split when the size is exceeded. The resulting regions are spread
out over the other nodes, so that not a problem. The first tasks result
in regionservers sometimes blocking updates because there flushing
memstores. After a while (around 10% completion of the job) the logs
mostly show the "LRU Stats", and sometimes "Updating" / "Opening"
statements. Although I presumely disabled minor compactions and no major
compact should be running yet, sometimes I also see Compacting
statements. Why is that so? In other words, what does "because Region
has references on open" mean?
Aside of these performance issues, tasks are failing with region offline
errors. These are always regions that were just split. The map/reduce
framework tolerates these errors, still I thought splitting process was
transparant to the user.
Please correct me if I'm wrong in any of my assumptions.
Ferdy.