Re: importing dataset, some problems and performance issues

Jean-Daniel Cryans Mon, 21 Mar 2011 13:48:38 -0700

What you are describing is solved usually by either:

- pre-creating the regions so that you don't have to go through the
"growing pains" of a new, virgin table. Use this sort of method:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor,
byte[][])


- use the bulk loader: http://hbase.apache.org/bulk-loads.html

J-D

On Fri, Mar 18, 2011 at 5:46 AM, Ferdy Galema <[email protected]> wrote:
> On second thought, removing the obsolete regionfolders was easily done by
> hand. This way I can merge regions with the merge tool.
>
> However, I'm still bothered by the (performance) issues I ran into. Any
> advice would be helpful.
>
> On 03/18/2011 11:06 AM, Ferdy Galema wrote:
>>
>> After exporting a tabel of about 30M rows (each row has about 500 columns,
>> totalling 400GB of data), there were several issues when trying to import it
>> again on an empty HBase. (HBase version is 0.90.1-CDH3B4, deployed on 15
>> nodes. LZO is enabled.)
>>
>> The reason for this export/import is to both reduce the number of regions
>> and clean up regionfolders in the table that are no longer referred to. (I
>> can see this because of the dfs timestamps). Btw, I'm aware of the Merge
>> tool, which can only solve the merging part. The max region size is set to
>> 1GB, which is not an uncommon number judging by other posts considering a
>> big data set.
>>
>> To eliminate some of the write bottlenecks, I already disabled writing to
>> the WAL by modifying the import tool. (I assume writing to the WAL is not
>> necessary during import as long no regionservers crash. If one does, I can
>> simply recreate an empty hbase and start over.)
>>
>> Also, I temporarily set hbase.hstore.compactionThreshold and
>> hbase.hstore.blockingStoreFiles excessively high in order to disable minor
>> compactions during the time of the import. With these changes it still takes
>> about 100 hour to import the data, opposed to the 6 hour it took to read it.
>> The importing starts with a single region on one node, and is split when the
>> size is exceeded. The resulting regions are spread out over the other nodes,
>> so that not a problem. The first tasks result in regionservers sometimes
>> blocking updates because there flushing memstores. After a while (around 10%
>> completion of the job) the logs mostly show the "LRU Stats", and sometimes
>> "Updating" / "Opening" statements. Although I presumely disabled minor
>> compactions and no major compact should be running yet, sometimes I also see
>> Compacting statements. Why is that so? In other words, what does "because
>> Region has references on open" mean?
>>
>> Aside of these performance issues, tasks are failing with region offline
>> errors. These are always regions that were just split. The map/reduce
>> framework tolerates these errors, still I thought splitting process was
>> transparant to the user.
>>
>> Please correct me if I'm wrong in any of my assumptions.
>>
>> Ferdy.
>

Re: importing dataset, some problems and performance issues

Reply via email to