Re: importing dataset, some problems and performance issues

Jean-Daniel Cryans Mon, 21 Mar 2011 18:23:29 -0700

I feel like I'm not understanding your need correctly, could you
elicit what you think HBase you should be doing in order to give you a
better life?


Thx,

J-D

On Mon, Mar 21, 2011 at 5:22 PM, Ferdy Galema <[email protected]> wrote:
> These methods are certainly helpful, whenever I ever need to do a heavy
> import. For now I got away with manually cleaning my regions/stores and
> merging the data. I thought importing/exporting was the easy way to do that,
> but I guess that's not (yet) true.
>
> On 03/21/2011 09:48 PM, Jean-Daniel Cryans wrote:
>>
>> What you are describing is solved usually by either:
>>
>> - pre-creating the regions so that you don't have to go through the
>> "growing pains" of a new, virgin table. Use this sort of method:
>>
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor,
>> byte[][])
>>
>> - use the bulk loader: http://hbase.apache.org/bulk-loads.html
>>
>> J-D
>>
>> On Fri, Mar 18, 2011 at 5:46 AM, Ferdy Galema<[email protected]>
>>  wrote:
>>>
>>> On second thought, removing the obsolete regionfolders was easily done by
>>> hand. This way I can merge regions with the merge tool.
>>>
>>> However, I'm still bothered by the (performance) issues I ran into. Any
>>> advice would be helpful.
>>>
>>> On 03/18/2011 11:06 AM, Ferdy Galema wrote:
>>>>
>>>> After exporting a tabel of about 30M rows (each row has about 500
>>>> columns,
>>>> totalling 400GB of data), there were several issues when trying to
>>>> import it
>>>> again on an empty HBase. (HBase version is 0.90.1-CDH3B4, deployed on 15
>>>> nodes. LZO is enabled.)
>>>>
>>>> The reason for this export/import is to both reduce the number of
>>>> regions
>>>> and clean up regionfolders in the table that are no longer referred to.
>>>> (I
>>>> can see this because of the dfs timestamps). Btw, I'm aware of the Merge
>>>> tool, which can only solve the merging part. The max region size is set
>>>> to
>>>> 1GB, which is not an uncommon number judging by other posts considering
>>>> a
>>>> big data set.
>>>>
>>>> To eliminate some of the write bottlenecks, I already disabled writing
>>>> to
>>>> the WAL by modifying the import tool. (I assume writing to the WAL is
>>>> not
>>>> necessary during import as long no regionservers crash. If one does, I
>>>> can
>>>> simply recreate an empty hbase and start over.)
>>>>
>>>> Also, I temporarily set hbase.hstore.compactionThreshold and
>>>> hbase.hstore.blockingStoreFiles excessively high in order to disable
>>>> minor
>>>> compactions during the time of the import. With these changes it still
>>>> takes
>>>> about 100 hour to import the data, opposed to the 6 hour it took to read
>>>> it.
>>>> The importing starts with a single region on one node, and is split when
>>>> the
>>>> size is exceeded. The resulting regions are spread out over the other
>>>> nodes,
>>>> so that not a problem. The first tasks result in regionservers sometimes
>>>> blocking updates because there flushing memstores. After a while (around
>>>> 10%
>>>> completion of the job) the logs mostly show the "LRU Stats", and
>>>> sometimes
>>>> "Updating" / "Opening" statements. Although I presumely disabled minor
>>>> compactions and no major compact should be running yet, sometimes I also
>>>> see
>>>> Compacting statements. Why is that so? In other words, what does
>>>> "because
>>>> Region has references on open" mean?
>>>>
>>>> Aside of these performance issues, tasks are failing with region offline
>>>> errors. These are always regions that were just split. The map/reduce
>>>> framework tolerates these errors, still I thought splitting process was
>>>> transparant to the user.
>>>>
>>>> Please correct me if I'm wrong in any of my assumptions.
>>>>
>>>> Ferdy.
>

Re: importing dataset, some problems and performance issues

Reply via email to