Re: importing dataset, some problems and performance issues

Jean-Daniel Cryans Tue, 22 Mar 2011 13:26:40 -0700

Usually the logs are pretty chatty about what's blocking them, here's
one example of my going through my own logs:
http://search-hadoop.com/m/fJ0vh6ojHm1


J-D

On Tue, Mar 22, 2011 at 4:18 AM, Ferdy Galema <[email protected]> wrote:
> HBase already makes my life better, so no worries there :)
>
> I agree the topic of this thread is not clear anymore. I also already know
> how to tackle my problem. So just for the record let me explain what I was
> thinking/doing:
>
> The original intend was to clean up my HBase installation (remove floating
> regions and storefiles). We have had some crashes in the past and therefore
> there were still some minor inconsistencies. I had never ran the hbck tool,
> in fact I was not aware of it. A second intend was to decrease the number of
> regions.
>
> However, I wrongly decided that the best way to do this is by doing an
> export and a consecutive import on a clean dataset. This way I could avoid
> the process of digging into the data files and merging the regions manually.
> Of course it would work if I tuned the (import) performance parameters
> better or simply accepted to wait for a long time for the import to finish.
> So my first posting was about these performance issues. After that, I
> quickly turned to manually cleaning/merging regions. This worked.
>
> So although my initial problems were solved, I was still a bit concerned. I
> know that importing is more expensive than exporting, but I did not expect
> see that big a difference in the order of magnitude. I thought there might
> as well be something terribly wrong with my configuration, or my assumptions
> about the way the clients/regionservers can be tuned in order to increase
> bulkloading performance. For example, the assumption that increasing the
> hbase.store.compactionThreshold and hbase.store.blockingStoreFiles to
> excessive amounts will completely disable minor compactions. (By the way,
> I'm still not sure if it does and if it's smart to do that when importing).
>
> Ferdy.
>
> On 03/22/2011 02:22 AM, Jean-Daniel Cryans wrote:
>>
>> I feel like I'm not understanding your need correctly, could you
>> elicit what you think HBase you should be doing in order to give you a
>> better life?
>>
>> Thx,
>>
>> J-D
>>
>> On Mon, Mar 21, 2011 at 5:22 PM, Ferdy Galema<[email protected]>
>>  wrote:
>>>
>>> These methods are certainly helpful, whenever I ever need to do a heavy
>>> import. For now I got away with manually cleaning my regions/stores and
>>> merging the data. I thought importing/exporting was the easy way to do
>>> that,
>>> but I guess that's not (yet) true.
>>>
>>> On 03/21/2011 09:48 PM, Jean-Daniel Cryans wrote:
>>>>
>>>> What you are describing is solved usually by either:
>>>>
>>>> - pre-creating the regions so that you don't have to go through the
>>>> "growing pains" of a new, virgin table. Use this sort of method:
>>>>
>>>>
>>>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor,
>>>> byte[][])
>>>>
>>>> - use the bulk loader: http://hbase.apache.org/bulk-loads.html
>>>>
>>>> J-D
>>>>
>>>> On Fri, Mar 18, 2011 at 5:46 AM, Ferdy Galema<[email protected]>
>>>>  wrote:
>>>>>
>>>>> On second thought, removing the obsolete regionfolders was easily done
>>>>> by
>>>>> hand. This way I can merge regions with the merge tool.
>>>>>
>>>>> However, I'm still bothered by the (performance) issues I ran into. Any
>>>>> advice would be helpful.
>>>>>
>>>>> On 03/18/2011 11:06 AM, Ferdy Galema wrote:
>>>>>>
>>>>>> After exporting a tabel of about 30M rows (each row has about 500
>>>>>> columns,
>>>>>> totalling 400GB of data), there were several issues when trying to
>>>>>> import it
>>>>>> again on an empty HBase. (HBase version is 0.90.1-CDH3B4, deployed on
>>>>>> 15
>>>>>> nodes. LZO is enabled.)
>>>>>>
>>>>>> The reason for this export/import is to both reduce the number of
>>>>>> regions
>>>>>> and clean up regionfolders in the table that are no longer referred
>>>>>> to.
>>>>>> (I
>>>>>> can see this because of the dfs timestamps). Btw, I'm aware of the
>>>>>> Merge
>>>>>> tool, which can only solve the merging part. The max region size is
>>>>>> set
>>>>>> to
>>>>>> 1GB, which is not an uncommon number judging by other posts
>>>>>> considering
>>>>>> a
>>>>>> big data set.
>>>>>>
>>>>>> To eliminate some of the write bottlenecks, I already disabled writing
>>>>>> to
>>>>>> the WAL by modifying the import tool. (I assume writing to the WAL is
>>>>>> not
>>>>>> necessary during import as long no regionservers crash. If one does, I
>>>>>> can
>>>>>> simply recreate an empty hbase and start over.)
>>>>>>
>>>>>> Also, I temporarily set hbase.hstore.compactionThreshold and
>>>>>> hbase.hstore.blockingStoreFiles excessively high in order to disable
>>>>>> minor
>>>>>> compactions during the time of the import. With these changes it still
>>>>>> takes
>>>>>> about 100 hour to import the data, opposed to the 6 hour it took to
>>>>>> read
>>>>>> it.
>>>>>> The importing starts with a single region on one node, and is split
>>>>>> when
>>>>>> the
>>>>>> size is exceeded. The resulting regions are spread out over the other
>>>>>> nodes,
>>>>>> so that not a problem. The first tasks result in regionservers
>>>>>> sometimes
>>>>>> blocking updates because there flushing memstores. After a while
>>>>>> (around
>>>>>> 10%
>>>>>> completion of the job) the logs mostly show the "LRU Stats", and
>>>>>> sometimes
>>>>>> "Updating" / "Opening" statements. Although I presumely disabled minor
>>>>>> compactions and no major compact should be running yet, sometimes I
>>>>>> also
>>>>>> see
>>>>>> Compacting statements. Why is that so? In other words, what does
>>>>>> "because
>>>>>> Region has references on open" mean?
>>>>>>
>>>>>> Aside of these performance issues, tasks are failing with region
>>>>>> offline
>>>>>> errors. These are always regions that were just split. The map/reduce
>>>>>> framework tolerates these errors, still I thought splitting process
>>>>>> was
>>>>>> transparant to the user.
>>>>>>
>>>>>> Please correct me if I'm wrong in any of my assumptions.
>>>>>>
>>>>>> Ferdy.
>

Re: importing dataset, some problems and performance issues

Reply via email to