Re: HBase performace & bulk load

Jean-Daniel Cryans Thu, 22 Jul 2010 15:32:29 -0700

Yes, then you should really look at using the write buffer.

J-D


On Thu, Jul 22, 2010 at 3:22 PM, HAN LIU <[email protected]> wrote:
> Thanks J-D.
>
> The only place where I create an HTable is in the constructor of my Mapper.  
> The constructor is called only once for each map task right?
>
> Han
> On Jul 22, 2010, at 4:43 PM, Jean-Daniel Cryans wrote:
>
>> Han,
>>
>> This is bad, you must be doing something slow like creating a new
>> HTable for each put call. Also you need to use the write buffer
>> (disable auto flushing, then set the write buffer size on HTable
>> during the map configuration) if since you manage the HTable yourself.
>>
>> The bulk load tool usage is wide-spread, you should give it a try if
>> you only have 1 family.
>>
>> J-D
>>
>> On Thu, Jul 22, 2010 at 1:06 PM, HAN LIU <[email protected]> wrote:
>>> Hi Guys,
>>>
>>> I've been doing some data insertion from HDFS to HBase and the performance 
>>> seems to be really bad. It took about 3 hours to insert 15 GB of data.  The 
>>> mapreduce job is launched from one machine which grabs data from HDFS and 
>>> insert them into an HTable located at 3 other machines (1 master and 2 
>>> regionservers). There are 17 map job in total (no reduce jobs), 
>>> representing 17 files each about 1GB in size. The mapper simply extracts 
>>> the useful information from each of these files and insert them into HBase. 
>>> In the end there are about 22 million rows added in the table, and with my 
>>> implementation (pretty low-efficient I think), for each of these row a 
>>> 'table.put(Put p)' method is called once, so in the end there are 22 
>>> million 'table.put()' calls.
>>>
>>> Does it make sense that these many 'table.put' calls talks 3 hours? Because 
>>> I have played with my code and I have determined that the bottleneck is 
>>> these 'table.put()' calls, because if I remove them, the rest of the code 
>>> (doing every part of the job except for committing the updates via 
>>> 'table.put()' )only takes 2 minutes to run. I am really inexperienced in 
>>> HBase, so how do you guys usually do data insertion? What could be the 
>>> tricks to enhance performance?
>>>
>>> I am thinking about using the bulk load feature to batch insert data into 
>>> HBase. Is this a popular method out there in the HBase community?
>>>
>>> Really sorry about asking so much help for my problems but not helping 
>>> other people with theirs. I really would like to offer help once I get more 
>>> experienced with HBase.
>>>
>>> Thanks a lot in advance :)
>>>
>>>
>>> ----
>>> Han Liu
>>> SCS & HCI Institute
>>> Undergrad. Class of 2012
>>> Carnegie Mellon University
>>>
>>>
>>>
>>>
>>
>
> Han Liu
> SCS & HCI Institute
> Undergrad. Class of 2012
> Carnegie Mellon University
>
>
>
>

Re: HBase performace & bulk load

Reply via email to