Hi,

For testing purposes I have to make some bulk loads as well.

What I do is to insert the data in bulks (for instance 10.000 rows every time).

I create a Put List out of those records:

List<Put> pList = new ArrayList<Put>();

where each Put has WriteToWAL set to false;

put.setWriteToWAL(false);
pList.add(p);

Then I set autoflush to false and create a larger writebuffer:

hTable.setAutoFlush(false);
hTable.setWriteBufferSize(1024*1024*12);
hTable.put(pList);      
hTable.setAutoFlush(true);

The following settings have boosted my load performance 5times -
without any bigger performance tunings, no special HW  configuration I
achieve 8000-9000 records per second:
p.setWriteToWAL(false);
hTable.setAutoFlush(false);
hTable.setWriteBufferSize(1024*1024*12);


/SJ


On Thu, Jul 22, 2010 at 6:31 PM, Jean-Daniel Cryans <[email protected]> wrote:
> Yes, then you should really look at using the write buffer.
>
> J-D
>
> On Thu, Jul 22, 2010 at 3:22 PM, HAN LIU <[email protected]> wrote:
>> Thanks J-D.
>>
>> The only place where I create an HTable is in the constructor of my Mapper.  
>> The constructor is called only once for each map task right?
>>
>> Han
>> On Jul 22, 2010, at 4:43 PM, Jean-Daniel Cryans wrote:
>>
>>> Han,
>>>
>>> This is bad, you must be doing something slow like creating a new
>>> HTable for each put call. Also you need to use the write buffer
>>> (disable auto flushing, then set the write buffer size on HTable
>>> during the map configuration) if since you manage the HTable yourself.
>>>
>>> The bulk load tool usage is wide-spread, you should give it a try if
>>> you only have 1 family.
>>>
>>> J-D
>>>
>>> On Thu, Jul 22, 2010 at 1:06 PM, HAN LIU <[email protected]> wrote:
>>>> Hi Guys,
>>>>
>>>> I've been doing some data insertion from HDFS to HBase and the performance 
>>>> seems to be really bad. It took about 3 hours to insert 15 GB of data.  
>>>> The mapreduce job is launched from one machine which grabs data from HDFS 
>>>> and insert them into an HTable located at 3 other machines (1 master and 2 
>>>> regionservers). There are 17 map job in total (no reduce jobs), 
>>>> representing 17 files each about 1GB in size. The mapper simply extracts 
>>>> the useful information from each of these files and insert them into 
>>>> HBase. In the end there are about 22 million rows added in the table, and 
>>>> with my implementation (pretty low-efficient I think), for each of these 
>>>> row a 'table.put(Put p)' method is called once, so in the end there are 22 
>>>> million 'table.put()' calls.
>>>>
>>>> Does it make sense that these many 'table.put' calls talks 3 hours? 
>>>> Because I have played with my code and I have determined that the 
>>>> bottleneck is these 'table.put()' calls, because if I remove them, the 
>>>> rest of the code (doing every part of the job except for committing the 
>>>> updates via 'table.put()' )only takes 2 minutes to run. I am really 
>>>> inexperienced in HBase, so how do you guys usually do data insertion? What 
>>>> could be the tricks to enhance performance?
>>>>
>>>> I am thinking about using the bulk load feature to batch insert data into 
>>>> HBase. Is this a popular method out there in the HBase community?
>>>>
>>>> Really sorry about asking so much help for my problems but not helping 
>>>> other people with theirs. I really would like to offer help once I get 
>>>> more experienced with HBase.
>>>>
>>>> Thanks a lot in advance :)
>>>>
>>>>
>>>> ----
>>>> Han Liu
>>>> SCS & HCI Institute
>>>> Undergrad. Class of 2012
>>>> Carnegie Mellon University
>>>>
>>>>
>>>>
>>>>
>>>
>>
>> Han Liu
>> SCS & HCI Institute
>> Undergrad. Class of 2012
>> Carnegie Mellon University
>>
>>
>>
>>
>

Reply via email to