Hi, For testing purposes I have to make some bulk loads as well.
What I do is to insert the data in bulks (for instance 10.000 rows every time). I create a Put List out of those records: List<Put> pList = new ArrayList<Put>(); where each Put has WriteToWAL set to false; put.setWriteToWAL(false); pList.add(p); Then I set autoflush to false and create a larger writebuffer: hTable.setAutoFlush(false); hTable.setWriteBufferSize(1024*1024*12); hTable.put(pList); hTable.setAutoFlush(true); The following settings have boosted my load performance 5times - without any bigger performance tunings, no special HW configuration I achieve 8000-9000 records per second: p.setWriteToWAL(false); hTable.setAutoFlush(false); hTable.setWriteBufferSize(1024*1024*12); /SJ On Thu, Jul 22, 2010 at 6:31 PM, Jean-Daniel Cryans <[email protected]> wrote: > Yes, then you should really look at using the write buffer. > > J-D > > On Thu, Jul 22, 2010 at 3:22 PM, HAN LIU <[email protected]> wrote: >> Thanks J-D. >> >> The only place where I create an HTable is in the constructor of my Mapper. >> The constructor is called only once for each map task right? >> >> Han >> On Jul 22, 2010, at 4:43 PM, Jean-Daniel Cryans wrote: >> >>> Han, >>> >>> This is bad, you must be doing something slow like creating a new >>> HTable for each put call. Also you need to use the write buffer >>> (disable auto flushing, then set the write buffer size on HTable >>> during the map configuration) if since you manage the HTable yourself. >>> >>> The bulk load tool usage is wide-spread, you should give it a try if >>> you only have 1 family. >>> >>> J-D >>> >>> On Thu, Jul 22, 2010 at 1:06 PM, HAN LIU <[email protected]> wrote: >>>> Hi Guys, >>>> >>>> I've been doing some data insertion from HDFS to HBase and the performance >>>> seems to be really bad. It took about 3 hours to insert 15 GB of data. >>>> The mapreduce job is launched from one machine which grabs data from HDFS >>>> and insert them into an HTable located at 3 other machines (1 master and 2 >>>> regionservers). There are 17 map job in total (no reduce jobs), >>>> representing 17 files each about 1GB in size. The mapper simply extracts >>>> the useful information from each of these files and insert them into >>>> HBase. In the end there are about 22 million rows added in the table, and >>>> with my implementation (pretty low-efficient I think), for each of these >>>> row a 'table.put(Put p)' method is called once, so in the end there are 22 >>>> million 'table.put()' calls. >>>> >>>> Does it make sense that these many 'table.put' calls talks 3 hours? >>>> Because I have played with my code and I have determined that the >>>> bottleneck is these 'table.put()' calls, because if I remove them, the >>>> rest of the code (doing every part of the job except for committing the >>>> updates via 'table.put()' )only takes 2 minutes to run. I am really >>>> inexperienced in HBase, so how do you guys usually do data insertion? What >>>> could be the tricks to enhance performance? >>>> >>>> I am thinking about using the bulk load feature to batch insert data into >>>> HBase. Is this a popular method out there in the HBase community? >>>> >>>> Really sorry about asking so much help for my problems but not helping >>>> other people with theirs. I really would like to offer help once I get >>>> more experienced with HBase. >>>> >>>> Thanks a lot in advance :) >>>> >>>> >>>> ---- >>>> Han Liu >>>> SCS & HCI Institute >>>> Undergrad. Class of 2012 >>>> Carnegie Mellon University >>>> >>>> >>>> >>>> >>> >> >> Han Liu >> SCS & HCI Institute >> Undergrad. Class of 2012 >> Carnegie Mellon University >> >> >> >> >
