Hi SJ, Awesome setup. I tested with your configurations and the performance is 6 times better. :) Thanks a lot. What role does 'setWriteToWAL(false)' play here? I hear that if it's set to false then there will be data loss in case of RegionServer crash. How would the performance be affected if I set it to true?
Thanks, Han On Jul 23, 2010, at 9:57 AM, Samuru Jackson wrote: > Hi, > > For testing purposes I have to make some bulk loads as well. > > What I do is to insert the data in bulks (for instance 10.000 rows every > time). > > I create a Put List out of those records: > > List<Put> pList = new ArrayList<Put>(); > > where each Put has WriteToWAL set to false; > > put.setWriteToWAL(false); > pList.add(p); > > Then I set autoflush to false and create a larger writebuffer: > > hTable.setAutoFlush(false); > hTable.setWriteBufferSize(1024*1024*12); > hTable.put(pList); > hTable.setAutoFlush(true); > > The following settings have boosted my load performance 5times - > without any bigger performance tunings, no special HW configuration I > achieve 8000-9000 records per second: > p.setWriteToWAL(false); > hTable.setAutoFlush(false); > hTable.setWriteBufferSize(1024*1024*12); > > > /SJ > > > On Thu, Jul 22, 2010 at 6:31 PM, Jean-Daniel Cryans <[email protected]> > wrote: >> Yes, then you should really look at using the write buffer. >> >> J-D >> >> On Thu, Jul 22, 2010 at 3:22 PM, HAN LIU <[email protected]> wrote: >>> Thanks J-D. >>> >>> The only place where I create an HTable is in the constructor of my Mapper. >>> The constructor is called only once for each map task right? >>> >>> Han >>> On Jul 22, 2010, at 4:43 PM, Jean-Daniel Cryans wrote: >>> >>>> Han, >>>> >>>> This is bad, you must be doing something slow like creating a new >>>> HTable for each put call. Also you need to use the write buffer >>>> (disable auto flushing, then set the write buffer size on HTable >>>> during the map configuration) if since you manage the HTable yourself. >>>> >>>> The bulk load tool usage is wide-spread, you should give it a try if >>>> you only have 1 family. >>>> >>>> J-D >>>> >>>> On Thu, Jul 22, 2010 at 1:06 PM, HAN LIU <[email protected]> wrote: >>>>> Hi Guys, >>>>> >>>>> I've been doing some data insertion from HDFS to HBase and the >>>>> performance seems to be really bad. It took about 3 hours to insert 15 GB >>>>> of data. The mapreduce job is launched from one machine which grabs data >>>>> from HDFS and insert them into an HTable located at 3 other machines (1 >>>>> master and 2 regionservers). There are 17 map job in total (no reduce >>>>> jobs), representing 17 files each about 1GB in size. The mapper simply >>>>> extracts the useful information from each of these files and insert them >>>>> into HBase. In the end there are about 22 million rows added in the >>>>> table, and with my implementation (pretty low-efficient I think), for >>>>> each of these row a 'table.put(Put p)' method is called once, so in the >>>>> end there are 22 million 'table.put()' calls. >>>>> >>>>> Does it make sense that these many 'table.put' calls talks 3 hours? >>>>> Because I have played with my code and I have determined that the >>>>> bottleneck is these 'table.put()' calls, because if I remove them, the >>>>> rest of the code (doing every part of the job except for committing the >>>>> updates via 'table.put()' )only takes 2 minutes to run. I am really >>>>> inexperienced in HBase, so how do you guys usually do data insertion? >>>>> What could be the tricks to enhance performance? >>>>> >>>>> I am thinking about using the bulk load feature to batch insert data into >>>>> HBase. Is this a popular method out there in the HBase community? >>>>> >>>>> Really sorry about asking so much help for my problems but not helping >>>>> other people with theirs. I really would like to offer help once I get >>>>> more experienced with HBase. >>>>> >>>>> Thanks a lot in advance :) >>>>> >>>>> >>>>> ---- >>>>> Han Liu >>>>> SCS & HCI Institute >>>>> Undergrad. Class of 2012 >>>>> Carnegie Mellon University >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> Han Liu >>> SCS & HCI Institute >>> Undergrad. Class of 2012 >>> Carnegie Mellon University >>> >>> >>> >>> >> > Han Liu SCS & HCI Institute Undergrad. Class of 2012 Carnegie Mellon University
