Yes, then you should really look at using the write buffer. J-D
On Thu, Jul 22, 2010 at 3:22 PM, HAN LIU <[email protected]> wrote: > Thanks J-D. > > The only place where I create an HTable is in the constructor of my Mapper. > The constructor is called only once for each map task right? > > Han > On Jul 22, 2010, at 4:43 PM, Jean-Daniel Cryans wrote: > >> Han, >> >> This is bad, you must be doing something slow like creating a new >> HTable for each put call. Also you need to use the write buffer >> (disable auto flushing, then set the write buffer size on HTable >> during the map configuration) if since you manage the HTable yourself. >> >> The bulk load tool usage is wide-spread, you should give it a try if >> you only have 1 family. >> >> J-D >> >> On Thu, Jul 22, 2010 at 1:06 PM, HAN LIU <[email protected]> wrote: >>> Hi Guys, >>> >>> I've been doing some data insertion from HDFS to HBase and the performance >>> seems to be really bad. It took about 3 hours to insert 15 GB of data. The >>> mapreduce job is launched from one machine which grabs data from HDFS and >>> insert them into an HTable located at 3 other machines (1 master and 2 >>> regionservers). There are 17 map job in total (no reduce jobs), >>> representing 17 files each about 1GB in size. The mapper simply extracts >>> the useful information from each of these files and insert them into HBase. >>> In the end there are about 22 million rows added in the table, and with my >>> implementation (pretty low-efficient I think), for each of these row a >>> 'table.put(Put p)' method is called once, so in the end there are 22 >>> million 'table.put()' calls. >>> >>> Does it make sense that these many 'table.put' calls talks 3 hours? Because >>> I have played with my code and I have determined that the bottleneck is >>> these 'table.put()' calls, because if I remove them, the rest of the code >>> (doing every part of the job except for committing the updates via >>> 'table.put()' )only takes 2 minutes to run. I am really inexperienced in >>> HBase, so how do you guys usually do data insertion? What could be the >>> tricks to enhance performance? >>> >>> I am thinking about using the bulk load feature to batch insert data into >>> HBase. Is this a popular method out there in the HBase community? >>> >>> Really sorry about asking so much help for my problems but not helping >>> other people with theirs. I really would like to offer help once I get more >>> experienced with HBase. >>> >>> Thanks a lot in advance :) >>> >>> >>> ---- >>> Han Liu >>> SCS & HCI Institute >>> Undergrad. Class of 2012 >>> Carnegie Mellon University >>> >>> >>> >>> >> > > Han Liu > SCS & HCI Institute > Undergrad. Class of 2012 > Carnegie Mellon University > > > >
