But, if I'm using bulkLoad, I think this method bypasses the WAL, right? I have no idea about the autoFlush, is it still necessary to set to false or the bulkload does some kind of magic with that as well??
I could try to do the loads without bulkLoad, but, I don't think that's the problem, maybe, it's just the time the cluster needs, although, it seems like too much time. 2014-04-14 22:51 GMT+02:00 lars hofhansl <[email protected]>: > +1 to what Vladimir said. > For the Puts in question you can also disable the write ahead log (WAL) > and issue a flush on the table after your ingest. > > -- Lars > > > ----- Original Message ----- > From: Vladimir Rodionov <[email protected]> > To: "[email protected]" <[email protected]> > Cc: > Sent: Monday, April 14, 2014 11:15 AM > Subject: RE: How to generate a large dataset quickly. > > There is no need to run M/R unless your cluster is large (very large) > Single multithreaded client can easily ingest 10s of thousands rows per > sec. > Check YCSB benchmark tool, for example. > > Make sure you disable both region splitting and major compaction during > data ingestion > and pre-split regions accordingly to improve overall performance. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: [email protected] > > ________________________________________ > > From: Ted Yu [[email protected]] > Sent: Monday, April 14, 2014 9:16 AM > To: [email protected] > Subject: Re: How to generate a large dataset quickly. > > I looked at revision history for HFileOutputFormat.java > There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't > affect throughput much. > > If you can use ganglia (or some similar tool) to pinpoint what caused the > low ingest rate, that would give us more clue. > > BTW Is upgrading to newer release, such as 0.98.1 (which contains > HBASE-8755), an option for you ? > > Cheers > > > On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz <[email protected] > >wrote: > > > I'm using. 0.94.6-cdh4.4.0, > > > > I use the bulkload: > > FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER)); > > FileOutputFormat.setOutputPath(job, hbasePath); > > HTable table = new HTable(jConf, HBASE_TABLE); > > HFileOutputFormat.configureIncrementalLoad(job, table); > > > > It seems that it takes really long time when it starts to execute the > Puts > > to HBase in the reduce phase. > > > > > > > > 2014-04-14 14:35 GMT+02:00 Ted Yu <[email protected]>: > > > > > Which hbase release did you run mapreduce job ? > > > > > > Cheers > > > > > > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <[email protected]> > > wrote: > > > > > > > I want to create a large dateset for HBase with different versions > and > > > > number of rows. It's about 10M rows and 100 versions to do some > > > benchmarks. > > > > > > > > What's the fastest way to create it?? I'm generating the dataset > with a > > > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size > > > around > > > > 7Gb. I don't know if I could do it quickly. The bottleneck is when > > > > MapReduces write the output and when transfer the output to the > > Reduces. > > > > > > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to be > read only by the individual or entity to whom this message is addressed. If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any form, > is strictly prohibited. If you have received this message in error, please > immediately notify the sender and/or [email protected] and > delete or destroy any copy of this message and its attachments. >
