Re: How to generate a large dataset quickly.

Guillermo Ortiz Mon, 14 Apr 2014 14:07:23 -0700

But, if I'm using bulkLoad, I think this method bypasses the WAL, right?
I have no idea about the autoFlush, is it still necessary to set to false
or the bulkload does some kind of magic with that as well??


I could try to do the loads without bulkLoad, but, I don't think that's the
problem, maybe, it's just the time the cluster needs, although, it seems
like too much time.



2014-04-14 22:51 GMT+02:00 lars hofhansl <[email protected]>:

> +1 to what Vladimir said.
> For the Puts in question you can also disable the write ahead log (WAL)
> and issue a flush on the table after your ingest.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Vladimir Rodionov <[email protected]>
> To: "[email protected]" <[email protected]>
> Cc:
> Sent: Monday, April 14, 2014 11:15 AM
> Subject: RE: How to generate a large dataset quickly.
>
> There is no need to run M/R unless your cluster is large (very large)
> Single multithreaded client can easily ingest 10s of thousands rows per
> sec.
> Check YCSB benchmark tool, for example.
>
> Make sure you disable both region splitting and major compaction during
> data ingestion
> and pre-split regions accordingly to improve overall performance.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: [email protected]
>
> ________________________________________
>
> From: Ted Yu [[email protected]]
> Sent: Monday, April 14, 2014 9:16 AM
> To: [email protected]
> Subject: Re: How to generate a large dataset quickly.
>
> I looked at revision history for HFileOutputFormat.java
> There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't
> affect throughput much.
>
> If you can use ganglia (or some similar tool) to pinpoint what caused the
> low ingest rate, that would give us more clue.
>
> BTW Is upgrading to newer release, such as 0.98.1 (which contains
> HBASE-8755), an option for you ?
>
> Cheers
>
>
> On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz <[email protected]
> >wrote:
>
> > I'm using. 0.94.6-cdh4.4.0,
> >
> > I use the bulkload:
> > FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
> > FileOutputFormat.setOutputPath(job, hbasePath);
> > HTable table = new HTable(jConf, HBASE_TABLE);
> > HFileOutputFormat.configureIncrementalLoad(job, table);
> >
> > It seems that it takes really long time when it starts to execute the
> Puts
> > to HBase in the reduce phase.
> >
> >
> >
> > 2014-04-14 14:35 GMT+02:00 Ted Yu <[email protected]>:
> >
> > > Which hbase release did you run mapreduce job ?
> > >
> > > Cheers
> > >
> > > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <[email protected]>
> > wrote:
> > >
> > > > I want to create a large dateset for HBase with different versions
> and
> > > > number of rows. It's about 10M rows and 100 versions to do some
> > > benchmarks.
> > > >
> > > > What's the fastest way to create it?? I'm generating the dataset
> with a
> > > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size
> > > around
> > > > 7Gb. I don't know if I could do it quickly. The bottleneck is when
> > > > MapReduces write the output and when transfer the output to the
> > Reduces.
> > >
> >
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or [email protected] and
> delete or destroy any copy of this message and its attachments.
>

Re: How to generate a large dataset quickly.

Reply via email to