Re: How to generate a large dataset quickly.

Doug Meil Mon, 14 Apr 2014 13:46:35 -0700

 
re:  "So, I execute 3.2Mill of Put¹s in HBase."

There will be 3.2 million Puts, but they won¹t be sent over 1 at a time if
autoFlush on Htable is false.  By default, htable should be using a 2mb
write buffer, and then it groups the Puts by RegionServer.







On 4/14/14, 2:21 PM, "Guillermo Ortiz" <[email protected]> wrote:

>Are there some benchmark about how long could it takes to insert data in
>HBase to have a reference?
>The output of my Mapper has 3.2mill. output. So, I execute 3.2Mill of
>Put's
>in HBase.
>
>Well, data has to be copied and sent to the reducers, but with a network
>of
>1Gb it shouldn't take too much time. I'll check Ganglia.
>
>
>2014-04-14 18:16 GMT+02:00 Ted Yu <[email protected]>:
>
>> I looked at revision history for HFileOutputFormat.java
>> There was one patch, HBASE-8949, which went into 0.94.11 but it
>>shouldn't
>> affect throughput much.
>>
>> If you can use ganglia (or some similar tool) to pinpoint what caused
>>the
>> low ingest rate, that would give us more clue.
>>
>> BTW Is upgrading to newer release, such as 0.98.1 (which contains
>> HBASE-8755), an option for you ?
>>
>> Cheers
>>
>>
>> On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz <[email protected]
>> >wrote:
>>
>> > I'm using. 0.94.6-cdh4.4.0,
>> >
>> > I use the bulkload:
>> > FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
>> > FileOutputFormat.setOutputPath(job, hbasePath);
>> > HTable table = new HTable(jConf, HBASE_TABLE);
>> > HFileOutputFormat.configureIncrementalLoad(job, table);
>> >
>> > It seems that it takes really long time when it starts to execute the
>> Puts
>> > to HBase in the reduce phase.
>> >
>> >
>> >
>> > 2014-04-14 14:35 GMT+02:00 Ted Yu <[email protected]>:
>> >
>> > > Which hbase release did you run mapreduce job ?
>> > >
>> > > Cheers
>> > >
>> > > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <[email protected]>
>> > wrote:
>> > >
>> > > > I want to create a large dateset for HBase with different versions
>> and
>> > > > number of rows. It's about 10M rows and 100 versions to do some
>> > > benchmarks.
>> > > >
>> > > > What's the fastest way to create it?? I'm generating the dataset
>> with a
>> > > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and
>>size
>> > > around
>> > > > 7Gb. I don't know if I could do it quickly. The bottleneck is when
>> > > > MapReduces write the output and when transfer the output to the
>> > Reduces.
>> > >
>> >
>>

Re: How to generate a large dataset quickly.

Reply via email to