I am looking for a way to avoid the regionserver hotspotting while doing a bulk 
load. My input files to ImportTsv are extracted from a relational store and 
have monotonically increasing Ids.


Alternatively, is there a way for ImportTsv to generate its own row key (which 
does not increase monotically) and load the column data from the input files? 
If  there are no options to bulk load using this tool and spread the load, then 
I will just write code to generate the rowkey and use the HBase API for 
loading. Just wanted to confirm with the experts from this DL


Thanks











From: Ted Yu
Sent: ‎Friday‎, ‎January‎ ‎9‎, ‎2015 ‎2‎:‎14‎ ‎PM
To: [email protected]





Salted buckets seem to be concept from other projects, such as Phoenix.

Can you be a bit more specific about your requirement ?

Cheers

On Fri, Jan 9, 2015 at 12:53 PM, Rama Ramani <[email protected]> wrote:

> Is there a way to specify Salted buckets with HBase ImportTsv while doing
> bulk load?
>
> Thanks
> Rama
>
> From: [email protected]
> To: [email protected]
> Subject: RE: HBase - bulk loading files
> Date: Fri, 19 Dec 2014 14:09:09 -0800
>
>
>
>
> 0.98.0.2.1.9.0-2196-hadoop2Hadoop 2.4.0.2.1.9.0-2196Subversion
> [email protected]:hortonworks/hadoop-monarch.git -r cb50542bc92fb77dee52
> No, the clusters were not taking additional load.
> ThanksRama
> > Date: Fri, 19 Dec 2014 13:50:30 -0800
> > Subject: Re: HBase - bulk loading files
> > From: [email protected]
> > To: [email protected]
> >
> > Can you let us know the HBase and hadoop versions you're using ?
> >
> > Were the clusters taking load from other sources when ImportTsv was
> running
> > ?
> >
> > Cheers
> >
> > On Fri, Dec 19, 2014 at 1:43 PM, Rama Ramani <[email protected]>
> wrote:
> >
> > > Hello,         I am bulk loading a set of files (about 400MB each) with
> > > "|" as the delimiter using ImportTsv. It takes a long time for the
> 'map'
> > > job to complete on both a 4 node and a 16 node cluster. I tried the
> option
> > > to generate the output (providing -Dimporttsv.bulk.output) which took
> time
> > > indicating that the generation of the output files needs improvement.
> > > I am seeing about 8000 rows / sec for this dataset, the 400MB ingestion
> > > takes about 5-6 mins. How can I improve this? Is there an alternate
> tool I
> > > can use?
> > > ThanksRama
>
>
>

Reply via email to