On Sat, Jan 10, 2015 at 9:08 AM, Rama Ramani <[email protected]> wrote:
> I am looking for a way to avoid the regionserver hotspotting while doing a > bulk load. My input files to ImportTsv are extracted from a relational > store and have monotonically increasing Ids. > > > Alternatively, is there a way for ImportTsv to generate its own row key > (which does not increase monotically) and load the column data from the > input files? If there are no options to bulk load using this tool and > spread the load, then I will just write code to generate the rowkey and use > the HBase API for loading. Just wanted to confirm with the experts from > this DL > > > Thanks > > > You could write out hfiles and do a bulk import of these? See http://hbase.apache.org/book.html#d0e8022 The writing of the hfiles will not suffer 'hotspotting'. Else, subclass TsvImporterMapper map function, doctor the RDBMS seqid key by adding a prefix ('salting') or reversing or hashing, etc., and then specify your customization as the mapper for ImportTsv to use. St.Ack > > > > > > > > > > From: Ted Yu > Sent: Friday, January 9, 2015 2:14 PM > To: [email protected] > > > > > > Salted buckets seem to be concept from other projects, such as Phoenix. > > Can you be a bit more specific about your requirement ? > > Cheers > > On Fri, Jan 9, 2015 at 12:53 PM, Rama Ramani <[email protected]> wrote: > > > Is there a way to specify Salted buckets with HBase ImportTsv while doing > > bulk load? > > > > Thanks > > Rama > > > > From: [email protected] > > To: [email protected] > > Subject: RE: HBase - bulk loading files > > Date: Fri, 19 Dec 2014 14:09:09 -0800 > > > > > > > > > > 0.98.0.2.1.9.0-2196-hadoop2Hadoop 2.4.0.2.1.9.0-2196Subversion > > [email protected]:hortonworks/hadoop-monarch.git -r cb50542bc92fb77dee52 > > No, the clusters were not taking additional load. > > ThanksRama > > > Date: Fri, 19 Dec 2014 13:50:30 -0800 > > > Subject: Re: HBase - bulk loading files > > > From: [email protected] > > > To: [email protected] > > > > > > Can you let us know the HBase and hadoop versions you're using ? > > > > > > Were the clusters taking load from other sources when ImportTsv was > > running > > > ? > > > > > > Cheers > > > > > > On Fri, Dec 19, 2014 at 1:43 PM, Rama Ramani <[email protected]> > > wrote: > > > > > > > Hello, I am bulk loading a set of files (about 400MB each) > with > > > > "|" as the delimiter using ImportTsv. It takes a long time for the > > 'map' > > > > job to complete on both a 4 node and a 16 node cluster. I tried the > > option > > > > to generate the output (providing -Dimporttsv.bulk.output) which took > > time > > > > indicating that the generation of the output files needs improvement. > > > > I am seeing about 8000 rows / sec for this dataset, the 400MB > ingestion > > > > takes about 5-6 mins. How can I improve this? Is there an alternate > > tool I > > > > can use? > > > > ThanksRama > > > > > > >
