Re: Hbase import Tsv performance (slow import)

anil gupta Tue, 23 Oct 2012 23:44:15 -0700

Yeah, we never used HBase client api(puts) for loading a batch of millions
of records. Can you tell me by default where the o/p HFile(s) from MR job
are stored in HDFS?



On Tue, Oct 23, 2012 at 11:31 PM, Anoop John <[email protected]> wrote:

> I think as per your explanation of need for unique id it is okey.. No need
> to worry abt data loss.. As long as you can make sure you make a unique id
> things are fine..  MR will make sure it run the job on whole data and the
> o/p is persisted in file.. Yes this file is HFile(s) only.. Then finally
> the HBase cluster is used for loading the HFiles to the Region stores..
> Bulk loading huge data using this way will be much much faster than normal
> put()s
>
> -Anoop-
>
> On Wed, Oct 24, 2012 at 11:44 AM, anil gupta <[email protected]>
> wrote:
>
> > Anoop: Only thing is that some
> > mappers crashed.. So thin MR fw will run that mapper again on the same
> data
> > set.. Then the unique id will be different?
> >
> > Anil: Yes, for the same dataset also the UniqueId will be different.
> > UniqueID does not depends on the data.
> >
> > Thanks,
> > Anil Gupta
> >
> > On Tue, Oct 23, 2012 at 11:07 PM, Anoop John <[email protected]>
> > wrote:
> >
> > > >. Is there a way that i can explicitly turn on WAL for bulk loading?
> > > no..
> > > How you generate the unique id?  Remember that initial steps wont need
> > the
> > > HBase cluster at all. MR generates the HFiles and the o/p will be in
> file
> > > only..  Mappers also will write o/p to file...  Only thing is that some
> > > mappers crashed.. So thin MR fw will run that mapper again on the same
> > data
> > > set.. Then the unique id will be different? I think you no need to
> worry
> > > about data loss from Hbase side..  So WAL is not required..
> > >
> > > -Anoop-
> > >
> > >
> > >
> > >
> > > On Wed, Oct 24, 2012 at 10:58 AM, anil gupta <[email protected]>
> > > wrote:
> > >
> > > > That's a very interesting fact. You made it clear but my custom Bulk
> > > Loader
> > > > generates an unique ID for every row in map phase. So, all my data is
> > not
> > > > in csv or text. Is there a way that i can explicitly turn on WAL for
> > bulk
> > > > loading?
> > > >
> > > > On Tue, Oct 23, 2012 at 10:14 PM, Anoop John <[email protected]>
> > > > wrote:
> > > >
> > > > > Hi Anil
> > > > >                 In case of bulk loading it is not like data is put
> > into
> > > > > HBase one by one.. The MR job will create an o/p like HFile.. It
> will
> > > > > create the KVs and write to file in order as how HFile will look
> > like..
> > > > The
> > > > > the file is loaded into HBase finally.. Only for this final step
> > HBase
> > > RS
> > > > > will be used.. So there is no point in WAL there...  I am making it
> > > clear
> > > > > for you?   The data is already present in form of raw data in some
> > txt
> > > or
> > > > > csv file  :)
> > > > >
> > > > > -Anoop-
> > > > >
> > > > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Anil
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta <
> > [email protected]
> > > > > >wrote:
> > > > > >
> > > > > >> Hi Anoop,
> > > > > >>
> > > > > >> As per your last email, did you mean that WAL is not used while
> > > using
> > > > > >> HBase
> > > > > >> Bulk Loader? If yes, then how we ensure "no data loss" in case
> of
> > > > > >> RegionServer failure?
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Anil Gupta
> > > > > >>
> > > > > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan <
> > > > > >> [email protected]> wrote:
> > > > > >>
> > > > > >> > As Kevin suggested we can make use of bulk load that goes thro
> > WAL
> > > > and
> > > > > >> > Memstore.  Or the second option will be to use the o/p of
> > mappers
> > > to
> > > > > >> create
> > > > > >> > HFiles directly.
> > > > > >> >
> > > > > >> > Regards
> > > > > >> > Ram
> > > > > >> >
> > > > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John <
> > > [email protected]>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Hi
> > > > > >> > >     Using ImportTSV tool you are trying to bulk load your
> > data.
> > > > Can
> > > > > >> you
> > > > > >> > see
> > > > > >> > > and tell how many mappers and reducers were there. Out of
> > total
> > > > time
> > > > > >> what
> > > > > >> > > is the time taken by the mapper phase and by the reducer
> > phase.
> > > > >  Seems
> > > > > >> > like
> > > > > >> > > MR related issue (may be some conf issue). In this bulk load
> > > case
> > > > > >> most of
> > > > > >> > > the work is done by the MR job. It will read the raw data
> and
> > > > > convert
> > > > > >> it
> > > > > >> > > into Puts and write to HFiles. MR o/p is HFiles itself. The
> > next
> > > > > part
> > > > > >> in
> > > > > >> > > ImportTSV will just put the HFiles under the table region
> > > store..
> > > > > >>  There
> > > > > >> > > wont be WAL usage in this bulk load.
> > > > > >> > >
> > > > > >> > > -Anoop-
> > > > > >> > >
> > > > > >> > > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard <
> > > > > >> > > [email protected]> wrote:
> > > > > >> > >
> > > > > >> > > > Hi everyone
> > > > > >> > > >
> > > > > >> > > > I'm starting with hbase and testing for our needs. I have
> > set
> > > > up a
> > > > > >> > hadoop
> > > > > >> > > > cluster of Three machines and A Hbase cluster atop on the
> > same
> > > > > three
> > > > > >> > > > machines,
> > > > > >> > > > one master two slaves.
> > > > > >> > > >
> > > > > >> > > > I am testing the Import of a 5GB csv file with the
> importTsv
> > > > > tool. I
> > > > > >> > > > import the
> > > > > >> > > > file in the HDFS and use the importTsv tool to import in
> > > Hbase.
> > > > > >> > > >
> > > > > >> > > > Right now it takes a little over an hour to complete. It
> > > creates
> > > > > >> > around 2
> > > > > >> > > > million entries in one table with a single family.
> > > > > >> > > > If I use bulk uploading it goes down to 20 minutes.
> > > > > >> > > >
> > > > > >> > > > My hadoop has 21 map tasks but they all seem to be taking
> a
> > > very
> > > > > >> long
> > > > > >> > > time
> > > > > >> > > > to
> > > > > >> > > > finish many tasks end up in time out.
> > > > > >> > > >
> > > > > >> > > > I am wondering what I have missed in my configuration. I
> > have
> > > > > >> followed
> > > > > >> > > the
> > > > > >> > > > different prerequisites in the documentations but I am
> > really
> > > > > >> unsure as
> > > > > >> > > to
> > > > > >> > > > what
> > > > > >> > > > is causing this slow down. If I were to apply the
> wordcount
> > > > > example
> > > > > >> to
> > > > > >> > > the
> > > > > >> > > > same
> > > > > >> > > > file it takes only minutes to complete so I am guessing
> the
> > > > issue
> > > > > >> lies
> > > > > >> > in
> > > > > >> > > > my
> > > > > >> > > > Hbase configuration.
> > > > > >> > > >
> > > > > >> > > > Any help or pointers would by appreciated
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Thanks & Regards,
> > > > > >> Anil Gupta
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Regards,
> > > > Anil Gupta
> > > >
> > >
> >
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: Hbase import Tsv performance (slow import)

Reply via email to