RE: performance consideration when writing to HBase from MR job

Jonathan Gray Sun, 06 Jun 2010 09:44:20 -0700

TableOutputFormat does batching of writes under the hood so it's basically 
doing the same thing.


> -----Original Message-----
> From: Raghava Mutharaju [mailto:[email protected]]
> Sent: Saturday, June 05, 2010 4:22 PM
> To: [email protected]
> Subject: Re: performance consideration when writing to HBase from MR
> job
> 
> Hi Amandeep,
> 
> Thank you for the reply. I was using HBase API directly in mapper
> (there is
> no reducer). I thought instead of writing out each row (using context),
> it
> would be quicker if a batch write - table.put(List) is done. Going by
> what
> you said, I guess the difference won't be much.
> 
> Regards,
> Raghava.
> 
> On Sat, Jun 5, 2010 at 7:01 PM, Amandeep Khurana <[email protected]>
> wrote:
> 
> > >
> > > a) all the Puts are collected in Reduce or Map (if there is no
> reduce)
> >  and
> > > a batch write is done
> > > b) writing out each <K,V> pair using context.write(k, v)
> > >
> > > If a) is considered instead of b) then wouldn't there be a
> violation of
> > > semantics w.r.t KEYOUT, VALUEOUT (because <K, V> is not being
> output)??
> > Is
> > > this OK?
> > >
> >
> > 1. If you can write from the mapper, you would avoid the overhead
> caused
> > due
> > to shuffling and sorting between the map and reduce phase.
> > 2. It would not make much difference if you are using the HBase API
> > directly
> > in the mapper/reducer to write to the table instead of writing out to
> the
> > context and using one of the output formats that writes to the table.
> > However, if you plan to use the bulkload utility (HBASE-48 jira), you
> will
> > get much better performance than using the HBase API directly.
> > Regarding the semantics - no there would not be a problem as long as
> you
> > create your Puts properly.
> >
> > -Amandeep
> >

RE: performance consideration when writing to HBase from MR job

Reply via email to