aah, Ok, thank you :)

On Sun, Jun 6, 2010 at 12:40 PM, Jonathan Gray <[email protected]> wrote:

> TableOutputFormat does batching of writes under the hood so it's basically
> doing the same thing.
>
> > -----Original Message-----
> > From: Raghava Mutharaju [mailto:[email protected]]
> > Sent: Saturday, June 05, 2010 4:22 PM
> > To: [email protected]
> > Subject: Re: performance consideration when writing to HBase from MR
> > job
> >
> > Hi Amandeep,
> >
> > Thank you for the reply. I was using HBase API directly in mapper
> > (there is
> > no reducer). I thought instead of writing out each row (using context),
> > it
> > would be quicker if a batch write - table.put(List) is done. Going by
> > what
> > you said, I guess the difference won't be much.
> >
> > Regards,
> > Raghava.
> >
> > On Sat, Jun 5, 2010 at 7:01 PM, Amandeep Khurana <[email protected]>
> > wrote:
> >
> > > >
> > > > a) all the Puts are collected in Reduce or Map (if there is no
> > reduce)
> > >  and
> > > > a batch write is done
> > > > b) writing out each <K,V> pair using context.write(k, v)
> > > >
> > > > If a) is considered instead of b) then wouldn't there be a
> > violation of
> > > > semantics w.r.t KEYOUT, VALUEOUT (because <K, V> is not being
> > output)??
> > > Is
> > > > this OK?
> > > >
> > >
> > > 1. If you can write from the mapper, you would avoid the overhead
> > caused
> > > due
> > > to shuffling and sorting between the map and reduce phase.
> > > 2. It would not make much difference if you are using the HBase API
> > > directly
> > > in the mapper/reducer to write to the table instead of writing out to
> > the
> > > context and using one of the output formats that writes to the table.
> > > However, if you plan to use the bulkload utility (HBASE-48 jira), you
> > will
> > > get much better performance than using the HBase API directly.
> > > Regarding the semantics - no there would not be a problem as long as
> > you
> > > create your Puts properly.
> > >
> > > -Amandeep
> > >
>

Reply via email to