Re: performance consideration when writing to HBase from MR job

Raghava Mutharaju Sat, 05 Jun 2010 16:22:31 -0700

Hi Amandeep,

Thank you for the reply. I was using HBase API directly in mapper (there is
no reducer). I thought instead of writing out each row (using context), it
would be quicker if a batch write - table.put(List) is done. Going by what
you said, I guess the difference won't be much.


Regards,
Raghava.

On Sat, Jun 5, 2010 at 7:01 PM, Amandeep Khurana <[email protected]> wrote:

> >
> > a) all the Puts are collected in Reduce or Map (if there is no reduce)
>  and
> > a batch write is done
> > b) writing out each <K,V> pair using context.write(k, v)
> >
> > If a) is considered instead of b) then wouldn't there be a violation of
> > semantics w.r.t KEYOUT, VALUEOUT (because <K, V> is not being output)??
> Is
> > this OK?
> >
>
> 1. If you can write from the mapper, you would avoid the overhead caused
> due
> to shuffling and sorting between the map and reduce phase.
> 2. It would not make much difference if you are using the HBase API
> directly
> in the mapper/reducer to write to the table instead of writing out to the
> context and using one of the output formats that writes to the table.
> However, if you plan to use the bulkload utility (HBASE-48 jira), you will
> get much better performance than using the HBase API directly.
> Regarding the semantics - no there would not be a problem as long as you
> create your Puts properly.
>
> -Amandeep
>

Re: performance consideration when writing to HBase from MR job

Reply via email to