aah, Ok, thank you :) On Sun, Jun 6, 2010 at 12:40 PM, Jonathan Gray <[email protected]> wrote:
> TableOutputFormat does batching of writes under the hood so it's basically > doing the same thing. > > > -----Original Message----- > > From: Raghava Mutharaju [mailto:[email protected]] > > Sent: Saturday, June 05, 2010 4:22 PM > > To: [email protected] > > Subject: Re: performance consideration when writing to HBase from MR > > job > > > > Hi Amandeep, > > > > Thank you for the reply. I was using HBase API directly in mapper > > (there is > > no reducer). I thought instead of writing out each row (using context), > > it > > would be quicker if a batch write - table.put(List) is done. Going by > > what > > you said, I guess the difference won't be much. > > > > Regards, > > Raghava. > > > > On Sat, Jun 5, 2010 at 7:01 PM, Amandeep Khurana <[email protected]> > > wrote: > > > > > > > > > > a) all the Puts are collected in Reduce or Map (if there is no > > reduce) > > > and > > > > a batch write is done > > > > b) writing out each <K,V> pair using context.write(k, v) > > > > > > > > If a) is considered instead of b) then wouldn't there be a > > violation of > > > > semantics w.r.t KEYOUT, VALUEOUT (because <K, V> is not being > > output)?? > > > Is > > > > this OK? > > > > > > > > > > 1. If you can write from the mapper, you would avoid the overhead > > caused > > > due > > > to shuffling and sorting between the map and reduce phase. > > > 2. It would not make much difference if you are using the HBase API > > > directly > > > in the mapper/reducer to write to the table instead of writing out to > > the > > > context and using one of the output formats that writes to the table. > > > However, if you plan to use the bulkload utility (HBASE-48 jira), you > > will > > > get much better performance than using the HBase API directly. > > > Regarding the semantics - no there would not be a problem as long as > > you > > > create your Puts properly. > > > > > > -Amandeep > > > >
