TableOutputFormat does batching of writes under the hood so it's basically doing the same thing.
> -----Original Message----- > From: Raghava Mutharaju [mailto:[email protected]] > Sent: Saturday, June 05, 2010 4:22 PM > To: [email protected] > Subject: Re: performance consideration when writing to HBase from MR > job > > Hi Amandeep, > > Thank you for the reply. I was using HBase API directly in mapper > (there is > no reducer). I thought instead of writing out each row (using context), > it > would be quicker if a batch write - table.put(List) is done. Going by > what > you said, I guess the difference won't be much. > > Regards, > Raghava. > > On Sat, Jun 5, 2010 at 7:01 PM, Amandeep Khurana <[email protected]> > wrote: > > > > > > > a) all the Puts are collected in Reduce or Map (if there is no > reduce) > > and > > > a batch write is done > > > b) writing out each <K,V> pair using context.write(k, v) > > > > > > If a) is considered instead of b) then wouldn't there be a > violation of > > > semantics w.r.t KEYOUT, VALUEOUT (because <K, V> is not being > output)?? > > Is > > > this OK? > > > > > > > 1. If you can write from the mapper, you would avoid the overhead > caused > > due > > to shuffling and sorting between the map and reduce phase. > > 2. It would not make much difference if you are using the HBase API > > directly > > in the mapper/reducer to write to the table instead of writing out to > the > > context and using one of the output formats that writes to the table. > > However, if you plan to use the bulkload utility (HBASE-48 jira), you > will > > get much better performance than using the HBase API directly. > > Regarding the semantics - no there would not be a problem as long as > you > > create your Puts properly. > > > > -Amandeep > >
