Hi Amandeep, Thank you for the reply. I was using HBase API directly in mapper (there is no reducer). I thought instead of writing out each row (using context), it would be quicker if a batch write - table.put(List) is done. Going by what you said, I guess the difference won't be much.
Regards, Raghava. On Sat, Jun 5, 2010 at 7:01 PM, Amandeep Khurana <[email protected]> wrote: > > > > a) all the Puts are collected in Reduce or Map (if there is no reduce) > and > > a batch write is done > > b) writing out each <K,V> pair using context.write(k, v) > > > > If a) is considered instead of b) then wouldn't there be a violation of > > semantics w.r.t KEYOUT, VALUEOUT (because <K, V> is not being output)?? > Is > > this OK? > > > > 1. If you can write from the mapper, you would avoid the overhead caused > due > to shuffling and sorting between the map and reduce phase. > 2. It would not make much difference if you are using the HBase API > directly > in the mapper/reducer to write to the table instead of writing out to the > context and using one of the output formats that writes to the table. > However, if you plan to use the bulkload utility (HBASE-48 jira), you will > get much better performance than using the HBase API directly. > Regarding the semantics - no there would not be a problem as long as you > create your Puts properly. > > -Amandeep >
