Thanks Harsh, just to be clear--if I have a large key set and if I run with just one reducer which is the default, the OutputFormat and the RecordWriter will be constructed only once?
On Thu, Nov 1, 2012 at 8:14 PM, Harsh J <[email protected]> wrote: > Hi Dhruv, > > Inline. > > On Fri, Nov 2, 2012 at 4:15 AM, Dhruv <[email protected]> wrote: > > I'm trying to optimize the performance of my OutputFormat's > implementation. > > I'm doing things similar to HBase's TableOutputFormat--sending the > reducer's > > output to a distributed k-v store. So, the context.write() call basically > > winds up doing a Put() on the store. > > > > Although I haven't profiled, a sequence of thread dumps on the reduce > tasks > > reveal that the threads are RUNNABLE and hanging out in the put() and its > > subsequent method calls. So, I proceeded to decouple these two by > > implementing the producer (context.write()) consumer > (RecordWriter.write()) > > pattern using ExecutorService. > > With HBase involved, this is only partly correct. The HTable API, > which regular TableOutputFormat uses, provides a "AutoFlush" option > which if disabled, begins to buffer writes to regionservers instead of > doing a flush of Puts/Deletes at every single invoke. > > The TableOutputFormat by default does disable AutoFlush, to provide > this behavior. > > Read more on that at > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#setAutoFlush(boolean,%20boolean) > and/or in Lars' book, "HBase: The Definitive Guide". > > > My understanding is that Context.write() calls RecordWriter.write() and > that > > these two are synchronous calls. The first will block until the second > > method completes.Each reduce phase blocks until the context.write() > > finishes, so the next reduce on the next key also blocks, making things > run > > slow in my case. Is this correct? > > Given the above explanation, this is untrue if HBase's > TableOutputFormat is involved, but true otherwise for general FS > interacting OFs. > > > Does this mean that OutputFormat is > > instantiated once by the TaskTracker for the Job's reduce logic and all > keys > > operated on by the reducers get the same instance of the OutputFormat. > Or, > > is it that for each key operated by the reducer, a new OutputFormat is > > instantiated? > > The TaskTracker is a service daemon that does not execute any > user-code. Only a single OutputFormat object is instantiated in a > single Task. The RecordWriter wrapped in it too is only instantiated > once per Task. > > > Thanks, > > Dhruv > > > > -- > Harsh J >
