Hi Dhruv, Inline.
On Fri, Nov 2, 2012 at 4:15 AM, Dhruv <[email protected]> wrote: > I'm trying to optimize the performance of my OutputFormat's implementation. > I'm doing things similar to HBase's TableOutputFormat--sending the reducer's > output to a distributed k-v store. So, the context.write() call basically > winds up doing a Put() on the store. > > Although I haven't profiled, a sequence of thread dumps on the reduce tasks > reveal that the threads are RUNNABLE and hanging out in the put() and its > subsequent method calls. So, I proceeded to decouple these two by > implementing the producer (context.write()) consumer (RecordWriter.write()) > pattern using ExecutorService. With HBase involved, this is only partly correct. The HTable API, which regular TableOutputFormat uses, provides a "AutoFlush" option which if disabled, begins to buffer writes to regionservers instead of doing a flush of Puts/Deletes at every single invoke. The TableOutputFormat by default does disable AutoFlush, to provide this behavior. Read more on that at http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#setAutoFlush(boolean,%20boolean) and/or in Lars' book, "HBase: The Definitive Guide". > My understanding is that Context.write() calls RecordWriter.write() and that > these two are synchronous calls. The first will block until the second > method completes.Each reduce phase blocks until the context.write() > finishes, so the next reduce on the next key also blocks, making things run > slow in my case. Is this correct? Given the above explanation, this is untrue if HBase's TableOutputFormat is involved, but true otherwise for general FS interacting OFs. > Does this mean that OutputFormat is > instantiated once by the TaskTracker for the Job's reduce logic and all keys > operated on by the reducers get the same instance of the OutputFormat. Or, > is it that for each key operated by the reducer, a new OutputFormat is > instantiated? The TaskTracker is a service daemon that does not execute any user-code. Only a single OutputFormat object is instantiated in a single Task. The RecordWriter wrapped in it too is only instantiated once per Task. > Thanks, > Dhruv -- Harsh J
