If you'll forgive the slight topic shift, it seems like the pattern of writing directly to HFiles rather than the TableOutputFormat would be better for several cases. For instance, TableOutputFormat results in everything being written to the WAL, and later compacted into HFiles. When practical, why not skip that interim state and produce the HFile directly, then do a bulk load?
Of course not all jobs that use the TableOutputFormat can easily write to Hfiles; those files require a strict ordering of row keys being output, and bulk loads are optimal only if the HFiles align with existing regions. But if such requirements are met, it seems like moving away from TableOutputFormat could help IO-bound jobs significantly. Is my reasoning sound? On 9/12/11 12:40 PM, "Leif Wickland" <[email protected]> wrote: >Thanks, Bryan. I'd love to hear any lessons you learn. I've used that >technique successfully at a prototype level, but haven't yet moved it to >production. > >Leif > >On Mon, Sep 12, 2011 at 10:51 AM, Bryan Keller <[email protected]> wrote: > >> Ah that is a very interesting solution Leif, this seems optimal to me. >>I am >> going to try this and I'll report back. >> >> On Sep 12, 2011, at 9:09 AM, Leif Wickland wrote: >> >> > >> > Bryan, >> > >> > Have you considered writing your MR output to HFileFormat and then >>asking >> > the regions to adopt the result? That would allow you to avoid >> committing >> > any changes to HBase until you knew that the MR job ran successfully. >> > >> > Leif >> >> ---------------------------------------------------------------------- CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
