We've been looking at ways to do multiple outputs in Crunch jobs,
specifically writing out some kind of Status or Error Avro object, based
on failures that occur processing individual records in various jobs. It
had been suggested that, rather than logging these errors to traditional
loggers, to consider them an output of the Crunch job.  After some
internal discussion, it was suggested to run the ideas past the Crunch
community.

 
A major goal we have is to end with all the error output in a location
that makes it easy to run Hive queries or perform other MapReduce-style
analysis to quickly view all errors across the larger system without the
need go to multiple facilities.  This means standardizing on the Avro
object, but it also necessitates decoupling the storage of the object from
the "standard output" of the job.

 
As Crunch DoFns support a single Emitter per invocation of process(), the
solution that gathered the most support would be to emit an object similar
to Pair<>, where first would be the "standard out" and second would be the
"standard error".  A DoFn would generally only populate one (nothing
preventing it from populating both if appropriate, but not really intended
as a part of general use), and separate DoFns would filter out the two
components of the pair and write the values to the appropriate targets.

As far as the emitted pairing object; the concept of a tagged union was
suggested although there currently isn't support in Java or Avro for the
concept; it was noted that
https://issues.apache.org/jira/browse/CRUNCH-239 might be a close
candidate. Pair<> would meet the requirements, although it was suggested
that a simple object dedicated to the task could make a cleaner approach.

Any general thoughts on this approach? Are there any other patterns that
might serve us better, or anything on the Crunch roadmap that might be
more appropriate?
 

Brandon Inman
Software Architect
www.cerner.com


CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

Reply via email to