Also, I am using 0.8.2+23-cdh4.4.0 version of Crunch. On Feb 9, 2014, at 9:28 PM, Koduri,Vinay <[email protected]<mailto:[email protected]>> wrote:
Crunch, This is closely related to what Stephen has just posted[1]. In the attached DAG_PipelineWithoutMaterialization.pdf, I am trying to avoid the double computation of the "MakingAPTable" function. Even with the scale factor < 1, the planner is trying to compute that twice.(Refer to PipelineWithoutMaterializationTest.java). So I am trying to materializing the PTable<Writable,Writable> the function produces there by avoiding the re run. I am doing pTable.materialize(); pipeline.run(); That pTable's values could be null and I get an exception(attached) when it is null during the materialization process. As discussed in [1], it seems Writables.tableOf() also does not support null. When PTable<Writable,Writable> is transformed to a PCollection<Pair<Writable,Writable> the materialization worked fine. (Refer to PipelineThatMaterializesAPCollectionTest.java and its DAG) Questions: 1. Is there a better way to to avoid double computation of the function without materialization? 2. Does Crunch convert PTable to a PCollection when emitting intermediate outputs that are used by subsequent phases in a pipeline execution? [1] http://mail-archives.apache.org/mod_mbox/crunch-user/201402.mbox/browser<https://urldefense.proofpoint.com/v1/url?u=http://mail-archives.apache.org/mod_mbox/crunch-user/201402.mbox/browser&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=xHz9XyA51Hu9x9YZR1pP985J40sBzwWvGdDneixt8Qo%3D%0A&m=0Phwjw2jpM0o2smM44pX2CZPA0EuMC1iwxRSyW6gQHI%3D%0A&s=8d9603e176b2946ce947aaeb0f4862f1c34a928b1ce639a21573795bc796445f> Stack Trace when materializing a PTable<Writable, Writable>: org.apache.crunch.CrunchRuntimeException: java.lang.NullPointerException at org.apache.crunch.impl.mr.emit.MultipleOutputEmitter.emit(MultipleOutputEmitter.java:45) at org.apache.crunch.MapFn.process(MapFn.java:34) at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:99) at org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56) at com.cerner.pophealth.refrecord.load.CrunchSimpleTest$1.process(CrunchSimpleTest.java:60) at com.cerner.pophealth.refrecord.load.CrunchSimpleTest$1.process(CrunchSimpleTest.java:1) at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:99) at org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56) at org.apache.crunch.MapFn.process(MapFn.java:34) at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:99) at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:110) at org.apache.crunch.impl.mr.run.CrunchMapper.map(CrunchMapper.java:60) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:673) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) Caused by: java.lang.NullPointerException at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1268) at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:74) at org.apache.crunch.io.CrunchOutputs.write(CrunchOutputs.java:133) at org.apache.crunch.impl.mr.emit.MultipleOutputEmitter.emit(MultipleOutputEmitter.java:41) ... 21 more Thanks CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024. <DAG_PipelineThatMaterializesAPCollection.pdf><DAG_PipelineWithoutMaterialization.pdf><PipelineThatMaterializesAPCollectionTest.java><PipelineWithoutMaterializationTest.java> Vinay Koduri Software Architect, Population Health Dev [email protected]<mailto:[email protected]> www.cerner.com<http://www.cerner.com>
