Also, I am using 0.8.2+23-cdh4.4.0 version of Crunch.

On Feb 9, 2014, at 9:28 PM, Koduri,Vinay 
<[email protected]<mailto:[email protected]>> wrote:

Crunch,

This is closely related to what Stephen has just posted[1].

In the attached DAG_PipelineWithoutMaterialization.pdf, I am trying to avoid 
the double computation of the "MakingAPTable" function. Even with the scale 
factor < 1, the planner is trying to compute that twice.(Refer to 
PipelineWithoutMaterializationTest.java). So I am trying to  materializing the 
PTable<Writable,Writable> the function produces there by avoiding the re run. I 
am doing

pTable.materialize();
pipeline.run();

That pTable's values could be null and I get an exception(attached) when it is 
null during the materialization process. As discussed in [1], it seems 
Writables.tableOf() also does not support null. When PTable<Writable,Writable> 
is transformed to  a PCollection<Pair<Writable,Writable> the materialization 
worked fine. (Refer to  PipelineThatMaterializesAPCollectionTest.java and its 
DAG)

Questions:
1. Is there a better way to to avoid double computation of the function without 
materialization?

2. Does Crunch convert PTable to a PCollection when emitting intermediate 
outputs that are used by subsequent phases in a pipeline execution?



[1] 
http://mail-archives.apache.org/mod_mbox/crunch-user/201402.mbox/browser<https://urldefense.proofpoint.com/v1/url?u=http://mail-archives.apache.org/mod_mbox/crunch-user/201402.mbox/browser&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=xHz9XyA51Hu9x9YZR1pP985J40sBzwWvGdDneixt8Qo%3D%0A&m=0Phwjw2jpM0o2smM44pX2CZPA0EuMC1iwxRSyW6gQHI%3D%0A&s=8d9603e176b2946ce947aaeb0f4862f1c34a928b1ce639a21573795bc796445f>


Stack Trace when materializing a PTable<Writable, Writable>:
org.apache.crunch.CrunchRuntimeException: java.lang.NullPointerException
at 
org.apache.crunch.impl.mr.emit.MultipleOutputEmitter.emit(MultipleOutputEmitter.java:45)
at org.apache.crunch.MapFn.process(MapFn.java:34)
at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:99)
at 
org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56)
at 
com.cerner.pophealth.refrecord.load.CrunchSimpleTest$1.process(CrunchSimpleTest.java:60)
at 
com.cerner.pophealth.refrecord.load.CrunchSimpleTest$1.process(CrunchSimpleTest.java:1)
at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:99)
at 
org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56)
at org.apache.crunch.MapFn.process(MapFn.java:34)
at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:99)
at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:110)
at org.apache.crunch.impl.mr.run.CrunchMapper.map(CrunchMapper.java:60)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:673)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1268)
at 
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:74)
at org.apache.crunch.io.CrunchOutputs.write(CrunchOutputs.java:133)
at 
org.apache.crunch.impl.mr.emit.MultipleOutputEmitter.emit(MultipleOutputEmitter.java:41)
... 21 more


Thanks
CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.
<DAG_PipelineThatMaterializesAPCollection.pdf><DAG_PipelineWithoutMaterialization.pdf><PipelineThatMaterializesAPCollectionTest.java><PipelineWithoutMaterializationTest.java>

Vinay Koduri
Software Architect, Population Health Dev
[email protected]<mailto:[email protected]>
www.cerner.com<http://www.cerner.com>





Reply via email to