Hey Josh,
Thanks for the reply! I think we will be able to get around this issue by 
materializing the output of union of S2 and S3.
However, the DAG shows that the first job is still writing output of S1 to the 
disk. Out of curiosity, how the planner decides to write output of S1 to the 
disk instead of writing the output of union of S2 and S3?

-Surbhi

From: Josh Wills <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, November 12, 2013 10:26 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Same processing in two m/r jobs

Hey Surbhi,

The planner is trying to minimize the amount of data it writes to disk at the 
end of the first job; it doesn't usually worry so much about re-running the 
same computation in two different jobs if it means that less data will be 
written to disk overall, since most MR jobs aren't CPU bound.

While that's often a useful heuristic, there are many cases where it isn't 
true, and this sounds like one of them. My advice would be to materialize the 
output of the union of S2 and S3, at which point the planner should run the 
processing of S2 and S3 once at the end of job 1, and then pick up that 
materialized output for grouping in job 2.

Best,
Josh


On Mon, Nov 11, 2013 at 8:30 PM, Mungre,Surbhi 
<[email protected]<mailto:[email protected]>> wrote:
Background:
We have a crunch pipeline which is used to normalize and standardize some 
entities represented as Avro. In our pipeline, we also capture some context 
information about the errors and warnings which we encounter during our 
processing. We pass a pair of context information and Avro entities in our 
pipeline. At the end of the pipeline, the context information is written to 
HDFS and Avro entities are written to HFiles.

Problem:
When we were trying to analyze DAG for our crunch pipeline we noticed that same 
processing is done in two m/r jobs. Once it is done to capture context 
information and second time it is done to generate HFiles. I wrote a test which 
replicates this issue with a simple example. The test and a DAG created from 
this test are attached with the post. It is clear from the DAG that S2 and S3 
are processed twice. I am not sure why this processing is done twice and if 
there is any way to avoid this behavior.

Surbhi Mungre


Software Engineer
www.cerner.com<http://www.cerner.com>



CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024<tel:%28%2B1%29%20%28816%29221-1024>.



--
Director of Data Science
Cloudera<https://urldefense.proofpoint.com/v1/url?u=http://www.cloudera.com&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=M4HOLRWOn0OttArZJ9eQx2C99tNakYcf01hNzMjL6jU%3D%0A&m=2A35W8Zwbane2C7kqZvZN35e20y0lK%2FSIpuatZ7F7b0%3D%0A&s=4c77387d6bc6fd6dd1e16d1610b72a1062bc302b11a69faef1b5c132d3907187>
Twitter: 
@josh_wills<https://urldefense.proofpoint.com/v1/url?u=http://twitter.com/josh_wills&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=M4HOLRWOn0OttArZJ9eQx2C99tNakYcf01hNzMjL6jU%3D%0A&m=2A35W8Zwbane2C7kqZvZN35e20y0lK%2FSIpuatZ7F7b0%3D%0A&s=6974028a6f2023bcda2dcf41613048b20bfc0a7ae358e7eaca1b25dc9f910737>

Reply via email to