Closing the loop on this one-- I'm fixing the underlying issue in: https://issues.apache.org/jira/browse/CRUNCH-294
On Tue, Nov 12, 2013 at 10:04 AM, Josh Wills <[email protected]> wrote: > I'm surprised that it's still writing out S1-- are you changing the > downstream operations (PTable.keys and GBK) to read from the union of S2 > and S3? If so, I'd like to try to recreate that, it sounds like a planner > bug. > > Whenever there is a dependency between two GBK operations, the planner > analyzes the operations that link those two GBKs and tries to find a "good" > place to split the pipeline into two separate MR jobs. Good is usually > based on the planner's rough estimate of how much data will be written by > each of the DoFns, which is largely determined by the value of the float > scaleFactor() function for each DoFn-- scaleFactor() > 1.0 means that the > DoFn is expected to write out more data than it reads in, scaleFactor() < > 1.0 means that the function is expected to write less. > > The only exception to that rule is if you are already writing a read/write > output file at some point along the chain of operations between the two > GBKs, at which point the planner will just choose that file as the split > point. Note that text outputs are write-only, Crunch does not assume that > it can read back a text file as it was written unless it is a > PCollection<String>. > > J > > > On Tue, Nov 12, 2013 at 9:31 AM, Mungre,Surbhi > <[email protected]>wrote: > >> Hey Josh, >> Thanks for the reply! I think we will be able to get around this issue >> by materializing the output of union of S2 and S3. >> However, the DAG shows that the first job is still writing output of S1 >> to the disk. Out of curiosity, how the planner decides to write output of >> S1 to the disk instead of writing the output of union of S2 and S3? >> >> -Surbhi >> >> From: Josh Wills <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Tuesday, November 12, 2013 10:26 AM >> To: "[email protected]" <[email protected]> >> Subject: Re: Same processing in two m/r jobs >> >> Hey Surbhi, >> >> The planner is trying to minimize the amount of data it writes to disk >> at the end of the first job; it doesn't usually worry so much about >> re-running the same computation in two different jobs if it means that less >> data will be written to disk overall, since most MR jobs aren't CPU bound. >> >> While that's often a useful heuristic, there are many cases where it >> isn't true, and this sounds like one of them. My advice would be to >> materialize the output of the union of S2 and S3, at which point the >> planner should run the processing of S2 and S3 once at the end of job 1, >> and then pick up that materialized output for grouping in job 2. >> >> Best, >> Josh >> >> >> On Mon, Nov 11, 2013 at 8:30 PM, Mungre,Surbhi >> <[email protected]>wrote: >> >>> Background: >>> We have a crunch pipeline which is used to normalize and standardize >>> some entities represented as Avro. In our pipeline, we also capture some >>> context information about the errors and warnings which we encounter during >>> our processing. We pass a pair of context information and Avro entities in >>> our pipeline. At the end of the pipeline, the context information is >>> written to HDFS and Avro entities are written to HFiles. >>> >>> Problem: >>> When we were trying to analyze DAG for our crunch pipeline we noticed that >>> same processing is done in two m/r jobs. Once it is done to capture context >>> information and second time it is done to generate HFiles. I wrote a test >>> which replicates this issue with a simple example. The test and a DAG >>> created from this test are attached with the post. It is clear from the DAG >>> that S2 and S3 are processed twice. I am not sure why this processing is >>> done twice and if there is any way to avoid this behavior. >>> >>> >>> Surbhi Mungre >>> >>> >>> >>> >>> Software Engineer >>> www.cerner.com >>> >>> >>> >>> CONFIDENTIALITY NOTICE This message and any included attachments are >>> from Cerner Corporation and are intended only for the addressee. The >>> information contained in this message is confidential and may constitute >>> inside or non-public information under international, federal, or state >>> securities laws. Unauthorized forwarding, printing, copying, distribution, >>> or use of such information is strictly prohibited and may be unlawful. If >>> you are not the addressee, please promptly delete this message and notify >>> the sender of the delivery error by e-mail or you may call Cerner's >>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024. >>> >> >> >> >> -- >> Director of Data Science >> Cloudera<https://urldefense.proofpoint.com/v1/url?u=http://www.cloudera.com&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=M4HOLRWOn0OttArZJ9eQx2C99tNakYcf01hNzMjL6jU%3D%0A&m=2A35W8Zwbane2C7kqZvZN35e20y0lK%2FSIpuatZ7F7b0%3D%0A&s=4c77387d6bc6fd6dd1e16d1610b72a1062bc302b11a69faef1b5c132d3907187> >> Twitter: >> @josh_wills<https://urldefense.proofpoint.com/v1/url?u=http://twitter.com/josh_wills&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=M4HOLRWOn0OttArZJ9eQx2C99tNakYcf01hNzMjL6jU%3D%0A&m=2A35W8Zwbane2C7kqZvZN35e20y0lK%2FSIpuatZ7F7b0%3D%0A&s=6974028a6f2023bcda2dcf41613048b20bfc0a7ae358e7eaca1b25dc9f910737> >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
