Re: Same processing in two m/r jobs

Josh Wills Thu, 14 Nov 2013 15:21:25 -0800

Closing the loop on this one-- I'm fixing the underlying issue in:
https://issues.apache.org/jira/browse/CRUNCH-294



On Tue, Nov 12, 2013 at 10:04 AM, Josh Wills <[email protected]> wrote:

> I'm surprised that it's still writing out S1-- are you changing the
> downstream operations (PTable.keys and GBK) to read from the union of S2
> and S3? If so, I'd like to try to recreate that, it sounds like a planner
> bug.
>
> Whenever there is a dependency between two GBK operations, the planner
> analyzes the operations that link those two GBKs and tries to find a "good"
> place to split the pipeline into two separate MR jobs. Good is usually
> based on the planner's rough estimate of how much data will be written by
> each of the DoFns, which is largely determined by the value of the float
> scaleFactor() function for each DoFn-- scaleFactor() > 1.0 means that the
> DoFn is expected to write out more data than it reads in, scaleFactor() <
> 1.0 means that the function is expected to write less.
>
> The only exception to that rule is if you are already writing a read/write
> output file at some point along the chain of operations between the two
> GBKs, at which point the planner will just choose that file as the split
> point. Note that text outputs are write-only, Crunch does not assume that
> it can read back a text file as it was written unless it is a
> PCollection<String>.
>
> J
>
>
> On Tue, Nov 12, 2013 at 9:31 AM, Mungre,Surbhi 
> <[email protected]>wrote:
>
>>  Hey Josh,
>> Thanks for the reply! I think we will be able to get around this issue
>> by materializing the output of union of S2 and S3.
>> However, the DAG shows that the first job is still writing output of S1
>> to the disk. Out of curiosity, how the planner decides to write output of
>> S1 to the disk instead of writing the output of union of S2 and S3?
>>
>>  -Surbhi
>>
>>   From: Josh Wills <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Tuesday, November 12, 2013 10:26 AM
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Same processing in two m/r jobs
>>
>>   Hey Surbhi,
>>
>>  The planner is trying to minimize the amount of data it writes to disk
>> at the end of the first job; it doesn't usually worry so much about
>> re-running the same computation in two different jobs if it means that less
>> data will be written to disk overall, since most MR jobs aren't CPU bound.
>>
>> While that's often a useful heuristic, there are many cases where it
>> isn't true, and this sounds like one of them. My advice would be to
>> materialize the output of the union of S2 and S3, at which point the
>> planner should run the processing of S2 and S3 once at the end of job 1,
>> and then pick up that materialized output for grouping in job 2.
>>
>>  Best,
>> Josh
>>
>>
>> On Mon, Nov 11, 2013 at 8:30 PM, Mungre,Surbhi 
>> <[email protected]>wrote:
>>
>>>  Background:
>>> We have a crunch pipeline which is used to normalize and standardize
>>> some entities represented as Avro. In our pipeline, we also capture some
>>> context information about the errors and warnings which we encounter during
>>> our processing. We pass a pair of context information and Avro entities in
>>> our pipeline. At the end of the pipeline, the context information is
>>> written to HDFS and Avro entities are written to HFiles.
>>>
>>> Problem:
>>> When we were trying to analyze DAG for our crunch pipeline we noticed that 
>>> same processing is done in two m/r jobs. Once it is done to capture context 
>>> information and second time it is done to generate HFiles. I wrote a test 
>>> which replicates this issue with a simple example. The test and a DAG 
>>> created from this test are attached with the post. It is clear from the DAG 
>>> that S2 and S3 are processed twice. I am not sure why this processing is 
>>> done twice and if there is any way to avoid this behavior.
>>>
>>>
>>> Surbhi Mungre
>>>
>>>
>>>
>>>
>>> Software Engineer
>>> www.cerner.com
>>>
>>>
>>>
>>>  CONFIDENTIALITY NOTICE This message and any included attachments are
>>> from Cerner Corporation and are intended only for the addressee. The
>>> information contained in this message is confidential and may constitute
>>> inside or non-public information under international, federal, or state
>>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>>> or use of such information is strictly prohibited and may be unlawful. If
>>> you are not the addressee, please promptly delete this message and notify
>>> the sender of the delivery error by e-mail or you may call Cerner's
>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>>
>>
>>
>>
>>  --
>> Director of Data Science
>> Cloudera<https://urldefense.proofpoint.com/v1/url?u=http://www.cloudera.com&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=M4HOLRWOn0OttArZJ9eQx2C99tNakYcf01hNzMjL6jU%3D%0A&m=2A35W8Zwbane2C7kqZvZN35e20y0lK%2FSIpuatZ7F7b0%3D%0A&s=4c77387d6bc6fd6dd1e16d1610b72a1062bc302b11a69faef1b5c132d3907187>
>> Twitter: 
>> @josh_wills<https://urldefense.proofpoint.com/v1/url?u=http://twitter.com/josh_wills&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=M4HOLRWOn0OttArZJ9eQx2C99tNakYcf01hNzMjL6jU%3D%0A&m=2A35W8Zwbane2C7kqZvZN35e20y0lK%2FSIpuatZ7F7b0%3D%0A&s=6974028a6f2023bcda2dcf41613048b20bfc0a7ae358e7eaca1b25dc9f910737>
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Same processing in two m/r jobs

Reply via email to