After looking a bit more, could it be that C output should have been cached, 
but it isn't totally since there isn't enough memory, meaning that the step 
A->B->C->... needs to run again for some partitions? 

Best regards,
Augusto


On 2019/05/06 15:03:12, [email protected] <[email protected]> wrote: 
> Hi,
> 
> A bit more info: I am trying to find out how the following construct is 
> translated into Spark:
> I have the following pipeline: A->B->C->(D/E/F with multipleTags)
> 
> Then D, E, F branch out and do things independently. B and C in this case are 
> very heavy steps and it seems to my by looking at the jobs generated in spark 
> and corresponding DAGs for Stages that this is being transformed in spark to 
> 3 independent pipelines:
> A->B->C->D
> A->B->C->E
> A->B->C->F
> And the operation B, C which are extremely heavy seem to be repeated. Could 
> this be the case? Am I missing something? I am not expert at looking at these 
> generated DAGs but it is the general feeling I get hence why I wanted to know 
> if there is a way to see what is generated from Beam to Spark to run.
> 
> Best regards,
> Augusto
> 
> 
> 
> 
> 
> On 2019/05/06 06:41:07, [email protected] <[email protected]> wrote: 
> > Hi,
> > 
> > I would like to know if there is a way to inspect whatever pipeline was 
> > generated from Beam to be run in the Spark Runner. 
> > 
> > Best regards,
> > Augusto
> > 
> 

Reply via email to