After looking a bit more, could it be that C output should have been cached, but it isn't totally since there isn't enough memory, meaning that the step A->B->C->... needs to run again for some partitions?
Best regards, Augusto On 2019/05/06 15:03:12, [email protected] <[email protected]> wrote: > Hi, > > A bit more info: I am trying to find out how the following construct is > translated into Spark: > I have the following pipeline: A->B->C->(D/E/F with multipleTags) > > Then D, E, F branch out and do things independently. B and C in this case are > very heavy steps and it seems to my by looking at the jobs generated in spark > and corresponding DAGs for Stages that this is being transformed in spark to > 3 independent pipelines: > A->B->C->D > A->B->C->E > A->B->C->F > And the operation B, C which are extremely heavy seem to be repeated. Could > this be the case? Am I missing something? I am not expert at looking at these > generated DAGs but it is the general feeling I get hence why I wanted to know > if there is a way to see what is generated from Beam to Spark to run. > > Best regards, > Augusto > > > > > > On 2019/05/06 06:41:07, [email protected] <[email protected]> wrote: > > Hi, > > > > I would like to know if there is a way to inspect whatever pipeline was > > generated from Beam to be run in the Spark Runner. > > > > Best regards, > > Augusto > > >
