Hi Augusto, Something that might be similar is happening in the portable version of the Spark runner [1]. I haven't figured out for sure yet whether my issue is caused by Spark redundantly scheduling work (which actually might be okay given sufficient parallelism), or if there is a problem with the way the Beam Spark runner sets up the Spark execution graph.
[1] https://issues.apache.org/jira/browse/BEAM-7131 Kyle Weaver | Software Engineer | github.com/ibzib | [email protected] | +16502035555 On Mon, May 6, 2019 at 8:03 AM [email protected] <[email protected]> wrote: > Hi, > > A bit more info: I am trying to find out how the following construct is > translated into Spark: > I have the following pipeline: A->B->C->(D/E/F with multipleTags) > > Then D, E, F branch out and do things independently. B and C in this case > are very heavy steps and it seems to my by looking at the jobs generated in > spark and corresponding DAGs for Stages that this is being transformed in > spark to 3 independent pipelines: > A->B->C->D > A->B->C->E > A->B->C->F > And the operation B, C which are extremely heavy seem to be repeated. > Could this be the case? Am I missing something? I am not expert at looking > at these generated DAGs but it is the general feeling I get hence why I > wanted to know if there is a way to see what is generated from Beam to > Spark to run. > > Best regards, > Augusto > > > > > > On 2019/05/06 06:41:07, [email protected] <[email protected]> > wrote: > > Hi, > > > > I would like to know if there is a way to inspect whatever pipeline was > generated from Beam to be run in the Spark Runner. > > > > Best regards, > > Augusto > > >
