Hi Augusto,

Something that might be similar is happening in the portable version of the
Spark runner [1]. I haven't figured out for sure yet whether my issue is
caused by Spark redundantly scheduling work (which actually might be okay
given sufficient parallelism), or if there is a problem with the way the
Beam Spark runner sets up the Spark execution graph.

[1] https://issues.apache.org/jira/browse/BEAM-7131

Kyle Weaver | Software Engineer | github.com/ibzib | [email protected] |
+16502035555


On Mon, May 6, 2019 at 8:03 AM [email protected] <[email protected]>
wrote:

> Hi,
>
> A bit more info: I am trying to find out how the following construct is
> translated into Spark:
> I have the following pipeline: A->B->C->(D/E/F with multipleTags)
>
> Then D, E, F branch out and do things independently. B and C in this case
> are very heavy steps and it seems to my by looking at the jobs generated in
> spark and corresponding DAGs for Stages that this is being transformed in
> spark to 3 independent pipelines:
> A->B->C->D
> A->B->C->E
> A->B->C->F
> And the operation B, C which are extremely heavy seem to be repeated.
> Could this be the case? Am I missing something? I am not expert at looking
> at these generated DAGs but it is the general feeling I get hence why I
> wanted to know if there is a way to see what is generated from Beam to
> Spark to run.
>
> Best regards,
> Augusto
>
>
>
>
>
> On 2019/05/06 06:41:07, [email protected] <[email protected]>
> wrote:
> > Hi,
> >
> > I would like to know if there is a way to inspect whatever pipeline was
> generated from Beam to be run in the Spark Runner.
> >
> > Best regards,
> > Augusto
> >
>

Reply via email to