Re: Controlling parallelism of a ParDo Transform while writing to DB

Jins George Thu, 17 May 2018 10:33:21 -0700

I am a user running beam+flink. Flink runner currently exposes onlythe job level parallelism, not at an operator level. This is a reallynice feature if can be supported.

Flink's Datastream api provide that option though.

Thanks,
Jins George


On 05/16/2018 10:24 PM, Chamikara Jayalath wrote:

Exact mechanism of controlling parallelism is runner specific. Lookslike Flink allows users to to specify the amount of parallelism (perjob) using following option.

https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineOptions.java#L65

I'm not sure if Flink allows more finer grained control.

On Wed, May 16, 2018 at 5:48 PM Harshvardhan Agrawal<[email protected] <mailto:[email protected]>>wrote:


    How do we control parallelism of a particular step then? Is there
    a recommended approach to solve this problem?

    On Wed, May 16, 2018 at 20:45 Chamikara Jayalath
    <[email protected] <mailto:[email protected]>> wrote:

        I don't think this can be specified through Beam API but Flink
        runner might have additional configurations that I'm not aware
        of. Also, many runners fuse steps to improve the execution
        performance. So simply specifying the parallelism of a single
        step will not work.

        Thanks,
        Cham

        On Tue, May 15, 2018 at 11:21 AM Harshvardhan Agrawal
        <[email protected]
        <mailto:[email protected]>> wrote:

            Hi Guys,

            I am currently in the process of developing a pipeline
            using Apache Beam with Flink as an execution engine. As a
            part of the process I read data from Kafka and perform a
            bunch of transformations that involve joins, aggregations
            as well as lookups to an external DB.

            The idea is that we want to have higher parallelism with
            Flink when we are performing the aggregations but
            eventually coalesce the data and have lesser number of
            processes writing to the DB so that the target DB can
            handle it (for example say I want to have a parallelism of
            40 for aggregations but only 10 when writing to target DB).

            Is there any way we could do that in Beam?

            Regards,

            Harsh

Re: Controlling parallelism of a ParDo Transform while writing to DB

Reply via email to