Re: using beam with flink runner

Günter Hipler Tue, 29 Dec 2020 10:45:19 -0800

Thanks a lot for the hint Chad (and all the others responding to my twoposts)


I will take a look at it


Günter


On 28.12.20 22:54, Chad Dombrova wrote:

Sam Bourne’s repo has a lot of tricks for using Flink with dockercontainers:

https://github.com/sambvfx/beam-flink-k8s

Feel free to make a PR if you find anything has changed.


-Chad

On Mon, Dec 28, 2020 at 9:38 AM Kyle Weaver <[email protected]<mailto:[email protected]>> wrote:


    Using Docker workers along with the local filesystem I/O is not
    recommended because the Docker workers will use their own
    filesystems instead of the host filesystem. See
    https://issues.apache.org/jira/browse/BEAM-5440

    On Sun, Dec 27, 2020 at 5:01 AM Günter Hipler
    <[email protected] <mailto:[email protected]>> wrote:

        Hi,

        I just tried to start a beam pipeline on a flink cluster using

        - the latest published beam version 2.26.0
        - the python SDK
        - a standalone flink cluster version 1.10.1
        - the simple pipeline I used  [1]

        When I start the pipeline in embedded mode it works correctly
        (even
        pulling a jobmanager docker image)

        python mas_zb_demo_marc_author_count.py --input
        
/swissbib_index/solrDocumentProcessing/FrequentInitialPreProcessing/data/beam/input/job8r1A069.format.xml

        --output
        
/swissbib_index/solrDocumentProcessing/FrequentInitialPreProcessing/data/beam/output/out.txt

        --runner=FlinkRunner --flink_version=1.10

        <logs>
        WARNING:root:Make sure that locally built Python SDK docker
        image has
        Python 3.7 interpreter.
        INFO:root:Using Python SDK docker image:
        apache/beam_python3.7_sdk:2.26.0. If the image is not
        available at
        local, we will try to pull from hub.docker.com
        <http://hub.docker.com>
        </logs>

        python mas_zb_demo_marc_author_count.py --input
        
/swissbib_index/solrDocumentProcessing/FrequentInitialPreProcessing/data/beam/input/job8r1A069.format.xml

        --output
        
/swissbib_index/solrDocumentProcessing/FrequentInitialPreProcessing/data/beam/output/out.txt

        --runner=FlinkRunner --flink_version=1.10

        I'm using python version
        python --version
        Python 3.7.9

        Trying to use the remote stanalone cluster the job fails when
        fetching
        the jobmanager docker image
        python mas_zb_demo_marc_author_count.py --input
        
/swissbib_index/solrDocumentProcessing/FrequentInitialPreProcessing/data/beam/input/job8r1A069.format.xml

        --output
        
/swissbib_index/solrDocumentProcessing/FrequentInitialPreProcessing/data/beam/output/out.txt

        --runner=FlinkRunner --flink_version=1.10
        flink_master=localhost:8081

        <logs>

        java.lang.Exception: The user defined 'open()' method caused an
        exception: java.util.concurrent.TimeoutException: Timed out while
        waiting for command 'docker run -d --network=host
        --env=DOCKER_MAC_CONTAINER=null apache/beam_python3.8_sdk:2.26.0
        --id=1-1 --provision_endpoint=localhost:41483'
                 at
        org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:499)
                 at
        org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
                 at
        org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
                 at
        org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
                 at java.base/java.lang.Thread.run(Thread.java:834)

        </logs>

        Then I pulled the apache/beam_python3.8_sdk:2.26.0 image
        locally to
        avoid the timeout, which was successful, the remote job
        finished and the
        they were shown in the Flink dashboard.

        But no result was written into the given --output dir although I
        couldn't find any logs referencing this issue in the logs of
        Flink.
        Additionally I'm getting quite a huge amount of logs in the
        python
        process shell which sends the script to the cluster [2] - but
        I can't
        see any reason for the behaviour

        Thanks for any explanations for the behaviour

        Günter


        [1]
        
https://gitlab.com/swissbib/lab/services/jupyter-beam-flink/-/blob/master/mas_zb_demo_marc_author_count.py
        (grep author data from bibliographic library data)

        [2]
        
https://gitlab.com/swissbib/lab/services/jupyter-beam-flink/-/blob/master/notes/logging.no.output.txt

Re: using beam with flink runner

Reply via email to