Looks like you ran into a bug.

You could just run your program without specifying any arguments, since running with Python's FnApiRunner should be enough.

Alternatively, how about trying to run the same pipeline with the FlinkRunner? Use: --runner=FlinkRunner and do not specify an endpoint. It will run the Python code embedded (loopback environment) without additional containers.

Cheers,
Max

On 10.08.20 21:59, Eugene Kirpichov wrote:
Thanks Valentyn!

Good to know that this is a bug (I'll file a bug), and that Dataflow has an experimental way to use custom containers. I'll try that.

On Mon, Aug 10, 2020 at 12:51 PM Valentyn Tymofieiev <valen...@google.com <mailto:valen...@google.com>> wrote:

    Hi Eugene,

    Good to hear from you. The experience you are describing on Portable
    Runner + Docker container in local execution mode is most certainly
    a bug, if you have not opened an issue on it, please do so and feel
    free to cc me.

    I can also reproduce the bug and likewise didn't see anything
    obvious immediately, this needs some debugging.

    cc: +Ankur Goenka <mailto:goe...@google.com> +Kyle Weaver
    <mailto:kcwea...@google.com> who recently worked on Portable Runner
    and may be interested.

    By the way, you should be able to use custom containers with
    Dataflow, if you set --experiments=use_runner_v2.

    On Mon, Aug 10, 2020 at 9:06 AM Eugene Kirpichov
    <ekirpic...@gmail.com <mailto:ekirpic...@gmail.com>> wrote:

        (cc'ing Sam with whom I'm working on this atm)

        FWIW I'm still stumped. I've looked through Python, Go and Java
        code in the Beam repo having anything to do with
        gzipping/unzipping, and none of it appears to be used in the
        artifact staging/retrieval codepaths. I also can't find any
        mention of compression/decompression in the container boot code.
        My next step will be to add a bunch of debugging, rebuild the
        containers, and see what the artifact services think they're
        serving.


        On Fri, Aug 7, 2020 at 6:47 PM Eugene Kirpichov
        <ekirpic...@gmail.com <mailto:ekirpic...@gmail.com>> wrote:

            Thanks Austin! Good stuff - though note that I am
            /not/ using custom containers, I'm just trying to get the
            basic stuff to work, a Python pipeline with a simple
            requirements.txt file. Feels like this should work
            out-of-the-box, I must be doing something wrong.

            On Fri, Aug 7, 2020 at 6:38 PM Austin Bennett
            <whatwouldausti...@gmail.com
            <mailto:whatwouldausti...@gmail.com>> wrote:

                I only believe @OrielResearch Eila Arich-Landkof
                <mailto:e...@orielresearch.org> potentially doing
                applied work with custom containers (there must be others)!

                For a plug for her and @BeamSummit --  I think enough
                related will be talked about in (with Conda specifics)
                -->
                
https://2020.beamsummit.org/sessions/workshop-using-conda-on-beam/

                I'm sure others will have more things to say that are
                actually helpful, on-list, before that occurs (~3 weeks).



                On Fri, Aug 7, 2020 at 6:32 PM Eugene Kirpichov
                <ekirpic...@gmail.com <mailto:ekirpic...@gmail.com>> wrote:

                    Hi old Beam friends,

                    I left Google to work on climate change
                    
<https://www.linkedin.com/posts/eugenekirpichov_i-am-leaving-google-heres-a-snipped-to-activity-6683408492444962816-Mw5U>
                    and am now doing a short engagement with Pachama
                    <https://pachama.com/>. Right now I'm trying to get
                    a Beam Python pipeline to work; the pipeline will
                    use fancy requirements and native dependencies, and
                    we plan to run it on Cloud Dataflow (so custom
                    containers are not yet an option), so I'm going
                    straight for the direct PortableRunner as per
                    https://beam.apache.org/documentation/runtime/environments/.

                    Basically I can't get a minimal Beam program with a
                    minimal requirements.txt file to work - the .tar.gz
                    of the dependency mysteriously ends up being
                    ungzipped and non-installable inside the Docker
                    container running the worker. Details below.

                    === main.py ===
                    import argparse
                    import logging

                    import apache_beam as beam
                    from apache_beam.options.pipeline_options import
                    PipelineOptions
                    from apache_beam.options.pipeline_options import
                    SetupOptions

                    def run(argv=None):
                         parser = argparse.ArgumentParser()
                         known_args, pipeline_args =
                    parser.parse_known_args(argv)

                         pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session
                    = True

                         with beam.Pipeline(options=pipeline_options) as p:
                             (p | 'Create' >> beam.Create(['Hello'])
                                | 'Write' >> beam.io.WriteToText('/tmp'))


                    if __name__ == '__main__':
                         logging.getLogger().setLevel(logging.INFO)
                         run()

                    === requirements.txt ===
                    alembic

                    When I run the program:
                    $ python3 main.py
                    --runner=PortableRunner --job_endpoint=embed 
--requirements_file=requirements.txt


                    I get some normal output and then:

                    
INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b'
                      File
                    
"/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py",
                    line 261, in unpack_file\n    untar_file(filename,
                    location)\n  File
                    
"/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py",
                    line 177, in untar_file\n    tar =
                    tarfile.open(filename, mode)\n  File
                    "/usr/local/lib/python3.7/tarfile.py", line 1591, in
                    open\n    return func(name, filemode, fileobj,
                    **kwargs)\n  File
                    "/usr/local/lib/python3.7/tarfile.py", line 1648, in
                    gzopen\n    raise ReadError("not a gzip
                    file")\ntarfile.ReadError: not a gzip
                    file\n2020/08/08 01:17:07 Failed to install required
                    packages: failed to install requirements: exit
                    status 2\n'

                    This greatly puzzled me and, after some looking, I
                    found something really surprising. Here is the
                    package in the /directory to be staged/:

                    $ file
                    
/var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz
                    ...: gzip compressed data, was
                    "dist/alembic-1.4.2.tar", last modified: Thu Mar 19
                    21:48:31 2020, max compression, original size modulo
                    2^32 4730880
                    $ ls -l
                    
/var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz
                    -rw-r--r--  1 jkff  staff  1092045 Aug  7 16:56 ...

                    So far so good. But here is the same file inside the
                    Docker container (I ssh'd into the dead container
                    
<https://thorsten-hans.com/how-to-run-commands-in-stopped-docker-containers>):

                    # file /tmp/staged/alembic-1.4.2.tar.gz
                    /tmp/staged/alembic-1.4.2.tar.gz: POSIX tar archive
                    (GNU)
                    # ls -l /tmp/staged/alembic-1.4.2.tar.gz
                    -rwxr-xr-x 1 root root 4730880 Aug  8 01:17
                    /tmp/staged/alembic-1.4.2.tar.gz

                    The file has clearly been unzipped and now of course
                    pip can't install it! What's going on here? Am I
                    using the direct/portable runner combination wrong?

                    Thanks!

-- Eugene Kirpichov
                    http://www.linkedin.com/in/eugenekirpichov



-- Eugene Kirpichov
            http://www.linkedin.com/in/eugenekirpichov



-- Eugene Kirpichov
        http://www.linkedin.com/in/eugenekirpichov



--
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov

Reply via email to