Hi Eugene,

Good to hear from you. The experience you are describing on Portable Runner
+ Docker container in local execution mode is most certainly a bug, if you
have not opened an issue on it, please do so and feel free to cc me.

I can also reproduce the bug and likewise didn't see anything obvious
immediately, this needs some debugging.

cc: +Ankur Goenka <[email protected]> +Kyle Weaver <[email protected]> who
recently worked on Portable Runner and may be interested.

By the way, you should be able to use custom containers with Dataflow, if
you set --experiments=use_runner_v2.

On Mon, Aug 10, 2020 at 9:06 AM Eugene Kirpichov <[email protected]>
wrote:

> (cc'ing Sam with whom I'm working on this atm)
>
> FWIW I'm still stumped. I've looked through Python, Go and Java code in
> the Beam repo having anything to do with gzipping/unzipping, and none of it
> appears to be used in the artifact staging/retrieval codepaths. I also
> can't find any mention of compression/decompression in the container boot
> code. My next step will be to add a bunch of debugging, rebuild the
> containers, and see what the artifact services think they're serving.
>
>
> On Fri, Aug 7, 2020 at 6:47 PM Eugene Kirpichov <[email protected]>
> wrote:
>
>> Thanks Austin! Good stuff - though note that I am *not* using custom
>> containers, I'm just trying to get the basic stuff to work, a Python
>> pipeline with a simple requirements.txt file. Feels like this should work
>> out-of-the-box, I must be doing something wrong.
>>
>> On Fri, Aug 7, 2020 at 6:38 PM Austin Bennett <
>> [email protected]> wrote:
>>
>>> I only believe @OrielResearch Eila Arich-Landkof
>>> <[email protected]> potentially doing applied work with custom
>>> containers (there must be others)!
>>>
>>> For a plug for her and @BeamSummit --  I think enough related will be
>>> talked about in (with Conda specifics) -->
>>> https://2020.beamsummit.org/sessions/workshop-using-conda-on-beam/
>>>
>>> I'm sure others will have more things to say that are actually helpful,
>>> on-list, before that occurs (~3 weeks).
>>>
>>>
>>>
>>> On Fri, Aug 7, 2020 at 6:32 PM Eugene Kirpichov <[email protected]>
>>> wrote:
>>>
>>>> Hi old Beam friends,
>>>>
>>>> I left Google to work on climate change
>>>> <https://www.linkedin.com/posts/eugenekirpichov_i-am-leaving-google-heres-a-snipped-to-activity-6683408492444962816-Mw5U>
>>>> and am now doing a short engagement with Pachama <https://pachama.com/>.
>>>> Right now I'm trying to get a Beam Python pipeline to work; the pipeline
>>>> will use fancy requirements and native dependencies, and we plan to run it
>>>> on Cloud Dataflow (so custom containers are not yet an option), so I'm
>>>> going straight for the direct PortableRunner as per
>>>> https://beam.apache.org/documentation/runtime/environments/.
>>>>
>>>> Basically I can't get a minimal Beam program with a minimal
>>>> requirements.txt file to work - the .tar.gz of the dependency mysteriously
>>>> ends up being ungzipped and non-installable inside the Docker container
>>>> running the worker. Details below.
>>>>
>>>> === main.py ===
>>>> import argparse
>>>> import logging
>>>>
>>>> import apache_beam as beam
>>>> from apache_beam.options.pipeline_options import PipelineOptions
>>>> from apache_beam.options.pipeline_options import SetupOptions
>>>>
>>>> def run(argv=None):
>>>>     parser = argparse.ArgumentParser()
>>>>     known_args, pipeline_args = parser.parse_known_args(argv)
>>>>
>>>>     pipeline_options = PipelineOptions(pipeline_args)
>>>>     pipeline_options.view_as(SetupOptions).save_main_session = True
>>>>
>>>>     with beam.Pipeline(options=pipeline_options) as p:
>>>>         (p | 'Create' >> beam.Create(['Hello'])
>>>>            | 'Write' >> beam.io.WriteToText('/tmp'))
>>>>
>>>>
>>>> if __name__ == '__main__':
>>>>     logging.getLogger().setLevel(logging.INFO)
>>>>     run()
>>>>
>>>> === requirements.txt ===
>>>> alembic
>>>>
>>>> When I run the program:
>>>> $ python3 main.py
>>>> --runner=PortableRunner --job_endpoint=embed 
>>>> --requirements_file=requirements.txt
>>>>
>>>>
>>>> I get some normal output and then:
>>>>
>>>> INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b'
>>>>  File
>>>> "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py",
>>>> line 261, in unpack_file\n    untar_file(filename, location)\n  File
>>>> "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py",
>>>> line 177, in untar_file\n    tar = tarfile.open(filename, mode)\n  File
>>>> "/usr/local/lib/python3.7/tarfile.py", line 1591, in open\n    return
>>>> func(name, filemode, fileobj, **kwargs)\n  File
>>>> "/usr/local/lib/python3.7/tarfile.py", line 1648, in gzopen\n    raise
>>>> ReadError("not a gzip file")\ntarfile.ReadError: not a gzip
>>>> file\n2020/08/08 01:17:07 Failed to install required packages: failed to
>>>> install requirements: exit status 2\n'
>>>>
>>>> This greatly puzzled me and, after some looking, I found something
>>>> really surprising. Here is the package in the *directory to be staged*:
>>>>
>>>> $ file
>>>> /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz
>>>> ...: gzip compressed data, was "dist/alembic-1.4.2.tar", last modified:
>>>> Thu Mar 19 21:48:31 2020, max compression, original size modulo 2^32 
>>>> 4730880
>>>> $ ls -l
>>>> /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz
>>>> -rw-r--r--  1 jkff  staff  1092045 Aug  7 16:56 ...
>>>>
>>>> So far so good. But here is the same file inside the Docker container
>>>> (I ssh'd into the dead container
>>>> <https://thorsten-hans.com/how-to-run-commands-in-stopped-docker-containers>
>>>> ):
>>>>
>>>> # file /tmp/staged/alembic-1.4.2.tar.gz
>>>> /tmp/staged/alembic-1.4.2.tar.gz: POSIX tar archive (GNU)
>>>> # ls -l /tmp/staged/alembic-1.4.2.tar.gz
>>>> -rwxr-xr-x 1 root root 4730880 Aug  8 01:17
>>>> /tmp/staged/alembic-1.4.2.tar.gz
>>>>
>>>> The file has clearly been unzipped and now of course pip can't install
>>>> it! What's going on here? Am I using the direct/portable runner combination
>>>> wrong?
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>> Eugene Kirpichov
>>>> http://www.linkedin.com/in/eugenekirpichov
>>>>
>>>
>>
>> --
>> Eugene Kirpichov
>> http://www.linkedin.com/in/eugenekirpichov
>>
>
>
> --
> Eugene Kirpichov
> http://www.linkedin.com/in/eugenekirpichov
>

Reply via email to