Hi Eugene, Good to hear from you. The experience you are describing on Portable Runner + Docker container in local execution mode is most certainly a bug, if you have not opened an issue on it, please do so and feel free to cc me.
I can also reproduce the bug and likewise didn't see anything obvious immediately, this needs some debugging. cc: +Ankur Goenka <[email protected]> +Kyle Weaver <[email protected]> who recently worked on Portable Runner and may be interested. By the way, you should be able to use custom containers with Dataflow, if you set --experiments=use_runner_v2. On Mon, Aug 10, 2020 at 9:06 AM Eugene Kirpichov <[email protected]> wrote: > (cc'ing Sam with whom I'm working on this atm) > > FWIW I'm still stumped. I've looked through Python, Go and Java code in > the Beam repo having anything to do with gzipping/unzipping, and none of it > appears to be used in the artifact staging/retrieval codepaths. I also > can't find any mention of compression/decompression in the container boot > code. My next step will be to add a bunch of debugging, rebuild the > containers, and see what the artifact services think they're serving. > > > On Fri, Aug 7, 2020 at 6:47 PM Eugene Kirpichov <[email protected]> > wrote: > >> Thanks Austin! Good stuff - though note that I am *not* using custom >> containers, I'm just trying to get the basic stuff to work, a Python >> pipeline with a simple requirements.txt file. Feels like this should work >> out-of-the-box, I must be doing something wrong. >> >> On Fri, Aug 7, 2020 at 6:38 PM Austin Bennett < >> [email protected]> wrote: >> >>> I only believe @OrielResearch Eila Arich-Landkof >>> <[email protected]> potentially doing applied work with custom >>> containers (there must be others)! >>> >>> For a plug for her and @BeamSummit -- I think enough related will be >>> talked about in (with Conda specifics) --> >>> https://2020.beamsummit.org/sessions/workshop-using-conda-on-beam/ >>> >>> I'm sure others will have more things to say that are actually helpful, >>> on-list, before that occurs (~3 weeks). >>> >>> >>> >>> On Fri, Aug 7, 2020 at 6:32 PM Eugene Kirpichov <[email protected]> >>> wrote: >>> >>>> Hi old Beam friends, >>>> >>>> I left Google to work on climate change >>>> <https://www.linkedin.com/posts/eugenekirpichov_i-am-leaving-google-heres-a-snipped-to-activity-6683408492444962816-Mw5U> >>>> and am now doing a short engagement with Pachama <https://pachama.com/>. >>>> Right now I'm trying to get a Beam Python pipeline to work; the pipeline >>>> will use fancy requirements and native dependencies, and we plan to run it >>>> on Cloud Dataflow (so custom containers are not yet an option), so I'm >>>> going straight for the direct PortableRunner as per >>>> https://beam.apache.org/documentation/runtime/environments/. >>>> >>>> Basically I can't get a minimal Beam program with a minimal >>>> requirements.txt file to work - the .tar.gz of the dependency mysteriously >>>> ends up being ungzipped and non-installable inside the Docker container >>>> running the worker. Details below. >>>> >>>> === main.py === >>>> import argparse >>>> import logging >>>> >>>> import apache_beam as beam >>>> from apache_beam.options.pipeline_options import PipelineOptions >>>> from apache_beam.options.pipeline_options import SetupOptions >>>> >>>> def run(argv=None): >>>> parser = argparse.ArgumentParser() >>>> known_args, pipeline_args = parser.parse_known_args(argv) >>>> >>>> pipeline_options = PipelineOptions(pipeline_args) >>>> pipeline_options.view_as(SetupOptions).save_main_session = True >>>> >>>> with beam.Pipeline(options=pipeline_options) as p: >>>> (p | 'Create' >> beam.Create(['Hello']) >>>> | 'Write' >> beam.io.WriteToText('/tmp')) >>>> >>>> >>>> if __name__ == '__main__': >>>> logging.getLogger().setLevel(logging.INFO) >>>> run() >>>> >>>> === requirements.txt === >>>> alembic >>>> >>>> When I run the program: >>>> $ python3 main.py >>>> --runner=PortableRunner --job_endpoint=embed >>>> --requirements_file=requirements.txt >>>> >>>> >>>> I get some normal output and then: >>>> >>>> INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b' >>>> File >>>> "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py", >>>> line 261, in unpack_file\n untar_file(filename, location)\n File >>>> "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py", >>>> line 177, in untar_file\n tar = tarfile.open(filename, mode)\n File >>>> "/usr/local/lib/python3.7/tarfile.py", line 1591, in open\n return >>>> func(name, filemode, fileobj, **kwargs)\n File >>>> "/usr/local/lib/python3.7/tarfile.py", line 1648, in gzopen\n raise >>>> ReadError("not a gzip file")\ntarfile.ReadError: not a gzip >>>> file\n2020/08/08 01:17:07 Failed to install required packages: failed to >>>> install requirements: exit status 2\n' >>>> >>>> This greatly puzzled me and, after some looking, I found something >>>> really surprising. Here is the package in the *directory to be staged*: >>>> >>>> $ file >>>> /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz >>>> ...: gzip compressed data, was "dist/alembic-1.4.2.tar", last modified: >>>> Thu Mar 19 21:48:31 2020, max compression, original size modulo 2^32 >>>> 4730880 >>>> $ ls -l >>>> /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz >>>> -rw-r--r-- 1 jkff staff 1092045 Aug 7 16:56 ... >>>> >>>> So far so good. But here is the same file inside the Docker container >>>> (I ssh'd into the dead container >>>> <https://thorsten-hans.com/how-to-run-commands-in-stopped-docker-containers> >>>> ): >>>> >>>> # file /tmp/staged/alembic-1.4.2.tar.gz >>>> /tmp/staged/alembic-1.4.2.tar.gz: POSIX tar archive (GNU) >>>> # ls -l /tmp/staged/alembic-1.4.2.tar.gz >>>> -rwxr-xr-x 1 root root 4730880 Aug 8 01:17 >>>> /tmp/staged/alembic-1.4.2.tar.gz >>>> >>>> The file has clearly been unzipped and now of course pip can't install >>>> it! What's going on here? Am I using the direct/portable runner combination >>>> wrong? >>>> >>>> Thanks! >>>> >>>> -- >>>> Eugene Kirpichov >>>> http://www.linkedin.com/in/eugenekirpichov >>>> >>> >> >> -- >> Eugene Kirpichov >> http://www.linkedin.com/in/eugenekirpichov >> > > > -- > Eugene Kirpichov > http://www.linkedin.com/in/eugenekirpichov >
