(cc'ing Sam with whom I'm working on this atm) FWIW I'm still stumped. I've looked through Python, Go and Java code in the Beam repo having anything to do with gzipping/unzipping, and none of it appears to be used in the artifact staging/retrieval codepaths. I also can't find any mention of compression/decompression in the container boot code. My next step will be to add a bunch of debugging, rebuild the containers, and see what the artifact services think they're serving.
On Fri, Aug 7, 2020 at 6:47 PM Eugene Kirpichov <[email protected]> wrote: > Thanks Austin! Good stuff - though note that I am *not* using custom > containers, I'm just trying to get the basic stuff to work, a Python > pipeline with a simple requirements.txt file. Feels like this should work > out-of-the-box, I must be doing something wrong. > > On Fri, Aug 7, 2020 at 6:38 PM Austin Bennett <[email protected]> > wrote: > >> I only believe @OrielResearch Eila Arich-Landkof <[email protected]> >> potentially >> doing applied work with custom containers (there must be others)! >> >> For a plug for her and @BeamSummit -- I think enough related will be >> talked about in (with Conda specifics) --> >> https://2020.beamsummit.org/sessions/workshop-using-conda-on-beam/ >> >> I'm sure others will have more things to say that are actually helpful, >> on-list, before that occurs (~3 weeks). >> >> >> >> On Fri, Aug 7, 2020 at 6:32 PM Eugene Kirpichov <[email protected]> >> wrote: >> >>> Hi old Beam friends, >>> >>> I left Google to work on climate change >>> <https://www.linkedin.com/posts/eugenekirpichov_i-am-leaving-google-heres-a-snipped-to-activity-6683408492444962816-Mw5U> >>> and am now doing a short engagement with Pachama <https://pachama.com/>. >>> Right now I'm trying to get a Beam Python pipeline to work; the pipeline >>> will use fancy requirements and native dependencies, and we plan to run it >>> on Cloud Dataflow (so custom containers are not yet an option), so I'm >>> going straight for the direct PortableRunner as per >>> https://beam.apache.org/documentation/runtime/environments/. >>> >>> Basically I can't get a minimal Beam program with a minimal >>> requirements.txt file to work - the .tar.gz of the dependency mysteriously >>> ends up being ungzipped and non-installable inside the Docker container >>> running the worker. Details below. >>> >>> === main.py === >>> import argparse >>> import logging >>> >>> import apache_beam as beam >>> from apache_beam.options.pipeline_options import PipelineOptions >>> from apache_beam.options.pipeline_options import SetupOptions >>> >>> def run(argv=None): >>> parser = argparse.ArgumentParser() >>> known_args, pipeline_args = parser.parse_known_args(argv) >>> >>> pipeline_options = PipelineOptions(pipeline_args) >>> pipeline_options.view_as(SetupOptions).save_main_session = True >>> >>> with beam.Pipeline(options=pipeline_options) as p: >>> (p | 'Create' >> beam.Create(['Hello']) >>> | 'Write' >> beam.io.WriteToText('/tmp')) >>> >>> >>> if __name__ == '__main__': >>> logging.getLogger().setLevel(logging.INFO) >>> run() >>> >>> === requirements.txt === >>> alembic >>> >>> When I run the program: >>> $ python3 main.py >>> --runner=PortableRunner --job_endpoint=embed >>> --requirements_file=requirements.txt >>> >>> >>> I get some normal output and then: >>> >>> INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b' >>> File >>> "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py", >>> line 261, in unpack_file\n untar_file(filename, location)\n File >>> "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py", >>> line 177, in untar_file\n tar = tarfile.open(filename, mode)\n File >>> "/usr/local/lib/python3.7/tarfile.py", line 1591, in open\n return >>> func(name, filemode, fileobj, **kwargs)\n File >>> "/usr/local/lib/python3.7/tarfile.py", line 1648, in gzopen\n raise >>> ReadError("not a gzip file")\ntarfile.ReadError: not a gzip >>> file\n2020/08/08 01:17:07 Failed to install required packages: failed to >>> install requirements: exit status 2\n' >>> >>> This greatly puzzled me and, after some looking, I found something >>> really surprising. Here is the package in the *directory to be staged*: >>> >>> $ file >>> /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz >>> ...: gzip compressed data, was "dist/alembic-1.4.2.tar", last modified: >>> Thu Mar 19 21:48:31 2020, max compression, original size modulo 2^32 4730880 >>> $ ls -l >>> /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz >>> -rw-r--r-- 1 jkff staff 1092045 Aug 7 16:56 ... >>> >>> So far so good. But here is the same file inside the Docker container (I >>> ssh'd >>> into the dead container >>> <https://thorsten-hans.com/how-to-run-commands-in-stopped-docker-containers> >>> ): >>> >>> # file /tmp/staged/alembic-1.4.2.tar.gz >>> /tmp/staged/alembic-1.4.2.tar.gz: POSIX tar archive (GNU) >>> # ls -l /tmp/staged/alembic-1.4.2.tar.gz >>> -rwxr-xr-x 1 root root 4730880 Aug 8 01:17 >>> /tmp/staged/alembic-1.4.2.tar.gz >>> >>> The file has clearly been unzipped and now of course pip can't install >>> it! What's going on here? Am I using the direct/portable runner combination >>> wrong? >>> >>> Thanks! >>> >>> -- >>> Eugene Kirpichov >>> http://www.linkedin.com/in/eugenekirpichov >>> >> > > -- > Eugene Kirpichov > http://www.linkedin.com/in/eugenekirpichov > -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov
