Thanks Austin! Good stuff - though note that I am *not* using custom containers, I'm just trying to get the basic stuff to work, a Python pipeline with a simple requirements.txt file. Feels like this should work out-of-the-box, I must be doing something wrong.
On Fri, Aug 7, 2020 at 6:38 PM Austin Bennett <[email protected]> wrote: > I only believe @OrielResearch Eila Arich-Landkof <[email protected]> > potentially > doing applied work with custom containers (there must be others)! > > For a plug for her and @BeamSummit -- I think enough related will be > talked about in (with Conda specifics) --> > https://2020.beamsummit.org/sessions/workshop-using-conda-on-beam/ > > I'm sure others will have more things to say that are actually helpful, > on-list, before that occurs (~3 weeks). > > > > On Fri, Aug 7, 2020 at 6:32 PM Eugene Kirpichov <[email protected]> > wrote: > >> Hi old Beam friends, >> >> I left Google to work on climate change >> <https://www.linkedin.com/posts/eugenekirpichov_i-am-leaving-google-heres-a-snipped-to-activity-6683408492444962816-Mw5U> >> and am now doing a short engagement with Pachama <https://pachama.com/>. >> Right now I'm trying to get a Beam Python pipeline to work; the pipeline >> will use fancy requirements and native dependencies, and we plan to run it >> on Cloud Dataflow (so custom containers are not yet an option), so I'm >> going straight for the direct PortableRunner as per >> https://beam.apache.org/documentation/runtime/environments/. >> >> Basically I can't get a minimal Beam program with a minimal >> requirements.txt file to work - the .tar.gz of the dependency mysteriously >> ends up being ungzipped and non-installable inside the Docker container >> running the worker. Details below. >> >> === main.py === >> import argparse >> import logging >> >> import apache_beam as beam >> from apache_beam.options.pipeline_options import PipelineOptions >> from apache_beam.options.pipeline_options import SetupOptions >> >> def run(argv=None): >> parser = argparse.ArgumentParser() >> known_args, pipeline_args = parser.parse_known_args(argv) >> >> pipeline_options = PipelineOptions(pipeline_args) >> pipeline_options.view_as(SetupOptions).save_main_session = True >> >> with beam.Pipeline(options=pipeline_options) as p: >> (p | 'Create' >> beam.Create(['Hello']) >> | 'Write' >> beam.io.WriteToText('/tmp')) >> >> >> if __name__ == '__main__': >> logging.getLogger().setLevel(logging.INFO) >> run() >> >> === requirements.txt === >> alembic >> >> When I run the program: >> $ python3 main.py >> --runner=PortableRunner --job_endpoint=embed >> --requirements_file=requirements.txt >> >> >> I get some normal output and then: >> >> INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b' >> File >> "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py", >> line 261, in unpack_file\n untar_file(filename, location)\n File >> "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py", >> line 177, in untar_file\n tar = tarfile.open(filename, mode)\n File >> "/usr/local/lib/python3.7/tarfile.py", line 1591, in open\n return >> func(name, filemode, fileobj, **kwargs)\n File >> "/usr/local/lib/python3.7/tarfile.py", line 1648, in gzopen\n raise >> ReadError("not a gzip file")\ntarfile.ReadError: not a gzip >> file\n2020/08/08 01:17:07 Failed to install required packages: failed to >> install requirements: exit status 2\n' >> >> This greatly puzzled me and, after some looking, I found something really >> surprising. Here is the package in the *directory to be staged*: >> >> $ file >> /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz >> ...: gzip compressed data, was "dist/alembic-1.4.2.tar", last modified: >> Thu Mar 19 21:48:31 2020, max compression, original size modulo 2^32 4730880 >> $ ls -l >> /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz >> -rw-r--r-- 1 jkff staff 1092045 Aug 7 16:56 ... >> >> So far so good. But here is the same file inside the Docker container (I >> ssh'd >> into the dead container >> <https://thorsten-hans.com/how-to-run-commands-in-stopped-docker-containers> >> ): >> >> # file /tmp/staged/alembic-1.4.2.tar.gz >> /tmp/staged/alembic-1.4.2.tar.gz: POSIX tar archive (GNU) >> # ls -l /tmp/staged/alembic-1.4.2.tar.gz >> -rwxr-xr-x 1 root root 4730880 Aug 8 01:17 >> /tmp/staged/alembic-1.4.2.tar.gz >> >> The file has clearly been unzipped and now of course pip can't install >> it! What's going on here? Am I using the direct/portable runner combination >> wrong? >> >> Thanks! >> >> -- >> Eugene Kirpichov >> http://www.linkedin.com/in/eugenekirpichov >> > -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov
