Thanks Valentyn! Good to know that this is a bug (I'll file a bug), and that Dataflow has an experimental way to use custom containers. I'll try that.
On Mon, Aug 10, 2020 at 12:51 PM Valentyn Tymofieiev <[email protected]> wrote: > Hi Eugene, > > Good to hear from you. The experience you are describing on Portable > Runner + Docker container in local execution mode is most certainly a bug, > if you have not opened an issue on it, please do so and feel free to cc me. > > I can also reproduce the bug and likewise didn't see anything obvious > immediately, this needs some debugging. > > cc: +Ankur Goenka <[email protected]> +Kyle Weaver <[email protected]> who > recently worked on Portable Runner and may be interested. > > By the way, you should be able to use custom containers with Dataflow, if > you set --experiments=use_runner_v2. > > On Mon, Aug 10, 2020 at 9:06 AM Eugene Kirpichov <[email protected]> > wrote: > >> (cc'ing Sam with whom I'm working on this atm) >> >> FWIW I'm still stumped. I've looked through Python, Go and Java code in >> the Beam repo having anything to do with gzipping/unzipping, and none of it >> appears to be used in the artifact staging/retrieval codepaths. I also >> can't find any mention of compression/decompression in the container boot >> code. My next step will be to add a bunch of debugging, rebuild the >> containers, and see what the artifact services think they're serving. >> >> >> On Fri, Aug 7, 2020 at 6:47 PM Eugene Kirpichov <[email protected]> >> wrote: >> >>> Thanks Austin! Good stuff - though note that I am *not* using custom >>> containers, I'm just trying to get the basic stuff to work, a Python >>> pipeline with a simple requirements.txt file. Feels like this should work >>> out-of-the-box, I must be doing something wrong. >>> >>> On Fri, Aug 7, 2020 at 6:38 PM Austin Bennett < >>> [email protected]> wrote: >>> >>>> I only believe @OrielResearch Eila Arich-Landkof >>>> <[email protected]> potentially doing applied work with custom >>>> containers (there must be others)! >>>> >>>> For a plug for her and @BeamSummit -- I think enough related will be >>>> talked about in (with Conda specifics) --> >>>> https://2020.beamsummit.org/sessions/workshop-using-conda-on-beam/ >>>> >>>> I'm sure others will have more things to say that are actually helpful, >>>> on-list, before that occurs (~3 weeks). >>>> >>>> >>>> >>>> On Fri, Aug 7, 2020 at 6:32 PM Eugene Kirpichov <[email protected]> >>>> wrote: >>>> >>>>> Hi old Beam friends, >>>>> >>>>> I left Google to work on climate change >>>>> <https://www.linkedin.com/posts/eugenekirpichov_i-am-leaving-google-heres-a-snipped-to-activity-6683408492444962816-Mw5U> >>>>> and am now doing a short engagement with Pachama >>>>> <https://pachama.com/>. Right now I'm trying to get a Beam Python >>>>> pipeline to work; the pipeline will use fancy requirements and native >>>>> dependencies, and we plan to run it on Cloud Dataflow (so custom >>>>> containers >>>>> are not yet an option), so I'm going straight for the direct >>>>> PortableRunner >>>>> as per https://beam.apache.org/documentation/runtime/environments/. >>>>> >>>>> Basically I can't get a minimal Beam program with a minimal >>>>> requirements.txt file to work - the .tar.gz of the dependency mysteriously >>>>> ends up being ungzipped and non-installable inside the Docker container >>>>> running the worker. Details below. >>>>> >>>>> === main.py === >>>>> import argparse >>>>> import logging >>>>> >>>>> import apache_beam as beam >>>>> from apache_beam.options.pipeline_options import PipelineOptions >>>>> from apache_beam.options.pipeline_options import SetupOptions >>>>> >>>>> def run(argv=None): >>>>> parser = argparse.ArgumentParser() >>>>> known_args, pipeline_args = parser.parse_known_args(argv) >>>>> >>>>> pipeline_options = PipelineOptions(pipeline_args) >>>>> pipeline_options.view_as(SetupOptions).save_main_session = True >>>>> >>>>> with beam.Pipeline(options=pipeline_options) as p: >>>>> (p | 'Create' >> beam.Create(['Hello']) >>>>> | 'Write' >> beam.io.WriteToText('/tmp')) >>>>> >>>>> >>>>> if __name__ == '__main__': >>>>> logging.getLogger().setLevel(logging.INFO) >>>>> run() >>>>> >>>>> === requirements.txt === >>>>> alembic >>>>> >>>>> When I run the program: >>>>> $ python3 main.py >>>>> --runner=PortableRunner --job_endpoint=embed >>>>> --requirements_file=requirements.txt >>>>> >>>>> >>>>> I get some normal output and then: >>>>> >>>>> INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b' >>>>> File >>>>> "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py", >>>>> line 261, in unpack_file\n untar_file(filename, location)\n File >>>>> "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py", >>>>> line 177, in untar_file\n tar = tarfile.open(filename, mode)\n File >>>>> "/usr/local/lib/python3.7/tarfile.py", line 1591, in open\n return >>>>> func(name, filemode, fileobj, **kwargs)\n File >>>>> "/usr/local/lib/python3.7/tarfile.py", line 1648, in gzopen\n raise >>>>> ReadError("not a gzip file")\ntarfile.ReadError: not a gzip >>>>> file\n2020/08/08 01:17:07 Failed to install required packages: failed to >>>>> install requirements: exit status 2\n' >>>>> >>>>> This greatly puzzled me and, after some looking, I found something >>>>> really surprising. Here is the package in the *directory to be staged* >>>>> : >>>>> >>>>> $ file >>>>> /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz >>>>> ...: gzip compressed data, was "dist/alembic-1.4.2.tar", last >>>>> modified: Thu Mar 19 21:48:31 2020, max compression, original size modulo >>>>> 2^32 4730880 >>>>> $ ls -l >>>>> /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz >>>>> -rw-r--r-- 1 jkff staff 1092045 Aug 7 16:56 ... >>>>> >>>>> So far so good. But here is the same file inside the Docker container >>>>> (I ssh'd into the dead container >>>>> <https://thorsten-hans.com/how-to-run-commands-in-stopped-docker-containers> >>>>> ): >>>>> >>>>> # file /tmp/staged/alembic-1.4.2.tar.gz >>>>> /tmp/staged/alembic-1.4.2.tar.gz: POSIX tar archive (GNU) >>>>> # ls -l /tmp/staged/alembic-1.4.2.tar.gz >>>>> -rwxr-xr-x 1 root root 4730880 Aug 8 01:17 >>>>> /tmp/staged/alembic-1.4.2.tar.gz >>>>> >>>>> The file has clearly been unzipped and now of course pip can't install >>>>> it! What's going on here? Am I using the direct/portable runner >>>>> combination >>>>> wrong? >>>>> >>>>> Thanks! >>>>> >>>>> -- >>>>> Eugene Kirpichov >>>>> http://www.linkedin.com/in/eugenekirpichov >>>>> >>>> >>> >>> -- >>> Eugene Kirpichov >>> http://www.linkedin.com/in/eugenekirpichov >>> >> >> >> -- >> Eugene Kirpichov >> http://www.linkedin.com/in/eugenekirpichov >> > -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov
