Hi old Beam friends,
I left Google to work on climate change
<https://www.linkedin.com/posts/eugenekirpichov_i-am-leaving-google-heres-a-snipped-to-activity-6683408492444962816-Mw5U>
and am now doing a short engagement with Pachama <https://pachama.com/>.
Right now I'm trying to get a Beam Python pipeline to work; the pipeline
will use fancy requirements and native dependencies, and we plan to run it
on Cloud Dataflow (so custom containers are not yet an option), so I'm
going straight for the direct PortableRunner as per
https://beam.apache.org/documentation/runtime/environments/.
Basically I can't get a minimal Beam program with a minimal
requirements.txt file to work - the .tar.gz of the dependency mysteriously
ends up being ungzipped and non-installable inside the Docker container
running the worker. Details below.
=== main.py ===
import argparse
import logging
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
def run(argv=None):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
with beam.Pipeline(options=pipeline_options) as p:
(p | 'Create' >> beam.Create(['Hello'])
| 'Write' >> beam.io.WriteToText('/tmp'))
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
=== requirements.txt ===
alembic
When I run the program:
$ python3 main.py
--runner=PortableRunner --job_endpoint=embed
--requirements_file=requirements.txt
I get some normal output and then:
INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b' File
"/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py",
line 261, in unpack_file\n untar_file(filename, location)\n File
"/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py",
line 177, in untar_file\n tar = tarfile.open(filename, mode)\n File
"/usr/local/lib/python3.7/tarfile.py", line 1591, in open\n return
func(name, filemode, fileobj, **kwargs)\n File
"/usr/local/lib/python3.7/tarfile.py", line 1648, in gzopen\n raise
ReadError("not a gzip file")\ntarfile.ReadError: not a gzip
file\n2020/08/08 01:17:07 Failed to install required packages: failed to
install requirements: exit status 2\n'
This greatly puzzled me and, after some looking, I found something really
surprising. Here is the package in the *directory to be staged*:
$ file
/var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz
...: gzip compressed data, was "dist/alembic-1.4.2.tar", last modified: Thu
Mar 19 21:48:31 2020, max compression, original size modulo 2^32 4730880
$ ls -l
/var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz
-rw-r--r-- 1 jkff staff 1092045 Aug 7 16:56 ...
So far so good. But here is the same file inside the Docker container (I ssh'd
into the dead container
<https://thorsten-hans.com/how-to-run-commands-in-stopped-docker-containers>
):
# file /tmp/staged/alembic-1.4.2.tar.gz
/tmp/staged/alembic-1.4.2.tar.gz: POSIX tar archive (GNU)
# ls -l /tmp/staged/alembic-1.4.2.tar.gz
-rwxr-xr-x 1 root root 4730880 Aug 8 01:17 /tmp/staged/alembic-1.4.2.tar.gz
The file has clearly been unzipped and now of course pip can't install it!
What's going on here? Am I using the direct/portable runner combination
wrong?
Thanks!
--
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov