Thanks Austin! Good stuff - though note that I am *not* using custom
containers, I'm just trying to get the basic stuff to work, a Python
pipeline with a simple requirements.txt file. Feels like this should work
out-of-the-box, I must be doing something wrong.

On Fri, Aug 7, 2020 at 6:38 PM Austin Bennett <[email protected]>
wrote:

> I only believe @OrielResearch Eila Arich-Landkof <[email protected]> 
> potentially
> doing applied work with custom containers (there must be others)!
>
> For a plug for her and @BeamSummit --  I think enough related will be
> talked about in (with Conda specifics) -->
> https://2020.beamsummit.org/sessions/workshop-using-conda-on-beam/
>
> I'm sure others will have more things to say that are actually helpful,
> on-list, before that occurs (~3 weeks).
>
>
>
> On Fri, Aug 7, 2020 at 6:32 PM Eugene Kirpichov <[email protected]>
> wrote:
>
>> Hi old Beam friends,
>>
>> I left Google to work on climate change
>> <https://www.linkedin.com/posts/eugenekirpichov_i-am-leaving-google-heres-a-snipped-to-activity-6683408492444962816-Mw5U>
>> and am now doing a short engagement with Pachama <https://pachama.com/>.
>> Right now I'm trying to get a Beam Python pipeline to work; the pipeline
>> will use fancy requirements and native dependencies, and we plan to run it
>> on Cloud Dataflow (so custom containers are not yet an option), so I'm
>> going straight for the direct PortableRunner as per
>> https://beam.apache.org/documentation/runtime/environments/.
>>
>> Basically I can't get a minimal Beam program with a minimal
>> requirements.txt file to work - the .tar.gz of the dependency mysteriously
>> ends up being ungzipped and non-installable inside the Docker container
>> running the worker. Details below.
>>
>> === main.py ===
>> import argparse
>> import logging
>>
>> import apache_beam as beam
>> from apache_beam.options.pipeline_options import PipelineOptions
>> from apache_beam.options.pipeline_options import SetupOptions
>>
>> def run(argv=None):
>>     parser = argparse.ArgumentParser()
>>     known_args, pipeline_args = parser.parse_known_args(argv)
>>
>>     pipeline_options = PipelineOptions(pipeline_args)
>>     pipeline_options.view_as(SetupOptions).save_main_session = True
>>
>>     with beam.Pipeline(options=pipeline_options) as p:
>>         (p | 'Create' >> beam.Create(['Hello'])
>>            | 'Write' >> beam.io.WriteToText('/tmp'))
>>
>>
>> if __name__ == '__main__':
>>     logging.getLogger().setLevel(logging.INFO)
>>     run()
>>
>> === requirements.txt ===
>> alembic
>>
>> When I run the program:
>> $ python3 main.py
>> --runner=PortableRunner --job_endpoint=embed 
>> --requirements_file=requirements.txt
>>
>>
>> I get some normal output and then:
>>
>> INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b'
>>  File
>> "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py",
>> line 261, in unpack_file\n    untar_file(filename, location)\n  File
>> "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py",
>> line 177, in untar_file\n    tar = tarfile.open(filename, mode)\n  File
>> "/usr/local/lib/python3.7/tarfile.py", line 1591, in open\n    return
>> func(name, filemode, fileobj, **kwargs)\n  File
>> "/usr/local/lib/python3.7/tarfile.py", line 1648, in gzopen\n    raise
>> ReadError("not a gzip file")\ntarfile.ReadError: not a gzip
>> file\n2020/08/08 01:17:07 Failed to install required packages: failed to
>> install requirements: exit status 2\n'
>>
>> This greatly puzzled me and, after some looking, I found something really
>> surprising. Here is the package in the *directory to be staged*:
>>
>> $ file
>> /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz
>> ...: gzip compressed data, was "dist/alembic-1.4.2.tar", last modified:
>> Thu Mar 19 21:48:31 2020, max compression, original size modulo 2^32 4730880
>> $ ls -l
>> /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz
>> -rw-r--r--  1 jkff  staff  1092045 Aug  7 16:56 ...
>>
>> So far so good. But here is the same file inside the Docker container (I 
>> ssh'd
>> into the dead container
>> <https://thorsten-hans.com/how-to-run-commands-in-stopped-docker-containers>
>> ):
>>
>> # file /tmp/staged/alembic-1.4.2.tar.gz
>> /tmp/staged/alembic-1.4.2.tar.gz: POSIX tar archive (GNU)
>> # ls -l /tmp/staged/alembic-1.4.2.tar.gz
>> -rwxr-xr-x 1 root root 4730880 Aug  8 01:17
>> /tmp/staged/alembic-1.4.2.tar.gz
>>
>> The file has clearly been unzipped and now of course pip can't install
>> it! What's going on here? Am I using the direct/portable runner combination
>> wrong?
>>
>> Thanks!
>>
>> --
>> Eugene Kirpichov
>> http://www.linkedin.com/in/eugenekirpichov
>>
>

-- 
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov

Reply via email to