Scott Wegner created BEAM-680:
---------------------------------

             Summary: Python Dataflow stages stale requirements-cache 
dependencies
                 Key: BEAM-680
                 URL: https://issues.apache.org/jira/browse/BEAM-680
             Project: Beam
          Issue Type: Bug
          Components: sdk-py
            Reporter: Scott Wegner
            Priority: Minor


When executing a python pipeline using a requirements.txt file, the Dataflow 
runner will stage all dependencies downloaded to its requirements cache 
directory, including those specified in the requirements.txt, and any 
previously cached dependencies. This results in bloated staging directory if 
previous pipeline runs from the same machine included different dependencies.

Repro:

# Initialize a virtualenv and pip install apache_beam
# Create an empty requirements.txt file
# Create a simple pipeline using DataflowPipelineRunner and a requirements.txt 
file, for example: 
[my_pipeline.py|https://gist.github.com/swegner/6df00df1423b48206c4ab5a7e917218a]
# {{touch /tmp/dataflow-requirements-cache/extra-file.txt}}
# Run the pipeline with a specified staging directory
# Check the staged files for the job

'extra-file.txt' will be uploaded with the job, along with any other cached 
dependencies under /tmp/dataflow-requirements-cache.

We should only be staging the dependencies necessary for a pipeline, not all 
previously-cached dependencies found on the machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to