Filed https://issues.apache.org/jira/browse/BEAM-8900 to address the
inefficiency discussed here. Thanks everyone.
On Thu, Dec 5, 2019 at 2:53 PM Valentyn Tymofieiev
wrote:
> Note that so far we have not been staging wheels, since SDK does not have
> a knowledge of a target platform, but there
Note that so far we have not been staging wheels, since SDK does not have a
knowledge of a target platform, but there is
https://issues.apache.org/jira/browse/BEAM-4032 to add this support.
On Thu, Dec 5, 2019 at 2:35 PM Chad Dombrova wrote:
> On Thu, Dec 5, 2019 at 12:36 PM Valentyn Tymofieiev
On Thu, Dec 5, 2019 at 12:36 PM Valentyn Tymofieiev
wrote:
Ah nice, so then the workflow would be: download [missing] deps from pypi
> into a long-lived cache directory, then download copy the same deps into
> a short-lived temporary directory, using long-lived cache directory as
> SoT, then
Ah nice, so then the workflow would be: download [missing] deps from pypi
into a long-lived cache directory, then download copy the same deps into a
short-lived temporary directory, using long-lived cache directory as SoT,
then stage files from a short-lived temporary directory and clean it up.
Another way to copy only the deps you care about is to use `pip download`
to do the copy. I believe you can provide the cache dir to `pip download
--find-links` and it will read from that before reading from pypi (you may
also need to set --wheel-dir to the cache dir as well), and thus it acts as
Looked for a bit at pip download command. The alternative seems to parse
the output of
python -m pip download --dest . -r requirements.txt --exists-action i
--no-binary :all:
and see which files were downloaded and/or skipped since they were already
present, and then stage only the files that
I think reusing the same cache directory makes sense during downloading but
why do we upload everything that is there?
On Thu, Dec 5, 2019 at 9:24 AM Udi Meiri wrote:
> Looking at the source, it seems that it should be using a
> os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache')
Can we filter the cache directory only for the artifacts that we want and
not everything that is there?
On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev
wrote:
> Luke, I am not sure I understand the question. The caching that happens
> here is implemented in the SDK for requirements packages:
Luke, I am not sure I understand the question. The caching that happens
here is implemented in the SDK for requirements packages:
https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
On Wed, Dec 4, 2019 at 6:19 PM
Is there a way to use a cache on disk that is separate from the set of
packages we use as requirements?
On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri wrote:
> Thanks!
> Another reason to periodically referesh workers.
>
> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev
> wrote:
>
>> Tests job
Thanks!
Another reason to periodically referesh workers.
On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev
wrote:
> Tests job specify[1] a requirements.txt file that contains two entries:
> pyhamcrest, mock.
>
> We download[2] sources of packages specified in requirements file,
> and
Tests job specify[1] a requirements.txt file that contains two entries:
pyhamcrest, mock.
We download[2] sources of packages specified in requirements file,
and packages they depend on. While doing so, it appears that we use a cache
directory on jenkins to store the sources of the packages [3],
I was investigating a Dataflow postcommit test failure (endpoints_pb2
missing), and saw this in the staging directory:
$ gsutil ls
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
13 matches
Mail list logo