Re: Runner Bundling Strategies

2023-09-27 Thread Kenneth Knowles
On Wed, Sep 27, 2023 at 2:53 PM Robert Bradshaw via dev wrote: > On Wed, Sep 27, 2023 at 10:58 AM Reuven Lax via dev > wrote: > >> DoFns are allowed to be non deterministic, so they don't have to yield >> the "same" output. >> > > Yeah. I'm more thinking here that there's a set of outputs that

Re: Runner Bundling Strategies

2023-09-27 Thread Robert Bradshaw via dev
On Wed, Sep 27, 2023 at 10:58 AM Reuven Lax via dev wrote: > DoFns are allowed to be non deterministic, so they don't have to yield the > "same" output. > Yeah. I'm more thinking here that there's a set of outputs that are considered equivalently valid. > The example I'm thinking of is where

Re: Runner Bundling Strategies

2023-09-27 Thread Jan Lukavský
Understood, thanks. This is fairly unintuitive from the "checkpoint barrier" viewpoint, because when such runner fails, it simply restarts from the checkpoint as it would be a fresh start - i.e. calling Setup. It makes sense that a bundle-based runner might not do that. It seems to follow

Re: Runner Bundling Strategies

2023-09-27 Thread Reuven Lax via dev
Using Setup would cause data loss in this case. A runner can always retry a bundle, and I don't believe Setup is called again in this case. If the user initiated the hashmap in setup, this would cause records to be completely lost whenever bundles retry. On Wed, Sep 27, 2023 at 11:20 AM Jan

Re: Runner Bundling Strategies

2023-09-27 Thread Jan Lukavský
What is the reason to rely on StartBundle and not Setup in this case? If the life-cycle of bundle is not "closed" (i.e. start - finish), then it seems to be ill defined and Setup should do? I'm trying to think of non-caching use-cases of StartBundle-FinishBundle, are there such cases? I'd say

Re: Runner Bundling Strategies

2023-09-27 Thread Reuven Lax via dev
DoFns are allowed to be non deterministic, so they don't have to yield the "same" output. The example I'm thinking of is where users perform some "best-effort" deduplication by creating a hashmap in StartBundle and removing duplicates. This is usually done purely for performance to reduce shuffle

Beam High Priority Issue Report (41)

2023-09-27 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need attention. See https://beam.apache.org/contribute/issue-priorities for the meaning and expectations around issue priorities. Unassigned P1 Issues: https://github.com/apache/beam/issues/28383 [Failing Test]: