On Wed, Sep 27, 2023 at 2:53 PM Robert Bradshaw via dev
wrote:
> On Wed, Sep 27, 2023 at 10:58 AM Reuven Lax via dev
> wrote:
>
>> DoFns are allowed to be non deterministic, so they don't have to yield
>> the "same" output.
>>
>
> Yeah. I'm more thinking here that there's a set of outputs that
On Wed, Sep 27, 2023 at 10:58 AM Reuven Lax via dev
wrote:
> DoFns are allowed to be non deterministic, so they don't have to yield the
> "same" output.
>
Yeah. I'm more thinking here that there's a set of outputs that are
considered equivalently valid.
> The example I'm thinking of is where
Understood, thanks. This is fairly unintuitive from the "checkpoint
barrier" viewpoint, because when such runner fails, it simply restarts
from the checkpoint as it would be a fresh start - i.e. calling Setup.
It makes sense that a bundle-based runner might not do that.
It seems to follow
Using Setup would cause data loss in this case. A runner can always retry a
bundle, and I don't believe Setup is called again in this case. If the user
initiated the hashmap in setup, this would cause records to be completely
lost whenever bundles retry.
On Wed, Sep 27, 2023 at 11:20 AM Jan
What is the reason to rely on StartBundle and not Setup in this case? If
the life-cycle of bundle is not "closed" (i.e. start - finish), then it
seems to be ill defined and Setup should do?
I'm trying to think of non-caching use-cases of
StartBundle-FinishBundle, are there such cases? I'd say
DoFns are allowed to be non deterministic, so they don't have to yield the
"same" output.
The example I'm thinking of is where users perform some "best-effort"
deduplication by creating a hashmap in StartBundle and removing duplicates.
This is usually done purely for performance to reduce shuffle
This is your daily summary of Beam's current high priority issues that may need
attention.
See https://beam.apache.org/contribute/issue-priorities for the meaning and
expectations around issue priorities.
Unassigned P1 Issues:
https://github.com/apache/beam/issues/28383 [Failing Test]: