Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-06 Thread Kenneth Knowles
Additionally we need https://github.com/apache/beam/pull/28665/files in
order to run GHA tests.

On Fri, Oct 6, 2023 at 3:19 PM Kenneth Knowles  wrote:

> That PR was prior to many cherry-picks so it is not the signal we need. I
> have updated it to the tip of the release-2.51.0 branch.
>
> There were some post-commit tests involving JPMS that I believe need
> https://github.com/apache/beam/pull/28726 to pass.
>
> Kenn
>
> On Fri, Oct 6, 2023 at 2:53 PM Valentyn Tymofieiev via dev <
> dev@beam.apache.org> wrote:
>
>> > PR to run tests against release branch [12].
>>
>>  https://github.com/apache/beam/pull/28663 is closed and test signal is
>> no longer available. did all the tests pass?
>>
>> On Fri, Oct 6, 2023 at 5:32 AM Alexey Romanenko 
>> wrote:
>>
>>> +1 (binding)
>>>
>>> —
>>> Alexey
>>>
>>> > On 5 Oct 2023, at 18:38, Jean-Baptiste Onofré  wrote:
>>> >
>>> > +1 (binding)
>>> >
>>> > Thanks !
>>> > Regards
>>> > JB
>>> >
>>> > On Tue, Oct 3, 2023 at 7:58 PM Kenneth Knowles 
>>> wrote:
>>> >>
>>> >> Hi everyone,
>>> >>
>>> >> Please review and vote on the release candidate #1 for the version
>>> 2.51.0, as follows:
>>> >>
>>> >> [ ] +1, Approve the release
>>> >> [ ] -1, Do not approve the release (please provide specific comments)
>>> >>
>>> >> Reviewers are encouraged to test their own use cases with the release
>>> candidate, and vote +1 if no issues are found. Only PMC member votes will
>>> count towards the final vote, but votes from all community members is
>>> encouraged and helpful for finding regressions; you can either test your
>>> own use cases or use cases from the validation sheet [10].
>>> >>
>>> >> The complete staging area is available for your review, which
>>> includes:
>>> >>
>>> >> GitHub Release notes [1],
>>> >> the official Apache source release to be deployed to dist.apache.org
>>> [2], which is signed with the key with fingerprint  [3],
>>> >> all artifacts to be deployed to the Maven Central Repository [4],
>>> >> source code tag "v1.2.3-RC3" [5],
>>> >> website pull request listing the release [6], the blog post [6], and
>>> publishing the API reference manual [7].
>>> >> Java artifacts were built with Gradle GRADLE_VERSION and
>>> OpenJDK/Oracle JDK JDK_VERSION.
>>> >> Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2] and PyPI[8].
>>> >> Go artifacts and documentation are available at pkg.go.dev [9]
>>> >> Validation sheet with a tab for 1.2.3 release to help with validation
>>> [10].
>>> >> Docker images published to Docker Hub [11].
>>> >> PR to run tests against release branch [12].
>>> >>
>>> >> The vote will be open for at least 72 hours. It is adopted by
>>> majority approval, with at least 3 PMC affirmative votes.
>>> >>
>>> >> For guidelines on how to try the release in your projects, check out
>>> our blog post at https://beam.apache.org/blog/validate-beam-release/.
>>> >>
>>> >> Thanks,
>>> >> Kenn
>>> >>
>>> >> [1] https://github.com/apache/beam/milestone/15
>>> >> [2] https://dist.apache.org/repos/dist/dev/beam/2.51.0
>>> >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> >> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1356/
>>> >> [5] https://github.com/apache/beam/tree/v2.51.0-RC1
>>> >> [6] https://github.com/apache/beam/pull/28800
>>> >> [7] https://github.com/apache/beam-site/pull/649
>>> >> [8] https://pypi.org/project/apache-beam/2.51.0rc1/
>>> >> [9]
>>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.51.0-RC1/go/pkg/beam
>>> >> [10]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=437054928
>>> >> [11] https://hub.docker.com/search?q=apache%2Fbeam=image
>>> >> [12] https://github.com/apache/beam/pull/28663
>>>
>>>


Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-06 Thread Kenneth Knowles
That PR was prior to many cherry-picks so it is not the signal we need. I
have updated it to the tip of the release-2.51.0 branch.

There were some post-commit tests involving JPMS that I believe need
https://github.com/apache/beam/pull/28726 to pass.

Kenn

On Fri, Oct 6, 2023 at 2:53 PM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:

> > PR to run tests against release branch [12].
>
>  https://github.com/apache/beam/pull/28663 is closed and test signal is
> no longer available. did all the tests pass?
>
> On Fri, Oct 6, 2023 at 5:32 AM Alexey Romanenko 
> wrote:
>
>> +1 (binding)
>>
>> —
>> Alexey
>>
>> > On 5 Oct 2023, at 18:38, Jean-Baptiste Onofré  wrote:
>> >
>> > +1 (binding)
>> >
>> > Thanks !
>> > Regards
>> > JB
>> >
>> > On Tue, Oct 3, 2023 at 7:58 PM Kenneth Knowles  wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> Please review and vote on the release candidate #1 for the version
>> 2.51.0, as follows:
>> >>
>> >> [ ] +1, Approve the release
>> >> [ ] -1, Do not approve the release (please provide specific comments)
>> >>
>> >> Reviewers are encouraged to test their own use cases with the release
>> candidate, and vote +1 if no issues are found. Only PMC member votes will
>> count towards the final vote, but votes from all community members is
>> encouraged and helpful for finding regressions; you can either test your
>> own use cases or use cases from the validation sheet [10].
>> >>
>> >> The complete staging area is available for your review, which includes:
>> >>
>> >> GitHub Release notes [1],
>> >> the official Apache source release to be deployed to dist.apache.org
>> [2], which is signed with the key with fingerprint  [3],
>> >> all artifacts to be deployed to the Maven Central Repository [4],
>> >> source code tag "v1.2.3-RC3" [5],
>> >> website pull request listing the release [6], the blog post [6], and
>> publishing the API reference manual [7].
>> >> Java artifacts were built with Gradle GRADLE_VERSION and
>> OpenJDK/Oracle JDK JDK_VERSION.
>> >> Python artifacts are deployed along with the source release to the
>> dist.apache.org [2] and PyPI[8].
>> >> Go artifacts and documentation are available at pkg.go.dev [9]
>> >> Validation sheet with a tab for 1.2.3 release to help with validation
>> [10].
>> >> Docker images published to Docker Hub [11].
>> >> PR to run tests against release branch [12].
>> >>
>> >> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>> >>
>> >> For guidelines on how to try the release in your projects, check out
>> our blog post at https://beam.apache.org/blog/validate-beam-release/.
>> >>
>> >> Thanks,
>> >> Kenn
>> >>
>> >> [1] https://github.com/apache/beam/milestone/15
>> >> [2] https://dist.apache.org/repos/dist/dev/beam/2.51.0
>> >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> >> [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1356/
>> >> [5] https://github.com/apache/beam/tree/v2.51.0-RC1
>> >> [6] https://github.com/apache/beam/pull/28800
>> >> [7] https://github.com/apache/beam-site/pull/649
>> >> [8] https://pypi.org/project/apache-beam/2.51.0rc1/
>> >> [9]
>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.51.0-RC1/go/pkg/beam
>> >> [10]
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=437054928
>> >> [11] https://hub.docker.com/search?q=apache%2Fbeam=image
>> >> [12] https://github.com/apache/beam/pull/28663
>>
>>


Re: Reshuffle PTransform Design Doc

2023-10-06 Thread Kenneth Knowles
On Fri, Oct 6, 2023 at 3:07 PM Jan Lukavský  wrote:

>
> On 10/6/23 15:11, Kenneth Knowles wrote:
>
>
>
> On Fri, Oct 6, 2023 at 3:20 AM Jan Lukavský  wrote:
>
>> Hi,
>>
>> there is also one other thing to mention with relation to
>> Reshuffle/RequiresStableinput and that is that our current implementation
>> of RequiresStableInput can break without Reshuffle in some corner cases on
>> most portable runners, at least with Java GreedyPipelineFuser, see [1]. The
>> only way to workaround this currently is inserting Reshuffle (or any other
>> fusion-break transform) directly before the stable DoFn (Reshuffle is
>> handy, because it does not change the data). I think we should either
>> somehow fix the issue [1] or include fusion break as a mandatory
>> requirement for the new Redistribute transform as well (at least with some
>> variant) or possibly add a new "hint" for non-optional fusion breaking.
>>
> This is actually the bug we have wanted to fix for years - redistribute
> has nothing to do with checkpointing or stable input and Reshuffle
> incorrectly merges the two concepts.
>
> I agree that we couldn't make any immediate change that will break a
> runner. I believe runners that depend on Reshuffle to provide stable input
> will also provide stable input after GroupByKey. Since the SDK expansion of
> Reshuffle will still contains a GBK, those runners functionality will be
> unchanged.
>
> I don't yet have a firm opinion between the these approaches:
>
> 1. Adjust the Java SDK implementation of Reshuffle (and maybe other SDKs
> if needed). With some flag so that users can use the old wrong behavior for
> update compatibility.
> 2. Add a Redistribute transform to the SDKs that has the right behavior
> and leave Reshuffle as it is.
> 1+2. Add the Redistribute transform but also make Reshuffle call it, so
> Reshuffle also gets the new behavior, with the same flag so that users can
> use the old wrong behavior for update compatibility.
>
> All of these will leave "Reshuffle for RequestStableInput" alone for now.
> The options that include (2) will move us a little closer to migrating to a
> "better" future state.
>
> I might have not expressed the right way. I understand that Reshuffle
> having "stable input" functionality is non-portable side-effect. It would
> be nice to get rid of it and my impression from this thread was that we
> would try to deprecate Reshuffle and introduce Redistribute which will not
> have such semantics. All of this is fine, problem is that we currently (is
> some corner cases) rely on Reshuffle *even though* Pipeline uses
> @RequiresStableInput. That is due to the fact that Reshuffle also ensures
> fusion breaking.  Fusing non-deterministic DoFn with stable DoFn breaks the
> stable input property, because runners can ensure stability only at the
> input of executable stage. Therefore we would either need to:
>
>  a) define Redistribute as being an unconditional fusion break boundary, or
>
>  b) define some other transform or hint to be able to enforce fusion
> breaking
>
> Otherwise I'd be in favor of 2 and deprecation of Reshuffle.
>

Just to be very clear - my goal right now is to just give Reshuffle a
consistent semantics. Even for the old "stable input + redistribute" use of
Reshuffle, the semantics are inconsistent/undefined and the Java SDK
expansion is wrong. Changing things having to do with stable input are not
part of what I am trying to change right now. But it is fine to do things
that prepare for that.

Kenn


>  Jan
>
>
> Any votes? Any other options?
>
> Kenn
>
>  Jan
>>
>> [1] https://github.com/apache/beam/issues/24655
>> On 10/5/23 21:01, Robert Burke wrote:
>>
>> Reshuffle/redistribute being a transform has the benefit of allowing
>> existing runners that aren't updated to be aware of the new urns to rely on
>> an SDK side implementation, which may be more expensive than what the
>> runner is able to do with that awareness.
>>
>> Aka: it gives purpose to the fallback implementations.
>>
>> On Thu, Oct 5, 2023, 9:03 AM Kenneth Knowles  wrote:
>>
>>> Another perspective, ignoring runners custom implementations and
>>> non-Java SDKs could be that the semantics are perfectly well defined: it is
>>> a composite and its semantics are defined by its implementation in terms of
>>> primitives. It is just that this expansion is not what we want so we should
>>> not use it (and also we shouldn't use "whatever the implementation does" as
>>> a spec for anything we care about).
>>>
>>> On Thu, Oct 5, 2023 at 11:56 AM Kenneth Knowles  wrote:
>>>
 I totally agree. I am motivated right now by the fact that it is
 already used all over the place but with no consistent semantics. Maybe it
 is simpler to focus on just making the minimal change, which would
 basically be to update the expansion of the Reshuffle in the Java SDK.

 Kenn

 On Thu, Oct 5, 2023 at 11:39 AM John Casey 
 wrote:

> Given that this is a hint, I'm not sure 

Re: Reshuffle PTransform Design Doc

2023-10-06 Thread Jan Lukavský


On 10/6/23 15:11, Kenneth Knowles wrote:



On Fri, Oct 6, 2023 at 3:20 AM Jan Lukavský  wrote:

Hi,

there is also one other thing to mention with relation to
Reshuffle/RequiresStableinput and that is that our current
implementation of RequiresStableInput can break without Reshuffle
in some corner cases on most portable runners, at least with Java
GreedyPipelineFuser, see [1]. The only way to workaround this
currently is inserting Reshuffle (or any other fusion-break
transform) directly before the stable DoFn (Reshuffle is handy,
because it does not change the data). I think we should either
somehow fix the issue [1] or include fusion break as a mandatory
requirement for the new Redistribute transform as well (at least
with some variant) or possibly add a new "hint" for non-optional
fusion breaking.

This is actually the bug we have wanted to fix for years - 
redistribute has nothing to do with checkpointing or stable input and 
Reshuffle incorrectly merges the two concepts.


I agree that we couldn't make any immediate change that will break a 
runner. I believe runners that depend on Reshuffle to provide stable 
input will also provide stable input after GroupByKey. Since the SDK 
expansion of Reshuffle will still contains a GBK, those runners 
functionality will be unchanged.


I don't yet have a firm opinion between the these approaches:

1. Adjust the Java SDK implementation of Reshuffle (and maybe other 
SDKs if needed). With some flag so that users can use the old wrong 
behavior for update compatibility.
2. Add a Redistribute transform to the SDKs that has the right 
behavior and leave Reshuffle as it is.
1+2. Add the Redistribute transform but also make Reshuffle call it, 
so Reshuffle also gets the new behavior, with the same flag so that 
users can use the old wrong behavior for update compatibility.


All of these will leave "Reshuffle for RequestStableInput" alone for 
now. The options that include (2) will move us a little closer to 
migrating to a "better" future state.


I might have not expressed the right way. I understand that Reshuffle 
having "stable input" functionality is non-portable side-effect. It 
would be nice to get rid of it and my impression from this thread was 
that we would try to deprecate Reshuffle and introduce Redistribute 
which will not have such semantics. All of this is fine, problem is that 
we currently (is some corner cases) rely on Reshuffle *even though* 
Pipeline uses @RequiresStableInput. That is due to the fact that 
Reshuffle also ensures fusion breaking.  Fusing non-deterministic DoFn 
with stable DoFn breaks the stable input property, because runners can 
ensure stability only at the input of executable stage. Therefore we 
would either need to:


 a) define Redistribute as being an unconditional fusion break boundary, or

 b) define some other transform or hint to be able to enforce fusion 
breaking


Otherwise I'd be in favor of 2 and deprecation of Reshuffle.

 Jan



Any votes? Any other options?

Kenn

 Jan

[1] https://github.com/apache/beam/issues/24655

On 10/5/23 21:01, Robert Burke wrote:

Reshuffle/redistribute being a transform has the benefit of
allowing existing runners that aren't updated to be aware of the
new urns to rely on an SDK side implementation, which may be more
expensive than what the runner is able to do with that awareness.

Aka: it gives purpose to the fallback implementations.

On Thu, Oct 5, 2023, 9:03 AM Kenneth Knowles  wrote:

Another perspective, ignoring runners custom implementations
and non-Java SDKs could be that the semantics are perfectly
well defined: it is a composite and its semantics are defined
by its implementation in terms of primitives. It is just that
this expansion is not what we want so we should not use it
(and also we shouldn't use "whatever the implementation does"
as a spec for anything we care about).

On Thu, Oct 5, 2023 at 11:56 AM Kenneth Knowles
 wrote:

I totally agree. I am motivated right now by the fact
that it is already used all over the place but with no
consistent semantics. Maybe it is simpler to focus on
just making the minimal change, which would basically be
to update the expansion of the Reshuffle in the Java SDK.

Kenn

On Thu, Oct 5, 2023 at 11:39 AM John Casey
 wrote:

Given that this is a hint, I'm not sure redistribute
should be a PTransform as opposed to some other way
to hint to a runner.

I'm not sure of what the syntax of that would be, but
a semantic no-op transform that the runner may or may
not do anything with is odd.

On Thu, Oct 5, 2023 at 11:30 AM Kenneth Knowles
 wrote:

  

Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-06 Thread Valentyn Tymofieiev via dev
> PR to run tests against release branch [12].

 https://github.com/apache/beam/pull/28663 is closed and test signal is no
longer available. did all the tests pass?

On Fri, Oct 6, 2023 at 5:32 AM Alexey Romanenko 
wrote:

> +1 (binding)
>
> —
> Alexey
>
> > On 5 Oct 2023, at 18:38, Jean-Baptiste Onofré  wrote:
> >
> > +1 (binding)
> >
> > Thanks !
> > Regards
> > JB
> >
> > On Tue, Oct 3, 2023 at 7:58 PM Kenneth Knowles  wrote:
> >>
> >> Hi everyone,
> >>
> >> Please review and vote on the release candidate #1 for the version
> 2.51.0, as follows:
> >>
> >> [ ] +1, Approve the release
> >> [ ] -1, Do not approve the release (please provide specific comments)
> >>
> >> Reviewers are encouraged to test their own use cases with the release
> candidate, and vote +1 if no issues are found. Only PMC member votes will
> count towards the final vote, but votes from all community members is
> encouraged and helpful for finding regressions; you can either test your
> own use cases or use cases from the validation sheet [10].
> >>
> >> The complete staging area is available for your review, which includes:
> >>
> >> GitHub Release notes [1],
> >> the official Apache source release to be deployed to dist.apache.org
> [2], which is signed with the key with fingerprint  [3],
> >> all artifacts to be deployed to the Maven Central Repository [4],
> >> source code tag "v1.2.3-RC3" [5],
> >> website pull request listing the release [6], the blog post [6], and
> publishing the API reference manual [7].
> >> Java artifacts were built with Gradle GRADLE_VERSION and OpenJDK/Oracle
> JDK JDK_VERSION.
> >> Python artifacts are deployed along with the source release to the
> dist.apache.org [2] and PyPI[8].
> >> Go artifacts and documentation are available at pkg.go.dev [9]
> >> Validation sheet with a tab for 1.2.3 release to help with validation
> [10].
> >> Docker images published to Docker Hub [11].
> >> PR to run tests against release branch [12].
> >>
> >> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
> >>
> >> For guidelines on how to try the release in your projects, check out
> our blog post at https://beam.apache.org/blog/validate-beam-release/.
> >>
> >> Thanks,
> >> Kenn
> >>
> >> [1] https://github.com/apache/beam/milestone/15
> >> [2] https://dist.apache.org/repos/dist/dev/beam/2.51.0
> >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> >> [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1356/
> >> [5] https://github.com/apache/beam/tree/v2.51.0-RC1
> >> [6] https://github.com/apache/beam/pull/28800
> >> [7] https://github.com/apache/beam-site/pull/649
> >> [8] https://pypi.org/project/apache-beam/2.51.0rc1/
> >> [9]
> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.51.0-RC1/go/pkg/beam
> >> [10]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=437054928
> >> [11] https://hub.docker.com/search?q=apache%2Fbeam=image
> >> [12] https://github.com/apache/beam/pull/28663
>
>


Re: Reshuffle PTransform Design Doc

2023-10-06 Thread Kenneth Knowles
On Fri, Oct 6, 2023 at 3:20 AM Jan Lukavský  wrote:

> Hi,
>
> there is also one other thing to mention with relation to
> Reshuffle/RequiresStableinput and that is that our current implementation
> of RequiresStableInput can break without Reshuffle in some corner cases on
> most portable runners, at least with Java GreedyPipelineFuser, see [1]. The
> only way to workaround this currently is inserting Reshuffle (or any other
> fusion-break transform) directly before the stable DoFn (Reshuffle is
> handy, because it does not change the data). I think we should either
> somehow fix the issue [1] or include fusion break as a mandatory
> requirement for the new Redistribute transform as well (at least with some
> variant) or possibly add a new "hint" for non-optional fusion breaking.
>
> This is actually the bug we have wanted to fix for years - redistribute
has nothing to do with checkpointing or stable input and Reshuffle
incorrectly merges the two concepts.

I agree that we couldn't make any immediate change that will break a
runner. I believe runners that depend on Reshuffle to provide stable input
will also provide stable input after GroupByKey. Since the SDK expansion of
Reshuffle will still contains a GBK, those runners functionality will be
unchanged.

I don't yet have a firm opinion between the these approaches:

1. Adjust the Java SDK implementation of Reshuffle (and maybe other SDKs if
needed). With some flag so that users can use the old wrong behavior for
update compatibility.
2. Add a Redistribute transform to the SDKs that has the right behavior and
leave Reshuffle as it is.
1+2. Add the Redistribute transform but also make Reshuffle call it, so
Reshuffle also gets the new behavior, with the same flag so that users can
use the old wrong behavior for update compatibility.

All of these will leave "Reshuffle for RequestStableInput" alone for now.
The options that include (2) will move us a little closer to migrating to a
"better" future state.

Any votes? Any other options?

Kenn

 Jan
>
> [1] https://github.com/apache/beam/issues/24655
> On 10/5/23 21:01, Robert Burke wrote:
>
> Reshuffle/redistribute being a transform has the benefit of allowing
> existing runners that aren't updated to be aware of the new urns to rely on
> an SDK side implementation, which may be more expensive than what the
> runner is able to do with that awareness.
>
> Aka: it gives purpose to the fallback implementations.
>
> On Thu, Oct 5, 2023, 9:03 AM Kenneth Knowles  wrote:
>
>> Another perspective, ignoring runners custom implementations and non-Java
>> SDKs could be that the semantics are perfectly well defined: it is a
>> composite and its semantics are defined by its implementation in terms of
>> primitives. It is just that this expansion is not what we want so we should
>> not use it (and also we shouldn't use "whatever the implementation does" as
>> a spec for anything we care about).
>>
>> On Thu, Oct 5, 2023 at 11:56 AM Kenneth Knowles  wrote:
>>
>>> I totally agree. I am motivated right now by the fact that it is already
>>> used all over the place but with no consistent semantics. Maybe it is
>>> simpler to focus on just making the minimal change, which would basically
>>> be to update the expansion of the Reshuffle in the Java SDK.
>>>
>>> Kenn
>>>
>>> On Thu, Oct 5, 2023 at 11:39 AM John Casey 
>>> wrote:
>>>
 Given that this is a hint, I'm not sure redistribute should be a
 PTransform as opposed to some other way to hint to a runner.

 I'm not sure of what the syntax of that would be, but a semantic no-op
 transform that the runner may or may not do anything with is odd.

 On Thu, Oct 5, 2023 at 11:30 AM Kenneth Knowles 
 wrote:

> So a high level suggestion from Robert that I want to highlight as a
> top-post:
>
> Instead of focusing on just fixing the SDKs and runners Reshuffle,
> this could be an opportunity to introduce Redistribute which was proposed
> in the long-ago thread. The semantics are identical but it is more clear
> that it *only* is a hint about redistributing data and there is no
> expectation of a checkpoint.
>
> This new name may also be an opportunity to maintain update
> compatibility (though this may actually be leaving unsafe code in user's
> hands) and/or separate @RequiresStableInput/checkpointing uses of 
> Reshuffle
> from redistribution-only uses of Reshuffle.
>
> Any other thoughts on this one high level bit?
>
> Kenn
>
> On Thu, Oct 5, 2023 at 11:15 AM Kenneth Knowles 
> wrote:
>
>>
>> On Wed, Oct 4, 2023 at 7:45 PM Robert Burke 
>> wrote:
>>
>>> LGTM.
>>>
>>> It looks the Go SDK already adheres to these semantics as well for
>>> the reference impl(well, reshuffle/redistribute_randomly, _by_key isn't
>>> implemented in the Go SDK, and only uses the existing unqualified 
>>> reshuffle
>>> URN [0].
>>>
>>> 

Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-06 Thread Alexey Romanenko
+1 (binding)

—
Alexey

> On 5 Oct 2023, at 18:38, Jean-Baptiste Onofré  wrote:
> 
> +1 (binding)
> 
> Thanks !
> Regards
> JB
> 
> On Tue, Oct 3, 2023 at 7:58 PM Kenneth Knowles  wrote:
>> 
>> Hi everyone,
>> 
>> Please review and vote on the release candidate #1 for the version 2.51.0, 
>> as follows:
>> 
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>> 
>> Reviewers are encouraged to test their own use cases with the release 
>> candidate, and vote +1 if no issues are found. Only PMC member votes will 
>> count towards the final vote, but votes from all community members is 
>> encouraged and helpful for finding regressions; you can either test your own 
>> use cases or use cases from the validation sheet [10].
>> 
>> The complete staging area is available for your review, which includes:
>> 
>> GitHub Release notes [1],
>> the official Apache source release to be deployed to dist.apache.org [2], 
>> which is signed with the key with fingerprint  [3],
>> all artifacts to be deployed to the Maven Central Repository [4],
>> source code tag "v1.2.3-RC3" [5],
>> website pull request listing the release [6], the blog post [6], and 
>> publishing the API reference manual [7].
>> Java artifacts were built with Gradle GRADLE_VERSION and OpenJDK/Oracle JDK 
>> JDK_VERSION.
>> Python artifacts are deployed along with the source release to the 
>> dist.apache.org [2] and PyPI[8].
>> Go artifacts and documentation are available at pkg.go.dev [9]
>> Validation sheet with a tab for 1.2.3 release to help with validation [10].
>> Docker images published to Docker Hub [11].
>> PR to run tests against release branch [12].
>> 
>> The vote will be open for at least 72 hours. It is adopted by majority 
>> approval, with at least 3 PMC affirmative votes.
>> 
>> For guidelines on how to try the release in your projects, check out our 
>> blog post at https://beam.apache.org/blog/validate-beam-release/.
>> 
>> Thanks,
>> Kenn
>> 
>> [1] https://github.com/apache/beam/milestone/15
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.51.0
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [4] https://repository.apache.org/content/repositories/orgapachebeam-1356/
>> [5] https://github.com/apache/beam/tree/v2.51.0-RC1
>> [6] https://github.com/apache/beam/pull/28800
>> [7] https://github.com/apache/beam-site/pull/649
>> [8] https://pypi.org/project/apache-beam/2.51.0rc1/
>> [9] https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.51.0-RC1/go/pkg/beam
>> [10] 
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=437054928
>> [11] https://hub.docker.com/search?q=apache%2Fbeam=image
>> [12] https://github.com/apache/beam/pull/28663



Beam High Priority Issue Report (42)

2023-10-06 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/28811 [Failing Test]: Many BigQuery 
direct read Python PostCommit failing possibly due to fastavro regression
https://github.com/apache/beam/issues/28760 [Bug]: EFO Kinesis IO reader 
provided by apache beam does not pick the event time for watermarking
https://github.com/apache/beam/issues/28703 [Failing Test]: Building a wheel 
for integration tests sometimes times out
https://github.com/apache/beam/issues/28383 [Failing Test]: 
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric
https://github.com/apache/beam/issues/28339 Fix failing 
"beam_PostCommit_XVR_GoUsingJava_Dataflow" job
https://github.com/apache/beam/issues/28326 Bug: 
apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working
https://github.com/apache/beam/issues/28142 [Bug]: [Go SDK] Memory seems to be 
leaking on 2.49.0 with Dataflow
https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not 
working when using CreateDisposition.CREATE_IF_NEEDED 
https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. 
PeriodicImpulse) running in Flink and polling using tracker.defer_remainder 
have checkpoint size growing indefinitely 
https://github.com/apache/beam/issues/27616 [Bug]: Unable to use 
applyRowMutations() in bigquery IO apache beam java
https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with 
inequality filters
https://github.com/apache/beam/issues/27314 [Failing Test]: 
bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1]
https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when 
using Kafka and GroupByKey on Dataflow Runner
https://github.com/apache/beam/issues/26981 [Bug]: Getting an error related to 
SchemaCoder after upgrading to 2.48
https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested 
ROW (described below)
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/25975 [Bug]: Reducing parallelism in 
FlinkRunner leads to a data loss
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey

Re: Reshuffle PTransform Design Doc

2023-10-06 Thread Jan Lukavský

Hi,

there is also one other thing to mention with relation to 
Reshuffle/RequiresStableinput and that is that our current 
implementation of RequiresStableInput can break without Reshuffle in 
some corner cases on most portable runners, at least with Java 
GreedyPipelineFuser, see [1]. The only way to workaround this currently 
is inserting Reshuffle (or any other fusion-break transform) directly 
before the stable DoFn (Reshuffle is handy, because it does not change 
the data). I think we should either somehow fix the issue [1] or include 
fusion break as a mandatory requirement for the new Redistribute 
transform as well (at least with some variant) or possibly add a new 
"hint" for non-optional fusion breaking.


 Jan

[1] https://github.com/apache/beam/issues/24655

On 10/5/23 21:01, Robert Burke wrote:
Reshuffle/redistribute being a transform has the benefit of allowing 
existing runners that aren't updated to be aware of the new urns to 
rely on an SDK side implementation, which may be more expensive than 
what the runner is able to do with that awareness.


Aka: it gives purpose to the fallback implementations.

On Thu, Oct 5, 2023, 9:03 AM Kenneth Knowles  wrote:

Another perspective, ignoring runners custom implementations and
non-Java SDKs could be that the semantics are perfectly well
defined: it is a composite and its semantics are defined by its
implementation in terms of primitives. It is just that this
expansion is not what we want so we should not use it (and also we
shouldn't use "whatever the implementation does" as a spec for
anything we care about).

On Thu, Oct 5, 2023 at 11:56 AM Kenneth Knowles 
wrote:

I totally agree. I am motivated right now by the fact that it
is already used all over the place but with no consistent
semantics. Maybe it is simpler to focus on just making the
minimal change, which would basically be to update the
expansion of the Reshuffle in the Java SDK.

Kenn

On Thu, Oct 5, 2023 at 11:39 AM John Casey
 wrote:

Given that this is a hint, I'm not sure redistribute
should be a PTransform as opposed to some other way to
hint to a runner.

I'm not sure of what the syntax of that would be, but a
semantic no-op transform that the runner may or may not do
anything with is odd.

On Thu, Oct 5, 2023 at 11:30 AM Kenneth Knowles
 wrote:

So a high level suggestion from Robert that I want to
highlight as a top-post:

Instead of focusing on just fixing the SDKs and
runners Reshuffle, this could be an opportunity to
introduce Redistribute which was proposed in the
long-ago thread. The semantics are identical but it is
more clear that it /only/ is a hint about
redistributing data and there is no expectation of a
checkpoint.

This new name may also be an opportunity to maintain
update compatibility (though this may actually be
leaving unsafe code in user's hands) and/or
separate @RequiresStableInput/checkpointing uses of
Reshuffle from redistribution-only uses of Reshuffle.

Any other thoughts on this one high level bit?

Kenn

On Thu, Oct 5, 2023 at 11:15 AM Kenneth Knowles
 wrote:


On Wed, Oct 4, 2023 at 7:45 PM Robert Burke
 wrote:

LGTM.

It looks the Go SDK already adheres to these
semantics as well for the reference impl(well,
reshuffle/redistribute_randomly, _by_key isn't
implemented in the Go SDK, and only uses the
existing unqualified reshuffle URN [0].

The original strategy, and then for every
element, the original Window, TS, and Pane are
all serialized, shuffled, and then
deserialized downstream.


https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L65


https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L145

Prism at the moment vaccuously implements
reshuffle by omitting the node, and rewriting
the inputs and outputs [1], as it's a local
runner with single transform per bundle
execution, but I was intending to make it a
fusion break regardless.  Ultimately prism's
"test" variant will default to executing the