Fwd: Community over Code EU 2024 Travel Assistance Applications now open!

2024-01-26 Thread Valentyn Tymofieiev via dev
FYI. -- Forwarded message - The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code EU 2024 are now open! TAC will be supporting Community over Code EU, Bratislava, Slovakia, June 3th - 5th, 2024. TAC exists to

Fwd: [Important] GSoC 2024 Project Ideas

2024-01-26 Thread Ahmet Altay via dev
Hi all! GSoC is a good opportunity to get new community members and a good opportunity for you to mentor junior folks. If you have any ideas please record them (https://s.apache.org/gsoc2024ideas) *by 3rd Feb*. As for some ideas, these are good opportunities for (i) some low hanging but low

Re: [python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Joey Tran
Ah! I understand now. Both GroupByKey _and_ CombineValues are joined together and both run locally on the worker first. I forgot that GroupByKey is also 'lifted'. So we groupbykey locally and drop the extraneous None's locally so we don't need to unnecessarily communicate them back to the full

Re: [python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Robert Bradshaw via dev
On Fri, Jan 26, 2024 at 8:43 AM Joey Tran wrote: > > Hmm, I think I might still be missing something. CombinePerKey is made up of > "GBK() | CombineValues". Pulling it out into the Distinct, Distinct looks > like: > > def Distinct(pcoll): # pylint: disable=invalid-name > """Produces a

Re: [python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Joey Tran
Hmm, I think I might still be missing something. CombinePerKey is made up of "GBK() | CombineValues". Pulling it out into the Distinct, Distinct looks like: def Distinct(pcoll): # pylint: disable=invalid-name """Produces a PCollection containing distinct elements of a PCollection.""" return

Re: [python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Robert Bradshaw via dev
This is because it allows us to do some of the deduplication before shuffle via combiner lifting. E.g. say we have [A, A, A, B, B] on one worker and [B, B, B, B, C, C] on another. Rather than passing all that data through the GroupByKey (which involves (relatively) expensive materialization and

[python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Joey Tran
Hey all, I was poking around and looking at `Distinct` and was confused about why it was implemented the way it was. Reproduced here: @ptransform_fn @typehints.with_input_types(T) @typehints.with_output_types(T) def Distinct(pcoll): # pylint: disable=invalid-name """Produces a PCollection

Re: Beam Jenkins nodes offline

2024-01-26 Thread Danny McCormick via dev
Hey Gavin, thanks for looking into this. I think there's a separate thread going for this, but we've now fully migrated all of our CI to GitHub Actions, so we can probably just safely turn down Beam's Jenkins instance. Valentyn opened a Jira ticket yesterday for this purpose -

Beam High Priority Issue Report (47)

2024-01-26 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need attention. See https://beam.apache.org/contribute/issue-priorities for the meaning and expectations around issue priorities. Unassigned P1 Issues: https://github.com/apache/beam/issues/29971 [Bug]: