FYI.
-- Forwarded message -
The Travel Assistance Committee (TAC) are pleased to announce that
travel assistance applications for Community over Code EU 2024 are now
open!
TAC will be supporting Community over Code EU, Bratislava, Slovakia,
June 3th - 5th, 2024.
TAC exists to
Hi all!
GSoC is a good opportunity to get new community members and a
good opportunity for you to mentor junior folks. If you have any ideas
please record them (https://s.apache.org/gsoc2024ideas) *by 3rd Feb*.
As for some ideas, these are good opportunities for (i) some low hanging
but low
Ah! I understand now. Both GroupByKey _and_ CombineValues are joined
together and both run locally on the worker first. I forgot that GroupByKey
is also 'lifted'. So we groupbykey locally and drop the extraneous None's
locally so we don't need to unnecessarily communicate them back to the full
On Fri, Jan 26, 2024 at 8:43 AM Joey Tran wrote:
>
> Hmm, I think I might still be missing something. CombinePerKey is made up of
> "GBK() | CombineValues". Pulling it out into the Distinct, Distinct looks
> like:
>
> def Distinct(pcoll): # pylint: disable=invalid-name
> """Produces a
Hmm, I think I might still be missing something. CombinePerKey is made up
of "GBK() | CombineValues". Pulling it out into the Distinct, Distinct
looks like:
def Distinct(pcoll): # pylint: disable=invalid-name
"""Produces a PCollection containing distinct elements of a
PCollection."""
return
This is because it allows us to do some of the deduplication before
shuffle via combiner lifting. E.g. say we have [A, A, A, B, B] on one
worker and [B, B, B, B, C, C] on another. Rather than passing all that
data through the GroupByKey (which involves (relatively) expensive
materialization and
Hey all,
I was poking around and looking at `Distinct` and was confused about why it
was implemented the way it was.
Reproduced here:
@ptransform_fn
@typehints.with_input_types(T)
@typehints.with_output_types(T)
def Distinct(pcoll): # pylint: disable=invalid-name
"""Produces a PCollection
Hey Gavin, thanks for looking into this. I think there's a separate thread
going for this, but we've now fully migrated all of our CI to GitHub
Actions, so we can probably just safely turn down Beam's Jenkins instance.
Valentyn opened a Jira ticket yesterday for this purpose -
This is your daily summary of Beam's current high priority issues that may need
attention.
See https://beam.apache.org/contribute/issue-priorities for the meaning and
expectations around issue priorities.
Unassigned P1 Issues:
https://github.com/apache/beam/issues/29971 [Bug]: