Re: [Discuss] Idea to increase RC voting participation

2023-10-19 Thread Robert Bradshaw via dev
On Thu, Oct 19, 2023 at 12:18 PM Kenneth Knowles wrote: > +1 to more helpful guide on "how to usefully participate in RC validation" > but also big +1 to Robert, Jack, Johanna. > > TL;DR the RC validation is an opportunity for downstream testing. > > Robert alluded to the origin of the

Re: [YAML] Aggregations

2023-10-19 Thread Robert Bradshaw via dev
On Thu, Oct 19, 2023 at 12:53 PM Reuven Lax wrote: > > Is the schema Group transform (in Java) something along these lines? Yes, for sure it is. It (and Python's and Typescript's equivalent) are linked in the original post. The open question is how to best express this in YAML. > On Wed, Oct

Re: [YAML] Aggregations

2023-10-19 Thread Reuven Lax via dev
Is the schema Group transform (in Java) something along these lines? On Wed, Oct 18, 2023 at 1:11 PM Robert Bradshaw via dev wrote: > Beam Yaml has good support for IOs and mappings, but one key missing > feature for even writing a WordCount is the ability to do Aggregations > [1]. While the

Re: [YAML] Aggregations

2023-10-19 Thread Reuven Lax via dev
Or are you specifically referring to the declarative YAML pipelines? On Thu, Oct 19, 2023 at 12:53 PM Reuven Lax wrote: > Is the schema Group transform (in Java) something along these lines? > > On Wed, Oct 18, 2023 at 1:11 PM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> Beam

Re: [PYTHON] partitioner utilities?

2023-10-19 Thread Joey Tran
Yeah, I already implemented these partitioners for my use case (I just pasted the classnames/docstrings for them) and I used both combiners.Top and combiners.Sample. In fact, before writing these partitioners I had misunderstood those combiners and thought they would partition my pcollections.

Re: [PYTHON] partitioner utilities?

2023-10-19 Thread Anand Inguva via dev
FYI, there is a Top transform[1] that will fetch the greatest n elements in Python SDK. It is not a partitioner but It may be useful for your reference. [1] https://github.com/apache/beam/blob/68e9c997a9085b0cb045238ae406d534011e7c21/sdks/python/apache_beam/transforms/combiners.py#L191 On Thu,

Re: [PYTHON] partitioner utilities?

2023-10-19 Thread Joey Tran
Yes, both need to be small enough to fit into state. Yeah a percentage sampler would also be great, we have a bunch of use cases for that ourselves. Not sure if it'd be too clever, but I was imagining three public sampling partitioners: FixedSample, PercentageSample, and Sample. Sample could

Re: [NOTICE] Deprecation Avro classes in "core" and use "extensions/avro" instead for Java SDK

2023-10-19 Thread Kenneth Knowles
W On Wed, Oct 18, 2023 at 4:19 PM Byron Ellis via dev wrote: > Awesome! > > On Wed, Oct 18, 2023 at 1:14 PM Alexey Romanenko > wrote: > >> Heads up! >> >> Finally, all Avro-related code and Avro dependency, that was deprecated >> before (see a message above), has been removed from Beam

Re: [Discuss] Idea to increase RC voting participation

2023-10-19 Thread Kenneth Knowles
+1 to more helpful guide on "how to usefully participate in RC validation" but also big +1 to Robert, Jack, Johanna. TL;DR the RC validation is an opportunity for downstream testing. Robert alluded to the origin of the spreadsheet: I created it long ago to validate that the human language on our

Re: [YAML] Aggregations

2023-10-19 Thread Robert Bradshaw via dev
On Thu, Oct 19, 2023 at 11:12 AM Kenneth Knowles wrote: > > Using SQL expressions in strings is maybe OK given we are all > relational all the time. Either way you have to define what the > universe of `fn` is. Here's a compact possibility: > > type: Combine > config: > group_by: [field1,

Re: [PYTHON] partitioner utilities?

2023-10-19 Thread Danny McCormick via dev
I'm interested adding something like this, I could see these being generally useful for a number of cases (one that immediately comes to mind is partitioning datasets into train/test/validation sets and writing each to a different place). I'm assuming Top (or FixedSample) needs to be small enough

Re: [YAML] Aggregations

2023-10-19 Thread Robert Bradshaw via dev
On Thu, Oct 19, 2023 at 11:42 AM Jan Lukavský wrote: > > On 10/19/23 19:41, Robert Bradshaw via dev wrote: > > On Thu, Oct 19, 2023 at 10:25 AM Jan Lukavský wrote: > >> On 10/19/23 18:28, Robert Bradshaw via dev wrote: > >>> On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis wrote: > Rill is

Re: [DISCUSS] Drop Euphoria extension

2023-10-19 Thread Kenneth Knowles
Makes sense to me. Let's deprecate for the 2.52.0 release unless there is some objection. You can also look at the maven central downloads (I believe all PMC and maybe all committers can view this) compared to other Beam jars. Kenn On Mon, Oct 16, 2023 at 9:28 AM Jan Lukavský wrote: > Sure,

Re: [YAML] Aggregations

2023-10-19 Thread Jan Lukavský
On 10/19/23 19:41, Robert Bradshaw via dev wrote: On Thu, Oct 19, 2023 at 10:25 AM Jan Lukavský wrote: On 10/19/23 18:28, Robert Bradshaw via dev wrote: On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis wrote: Rill is definitely SQL-oriented but I think that's going to be the most common.

Re: Reshuffle PTransform Design Doc

2023-10-19 Thread Kenneth Knowles
Well I accidentally conflated "stateful" and "persisting", but anyhow yea we aren't targeting to have one Beam primitive for each thing that is probably a runner primitive. On Thu, Oct 19, 2023 at 2:25 PM Kenneth Knowles wrote: > > On Fri, Oct 13, 2023 at 12:51 PM Jan Lukavský wrote: > > > >

Re: Reshuffle PTransform Design Doc

2023-10-19 Thread Kenneth Knowles
On Fri, Oct 13, 2023 at 12:51 PM Jan Lukavský wrote: > > Hi, > > I think there's been already said nearly everything in this thread, but ... > it is time for Friday discussions. :) > > Today I recalled of a discussion we've had long time ago, when we were > designing Euphoria (btw, deprecating

Re: [YAML] Aggregations

2023-10-19 Thread Kenneth Knowles
Using SQL expressions in strings is maybe OK given we are all relational all the time. Either way you have to define what the universe of `fn` is. Here's a compact possibility: type: Combine config: group_by: [field1, field2] aggregates: max_cost: "MAX(cost)" total_cost: "SUM(cost)"

Re: [YAML] Aggregations

2023-10-19 Thread Robert Bradshaw via dev
On Thu, Oct 19, 2023 at 10:25 AM Jan Lukavský wrote: > > On 10/19/23 18:28, Robert Bradshaw via dev wrote: > > On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis wrote: > >> Rill is definitely SQL-oriented but I think that's going to be the most > >> common. Dataframes are explicitly modeled on the

Re: [YAML] Aggregations

2023-10-19 Thread Jan Lukavský
On 10/19/23 18:28, Robert Bradshaw via dev wrote: On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis wrote: Rill is definitely SQL-oriented but I think that's going to be the most common. Dataframes are explicitly modeled on the relational approach so that's going to look a lot like SQL, I think

Re: [YAML] Aggregations

2023-10-19 Thread Byron Ellis via dev
On Thu, Oct 19, 2023 at 9:28 AM Robert Bradshaw wrote: > On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis wrote: > > > > Rill is definitely SQL-oriented but I think that's going to be the most > common. Dataframes are explicitly modeled on the relational approach so > that's going to look a lot like

Re: [YAML] Aggregations

2023-10-19 Thread Robert Bradshaw via dev
On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis wrote: > > Rill is definitely SQL-oriented but I think that's going to be the most > common. Dataframes are explicitly modeled on the relational approach so > that's going to look a lot like SQL, I think pretty much any approach that fits here is

Re: [YAML] Aggregations

2023-10-19 Thread Byron Ellis via dev
Rill is definitely SQL-oriented but I think that's going to be the most common. Dataframes are explicitly modeled on the relational approach so that's going to look a lot like SQL, which leaves us with S-style formulas (which I like but are pretty niche) and I guess pivot tables coming from the

[PYTHON] partitioner utilities?

2023-10-19 Thread Joey Tran
Hey all, While writing a few pipelines, I was surprised by how few partitioners there were in the python SDK. I wrote a couple that are pretty generic and possibly generally useful. Just wanted to do a quick poll to see if they seem useful enough to be in the sdk's library of transforms. If so, I

Beam High Priority Issue Report (43)

2023-10-19 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need attention. See https://beam.apache.org/contribute/issue-priorities for the meaning and expectations around issue priorities. Unassigned P1 Issues: https://github.com/apache/beam/issues/29022 [Failing Test]: