Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-14 Thread Brian Hulette
To close the loop on this: Rui just added a check that rejects distinct aggregations for now[1]. I wrote up BEAM-7306[2] to track this feature going forward. [1] https://github.com/apache/beam/pull/8498 [2] https://issues.apache.org/jira/browse/BEAM-7306 *From: *Mingmin Xu *Date: *Mon, May 6,

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-06 Thread Mingmin Xu
Good point to reject DISTINCT operations currently, as it's not handled now. There could be more similar cases need to revise and document well. Regarding to how to DISTINCT support, I was confused by stateful CombineFn at first. To make it simple, we can extend step by step, like reject

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-03 Thread Rui Wang
A compromise solution would be using SELECT DISTINCT or GROUP BY to duplicate before apply aggregations. It's two shuffles and works on non floating point columns. The good thing is no code change is needed, but downsides are users need to write more complicated query and floating point data is

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-03 Thread Rui Wang
Fair point. It lacks of proper benchmarks for BeamSQL to test performance and scalability of implementations. -Rui On Fri, May 3, 2019 at 12:56 PM Reuven Lax wrote: > Back to the original point: I'm very skeptical of adding something that > does not scale at all. In our experience, users get

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-03 Thread Reuven Lax
Back to the original point: I'm very skeptical of adding something that does not scale at all. In our experience, users get far more upset with an advertised feature that doesn't work for them (e.g. their workers OOM) than with a missing feature. Reuven On Fri, May 3, 2019 at 12:41 PM Kenneth

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-03 Thread Kenneth Knowles
All good points. My version of the two shuffle approach does not work at all. On Fri, May 3, 2019 at 11:38 AM Brian Hulette wrote: > Rui's point about FLOAT/DOUBLE columns is interesting as well. We couldn't > support distinct aggregations on floating point columns with the > two-shuffle

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-03 Thread Rui Wang
To clarify what I said "So two shuffle approach will lead to two different implementation for tables with and without FLOAT/DOUBLE column.": Basically I wanted to say that two shuffles approach will be an implementation for some cases, and it will co-exist with CombineFn approach. In the feature,

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-03 Thread Rui Wang
> > > As to the distinct aggregations: At the least, these queries should be > rejected, not evaluated incorrectly. > Yes. The least is not to support it, and throws clear message to say no. (current implementation ignores DISTINCT and executes all aggregations as ALL). > The term "stateful

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-02 Thread Kenneth Knowles
Meta: All of Beam SQL is still "experimental" isn't it? There's very little chance that the structure of Beam SQL pipelines will be stable enough for e.g. pipeline update. So that is not worth worrying about at this stage. And this doesn't seem to affect APIs / compile time compatibility. As to

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-02 Thread Ahmet Altay
On Thu, May 2, 2019 at 2:18 PM Rui Wang wrote: > Brian's first proposal is challenging also partially because in BeamSQL > there is no good practice to deal with complex SQL plans. Ideally we need > enough rules and SQL plan node in Beam to construct easy-to-transform plans > for different

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-02 Thread Rui Wang
Brian's first proposal is challenging also partially because in BeamSQL there is no good practice to deal with complex SQL plans. Ideally we need enough rules and SQL plan node in Beam to construct easy-to-transform plans for different cases. I had a similar situation before when I needed to

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-02 Thread Brian Hulette
Ahmet - I think it would only require observing each key's partition of the input independently, and the size of the state would only be proportional to the number of distinct elements, not the entire input. Note the pipeline would be a GBK with a key based on the GROUP BY, followed by a

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-02 Thread Lukasz Cwik
Can you also go into more detail why you think 1) is more challenging to implement? On Thu, May 2, 2019 at 11:58 AM Ahmet Altay wrote: > From my limited understanding, would not the stateful combinefn option > require observing the whole input before being able combine and the risk of > blowing

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-02 Thread Ahmet Altay
>From my limited understanding, would not the stateful combinefn option require observing the whole input before being able combine and the risk of blowing memory is actually very high except for trivial inputs? On Thu, May 2, 2019 at 11:50 AM Brian Hulette wrote: > Hi everyone, > Currently