To close the loop on this: Rui just added a check that rejects distinct
aggregations for now[1]. I wrote up BEAM-7306[2] to track this feature
going forward.
[1] https://github.com/apache/beam/pull/8498
[2] https://issues.apache.org/jira/browse/BEAM-7306
*From: *Mingmin Xu
*Date: *Mon, May 6,
Good point to reject DISTINCT operations currently, as it's not handled
now. There could be more similar cases need to revise and document well.
Regarding to how to DISTINCT support, I was confused by stateful CombineFn
at first. To make it simple, we can extend step by step, like reject
A compromise solution would be using SELECT DISTINCT or GROUP BY to
duplicate before apply aggregations. It's two shuffles and works on non
floating point columns. The good thing is no code change is needed, but
downsides are users need to write more complicated query and floating point
data is
Fair point. It lacks of proper benchmarks for BeamSQL to test performance
and scalability of implementations.
-Rui
On Fri, May 3, 2019 at 12:56 PM Reuven Lax wrote:
> Back to the original point: I'm very skeptical of adding something that
> does not scale at all. In our experience, users get
Back to the original point: I'm very skeptical of adding something that
does not scale at all. In our experience, users get far more upset with an
advertised feature that doesn't work for them (e.g. their workers OOM) than
with a missing feature.
Reuven
On Fri, May 3, 2019 at 12:41 PM Kenneth
All good points. My version of the two shuffle approach does not work at
all.
On Fri, May 3, 2019 at 11:38 AM Brian Hulette wrote:
> Rui's point about FLOAT/DOUBLE columns is interesting as well. We couldn't
> support distinct aggregations on floating point columns with the
> two-shuffle
To clarify what I said "So two shuffle approach will lead to two different
implementation for tables with and without FLOAT/DOUBLE column.":
Basically I wanted to say that two shuffles approach will be an
implementation for some cases, and it will co-exist with CombineFn
approach. In the feature,
>
>
> As to the distinct aggregations: At the least, these queries should be
> rejected, not evaluated incorrectly.
>
Yes. The least is not to support it, and throws clear message to say no.
(current implementation ignores DISTINCT and executes all aggregations as
ALL).
> The term "stateful
Meta: All of Beam SQL is still "experimental" isn't it? There's very little
chance that the structure of Beam SQL pipelines will be stable enough for
e.g. pipeline update. So that is not worth worrying about at this stage.
And this doesn't seem to affect APIs / compile time compatibility.
As to
On Thu, May 2, 2019 at 2:18 PM Rui Wang wrote:
> Brian's first proposal is challenging also partially because in BeamSQL
> there is no good practice to deal with complex SQL plans. Ideally we need
> enough rules and SQL plan node in Beam to construct easy-to-transform plans
> for different
Brian's first proposal is challenging also partially because in BeamSQL
there is no good practice to deal with complex SQL plans. Ideally we need
enough rules and SQL plan node in Beam to construct easy-to-transform plans
for different cases. I had a similar situation before when I needed to
Ahmet -
I think it would only require observing each key's partition of the input
independently, and the size of the state would only be proportional to the
number of distinct elements, not the entire input. Note the pipeline would
be a GBK with a key based on the GROUP BY, followed by a
Can you also go into more detail why you think 1) is more challenging to
implement?
On Thu, May 2, 2019 at 11:58 AM Ahmet Altay wrote:
> From my limited understanding, would not the stateful combinefn option
> require observing the whole input before being able combine and the risk of
> blowing
>From my limited understanding, would not the stateful combinefn option
require observing the whole input before being able combine and the risk of
blowing memory is actually very high except for trivial inputs?
On Thu, May 2, 2019 at 11:50 AM Brian Hulette wrote:
> Hi everyone,
> Currently
14 matches
Mail list logo