Re: Exposing multiple values of a trait from the operator
Negative, currently. It is indeed complicated, but still acceptable IMO. If the community have consensus on this matter, we can create a JIRA to add the feature to trait and handle equivalent keys trait satisfaction in trait.satisfy() and column remapping in trait.apply(), hope that can ease some pain, if you happen to use Calcite's default distribution/collation implementation. Regards, Haisheng Yuan On 2021/05/25 17:45:32, Vladimir Ozerov wrote: > Hi Haisheng, > > Thank you for the advice. This is exactly how I designed distribution at > the moment (the approach 2 from my original email) - as a List > instead of just int[]. My main concern was the increased complexity of the > trait propagation/derivation, as I have to manage these nested lists by > hand. Nevertheless, it works well. So I hoped that there are better > built-in approaches that I may use. If the answer is negative, I'll > continue using the original approach, when multiple alternatives managed > manually. > > Regards, > Vladimir. > > вт, 25 мая 2021 г. в 20:30, Haisheng Yuan : > > > Hi Vladimir, > > > > Glad to see you raised the question. > > > > Here is the advice: > > Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally > > flawed and has many bugs. It can't work properly no matter for top-down or > > bottom-up. > > > > Instead, we need to add equivalent keys bitmap as the property of physical > > trait like RelCollation, RelDistribution. > > > > For example: > > class RelDistributionImpl { > > // list of distribution keys > > private ImmutableIntList keys; > > > >// list of equivalent bitset for each distribution key > > private ImmutableList equivBitSets; > > } > > > > In the trait satisfy and column remapping, we also need to take equivalent > > keys into consideration. Some of the work need to be done in Calcite core > > framework. > > > > Greenplum Orca optimizer has similar strategy: > > > > https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44 > > > > Regards, > > Haisheng Yuan > > > > On 2021/05/25 15:37:32, Vladimir Ozerov wrote: > > > Hi, > > > > > > Consider the distributed SQL engine that uses a distribution property to > > > model exchanges. Consider the following physical tree. To do the > > > distributed join, we co-locate tuples using the equijoin key. Now the > > Join > > > operator has two equivalent distributions - [a1] and [b1]. It is critical > > > to expose both distributions so that the top Aggregate can take advantage > > > of the co-location. > > > > > > Aggregate[group=b1] > > > DistributedJoin[a.a1=b.b1] // SHARDED[a1], SHARDED[b1] > > > Input[a] // SHARDED[a1] > > > Input[b] // SHARDED[b1] > > > > > > A similar example for the Project: > > > Aggregate[group=$1] > > > Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1] > > > Input // SHARDED[a] > > > > > > The question is how to model this situation properly? > > > > > > First, it seems that RelMultipleTrait and RelCompositeTrait were designed > > > to handle this situation. However, I couldn't make them work with the > > > top-down optimizer. The reason is that when we register a RelNode with a > > > composite trait in MEMO, VolcanoPlanner flattens the composite trait into > > > the default trait value in RelSet.add -> RelTraitSet.simplify. That is, > > the > > > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the > > > original traits could not be derived in the PhysicalNode.derive methods. > > > > > > Second, we may try to model multiple sharding keys in a single trait. But > > > this complicates the implementation of PhysicalNode.passThrough/derive > > > significantly. > > > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]] > > > > > > Third, we may expose multiple traits using metadata. RelMdDistribution > > > would not work, because it exposes only a single distribution. But a > > custom > > > handler may potentially fix that. However, it will not be integrated with > > > the top-down optimizer still, which makes the idea questionable. > > > > > > To summarize, it seems that currently there is no easy way to handle > > > composite traits with a top-down optimizer. I wonder whether someone from > > > the devlist already solved similar issues in Apache Calcite or other > > > optimizers. If so, what was the approach or best practices? Intuitively, > > it > > > seems that RelMultipleTrait/RelCompositeTrait approach might be the way > > to > > > go. But why do we replace the original composite trait set with the > > default > > > value in the RelTraitSet.simplify routine? > > > > > > Regards, > > > Vladimir. > > > > > >
Re: AggregateUnionTransposeRule fails when some inputs have unique grouping key
Done: https://issues.apache.org/jira/browse/CALCITE-4616?focusedCommentId=17351269=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17351269 пт, 21 мая 2021 г. в 21:32, Julian Hyde : > Can you add your proposed fix to the bug, and we can discuss there. > > > On May 21, 2021, at 2:33 AM, Vladimir Ozerov wrote: > > > > Hi, > > > > I created two issues for two distinct bugs in > AggregateUnionTransposeRule: > > > > 1. https://issues.apache.org/jira/browse/CALCITE-4616 - the problem > with > > a partial Aggregate push-down > > 2. https://issues.apache.org/jira/browse/CALCITE-4615 - the problem > with > > an incorrect type of aggregate call (not discussed in this thread > before) > > > > Regarding issue 1, my proposal is to remove the partial pushdown > > optimization completely. We may do the following: > > > > 1. Iterate over all inputs, and check if they have a unique key on the > > Aggregate's group key. > > 2. If all inputs are unique, return as there is no sense to push the > > Aggregate past Union. > > 3. If at least one input is not unique, push aggregate to *all* inputs, > > to maintain the proper row type, assuming that the AggregateRemoveRule > > will eventually remove unnecessary aggregations. > > > > Regards, > > Vladimir. > > > > пт, 21 мая 2021 г. в 09:43, JiaTao Tao : > > > >> Sounds make sense > >> > >> Regards! > >> > >> Aron Tao > >> > >> > >> Vladimir Ozerov 于2021年5月19日周三 下午8:57写道: > >> > >>> Hi, > >>> > >>> The AggregateUnionTransposeRule attempts to push the Aggregate below > the > >>> Union. > >>> > >>> Before: > >>> Aggregate[group=$0, agg=SUM($1] > >>> Union[all] > >>>Input1 > >>>Input2 > >>> > >>> After: > >>> Aggregate[group=$0, agg=SUM($1] > >>> Union[all] > >>>Aggregate[group=$0, agg=SUM($1] > >>> Input1 > >>>Aggregate[group=$0, agg=SUM($1] > >>> Input2 > >>> > >>> When pushing the Aggregate, it checks whether the input is definitively > >>> unique on the grouping key. If yes, the Aggregate is not installed on > top > >>> of the input, assuming that the result would be the same as without the > >>> Aggregate. This generates a type mismatch exception when aggregation is > >>> pushed only to some of the inputs: > >>> Aggregate[group=$0, agg=SUM($1] > >>> Union[all] > >>>Aggregate[group=$0, agg=SUM($1] > >>> Input1 > >>>Input2 > >>> > >>> It seems that the uniqueness check should not be in that rule at all, > and > >>> the aggregate should be pushed unconditionally. Motivation: we already > >> have > >>> AggregateRemoveRule that removes unnecessary aggregates. No need to > >>> duplicate the same non-trivial logic twice. > >>> > >>> Does the proposal make sense to you? > >>> > >>> Regards, > >>> Vladimir. > >>> > >> > >
Re: Exposing multiple values of a trait from the operator
Hi Haisheng, Thank you for the advice. This is exactly how I designed distribution at the moment (the approach 2 from my original email) - as a List instead of just int[]. My main concern was the increased complexity of the trait propagation/derivation, as I have to manage these nested lists by hand. Nevertheless, it works well. So I hoped that there are better built-in approaches that I may use. If the answer is negative, I'll continue using the original approach, when multiple alternatives managed manually. Regards, Vladimir. вт, 25 мая 2021 г. в 20:30, Haisheng Yuan : > Hi Vladimir, > > Glad to see you raised the question. > > Here is the advice: > Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally > flawed and has many bugs. It can't work properly no matter for top-down or > bottom-up. > > Instead, we need to add equivalent keys bitmap as the property of physical > trait like RelCollation, RelDistribution. > > For example: > class RelDistributionImpl { > // list of distribution keys > private ImmutableIntList keys; > >// list of equivalent bitset for each distribution key > private ImmutableList equivBitSets; > } > > In the trait satisfy and column remapping, we also need to take equivalent > keys into consideration. Some of the work need to be done in Calcite core > framework. > > Greenplum Orca optimizer has similar strategy: > > https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44 > > Regards, > Haisheng Yuan > > On 2021/05/25 15:37:32, Vladimir Ozerov wrote: > > Hi, > > > > Consider the distributed SQL engine that uses a distribution property to > > model exchanges. Consider the following physical tree. To do the > > distributed join, we co-locate tuples using the equijoin key. Now the > Join > > operator has two equivalent distributions - [a1] and [b1]. It is critical > > to expose both distributions so that the top Aggregate can take advantage > > of the co-location. > > > > Aggregate[group=b1] > > DistributedJoin[a.a1=b.b1] // SHARDED[a1], SHARDED[b1] > > Input[a] // SHARDED[a1] > > Input[b] // SHARDED[b1] > > > > A similar example for the Project: > > Aggregate[group=$1] > > Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1] > > Input // SHARDED[a] > > > > The question is how to model this situation properly? > > > > First, it seems that RelMultipleTrait and RelCompositeTrait were designed > > to handle this situation. However, I couldn't make them work with the > > top-down optimizer. The reason is that when we register a RelNode with a > > composite trait in MEMO, VolcanoPlanner flattens the composite trait into > > the default trait value in RelSet.add -> RelTraitSet.simplify. That is, > the > > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the > > original traits could not be derived in the PhysicalNode.derive methods. > > > > Second, we may try to model multiple sharding keys in a single trait. But > > this complicates the implementation of PhysicalNode.passThrough/derive > > significantly. > > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]] > > > > Third, we may expose multiple traits using metadata. RelMdDistribution > > would not work, because it exposes only a single distribution. But a > custom > > handler may potentially fix that. However, it will not be integrated with > > the top-down optimizer still, which makes the idea questionable. > > > > To summarize, it seems that currently there is no easy way to handle > > composite traits with a top-down optimizer. I wonder whether someone from > > the devlist already solved similar issues in Apache Calcite or other > > optimizers. If so, what was the approach or best practices? Intuitively, > it > > seems that RelMultipleTrait/RelCompositeTrait approach might be the way > to > > go. But why do we replace the original composite trait set with the > default > > value in the RelTraitSet.simplify routine? > > > > Regards, > > Vladimir. > > >
Re: Exposing multiple values of a trait from the operator
Hi Vladimir, Glad to see you raised the question. Here is the advice: Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally flawed and has many bugs. It can't work properly no matter for top-down or bottom-up. Instead, we need to add equivalent keys bitmap as the property of physical trait like RelCollation, RelDistribution. For example: class RelDistributionImpl { // list of distribution keys private ImmutableIntList keys; // list of equivalent bitset for each distribution key private ImmutableList equivBitSets; } In the trait satisfy and column remapping, we also need to take equivalent keys into consideration. Some of the work need to be done in Calcite core framework. Greenplum Orca optimizer has similar strategy: https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44 Regards, Haisheng Yuan On 2021/05/25 15:37:32, Vladimir Ozerov wrote: > Hi, > > Consider the distributed SQL engine that uses a distribution property to > model exchanges. Consider the following physical tree. To do the > distributed join, we co-locate tuples using the equijoin key. Now the Join > operator has two equivalent distributions - [a1] and [b1]. It is critical > to expose both distributions so that the top Aggregate can take advantage > of the co-location. > > Aggregate[group=b1] > DistributedJoin[a.a1=b.b1] // SHARDED[a1], SHARDED[b1] > Input[a] // SHARDED[a1] > Input[b] // SHARDED[b1] > > A similar example for the Project: > Aggregate[group=$1] > Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1] > Input // SHARDED[a] > > The question is how to model this situation properly? > > First, it seems that RelMultipleTrait and RelCompositeTrait were designed > to handle this situation. However, I couldn't make them work with the > top-down optimizer. The reason is that when we register a RelNode with a > composite trait in MEMO, VolcanoPlanner flattens the composite trait into > the default trait value in RelSet.add -> RelTraitSet.simplify. That is, the > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the > original traits could not be derived in the PhysicalNode.derive methods. > > Second, we may try to model multiple sharding keys in a single trait. But > this complicates the implementation of PhysicalNode.passThrough/derive > significantly. > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]] > > Third, we may expose multiple traits using metadata. RelMdDistribution > would not work, because it exposes only a single distribution. But a custom > handler may potentially fix that. However, it will not be integrated with > the top-down optimizer still, which makes the idea questionable. > > To summarize, it seems that currently there is no easy way to handle > composite traits with a top-down optimizer. I wonder whether someone from > the devlist already solved similar issues in Apache Calcite or other > optimizers. If so, what was the approach or best practices? Intuitively, it > seems that RelMultipleTrait/RelCompositeTrait approach might be the way to > go. But why do we replace the original composite trait set with the default > value in the RelTraitSet.simplify routine? > > Regards, > Vladimir. >
Re: Exposing multiple values of a trait from the operator
Hi Vladimir, Thank you for the link. It is very relevant to my problem. I see that in these discussions, there were several ideas and claims, such as that (1) we can get rid of "simplify" altogether, (2) composite traits are rare in practice, (3) composite traits are not designed well in the first place [1]. I do not have the full picture in my head, so I'll try to share some thoughts to advance the discussion. Regarding (1), I think that the removal of "simplify" may help with my particular (and pretty simple) test but might lead to some unpredictable results for more complicated queries. Suppose that we generate two equivalent nodes with different traits: [a] and [a][b]. Depending on the nature of the trait def, these two nodes might or might belong to the same subset. For example, [a] and [a][b] are different subsets for RelCollcation. At the same time, [a] and [a][b] could belong to the same subset for some distributions. That is, if the input is hash-distributed by either [a] or [b], it might imply that a==b for every tuple (otherwise, hashes will not match), and therefore every RelNode in the RelSet that is shared by [a] is also sharded by [b] and vice verse. The idea is similar to transitive predicates. So ideally, we should let the RelTraitDef define how to compare composite traits with other traits. Otherwise, we may lose some optimization opportunities. Regarding (2), perhaps the multi-collation nodes are really rare in practice. But nodes with multiple hash distributions are widespread for distributed engines. Because in distributed systems, the collocated hash equijoin is the most common way of joining two inputs, and such join always produces an additional distribution. Regarding (3), it would be very interesting to hear suggestions and ideas on the proper design of composite traits. The composite traits mechanics mentioned in RelSubset Javadoc's is not a good design choice for distribution traits. That is, if we have a node that is distributed by [a][b], we cannot just put it into two subsets [a] and [b], because operator parents may require both [a] and [b], otherwise unnecessary exchanges could appear. That is, [a][b] should be propagated together. For example, the removal of SHARDED[a1] from #1 would add the exchange between #2 and #1, and the removal of SHARDED[b1] from #1 would add the exchange between #3 and #2. Neither is optimal. 3: Aggregate[group=b1] 2: Join[a.a1=c.c1] // SHARDED[a1], SHARDED[b1], SHARDED[c1] 1: Join[a.a1=b.b1] // SHARDED[a1], SHARDED[b1] @Haisheng Yuan , following your comment [1], would you mind providing your ideas around the proper design of composite traits? Are composite traits implemented in Orca? Regards, Vladimir. [1] https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=17081984=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17081984 вт, 25 мая 2021 г. в 19:32, Vladimir Sitnikov : > >VolcanoPlanner flattens the composite trait into > the default trait value in RelSet.add -> RelTraitSet.simplify > > Vladimir, have you tried removing that RelTraitSet.simplify? > > I remember I have run into that multiple times already, and I suggested > removing that "simplify". > For example, > > https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=16750377=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16750377 > > > Vladimir >
Re: Exposing multiple values of a trait from the operator
>VolcanoPlanner flattens the composite trait into the default trait value in RelSet.add -> RelTraitSet.simplify Vladimir, have you tried removing that RelTraitSet.simplify? I remember I have run into that multiple times already, and I suggested removing that "simplify". For example, https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=16750377=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16750377 Vladimir
Exposing multiple values of a trait from the operator
Hi, Consider the distributed SQL engine that uses a distribution property to model exchanges. Consider the following physical tree. To do the distributed join, we co-locate tuples using the equijoin key. Now the Join operator has two equivalent distributions - [a1] and [b1]. It is critical to expose both distributions so that the top Aggregate can take advantage of the co-location. Aggregate[group=b1] DistributedJoin[a.a1=b.b1] // SHARDED[a1], SHARDED[b1] Input[a] // SHARDED[a1] Input[b] // SHARDED[b1] A similar example for the Project: Aggregate[group=$1] Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1] Input // SHARDED[a] The question is how to model this situation properly? First, it seems that RelMultipleTrait and RelCompositeTrait were designed to handle this situation. However, I couldn't make them work with the top-down optimizer. The reason is that when we register a RelNode with a composite trait in MEMO, VolcanoPlanner flattens the composite trait into the default trait value in RelSet.add -> RelTraitSet.simplify. That is, the trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the original traits could not be derived in the PhysicalNode.derive methods. Second, we may try to model multiple sharding keys in a single trait. But this complicates the implementation of PhysicalNode.passThrough/derive significantly. SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]] Third, we may expose multiple traits using metadata. RelMdDistribution would not work, because it exposes only a single distribution. But a custom handler may potentially fix that. However, it will not be integrated with the top-down optimizer still, which makes the idea questionable. To summarize, it seems that currently there is no easy way to handle composite traits with a top-down optimizer. I wonder whether someone from the devlist already solved similar issues in Apache Calcite or other optimizers. If so, what was the approach or best practices? Intuitively, it seems that RelMultipleTrait/RelCompositeTrait approach might be the way to go. But why do we replace the original composite trait set with the default value in the RelTraitSet.simplify routine? Regards, Vladimir.
[jira] [Created] (CALCITE-4621) SemiJoinRule throws AssertionError on ANTI join
Ruben Q L created CALCITE-4621: -- Summary: SemiJoinRule throws AssertionError on ANTI join Key: CALCITE-4621 URL: https://issues.apache.org/jira/browse/CALCITE-4621 Project: Calcite Issue Type: Bug Components: core Affects Versions: 1.26.0 Reporter: Ruben Q L Assignee: Ruben Q L Fix For: 1.27.0 When SemiJoinRule (both {{CoreRules.JOIN_TO_SEMI_JOIN}} and {{CoreRules.PROJECT_TO_SEMI_JOIN}}) matches an ANTI join, it fails with the following error: {noformat} java.lang.AssertionError: ANTI at org.apache.calcite.rel.rules.SemiJoinRule.perform(SemiJoinRule.java:122) ... {noformat} The problem is that the rule config only forbids RIGHT and FULL joins, and lets ANTI go through. Later when the rule is actually executed, joins with type ANTI cannot be handled, hence the {{AssertionError}}. The rule config must be adapted to forbid also ANTI joins. -- This message was sent by Atlassian Jira (v8.3.4#803005)