Re: Exposing multiple values of a trait from the operator

2021-05-25 Thread Haisheng Yuan
Negative, currently.

It is indeed complicated, but still acceptable IMO. If the community have 
consensus on this matter, we can create a JIRA to add the feature to trait and 
handle equivalent keys trait satisfaction in trait.satisfy() and column 
remapping in trait.apply(), hope that can ease some pain, if you happen to use 
Calcite's default distribution/collation implementation. 

Regards,
Haisheng Yuan

On 2021/05/25 17:45:32, Vladimir Ozerov  wrote: 
> Hi Haisheng,
> 
> Thank you for the advice. This is exactly how I designed distribution at
> the moment (the approach 2 from my original email) - as a List
> instead of just int[]. My main concern was the increased complexity of the
> trait propagation/derivation, as I have to manage these nested lists by
> hand. Nevertheless, it works well. So I hoped that there are better
> built-in approaches that I may use. If the answer is negative, I'll
> continue using the original approach, when multiple alternatives managed
> manually.
> 
> Regards,
> Vladimir.
> 
> вт, 25 мая 2021 г. в 20:30, Haisheng Yuan :
> 
> > Hi Vladimir,
> >
> > Glad to see you raised the question.
> >
> > Here is the advice:
> > Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally
> > flawed and has many bugs. It can't work properly no matter for top-down or
> > bottom-up.
> >
> > Instead, we need to add equivalent keys bitmap as the property of physical
> > trait like RelCollation, RelDistribution.
> >
> > For example:
> > class RelDistributionImpl {
> >   // list of distribution keys
> >   private ImmutableIntList keys;
> >
> >// list of equivalent bitset for each distribution key
> >   private ImmutableList equivBitSets;
> > }
> >
> > In the trait satisfy and column remapping, we also need to take equivalent
> > keys into consideration. Some of the work need to be done in Calcite core
> > framework.
> >
> > Greenplum Orca optimizer has similar strategy:
> >
> > https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44
> >
> > Regards,
> > Haisheng Yuan
> >
> > On 2021/05/25 15:37:32, Vladimir Ozerov  wrote:
> > > Hi,
> > >
> > > Consider the distributed SQL engine that uses a distribution property to
> > > model exchanges. Consider the following physical tree. To do the
> > > distributed join, we co-locate tuples using the equijoin key. Now the
> > Join
> > > operator has two equivalent distributions - [a1] and [b1]. It is critical
> > > to expose both distributions so that the top Aggregate can take advantage
> > > of the co-location.
> > >
> > > Aggregate[group=b1]
> > >   DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
> > > Input[a]   // SHARDED[a1]
> > > Input[b]   // SHARDED[b1]
> > >
> > > A similar example for the Project:
> > > Aggregate[group=$1]
> > >   Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
> > > Input // SHARDED[a]
> > >
> > > The question is how to model this situation properly?
> > >
> > > First, it seems that RelMultipleTrait and RelCompositeTrait were designed
> > > to handle this situation. However, I couldn't make them work with the
> > > top-down optimizer. The reason is that when we register a RelNode with a
> > > composite trait in MEMO, VolcanoPlanner flattens the composite trait into
> > > the default trait value in RelSet.add -> RelTraitSet.simplify. That is,
> > the
> > > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the
> > > original traits could not be derived in the PhysicalNode.derive methods.
> > >
> > > Second, we may try to model multiple sharding keys in a single trait. But
> > > this complicates the implementation of PhysicalNode.passThrough/derive
> > > significantly.
> > > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]
> > >
> > > Third, we may expose multiple traits using metadata. RelMdDistribution
> > > would not work, because it exposes only a single distribution. But a
> > custom
> > > handler may potentially fix that. However, it will not be integrated with
> > > the top-down optimizer still, which makes the idea questionable.
> > >
> > > To summarize, it seems that currently there is no easy way to handle
> > > composite traits with a top-down optimizer. I wonder whether someone from
> > > the devlist already solved similar issues in Apache Calcite or other
> > > optimizers. If so, what was the approach or best practices? Intuitively,
> > it
> > > seems that RelMultipleTrait/RelCompositeTrait approach might be the way
> > to
> > > go. But why do we replace the original composite trait set with the
> > default
> > > value in the RelTraitSet.simplify routine?
> > >
> > > Regards,
> > > Vladimir.
> > >
> >
> 


Re: AggregateUnionTransposeRule fails when some inputs have unique grouping key

2021-05-25 Thread Vladimir Ozerov
Done:
https://issues.apache.org/jira/browse/CALCITE-4616?focusedCommentId=17351269=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17351269

пт, 21 мая 2021 г. в 21:32, Julian Hyde :

> Can you add your proposed fix to the bug, and we can discuss there.
>
> > On May 21, 2021, at 2:33 AM, Vladimir Ozerov  wrote:
> >
> > Hi,
> >
> > I created two issues for two distinct bugs in
> AggregateUnionTransposeRule:
> >
> >   1. https://issues.apache.org/jira/browse/CALCITE-4616 - the problem
> with
> >   a partial Aggregate push-down
> >   2. https://issues.apache.org/jira/browse/CALCITE-4615 - the problem
> with
> >   an incorrect type of aggregate call (not discussed in this thread
> before)
> >
> > Regarding issue 1, my proposal is to remove the partial pushdown
> > optimization completely. We may do the following:
> >
> >   1. Iterate over all inputs, and check if they have a unique key on the
> >   Aggregate's group key.
> >   2. If all inputs are unique, return as there is no sense to push the
> >   Aggregate past Union.
> >   3. If at least one input is not unique, push aggregate to *all* inputs,
> >   to maintain the proper row type, assuming that the AggregateRemoveRule
> >   will eventually remove unnecessary aggregations.
> >
> > Regards,
> > Vladimir.
> >
> > пт, 21 мая 2021 г. в 09:43, JiaTao Tao :
> >
> >> Sounds  make sense
> >>
> >> Regards!
> >>
> >> Aron Tao
> >>
> >>
> >> Vladimir Ozerov  于2021年5月19日周三 下午8:57写道:
> >>
> >>> Hi,
> >>>
> >>> The AggregateUnionTransposeRule attempts to push the Aggregate below
> the
> >>> Union.
> >>>
> >>> Before:
> >>> Aggregate[group=$0, agg=SUM($1]
> >>>  Union[all]
> >>>Input1
> >>>Input2
> >>>
> >>> After:
> >>> Aggregate[group=$0, agg=SUM($1]
> >>>  Union[all]
> >>>Aggregate[group=$0, agg=SUM($1]
> >>>  Input1
> >>>Aggregate[group=$0, agg=SUM($1]
> >>>  Input2
> >>>
> >>> When pushing the Aggregate, it checks whether the input is definitively
> >>> unique on the grouping key. If yes, the Aggregate is not installed on
> top
> >>> of the input, assuming that the result would be the same as without the
> >>> Aggregate. This generates a type mismatch exception when aggregation is
> >>> pushed only to some of the inputs:
> >>> Aggregate[group=$0, agg=SUM($1]
> >>>  Union[all]
> >>>Aggregate[group=$0, agg=SUM($1]
> >>>  Input1
> >>>Input2
> >>>
> >>> It seems that the uniqueness check should not be in that rule at all,
> and
> >>> the aggregate should be pushed unconditionally. Motivation: we already
> >> have
> >>> AggregateRemoveRule that removes unnecessary aggregates. No need to
> >>> duplicate the same non-trivial logic twice.
> >>>
> >>> Does the proposal make sense to you?
> >>>
> >>> Regards,
> >>> Vladimir.
> >>>
> >>
>
>


Re: Exposing multiple values of a trait from the operator

2021-05-25 Thread Vladimir Ozerov
Hi Haisheng,

Thank you for the advice. This is exactly how I designed distribution at
the moment (the approach 2 from my original email) - as a List
instead of just int[]. My main concern was the increased complexity of the
trait propagation/derivation, as I have to manage these nested lists by
hand. Nevertheless, it works well. So I hoped that there are better
built-in approaches that I may use. If the answer is negative, I'll
continue using the original approach, when multiple alternatives managed
manually.

Regards,
Vladimir.

вт, 25 мая 2021 г. в 20:30, Haisheng Yuan :

> Hi Vladimir,
>
> Glad to see you raised the question.
>
> Here is the advice:
> Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally
> flawed and has many bugs. It can't work properly no matter for top-down or
> bottom-up.
>
> Instead, we need to add equivalent keys bitmap as the property of physical
> trait like RelCollation, RelDistribution.
>
> For example:
> class RelDistributionImpl {
>   // list of distribution keys
>   private ImmutableIntList keys;
>
>// list of equivalent bitset for each distribution key
>   private ImmutableList equivBitSets;
> }
>
> In the trait satisfy and column remapping, we also need to take equivalent
> keys into consideration. Some of the work need to be done in Calcite core
> framework.
>
> Greenplum Orca optimizer has similar strategy:
>
> https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44
>
> Regards,
> Haisheng Yuan
>
> On 2021/05/25 15:37:32, Vladimir Ozerov  wrote:
> > Hi,
> >
> > Consider the distributed SQL engine that uses a distribution property to
> > model exchanges. Consider the following physical tree. To do the
> > distributed join, we co-locate tuples using the equijoin key. Now the
> Join
> > operator has two equivalent distributions - [a1] and [b1]. It is critical
> > to expose both distributions so that the top Aggregate can take advantage
> > of the co-location.
> >
> > Aggregate[group=b1]
> >   DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
> > Input[a]   // SHARDED[a1]
> > Input[b]   // SHARDED[b1]
> >
> > A similar example for the Project:
> > Aggregate[group=$1]
> >   Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
> > Input // SHARDED[a]
> >
> > The question is how to model this situation properly?
> >
> > First, it seems that RelMultipleTrait and RelCompositeTrait were designed
> > to handle this situation. However, I couldn't make them work with the
> > top-down optimizer. The reason is that when we register a RelNode with a
> > composite trait in MEMO, VolcanoPlanner flattens the composite trait into
> > the default trait value in RelSet.add -> RelTraitSet.simplify. That is,
> the
> > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the
> > original traits could not be derived in the PhysicalNode.derive methods.
> >
> > Second, we may try to model multiple sharding keys in a single trait. But
> > this complicates the implementation of PhysicalNode.passThrough/derive
> > significantly.
> > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]
> >
> > Third, we may expose multiple traits using metadata. RelMdDistribution
> > would not work, because it exposes only a single distribution. But a
> custom
> > handler may potentially fix that. However, it will not be integrated with
> > the top-down optimizer still, which makes the idea questionable.
> >
> > To summarize, it seems that currently there is no easy way to handle
> > composite traits with a top-down optimizer. I wonder whether someone from
> > the devlist already solved similar issues in Apache Calcite or other
> > optimizers. If so, what was the approach or best practices? Intuitively,
> it
> > seems that RelMultipleTrait/RelCompositeTrait approach might be the way
> to
> > go. But why do we replace the original composite trait set with the
> default
> > value in the RelTraitSet.simplify routine?
> >
> > Regards,
> > Vladimir.
> >
>


Re: Exposing multiple values of a trait from the operator

2021-05-25 Thread Haisheng Yuan
Hi Vladimir,

Glad to see you raised the question.

Here is the advice:
Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally flawed 
and has many bugs. It can't work properly no matter for top-down or bottom-up.

Instead, we need to add equivalent keys bitmap as the property of physical 
trait like RelCollation, RelDistribution.

For example:
class RelDistributionImpl {
  // list of distribution keys
  private ImmutableIntList keys; 

   // list of equivalent bitset for each distribution key
  private ImmutableList equivBitSets; 
}

In the trait satisfy and column remapping, we also need to take equivalent keys 
into consideration. Some of the work need to be done in Calcite core framework.

Greenplum Orca optimizer has similar strategy:
https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44

Regards,
Haisheng Yuan

On 2021/05/25 15:37:32, Vladimir Ozerov  wrote: 
> Hi,
> 
> Consider the distributed SQL engine that uses a distribution property to
> model exchanges. Consider the following physical tree. To do the
> distributed join, we co-locate tuples using the equijoin key. Now the Join
> operator has two equivalent distributions - [a1] and [b1]. It is critical
> to expose both distributions so that the top Aggregate can take advantage
> of the co-location.
> 
> Aggregate[group=b1]
>   DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
> Input[a]   // SHARDED[a1]
> Input[b]   // SHARDED[b1]
> 
> A similar example for the Project:
> Aggregate[group=$1]
>   Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
> Input // SHARDED[a]
> 
> The question is how to model this situation properly?
> 
> First, it seems that RelMultipleTrait and RelCompositeTrait were designed
> to handle this situation. However, I couldn't make them work with the
> top-down optimizer. The reason is that when we register a RelNode with a
> composite trait in MEMO, VolcanoPlanner flattens the composite trait into
> the default trait value in RelSet.add -> RelTraitSet.simplify. That is, the
> trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the
> original traits could not be derived in the PhysicalNode.derive methods.
> 
> Second, we may try to model multiple sharding keys in a single trait. But
> this complicates the implementation of PhysicalNode.passThrough/derive
> significantly.
> SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]
> 
> Third, we may expose multiple traits using metadata. RelMdDistribution
> would not work, because it exposes only a single distribution. But a custom
> handler may potentially fix that. However, it will not be integrated with
> the top-down optimizer still, which makes the idea questionable.
> 
> To summarize, it seems that currently there is no easy way to handle
> composite traits with a top-down optimizer. I wonder whether someone from
> the devlist already solved similar issues in Apache Calcite or other
> optimizers. If so, what was the approach or best practices? Intuitively, it
> seems that RelMultipleTrait/RelCompositeTrait approach might be the way to
> go. But why do we replace the original composite trait set with the default
> value in the RelTraitSet.simplify routine?
> 
> Regards,
> Vladimir.
> 


Re: Exposing multiple values of a trait from the operator

2021-05-25 Thread Vladimir Ozerov
Hi Vladimir,

Thank you for the link. It is very relevant to my problem. I see that in
these discussions, there were several ideas and claims, such as that (1) we
can get rid of "simplify" altogether, (2) composite traits are rare in
practice, (3) composite traits are not designed well in the first place
[1]. I do not have the full picture in my head, so I'll try to share some
thoughts to advance the discussion.

Regarding (1), I think that the removal of "simplify" may help with my
particular (and pretty simple) test but might lead to some unpredictable
results for more complicated queries. Suppose that we generate two
equivalent nodes with different traits: [a] and [a][b]. Depending on the
nature of the trait def, these two nodes might or might belong to the same
subset. For example, [a] and [a][b] are different subsets for
RelCollcation. At the same time, [a] and [a][b] could belong to the same
subset for some distributions. That is, if the input is hash-distributed by
either [a] or [b], it might imply that a==b for every tuple (otherwise,
hashes will not match), and therefore every RelNode in the RelSet that is
shared by [a] is also sharded by [b] and vice verse. The idea is similar to
transitive predicates. So ideally, we should let the RelTraitDef define how
to compare composite traits with other traits. Otherwise, we may lose some
optimization opportunities.

Regarding (2), perhaps the multi-collation nodes are really rare in
practice. But nodes with multiple hash distributions are widespread for
distributed engines. Because in distributed systems, the collocated hash
equijoin is the most common way of joining two inputs, and such join always
produces an additional distribution.

Regarding (3), it would be very interesting to hear suggestions and ideas
on the proper design of composite traits. The composite traits mechanics
mentioned in RelSubset Javadoc's is not a good design choice for
distribution traits. That is, if we have a node that is distributed by
[a][b], we cannot just put it into two subsets [a] and [b], because
operator parents may require both [a] and [b], otherwise unnecessary
exchanges could appear. That is, [a][b] should be propagated together. For
example, the removal of SHARDED[a1] from #1 would add the exchange between
#2 and #1, and the removal of SHARDED[b1] from #1 would add the exchange
between #3 and #2. Neither is optimal.
3: Aggregate[group=b1]
2:   Join[a.a1=c.c1]   // SHARDED[a1], SHARDED[b1], SHARDED[c1]
1: Join[a.a1=b.b1] // SHARDED[a1], SHARDED[b1]

@Haisheng Yuan , following your comment [1], would
you mind providing your ideas around the proper design of composite traits?
Are composite traits implemented in Orca?

Regards,
Vladimir.

[1]
https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=17081984=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17081984

вт, 25 мая 2021 г. в 19:32, Vladimir Sitnikov :

> >VolcanoPlanner flattens the composite trait into
> the default trait value in RelSet.add -> RelTraitSet.simplify
>
> Vladimir, have you tried removing that RelTraitSet.simplify?
>
> I remember I have run into that multiple times already, and I suggested
> removing that "simplify".
> For example,
>
> https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=16750377=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16750377
>
>
> Vladimir
>


Re: Exposing multiple values of a trait from the operator

2021-05-25 Thread Vladimir Sitnikov
>VolcanoPlanner flattens the composite trait into
the default trait value in RelSet.add -> RelTraitSet.simplify

Vladimir, have you tried removing that RelTraitSet.simplify?

I remember I have run into that multiple times already, and I suggested
removing that "simplify".
For example,
https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=16750377=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16750377


Vladimir


Exposing multiple values of a trait from the operator

2021-05-25 Thread Vladimir Ozerov
Hi,

Consider the distributed SQL engine that uses a distribution property to
model exchanges. Consider the following physical tree. To do the
distributed join, we co-locate tuples using the equijoin key. Now the Join
operator has two equivalent distributions - [a1] and [b1]. It is critical
to expose both distributions so that the top Aggregate can take advantage
of the co-location.

Aggregate[group=b1]
  DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
Input[a]   // SHARDED[a1]
Input[b]   // SHARDED[b1]

A similar example for the Project:
Aggregate[group=$1]
  Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
Input // SHARDED[a]

The question is how to model this situation properly?

First, it seems that RelMultipleTrait and RelCompositeTrait were designed
to handle this situation. However, I couldn't make them work with the
top-down optimizer. The reason is that when we register a RelNode with a
composite trait in MEMO, VolcanoPlanner flattens the composite trait into
the default trait value in RelSet.add -> RelTraitSet.simplify. That is, the
trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the
original traits could not be derived in the PhysicalNode.derive methods.

Second, we may try to model multiple sharding keys in a single trait. But
this complicates the implementation of PhysicalNode.passThrough/derive
significantly.
SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]

Third, we may expose multiple traits using metadata. RelMdDistribution
would not work, because it exposes only a single distribution. But a custom
handler may potentially fix that. However, it will not be integrated with
the top-down optimizer still, which makes the idea questionable.

To summarize, it seems that currently there is no easy way to handle
composite traits with a top-down optimizer. I wonder whether someone from
the devlist already solved similar issues in Apache Calcite or other
optimizers. If so, what was the approach or best practices? Intuitively, it
seems that RelMultipleTrait/RelCompositeTrait approach might be the way to
go. But why do we replace the original composite trait set with the default
value in the RelTraitSet.simplify routine?

Regards,
Vladimir.


[jira] [Created] (CALCITE-4621) SemiJoinRule throws AssertionError on ANTI join

2021-05-25 Thread Ruben Q L (Jira)
Ruben Q L created CALCITE-4621:
--

 Summary: SemiJoinRule throws AssertionError on ANTI join
 Key: CALCITE-4621
 URL: https://issues.apache.org/jira/browse/CALCITE-4621
 Project: Calcite
  Issue Type: Bug
  Components: core
Affects Versions: 1.26.0
Reporter: Ruben Q L
Assignee: Ruben Q L
 Fix For: 1.27.0


When SemiJoinRule (both {{CoreRules.JOIN_TO_SEMI_JOIN}} and 
{{CoreRules.PROJECT_TO_SEMI_JOIN}}) matches an ANTI join, it fails with the 
following error:
{noformat}
java.lang.AssertionError: ANTI
  at org.apache.calcite.rel.rules.SemiJoinRule.perform(SemiJoinRule.java:122)
  ...
{noformat}

The problem is that the rule config only forbids RIGHT and FULL joins, and lets 
ANTI go through. Later when the rule is actually executed, joins with type ANTI 
cannot be handled, hence the {{AssertionError}}.
The rule config must be adapted to forbid also ANTI joins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)