Re: Subquery de-correlation

2016-09-28 Thread Vineet Garg
Never mind I figured it out by looking at Calcite tests :)




On 9/22/16, 9:26 PM, "Vineet Garg"  wrote:

>Hi Julian,
>
>Thank you for your response. I have few follow-up questions:
>
>Yes. Remember it should return only the correlating variables it sets, not 
>those it inherits
>What do you mean by inherit ? Could you kindly provide an example to elaborate?
>
>No it shouldn’t necessarily. The id must be unique within the whole query.
>If id is unique how does co-related variable in inner query is bound to outer 
>query ? I.e. How would calcite figure out what variable in outer query a 
>particular co-related variable refers to ?
>
>Vineet
>
>From: Julian Hyde mailto:jh...@apache.org>>
>Date: Thursday, September 22, 2016 at 3:05 PM
>To: default mailto:vg...@hortonworks.com>>
>Cc: "dev@calcite.apache.org<mailto:dev@calcite.apache.org>" 
>mailto:dev@calcite.apache.org>>
>Subject: Re: Subquery de-correlation
>
>Vineet,
>
>Thanks for your message. See my responses inline.
>
>On Sep 21, 2016, at 5:11 PM, Vineet Garg 
>mailto:vg...@hortonworks.com>> wrote:
>
>Hello Julian/Calcite community,
>
>I am working on adding subquery support in HIVE using calcite.  From what I 
>have read/understood so far Calcite requires HIVE to create RexSubqueryNode 
>corresponding to a subquery and then call SubQueryRemoveRule to get rid of 
>RexSubqueryNode and change it to join. This seems to be working for 
>Un-correlated queries where SubQueryRemoveRule creates Aggregate + Join to get 
>rid of RexSubqueryNode. But I am running into following issues with 
>Co-rrelated queries: (Note that I am using FILTER rule)
>
>  *   Looking at SubQueryRemoveRule code it should be creating Correlate node 
> if it finds any correlation in given filter. To find if given filter has 
> correlation getVariablesSet is called on filter, which supposedly should be 
> returning set of correlated variables, but it is always returning empty set 
> as filter does not implement this method. Shouldn’t Filter implement this 
> method to return appropriate correlated variables ?
>
>Yes. Remember it should return only the correlating variables it sets, not 
>those it inherits.
>
>  *   Comments in SubQueryRemoveRule mentions that “The correlate can be 
> removed using RelDecorrelator”. But I don’t see SubqueryRemoveRule using 
> RelDecorrelator to de-correlate given query. Should SubQueryRemoveRule call 
> this ? If not is doing de-correlation immediately after SubQueryRemoveRule 
> appropriate ?
>
>I would tend to invoke RelDecorrelator on the whole tree. But I see no reason 
>in principle why it can’t be called on a section of the tree, as long as that 
>section is self-contained (i.e. no unbound correlating variables).
>
>Here is what I have done so far for co-rrelated queries. Could you please 
>comment if this is right ?
>
>  *   While creating RexSubqueryNode and RelNode for the subquery I am 
> creating RexCorrelVariable. RexCorrelVariable needs a correlation id. 
> CorrelationId requires an integer id. Should this id be same as index of 
> co-relatted column in outer table ?
>
>No it shouldn’t necessarily. The id must be unique within the whole query.
>
>  *   Hive has a HiveFilter which is extended from Filter. I implemented 
> getVariableSet method to look at the condition and return all correlated 
> variables in condition’s RelNode. Does this sound correct ?
>
>Yes, sounds right.
>
>  *   I am calling RelDecorrelator’s decorrelateQuery immediately after 
> calling SubQueryRemoveRule.  After implementing getVariableSet in HiveFilter 
> SubQueryRemoveRule seems to be creating appropriate LogicalCorrelate for 
> correlate queries but decorrelateQuery is throwing an exception.
>
>I can’t help too much if you are getting errors in Hive-land. This stuff is so 
>complicated I strongly suggest unit tests. Don’t do anything “new” in Hive, 
>make sure that it all works on Calcite logical nodes. Write tests in 
>RelOptRulesTest.
>
>Julian
>


Re: Subquery de-correlation

2016-09-22 Thread Vineet Garg
Hi Julian,

Thank you for your response. I have few follow-up questions:

Yes. Remember it should return only the correlating variables it sets, not 
those it inherits
What do you mean by inherit ? Could you kindly provide an example to elaborate?

No it shouldn’t necessarily. The id must be unique within the whole query.
If id is unique how does co-related variable in inner query is bound to outer 
query ? I.e. How would calcite figure out what variable in outer query a 
particular co-related variable refers to ?

Vineet

From: Julian Hyde mailto:jh...@apache.org>>
Date: Thursday, September 22, 2016 at 3:05 PM
To: default mailto:vg...@hortonworks.com>>
Cc: "dev@calcite.apache.org<mailto:dev@calcite.apache.org>" 
mailto:dev@calcite.apache.org>>
Subject: Re: Subquery de-correlation

Vineet,

Thanks for your message. See my responses inline.

On Sep 21, 2016, at 5:11 PM, Vineet Garg 
mailto:vg...@hortonworks.com>> wrote:

Hello Julian/Calcite community,

I am working on adding subquery support in HIVE using calcite.  From what I 
have read/understood so far Calcite requires HIVE to create RexSubqueryNode 
corresponding to a subquery and then call SubQueryRemoveRule to get rid of 
RexSubqueryNode and change it to join. This seems to be working for 
Un-correlated queries where SubQueryRemoveRule creates Aggregate + Join to get 
rid of RexSubqueryNode. But I am running into following issues with Co-rrelated 
queries: (Note that I am using FILTER rule)

  *   Looking at SubQueryRemoveRule code it should be creating Correlate node 
if it finds any correlation in given filter. To find if given filter has 
correlation getVariablesSet is called on filter, which supposedly should be 
returning set of correlated variables, but it is always returning empty set as 
filter does not implement this method. Shouldn’t Filter implement this method 
to return appropriate correlated variables ?

Yes. Remember it should return only the correlating variables it sets, not 
those it inherits.

  *   Comments in SubQueryRemoveRule mentions that “The correlate can be 
removed using RelDecorrelator”. But I don’t see SubqueryRemoveRule using 
RelDecorrelator to de-correlate given query. Should SubQueryRemoveRule call 
this ? If not is doing de-correlation immediately after SubQueryRemoveRule 
appropriate ?

I would tend to invoke RelDecorrelator on the whole tree. But I see no reason 
in principle why it can’t be called on a section of the tree, as long as that 
section is self-contained (i.e. no unbound correlating variables).

Here is what I have done so far for co-rrelated queries. Could you please 
comment if this is right ?

  *   While creating RexSubqueryNode and RelNode for the subquery I am creating 
RexCorrelVariable. RexCorrelVariable needs a correlation id. CorrelationId 
requires an integer id. Should this id be same as index of co-relatted column 
in outer table ?

No it shouldn’t necessarily. The id must be unique within the whole query.

  *   Hive has a HiveFilter which is extended from Filter. I implemented 
getVariableSet method to look at the condition and return all correlated 
variables in condition’s RelNode. Does this sound correct ?

Yes, sounds right.

  *   I am calling RelDecorrelator’s decorrelateQuery immediately after calling 
SubQueryRemoveRule.  After implementing getVariableSet in HiveFilter 
SubQueryRemoveRule seems to be creating appropriate LogicalCorrelate for 
correlate queries but decorrelateQuery is throwing an exception.

I can’t help too much if you are getting errors in Hive-land. This stuff is so 
complicated I strongly suggest unit tests. Don’t do anything “new” in Hive, 
make sure that it all works on Calcite logical nodes. Write tests in 
RelOptRulesTest.

Julian



Re: Subquery de-correlation

2016-09-22 Thread Julian Hyde
Vineet,

Thanks for your message. See my responses inline.

> On Sep 21, 2016, at 5:11 PM, Vineet Garg  wrote:
> 
> Hello Julian/Calcite community,
> 
> I am working on adding subquery support in HIVE using calcite.  From what I 
> have read/understood so far Calcite requires HIVE to create RexSubqueryNode 
> corresponding to a subquery and then call SubQueryRemoveRule to get rid of 
> RexSubqueryNode and change it to join. This seems to be working for 
> Un-correlated queries where SubQueryRemoveRule creates Aggregate + Join to 
> get rid of RexSubqueryNode. But I am running into following issues with 
> Co-rrelated queries: (Note that I am using FILTER rule)
> Looking at SubQueryRemoveRule code it should be creating Correlate node if it 
> finds any correlation in given filter. To find if given filter has 
> correlation getVariablesSet is called on filter, which supposedly should be 
> returning set of correlated variables, but it is always returning empty set 
> as filter does not implement this method. Shouldn’t Filter implement this 
> method to return appropriate correlated variables ?
Yes. Remember it should return only the correlating variables it sets, not 
those it inherits.
> Comments in SubQueryRemoveRule mentions that “The correlate can be removed 
> using RelDecorrelator”. But I don’t see SubqueryRemoveRule using 
> RelDecorrelator to de-correlate given query. Should SubQueryRemoveRule call 
> this ? If not is doing de-correlation immediately after SubQueryRemoveRule 
> appropriate ? 
I would tend to invoke RelDecorrelator on the whole tree. But I see no reason 
in principle why it can’t be called on a section of the tree, as long as that 
section is self-contained (i.e. no unbound correlating variables).

> Here is what I have done so far for co-rrelated queries. Could you please 
> comment if this is right ?
> While creating RexSubqueryNode and RelNode for the subquery I am creating 
> RexCorrelVariable. RexCorrelVariable needs a correlation id. CorrelationId 
> requires an integer id. Should this id be same as index of co-relatted column 
> in outer table ? 
No it shouldn’t necessarily. The id must be unique within the whole query.
> Hive has a HiveFilter which is extended from Filter. I implemented 
> getVariableSet method to look at the condition and return all correlated 
> variables in condition’s RelNode. Does this sound correct ? 
Yes, sounds right.
> I am calling RelDecorrelator’s decorrelateQuery immediately after calling 
> SubQueryRemoveRule.  After implementing getVariableSet in HiveFilter 
> SubQueryRemoveRule seems to be creating appropriate LogicalCorrelate for 
> correlate queries but decorrelateQuery is throwing an exception.
I can’t help too much if you are getting errors in Hive-land. This stuff is so 
complicated I strongly suggest unit tests. Don’t do anything “new” in Hive, 
make sure that it all works on Calcite logical nodes. Write tests in 
RelOptRulesTest.

Julian



Subquery de-correlation

2016-09-21 Thread Vineet Garg
Hello Julian/Calcite community,

I am working on adding subquery support in HIVE using calcite.  From what I 
have read/understood so far Calcite requires HIVE to create RexSubqueryNode 
corresponding to a subquery and then call SubQueryRemoveRule to get rid of 
RexSubqueryNode and change it to join. This seems to be working for 
Un-correlated queries where SubQueryRemoveRule creates Aggregate + Join to get 
rid of RexSubqueryNode. But I am running into following issues with Co-rrelated 
queries: (Note that I am using FILTER rule)

  *   Looking at SubQueryRemoveRule code it should be creating Correlate node 
if it finds any correlation in given filter. To find if given filter has 
correlation getVariablesSet is called on filter, which supposedly should be 
returning set of correlated variables, but it is always returning empty set as 
filter does not implement this method. Shouldn’t Filter implement this method 
to return appropriate correlated variables ?
  *   Comments in SubQueryRemoveRule mentions that “The correlate can be 
removed using RelDecorrelator”. But I don’t see SubqueryRemoveRule using 
RelDecorrelator to de-correlate given query. Should SubQueryRemoveRule call 
this ? If not is doing de-correlation immediately after SubQueryRemoveRule 
appropriate ?

Here is what I have done so far for co-rrelated queries. Could you please 
comment if this is right ?

  *   While creating RexSubqueryNode and RelNode for the subquery I am creating 
RexCorrelVariable. RexCorrelVariable needs a correlation id. CorrelationId 
requires an integer id. Should this id be same as index of co-relatted column 
in outer table ?
  *   Hive has a HiveFilter which is extended from Filter. I implemented 
getVariableSet method to look at the condition and return all correlated 
variables in condition’s RelNode. Does this sound correct ?
  *   I am calling RelDecorrelator’s decorrelateQuery immediately after calling 
SubQueryRemoveRule.  After implementing getVariableSet in HiveFilter 
SubQueryRemoveRule seems to be creating appropriate LogicalCorrelate for 
correlate queries but decorrelateQuery is throwing an exception.

Thanks,
Vineet G