[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-07-09 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154893#comment-17154893
 ] 

Julian Hyde commented on CALCITE-3786:
--

I was mistaken. It actually affected multiple {{RelNode}} methods: 
{{isDistinct()}}, {{isKey(ImmutableBitSet)}}, {{getQuery()}}, {{getRows()}}, 
{{getVariablesStopped()}}, {{computeSelfCost()}}, {{isValid(boolean)}}, 
{{getCollationList()}}, {{getChildExps()}}.

I'll change all of these methods to 'deprecated - to be removed before 1.25'.



> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
> Attachments: image-2020-06-23-12-47-25-599.png, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png, screenshot-4.png, screenshot-5.png, 
> screenshot-6.png
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-07-08 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154135#comment-17154135
 ] 

Danny Chen commented on CALCITE-3786:
-

I'm afraid this is a mistake, i meant to change them all to be deprecated in 
1.25, the {{RelNode.isDistinct()}} is missed.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
> Attachments: image-2020-06-23-12-47-25-599.png, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png, screenshot-4.png, screenshot-5.png, 
> screenshot-6.png
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-07-08 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154096#comment-17154096
 ] 

Julian Hyde commented on CALCITE-3786:
--

[~danny0405], In this change, you marked {{AbstractRelNode.isDistinct()}} 
'deprecated to be removed before 1.25', whereas {{RelNode.isDistinct()}} was 
and still is marked 'deprecated to be removed before 2.0'. We can't remove one 
without the other. Which release should it be?

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
> Attachments: image-2020-06-23-12-47-25-599.png, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png, screenshot-4.png, screenshot-5.png, 
> screenshot-6.png
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-23 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143173#comment-17143173
 ] 

Haisheng Yuan commented on CALCITE-3786:


OK, since you desire it so much, let me show you some graphs. The benchmark 
test hardly tell anything useful and match the real use case, let's run tpch 
query to see how it goes. Let's use {{TpchTest.testQuery05}}, which is slow 
enough to sample enough data for these methods.

1. Master before your change 
([978bb7|https://github.com/apache/calcite/commit/978bb7ea44969351468d1b5e240e8f57af7e5770])
 !screenshot-1.png|width=895,height=299! 
 !screenshot-2.png|width=850,height=364!

2. Your patch with part 2 
([b18470|https://github.com/apache/calcite/commit/b18470797eaec8c4e64382c0b742ae83f758275f])
 !screenshot-3.png|width=869,height=456! 
 !screenshot-4.png|width=851,height=407!

3. My patch committed to master that is 
[reverted|https://github.com/apache/calcite/commit/a551d4bfa3b97c42e74fe4e30ca66ba493719c96]
 by you 
([b00c1f|https://github.com/apache/calcite/commit/b00c1fd6e26b5e3cc1a810c9f862b184649cd1a6])
 !screenshot-5.png|width=861,height=299!

4. My patch that make digestEquals and digestHash private per your request 
([173624|https://github.com/apache/calcite/commit/1736242612634359164be483334884cb9113f804])
 !image-2020-06-23-12-47-25-599.png|width=869,height=307!
 Note that recomputeDigest in my patches costs nothing, it just clears the hash 
to 0.

It turned out my initial patch that got +1 from 2 committers, yet you said 
which is a mess, far from +1, then reverted, yields the best performance and 
use the least memory. Even my last patch with private digestEquals after your 
review, still perform better than yours and master with string digest.

Now do you think it is worth allowing digestEquals and digestHash to be 
protected instead of private, so that if downstream projects think the default 
digestEquals and digestHash are performance bottleneck (it doesn't seem to be 
for this TPCH query), they can override it like 
[this|https://github.com/apache/calcite/commit/b00c1fd6e26b5e3cc1a810c9f862b184649cd1a6#diff-16c976c118188d143e34105a8c79f235],
 or they can choose to override the default behavior in case the downstream 
data type doesn't have fields in CALCITE-3793 that the default methods can't 
handle appropriately?

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
> Attachments: image-2020-06-23-12-47-25-599.png, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png, screenshot-4.png, screenshot-5.png, 
> screenshot-6.png
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-22 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142508#comment-17142508
 ] 

Danny Chen commented on CALCITE-3786:
-

> Wait your fix to be released with 1.24

I have fixed it in CALCITE-3793 which is in release 1.23.

> I just move your code from one place to another place

I don't care where the code was moved, i just want to know if the new patch is 
better(collect the items each time #equals instead of cache the object 
references).

> RexCall toString() still stores string digest

The RexCall string may still be problematic, but the new patch still does not 
solve that, so let's put it aside and not struggle on it.

> Do you really want to do benchmark test competition? 

I did, and this is *a must*. Especially we have spent so much time on this 
issue and you modify based on my code, i need to see if the modification 
deserves that.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-22 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142199#comment-17142199
 ] 

Haisheng Yuan commented on CALCITE-3786:


{quote}My example code just illustrates that row type for downstream may not 
have fields, but they must also obey the rule that the type should equal(in the 
version before 1.22, we compare with just row type, but later we switch to 
field list which may cause problem, the code fix that).
{quote}
So the only thing that downstream can do is wait. Wait your fix to be released 
with 1.24, because they have no choice to override the default behavior in 
their own operator, do they?
 Anyway, the digestEquals and digestHash has been changed to private completely 
per your request, should not be a problem that concerns you.
{quote}you have to cache the hashcode for RexNodes, it sacrifice performance 
for memory.
{quote}
I can change it to only cache hashcode to RexCall if you think it is wasting 
memory. Do you know that RexNode can be shared between operators, especially 
when logical operator is transformed to physical operator. Do you really want 
to compute hashCode for each RexNode again even they are shared?
{quote}i don't think #getDigest should be a public API, the digest is a planner 
internal abstraction
{quote}
I 100% agree with you. I don't think it should be a public API either. But it 
is already a public API, no matter you like it or not, since the first day.

Dude, don't be pissed off. You have reviewed my code multiple times, helped 
improve my PR and remove bunch of ugly codes and unnecessary subclasses. It is 
also a part of your work, isn't it? Look, I didn't revert your code at all, I 
just move your code from one place to another place, e.g. from Digest to 
AbstractRelNode, don't you agree? It is just a tiny refactoring on top of your 
contribution. Some of the change is just like I collect the money that you have 
left on the table. For example:
 - The RexCall toString() still stores string digest. When the toString is 
called, we still can't avoid OOM because of the digest there.
 - The Janino metadata code generator still calls RexNode#toString, it will use 
the large RexCall string as the key, until the metadata is cleared.

Do you really want to do benchmark test competition? Especially for memory 
consumption? It is not fair to you. You have done tons of work, made the money, 
but forgot to take the money. I am just helping you take the money that is left 
on the table by you.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-22 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142172#comment-17142172
 ] 

Chunwei Lei commented on CALCITE-3786:
--

> Your patch collects the digest items every time we do #equals, and you have 
>to cache the hashcode for RexNodes, it sacrifice performance for memory, so i 
>would be hard to say it is better, we need a benchmark to prove that. Also 
>note that the #hashCode and #equals have different items, that means 2 equals 
>Digest can have different hashCode which is error-prone.

You have a point.

> BTW, i don't think #getDigest should be a public API, the digest is a planner 
>internal abstraction, the only interface we should impose is #explainTerms, so 
>i don't think there is compatibility problem for master.

I don't think so. We should be more serious about the public API just like 
others. Because we don't know how much the downstream projects depend on it.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-22 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142132#comment-17142132
 ] 

Chunwei Lei commented on CALCITE-3786:
--

IMHO, the latest change proposed by [~hyuan] is considerable and reasonable. Is 
it possible to provide a benchmark to show the improvement?

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-22 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142117#comment-17142117
 ] 

Danny Chen commented on CALCITE-3786:
-

> each operator has to getFieldList from rowtype and compare, but some 
> operators don't want to do so, but they have no control

I would say that the row type equivalence is *a must* for current 
transformation, you can not recognize that 2 nodes with different fields number 
or types are equivalent. The planner also have a strict check inside actually. 
Thus make the row type comparison transient to downstream project nodes reduce 
the redundant work. My example code just illustrates that row type for 
downstream may not have fields, but they must also obey the rule that the type 
should equal(in the version before 1.22, we compare with just row type, but 
later we switch to field list which may cause problem, the code fix that).

> Do you think it is better to leave some space for operators to define their 
> own logic?

I did, that is why we have different SqlExplainLevel for customization of 
different cases, for 90% cases the digest items and toString items are almost 
the same, the Enum keeps the customization and also saves the redundant 
interfaces. As for other things, i.e. row type. hints, traits, i don't think we 
should/need to open customization for them, they are common and required.

Your patch collects the digest items every time we do #equals, and you have to 
cache the hashcode for RexNodes, it sacrifice performance for memory, so i 
would be hard to say it is better, we need a benchmark to prove that. 

BTW, i don't think #getDigest should be a public API, the digest is a planner 
internal abstraction, the only interface we should impose is #explainTerms, so 
i don't think there is compatibility problem for master.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-22 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142038#comment-17142038
 ] 

Haisheng Yuan commented on CALCITE-3786:


[~danny0405] Thank you for replying and reviewing. You are correct that using 
DIGEST_ATTRIBUTE can avoid plan diffs. However, do you think it is better that 
each operator can control how they define their equivalence. Like I said in 
previous example, each operator has to getFieldList from rowtype and compare, 
but some operators don't want to do so, but they have no control... You already 
have an example in [your 
code|https://github.com/apache/calcite/commit/69f25863f5f4197c17927a39a82cbf1cffd12b80#diff-cb90b503c85180c9be40a4984f8a1a54R128].
 We can't know all the different requirements before-hands, right? Do you think 
it is better to leave some space for operators to define their own logic?

Anyway, to ease your concern, now digestEquals and digestHash are private, and 
there is only 1 RelDigest subclass. There are so many inner class in Calcite, 
we can't remove all the inner classes, right? And we'd better keep backward 
compatibility of {{getDigest}} if we can keep it easily, do you agree?

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-22 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141764#comment-17141764
 ] 

Danny Chen commented on CALCITE-3786:
-

> You haven't clarify how can I customize it in current master without 
> introducing plan changes

I did, but you seems to have some confusion, maybe you should check how the 
customization works before my patch, you would find the answer, or you may take 
[1] as an example.

[1] 
https://github.com/apache/calcite/blob/797b48754d074d53e853a67bf249c155f51b71dc/core/src/main/java/org/apache/calcite/rel/core/Project.java#L262

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-21 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141718#comment-17141718
 ] 

Haisheng Yuan commented on CALCITE-3786:


{quote}
Make the Digest an inner class
{quote}
Is anything wrong with inner classes? It is everywhere in Calcite. If it is 
evil, should we request Oracle to remove Java inner classes instead?

{quote}
Add #getRelDigest #digestEquals #digestHash (plus the original #getDigest 
#explainTerms)
{quote}
getDigest (which returns a String) is a public API, do you really want to break 
it even you can choose not to break it? If you think backward compatibility is 
not important, I am more than happy to remove getRelDigest, just using 
getDigest instead.
#digestEquals #digestHash can be used to customize the rel equivalence 
comparison, decouple it from explainTerms. Do you think no downstream need it? 
You haven't clarify how can I customize it in current master without 
introducing plan changes

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-21 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141682#comment-17141682
 ] 

Danny Chen commented on CALCITE-3786:
-

I don't think that reduce the redundant object references can significantly 
reduce memory consumption. In order to remove the object references, you have 
to:

- Make the Digest an inner class
- Add #getRelDigest #digestEquals #digestHash (plus the original #getDigest 
#explainTerms)

I don't think these interfaces are necessary and have clear semantics, in order 
to reduce memory usage which is not a bottleneck.

Expose a single interface #explainTerms with different SqlExplainLevel is well 
to use and compatible with the old behaviors.

Maybe we really have disagreement on the design, i have to say i will not 
accept it.

If possible, move to the RexNode memo, which can indeed reduce memory 
consumption significantly, and i would give a definite +1 for that.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-21 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141648#comment-17141648
 ] 

Haisheng Yuan commented on CALCITE-3786:


{quote}
I don't think having an interface there with 3 sub-classes makes any sense.
{quote}
Thank you for your feedback. I know that you will question on the 3 
sub-classes, so here is another version with only 1 sub-class. Does it make 
more sense to you?
https://github.com/apache/calcite/pull/2039/commits/897163219d981565a524121d47ec7845c971c473

Under what circumstances will you accept the change? Even when this patch can 
significantly reduce memory consumption comparing with current master branch, 
you won't accept it? 

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-21 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141639#comment-17141639
 ] 

Danny Chen commented on CALCITE-3786:
-

I don't think having an interface there with 3 sub-classes makes any sense.

I think i have represented my points as clear, so, i would not repeat it again 
and again.

And stop talking with emotions, i will choose to ignore that.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-21 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141516#comment-17141516
 ] 

Haisheng Yuan commented on CALCITE-3786:


{quote}
Just implement the #explainTerms correctly with the items you want to involve 
in.
{quote}
Really? Could you please kindly show me the code snippet? Won't it introduce 
tons of "meaningless test plan changes"?

{quote}
Use different SqlExplainLevel to distinguish just like before.
{quote}
How? By setting explain level to NO_ATTRIBUTES? Won't all the plan nodes be 
affected?

{quote}
 especially when we are talking about "semantic equivalent" not strict #equals.
{quote}
There is no "semantic equivalent" or "strict #equals", which are terms coined 
by you. If one rel instance can be used to represent the other instance, then 
they are equal. The digestEquals was introduced (which I have changed to 
private, because you said it is a mess), because we can't override #equals and 
#hashCode, which are final. If we change them directly, then it will be a large 
breaking change, when downstream projects use RelNode as HashMap key, in which 
case they have to change the HashMap to IdentityHashMap.

{quote}
 Just like Julian said, Cons is a basic data structure, it can be used in many 
codes, but digest is a planner internal thing
{quote}
Well, are you aware of that, in non-federation and non-MV project that uses 
Calcite, it may generate thousands or tens of thousands RelNodes during query 
planning, but how many ConsList will be generated? Zero. The use case is not 
uncommon, it is even dominant. Do you think we should care more about a basic 
data structure that may not be used at all, than a "planner internal thing" 
that is ubiquitous in Calcite, and being used no matter what case?

{quote}
it is not necessary from my side if we can make the code more concise.
{quote}
Do you think a 256 lines Digest.java is conciser than a 52 lines 
RelDigest.java, which is essentially a 4 line interface:
{code:java}
public interface RelDigest {
  void clear();
  RelNode getRel();
}
{code}
just like you think 9 lines of code are simpler that one line of code?
Could you please kindly point out which part of 
[PR#2039|https://github.com/apache/calcite/pull/2039] is not concise so that I 
can make it conciser?
 

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-20 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141268#comment-17141268
 ] 

Danny Chen commented on CALCITE-3786:
-

> Could you please enlighten me on how to properly customize it?

Just implement the #explainTerms correctly with the items you want to involve 
in.

> Or My operator has some additional field, that participate in the operator 
> identity, aka digest comparison, but I don't want to output it in the explain 
> terms, is there anyway I can achieve this?

Use different SqlExplainLevel to distinguish just like before.

> Object#hashCode and #equals final so that we have a bug-free world.

I think things are much different when there are 3 methods there, especially 
when we are talking about "semantic equivalent" not strict #equals. And i'm 
confused why we reference the Flink code, to prove what ?

> 4 bytes to the Cons data structure to serve his purpose

Just like Julian said, Cons is a basic data structure, it can be used in many 
codes, but digest is a planner internal thing, it is better if we can avoid 
theses duplicate object references, but it is not necessary from my side if we 
can make the code more concise.



> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-20 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141240#comment-17141240
 ] 

Haisheng Yuan commented on CALCITE-3786:


{quote}
We should be cautious for #equals #hashcode #explainTerms to have similar 
variables, it is error-prone for downstream projects to implement for their 
custom nodes
{quote}
I astonishingly found that you are correct. #equals #hashCode #toString have 
similar variables likewise, e.g. 
[StreamShardMetadata.java|https://github.com/apache/flink/blob/8674b69964eae50cad024f2c5caf92a71bf21a09/flink-connectors/flink-connector-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/model/StreamShardMetadata.java#L112]
 in Flink, it is exactly what you said. Should I open a pull request to Flink 
to always use "toString().hashCode" and "toString().equals" or use Digest 
instead? What is worse, 90% of Java classes have exactly same variables for 
these methods. The worst thing is, they are default Java methods for every Java 
object. Damn it! We should propose to JCP members to make Object#hashCode and 
#equals final so that we have a bug-free world.

{quote}
The additional objects references should not be a bottleneck, removing it does 
not gain much.
{quote}
After deep meditation, I finally got your reasoning. Maybe because I still live 
in an era when the RAM is 128 MB. I can't help reflecting myself, am I thinking 
wrong in CALCITE-4049? Initially I agree with Xiening and Julian's criticize 
about our buddy adding 4 bytes to the Cons data structure to serve his purpose. 
I pondered whether we are too nitpicking to him. 1 ConsList only add 40KB, 
which is also acceptable. Isn't it?

Anyway, I have addressed your kind comments, removed the 
digestEquals/digestHash from operators as you suggested, hope that can ease 
your concern. I have opened a PR for review: 
https://github.com/apache/calcite/pull/2039, which is open for discussion, 
comment and -1. Thanks.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-19 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140958#comment-17140958
 ] 

Haisheng Yuan commented on CALCITE-3786:


OK, let me put it in another way. What if my Join and Project operator want to 
customize the digest equal logic?
e.g.
In the default implementation digest equal comparison, we always ignore column 
names, only compare the field types. It is OK for my Project operator. But my 
Join and Filter operator doesn't want to do this, these 2 operators want the 
rowtype to be fully equal, even for column names. Could you please enlighten me 
on how to properly customize it?
Or My operator has some additional field, that participate in the operator 
identity, aka digest comparison, but I don't want to output it in the explain 
terms, is there anyway I can achieve this? 

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-18 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140226#comment-17140226
 ] 

Danny Chen commented on CALCITE-3786:
-

Removing digestEquals and digestHash does not really solve the concerns because 
the logic is still there, without a explicit interface, it's even harder for 
downstream projects to customize the hashCode/Digest format.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-18 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140202#comment-17140202
 ] 

Haisheng Yuan commented on CALCITE-3786:


[~danny0405] Thank you for your feedback. Definitely will respect your opinion. 
What if I remove digestEquals and digestHash from all the default base 
operators, Logical and Enumerable operators? They are completely optional, not 
required to implement.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-18 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140152#comment-17140152
 ] 

Danny Chen commented on CALCITE-3786:
-

[~vladimirsitnikov] I didn't intend to do modify the RexNode normalization at 
first, but when i fix to the RexNode#equals instead of the pure string, i found 
that *i have to* because it seems not right that 2 RexCalls do not equals but 
have same digest. I can think of 2 more benefits to switch from pure string 
normalization to nodes:
- We can add more normalizations logic more easily(put complex rexnodes 
behind), i.e. i plan to add a new component RexNormalizaiton
- We should make the configuration with more flexibility, user/downstream can 
config through FrameworkConfig

[~hyuan] Current patch already solved 2, 3 of your questions, for 1, i would 
fix if [~laurent] and i make agreements, but remove the Digest and add 2 more 
interfaces to RelNodes(#hashCode and #equals) goes too far away for these 
reasons:
- We should be cautious for #equals #hashcode #explainTerms to have similar 
variables, it is error-prone for downstream projects to implement for their 
custom nodes
- And it is hard to clarify the semantics between strict #equals1 and 
equivalent #equals1, what if i want a normal #equals1 there
- The additional objects references should not be a bottleneck, removing it 
does not gain much.
Based on that, i would definitely give a -1 for the commit f970df.

As for 4, i'm looking forward to the RexNode memo(maybe), but it has no 
relationship with this issue.


> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-18 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140010#comment-17140010
 ] 

Haisheng Yuan commented on CALCITE-3786:


I think my objection on this proposal might be mis-interpreted, maybe because I 
was not clear on myself. I am objecting it, not because it won't reduce memory 
consumption, or it will make things worse than current master. Quite the 
opposite, I believe with the proposal, memory footprint will be reduced and 
improved, which I already stated in above comments. As long as we stop using 
string digest, things will become much better, without any doubt. However, is 
this the best approach? I don't think so. So no matter what the performance 
result is, my stance on the proposal doesn't change. I am still against it. 
Because this is not the most appropriate way we would adopt.

 [~danny0405] has worked a lot on this issue, and we all appreciate the amount 
of work he has done. He is very smart and diligent, I couldn't do it as good as 
he can on this matter. Even [~laurent] and I casted -1 on it, and people 
requested Danny to revert the change, but since he refused to revert it, let's 
just keep it in master. I believe that Danny believes his change is beneficial 
for all Calcite users, all with good intention. [~danny0405], Laurent made some 
code review comments, to try to help improve your code, even some are minor 
optimization, can't reflect on the benchmark, but definitely won't make your 
code run slower. 

What if we all take a step back, suppose this JIRA is about to improve the 
memory consumption, there are several steps here. We have finished the first 
step, thanks [~danny0405] for taking the initiative to complete the most 
important 1st step. Now let's think the next step. Is there anything that we 
can further improve based the first step? Some viewpoints (or questions) that I 
share with [~laurent] are:
1. Can the field of Digest be optimized to use less memory?
2. Can operators customize their own identity?
3. Can operators customize the string representation by omitting some fields?
4. Can RexNode be further improved?

Here is the the next step draft change I have in my mind based the above 
questions and Laurent's comments. Can someone take a look? [~danny0405] Feel 
free to take from it if you think there are some improvements:
https://github.com/apache/calcite/commit/f970df50eb69389bbc81bd94e0c8f9da033e5e01

The main change is RelDigest.java, AbstractRelNode.java and RelSubset.java.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-18 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140008#comment-17140008
 ] 

Julian Hyde commented on CALCITE-3786:
--

[~danny0405] and [~vlsi] and others. For the record, I don't think that it is 
essential that we add *performance tests* to the code base. A benchmark - 
perhaps only run once, to help us make a design decision empirically - would be 
sufficient.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-18 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140006#comment-17140006
 ] 

Julian Hyde commented on CALCITE-3786:
--

[~vlsi] I am asking that each person - the contributor and the reviewers - is 
respectful of the other's time. If you had been able to say "I think I can fix 
those plan changes, but it will take considerable effort on my part. If I do 
it, will you accept it?" then it would have saved your time. Danny does not 
have to agree with you, but he must not waste your time.

The same goes for performance benchmarks. If Danny is never going to be able to 
fulfill your criteria, let's all realize that as soon as possible.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-18 Thread Vladimir Sitnikov (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139390#comment-17139390
 ] 

Vladimir Sitnikov commented on CALCITE-3786:


[~julianhyde], I agree that we should not nitpick the fine details within 
benchmarks. I agree we should not raise the bar again and again.

However, all the comments are not something advanced. They are very basic 
requests. I just ask to make the benchmark to measure the thing the patch 
touches. I don't even ask to implement multiple different benchmarks.

Performance in this PR is very important (even in the case being able to 
customize digest is important as well), so I would expect that there should be 
reasonable tests before the PR is merged.

--

The current benchmark measures the wrong thing: it uses the same object 
instances to lookup in the cache. In typical rule execution, one creates new 
(!) rels with a relbuilder, then they register the result in the planner.

---

I could expect the following benchmarks:
1) Overall execution of the test suite (e.g. capture timings from slow_tests in 
CI). Note: the current PR seems to perform early normalization of RexNodes, 
which alone might shift the response time significantly.
2) Planner#registerRel execution time. For instance: create a planner, register 
same rels there. Then the benchmark would create new (!) rels and call 
planner.register (or whatever it is named). That benchmark code would be the 
same in both Calcite versions (original and patched). The measurement values 
should be "response time", and "allocation rate".
3) Digest computation time. This might be something like "create rel and 
compute its digest". The measurement values should be "response time", and 
"allocation rate". This might be interesting for fine-tuning digset 
implementation (e.g. to eliminate excessive array allocations), however, digest 
is cached, so "digest computation" benchmark alone provides very little 
information on the impact of the fix on the overall performance.
4) Memory footprint. It is probably to be measured with Java Object Layout. For 
instance, we create a relnode instance, and weight the object graph.

The benchmark in https://github.com/apache/calcite/pull/2034 looks like #3, 
however, there are major flaws.



{quote} I think we now need to stop asking him to jump over higher and higher 
hurdles{quote}
It is sad to receive "-1 because it results in lots of irrelevant plan changes" 
when I suggest "early RexCall digest normalization". That is exactly "asking to 
jump over higher and higher hurdles". I had to invent and implement a more 
complicated approach.
It is really sad to see that the very same contributor commits the patch that 
performs that early normalization (with all the "irrelevant plan changes") with 
not a single blink.

Note: I do not complain that the PR removes "my" code. I complain that there's 
no acknowledgment that my original proposal (normalize RexCall in the 
constructor) was viable from the very beginning.

---

I have no time to work on Calcite for the following 14 days, and I have no time 
to write benchmarks or even analyze the results.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g

[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-18 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139268#comment-17139268
 ] 

Danny Chen commented on CALCITE-3786:
-

[~vladimirsitnikov] I used the GC profiler and here is the test data:


{code:xml}
Benchmark   (digestType)  (disjunctions)  
(joins)  Mode  Cnt   ScoreError   Units
DigestBenchmark.getRelOBJECT   1
1  avgt5   0.082 ±  0.010   us/op
DigestBenchmark.getRelOBJECT   1   
10  avgt5   0.380 ±  0.025   us/op
DigestBenchmark.getRelOBJECT   1   
20  avgt5   0.732 ±  0.077   us/op
DigestBenchmark.getRelOBJECT  10
1  avgt5   0.081 ±  0.010   us/op
DigestBenchmark.getRelOBJECT  10   
10  avgt5   0.364 ±  0.022   us/op
DigestBenchmark.getRelOBJECT  10   
20  avgt5   0.697 ±  0.046   us/op
DigestBenchmark.getRelOBJECT 100
1  avgt5   0.081 ±  0.008   us/op
DigestBenchmark.getRelOBJECT 100   
10  avgt5   0.359 ±  0.025   us/op
DigestBenchmark.getRelOBJECT 100   
20  avgt5   0.726 ±  0.090   us/op

DigestBenchmark.getRelSTRING   1
1  avgt5   1.269 ±   0.035   us/op
DigestBenchmark.getRelSTRING   1   
10  avgt5  10.609 ±   0.146   us/op
DigestBenchmark.getRelSTRING   1   
20  avgt5  28.708 ±   0.810   us/op
DigestBenchmark.getRelSTRING  10
1  avgt5   1.365 ±   0.073   us/op
DigestBenchmark.getRelSTRING  10   
10  avgt5  10.640 ±   0.107   us/op
DigestBenchmark.getRelSTRING  10   
20  avgt5  28.171 ±   0.612   us/op
DigestBenchmark.getRelSTRING 100
1  avgt5   1.354 ±   0.083   us/op
DigestBenchmark.getRelSTRING 100   
10  avgt5  11.583 ±   5.685   us/op
DigestBenchmark.getRelSTRING 100   
20  avgt5  27.828 ±   0.343   us/op


Benchmark   (digestType)  (disjunctions)  
(joins)  Mode  Cnt  Score Error   Units
DigestBenchmark.getRel:·gc.alloc.rate.normOBJECT   1
1  avgt5  ≈ 10⁻⁴ B/op
DigestBenchmark.getRel:·gc.alloc.rate.normOBJECT   1   
10  avgt5   0.005 ±  0.001B/op
DigestBenchmark.getRel:·gc.alloc.rate.normOBJECT   1   
20  avgt5   0.020 ±  0.002B/op
DigestBenchmark.getRel:·gc.alloc.rate.normOBJECT  10
1  avgt5   0.001 ±  0.001B/op
DigestBenchmark.getRel:·gc.alloc.rate.normOBJECT  10   
10  avgt5   0.006 ±  0.001B/op
DigestBenchmark.getRel:·gc.alloc.rate.normOBJECT  10   
20  avgt5   0.022 ±  0.001B/op
DigestBenchmark.getRel:·gc.alloc.rate.normOBJECT 100
1  avgt5   0.003 ±  0.001B/op
DigestBenchmark.getRel:·gc.alloc.rate.normOBJECT 100   
10  avgt5   0.018 ±  0.001B/op
DigestBenchmark.getRel:·gc.alloc.rate.normOBJECT 100   
20  avgt5   0.047 ±  0.006B/op

DigestBenchmark.getRel:·gc.alloc.rate.normSTRING   1
1  avgt5   1840.004 ±   0.001B/op
DigestBenchmark.getRel:·gc.alloc.rate.normSTRING   1   
10  avgt5   8568.145 ±   0.002B/op
DigestBenchmark.getRel:·gc.alloc.rate.normSTRING   1   
20  avgt5  16008.839 ±   0.022B/op
DigestBenchmark.getRel:·gc.alloc.rate.normSTRING  10
1  avgt5   1960.009 ±   0.001B/op
DigestBenchmark.getRel:·gc.alloc.rate.normSTRING  10   
10  avgt5   8568.180 ±   0.003B/op
DigestBenchmark.getRel:·gc.alloc.rate.normSTRING  10   
20  avgt5  16008.913 ±   0.018B/op
DigestBenchmark.getRel:·gc.alloc.rate.normSTRING 100
1  avgt5   1960.054 ±   0.003B/op
DigestBenchmark.getRel:·gc.alloc.rate.normSTRING 100   
10  avgt5   8568.577 ±   0.284B/op
DigestBenchmark.getRel:·gc.alloc.rate.normSTRING 100   
20  avgt5  16009.819 ±   0.024B/op
{code}


> Add Digest interface to enable efficient hashCode(equals) for RexNode and 

[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-17 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139099#comment-17139099
 ] 

Chunwei Lei commented on CALCITE-3786:
--

[~julianhyde], I don't agree with you. The owner should provide a convincing 
benchmark, not the reviewers.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-17 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138896#comment-17138896
 ] 

Julian Hyde commented on CALCITE-3786:
--

[~vlsi], Danny did as we asked, and created a benchmark. I think we now need to 
stop asking him to jump over higher and higher hurdles. If his benchmark is 
unsatisfactory, I think you should modify it and run it so that it can answer 
the questions that you have on your mind.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-17 Thread Vladimir Sitnikov (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138446#comment-17138446
 ] 

Vladimir Sitnikov commented on CALCITE-3786:


{quote} because it is the most straight-forward metric to illustrate the memory 
usage{quote}
It is not.

"getRuntime().totalMemory" has very little in common with the benchmark itself, 
and it depends on the GC heuristics.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-17 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138427#comment-17138427
 ] 

Danny Chen commented on CALCITE-3786:
-

Thanks, [~vladimirsitnikov] and [~zabetak] ~ I have addressed the comments ~ [1]

> Can you please clarify what is the number of objects in DigestToRelMap?

They are the rel and all it's input's digest to node mapping, in order to 
simulate the planner node #register(I know there is also RelSubSet node there, 
but the affect should be the same).

I have changed the level from  Level.Invocation to Level.Iteration.

Here is the latest data:

The diff of performance:
{code:java}

Benchmark   (digestType)  (disjunctions)  (joins)  
Mode  Cnt  Score   Error  Units
DigestBenchmark.getRelOBJECT   11  
avgt5  0.123 ± 0.004  us/op
DigestBenchmark.getRelOBJECT   1   10  
avgt5  0.447 ± 0.023  us/op
DigestBenchmark.getRelOBJECT   1   20  
avgt5  0.868 ± 0.085  us/op
DigestBenchmark.getRelOBJECT  101  
avgt5  0.126 ± 0.014  us/op
DigestBenchmark.getRelOBJECT  10   10  
avgt5  0.459 ± 0.029  us/op
DigestBenchmark.getRelOBJECT  10   20  
avgt5  0.920 ± 0.147  us/op
DigestBenchmark.getRelOBJECT 1001  
avgt5  0.119 ± 0.008  us/op
DigestBenchmark.getRelOBJECT 100   10  
avgt5  0.452 ± 0.030  us/op
DigestBenchmark.getRelOBJECT 100   20  
avgt5  0.857 ± 0.109  us/op

DigestBenchmark.getRelSTRING   11  
avgt5  1.320 ± 0.049  us/op
DigestBenchmark.getRelSTRING   1   10  
avgt5 10.588 ± 0.088  us/op
DigestBenchmark.getRelSTRING   1   20  
avgt5 27.863 ± 0.320  us/op
DigestBenchmark.getRelSTRING  101  
avgt5  1.352 ± 0.028  us/op
DigestBenchmark.getRelSTRING  10   10  
avgt5 10.612 ± 0.286  us/op
DigestBenchmark.getRelSTRING  10   20  
avgt5 27.865 ± 1.627  us/op
DigestBenchmark.getRelSTRING 1001  
avgt5  1.467 ± 0.683  us/op
DigestBenchmark.getRelSTRING 100   10  
avgt5 10.738 ± 0.075  us/op
DigestBenchmark.getRelSTRING 100   20  
avgt5 28.211 ± 0.449  us/op
{code}

The diff of memory usage:
{code:java}
Benchmark   (digestType)  (disjunctions)  (joins)  
Mode  Cnt  Score   Error  Units
DigestBenchmark.getRel:Max memory heapOBJECT   11  
avgt5  228065280.000  bytes
DigestBenchmark.getRel:Max memory heapOBJECT   1   10  
avgt5  211812352.000  bytes
DigestBenchmark.getRel:Max memory heapOBJECT   1   20  
avgt5  215482368.000  bytes
DigestBenchmark.getRel:Max memory heapOBJECT  101  
avgt5  239599616.000  bytes
DigestBenchmark.getRel:Max memory heapOBJECT  10   10  
avgt5  218628096.000  bytes
DigestBenchmark.getRel:Max memory heapOBJECT  10   20  
avgt5  257949696.000  bytes
DigestBenchmark.getRel:Max memory heapOBJECT 1001  
avgt5  258998272.000  bytes
DigestBenchmark.getRel:Max memory heapOBJECT 100   10  
avgt5  211812352.000  bytes
DigestBenchmark.getRel:Max memory heapOBJECT 100   20  
avgt5  213385216.000  bytes


DigestBenchmark.getRel:Max memory heapSTRING   11  
avgt5  300417024.000  bytes
DigestBenchmark.getRel:Max memory heapSTRING   1   10  
avgt5  262144000.000  bytes
DigestBenchmark.getRel:Max memory heapSTRING   1   20  
avgt5  242745344.000  bytes
DigestBenchmark.getRel:Max memory heapSTRING  101  
avgt5  317194240.000  bytes
DigestBenchmark.getRel:Max memory heapSTRING  10   10  
avgt5  273154048.000  bytes
DigestBenchmark.getRel:Max memory heapSTRING  10   20  
avgt5  258473984.000  bytes
DigestBenchma

[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-17 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138340#comment-17138340
 ] 

Stamatis Zampetakis commented on CALCITE-3786:
--

Thanks for the benchmark [~danny0405]. One thing that caught my eye is the use 
of {{Level.INVOCATION}} for setting up the state. As indicated in the 
[javadoc|http://javadox.com/org.openjdk.jmh/jmh-core/1.7/org/openjdk/jmh/annotations/Level.html#Invocation],
 it is better to avoid using this option.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-17 Thread Vladimir Sitnikov (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138302#comment-17138302
 ] 

Vladimir Sitnikov commented on CALCITE-3786:


{quote}I used the max used heap mem as the metric, is there better way to 
compare the memory there ?{quote}

1) gc.alloc.rate.norm  -- that is the metric for "allocation rate per benchmark 
execution"
2) The memory consumption (e.g. the number of bytes a rel consumes in heap) is 
to be measured with Eclipse Memory Analyzer or with 
https://openjdk.java.net/projects/code-tools/jol/

{quote} is an impressive improvement(20x) for performance{quote}

Can you please clarify what is the number of objects in {{DigestToRelMap}}?

What is the purpose of StringDigest? I guess it should not impact much, but it 
looks excessive, and it was not there in the old implementation.

The benchmark might suffer from dead code elimination, please follow 
http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_08_DeadCode.java

Please add a test case when the rel is built dynamically.
With calcite, we often build a relation, and then it is queried in the cache. 
It looks like the current benchmark measures use the same instances for the 
initial data population and for the measurements.


Please prefer shorter benchmark names and benchmark parameter names.
The current result table is very wide, and it is hard to follow.

For instance: isStringDigest could be refactored into

{code:java}
enum DigestType { STRING, OBJECT }
@Param DigestType digest;
{code}


> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-17 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138289#comment-17138289
 ] 

Danny Chen commented on CALCITE-3786:
-

Hi, [~vladimirsitnikov] ~

I write a benchmark there [1] to compare the performance and memory usage diff 
between the pure string digest and the new Digest structure.


{code:xml}
Benchmark (isStringDigest)  
(joins)  (whereClauseDisjunctions)  Mode  Cnt  Score   Error  Units
DigestBenchmark.getRelFromDigestToRelMap false  
  1  1  avgt5  0.113 ± 0.009  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap false  
  1  1  avgt5  376963072.000  bytes
DigestBenchmark.getRelFromDigestToRelMap false  
  1 10  avgt5  0.146 ± 0.029  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap false  
  1 10  avgt5  346554368.000  bytes
DigestBenchmark.getRelFromDigestToRelMap false  
  1100  avgt5  0.138 ± 0.014  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap false  
  1100  avgt5  348127232.000  bytes
DigestBenchmark.getRelFromDigestToRelMap false  
 10  1  avgt5  0.452 ± 0.041  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap false  
 10  1  avgt5  397934592.000  bytes
DigestBenchmark.getRelFromDigestToRelMap false  
 10 10  avgt5  0.450 ± 0.050  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap false  
 10 10  avgt5  383254528.000  bytes
DigestBenchmark.getRelFromDigestToRelMap false  
 10100  avgt5  0.452 ± 0.085  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap false  
 10100  avgt5  353894400.000  bytes
DigestBenchmark.getRelFromDigestToRelMap false  
 20  1  avgt5  0.819 ± 0.239  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap false  
 20  1  avgt5  327155712.000  bytes
DigestBenchmark.getRelFromDigestToRelMap false  
 20 10  avgt5  0.814 ± 0.123  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap false  
 20 10  avgt5  427819008.000  bytes
DigestBenchmark.getRelFromDigestToRelMap false  
 20100  avgt5  0.844 ± 0.218  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap false  
 20100  avgt5  366477312.000  bytes
{code}


{code:xml}
Benchmark (isStringDigest)  
(joins)  (whereClauseDisjunctions)  Mode  Cnt  Score   Error  Units
DigestBenchmark.getRelFromDigestToRelMap  true  
  1  1  avgt5  1.797 ± 0.218  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap  true  
  1  1  avgt5  412090368.000  bytes
DigestBenchmark.getRelFromDigestToRelMap  true  
  1 10  avgt5  1.824 ± 0.147  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap  true  
  1 10  avgt5  405274624.000  bytes
DigestBenchmark.getRelFromDigestToRelMap  true  
  1100  avgt5  2.109 ± 0.453  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap  true  
  1100  avgt5  402653184.000  bytes
DigestBenchmark.getRelFromDigestToRelMap  true  
 10  1  avgt5 12.118 ± 0.113  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap  true  
 10  1  avgt5  346030080.000  bytes
DigestBenchmark.getRelFromDigestToRelMap  true  
 10 10  avgt5 12.231 ± 0.807  us/op
DigestBenchmark.getRelFromDigestToRelMap:Max memory heap  true  
 10  

[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-16 Thread Laurent Goujon (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137805#comment-17137805
 ] 

Laurent Goujon commented on CALCITE-3786:
-

{quote}
Laurent Goujon It seems that you have some promotion ideas for the Digest code, 
feel free to do that, i tried to answer each message you left in the PR, hope 
it helps. But i would not revert the PR for no other strong reason, sorry.
{quote}
The thing is that you left no time to answer back (sorry, I'm not working over 
the weekend), or did you send a courtesy check on either the pull request and 
the jira issue to see people got their questions answers/concerns alleviated. 
We usually try to reach a consensus on issues and code changes, but it doesn't 
seem to be the case here. And it's a bit disappointed you are not ready to 
revert temporarily your change unless there are strong reason whereas (ignoring 
my comments/concerns on github) the lack of concrete evaluation regarding 
memory usage and performance impact seems quite a strong reason to do so...

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-16 Thread Vladimir Sitnikov (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136557#comment-17136557
 ] 

Vladimir Sitnikov commented on CALCITE-3786:


{quote}I would try to give a mem benchmark then ~{quote}

Would you please compare the performance as well?
It would be sad if the new approach is significantly slower.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-16 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136547#comment-17136547
 ] 

Stamatis Zampetakis commented on CALCITE-3786:
--

Hey [~danny0405] thanks for working on this. For the moment I am not going to 
comment on the approach will do that later if there are still disagreements.

In the meantime, multiple people have requested to wait a bit and not commit 
the PR yet so I find it fair to revert the commit for the time being. I know 
that you have spent quite some time on it but reviewers are also investing 
their own. We are not in a rush to get a release out so I don't see a big 
problem keeping this off-master a bit more.    

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-16 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136473#comment-17136473
 ] 

Danny Chen commented on CALCITE-3786:
-

Thanks [~julianhyde] and [~laurent] for the feedback ~

Compared to the old code this patch does these things:
- Use #equals to compare 2 RexCalls so that we can avoid the OOM like 
CALCITE-3784
- Use cached hashCode for quick #equals of the digest, it also replace the pure 
string comparison with object equals
- Include the hints in the digest
- Makes the RexCall normalization more general, more cases are normalized

So in general it definitely do not make the memory consumption bigger.

[~laurent] It seems that you have some promotion ideas for the Digest code, 
feel free to do that, i tried to answer each message you left in the PR, hope 
it helps. But i would not revert the PR for no other strong reason, sorry.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-15 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136290#comment-17136290
 ] 

Julian Hyde commented on CALCITE-3786:
--

I agree with [~laurent]. It doesn't look as if we've reached consensus yet. In 
particular, [~hyuan] gave a "-1", which I interpret not as a veto but as a 
"you've not convinced me yet".

I have been quiet on this issue because [~danny0405], [~vlsi] and [~hyuan] seem 
to be representing the sides of the argument very well. But I would like to see 
the discussion arrive at consensus. If there is disagreement, let's find an 
empirical way to settle the agreement -- e.g. a benchmark.

[~danny0405], I think you should revert the commit, re-open this case, and 
continue the discussion.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-15 Thread Laurent Goujon (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136282#comment-17136282
 ] 

Laurent Goujon commented on CALCITE-3786:
-

I'm not sure I understand why the change has been merged already: there has 
been questions/concerns on both JIRA and Github about the change. Is it 
possible to revert it and address those?

Personally, considering the impact, I would like to see a more concrete 
benchmark on the memory impact. I have for example some reservation about the 
use of a list of pair of string/objects in digest for things which are most 
likely fields of the {{RelNode}}.

And there's also the digest String field, initialized lazily (but not always?), 
but which is not sensible to memory pressure for example: could we for example 
let this string be gc'ed if there's not enough memory at the time even it means 
recomputing it?

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.24.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-11 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133914#comment-17133914
 ] 

Danny Chen commented on CALCITE-3786:
-

[~vladimirsitnikov] Thanks for the background sharing, if we do the 
normalization during planning already, it would be better that we made the 
thing more explicit, not an implicit contract there. That is to say, i prefer 
to keep sync the plan structure same with the real RexTree, a normalization on 
creation seems better. We can have a control flag three just like the Rex 
simplification, but the default value should be true.

Personally i still question about the gains of Rex normalization because it 
makes the RexNode code complex, and there is an implicit contract there, we 
have already made decision that Rex normalization is the way to go, so let's 
make the impl better if we can.

[~hyuan] I can thought of another benefit to use Digest, downstream projects 
need only implement the #explainTerms instead of additional #hashCode and 
#equals, which are both error-prone. As for the additional mem consumption for 
object references, assume 100 bytes for a Digest there, and 1 rel nodes, 
the total memory should be 1Mb, which i think is acceptable, so i would choose 
a more concise interface, a Digest behind the #explainTerms.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-11 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133799#comment-17133799
 ] 

Haisheng Yuan commented on CALCITE-3786:


{quote}

Even we are with these additional object references, then how much join nodes 
during the planning ? They do not expect to be the cause of the OOM. What 
causes the OOM is the digest strings, especially for complex RexCalls, before 
this patch, their string format digest were always patched up in the digest of 
the RelNode, which would be big memory comsuption.

{quote}

I am not saying this proposal causes the OOM, instead it reduces memory usage 
comparing with the string digest, as long as we stop using string digest for 
RexNode, OOM issue like 3784 will ease a lot. However, the so-called Digest 
still waste a lot of memory, unnecessarily. If you think Join node is not that 
many, how about Project? Try to compute the Digest size of Project, it may 
consume much more memory. 



{quote}

Why do you think we should use #hashCode and #equals to decide that two 
relational expression are semantically equivalent ? Or i asked it in different 
way, if two relational expression are equivalent, should they be equals to each 
other ? (For example, 2 projects differs only with field aliases.)

{quote}

Why not? Although this is not mandatory, it is so intuitive and natural. Two 
Project with different aliases can be equal with each other, as long as they 
produce the same hashcode, we treat them as equal object, in MEMO we can use 1 
instance to represent the other one.



{quote}

BTW, even if RexCall are new object, use the equals to compare would solve the 
problem.

{quote}

Exactly, that is the problem. It is deep comparison, which may cause 
performance issue for large complex queries. Note that different rules can 
generate lots of intermediate, equal RelNode and RexNode but different 
instances, especially when Project/Filter/Calc merge happens on physical 
operator in VolcanoPlanner.

BTW, does anyone know that [VoltDB|https://www.voltdb.com/company/why-voltdb/], 
an in-memory OLTP RDBMS, is 
[experimenting|https://github.com/VoltDB/voltdb/tree/master/src/frontend/org/voltdb/plannerv2]
 on integrating Calcite into its system?

If Calcite wants to be versatile for OLTP, OLAP, Batch, Stream, every bit of 
memory counts, and every millisecond counts.

I am casting "-1" on this proposal, the justification has been provided in this 
and previous comments. I am just expressing objection on this idea, I don't 
have rights to prevent anyone from doing anything, though.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-11 Thread Vladimir Sitnikov (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133205#comment-17133205
 ] 

Vladimir Sitnikov commented on CALCITE-3786:


>so that we can finally compare 2 RexCall using member equals, and it works 
>well.

1) There are code parts that still use #toString() for the comparison (e.g. 
RexSimplify)

2) I had exactly that suggestion (normalize RexNodes on creation), however, you 
have cast -1 multiple times because that would alter string representation of 
the plans with little gain.

[https://lists.apache.org/thread.html/d89103671efdd813fac768fbc2336bd125e925f0790e9137a2a16375%40%3Cdev.calcite.apache.org%3E]

My suggestion was

Vladimir: It turned out "b" (sort operands in computeDigest) is easier to 
implement.

Vladimir: I've filed a PR: [https://github.com/apache/calcite/pull/1703
]

and your response was "I’m strongly -1 for this way, because it beaks the plan 
test where almost all of the change are meaningless."

Note: PR1703 was updated since that discussion, and the actual implementation 
normalizes nodes at the planning time only, while the user-visible plans are 
displayed without normalization.

 

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-11 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133190#comment-17133190
 ] 

Danny Chen commented on CALCITE-3786:
-

[~vladimirsitnikov] I tried to normalize the RexCall in its constructor, so 
that we can finally compare 2 RexCall using member equals, and it works well.

But i also got 54 plan changes which are all operands order change [1], it this 
as expected ? 

[1] 
https://github.com/apache/calcite/pull/2016/commits/99f55001923eb1618268cfb9d0d645a86eb2ceff

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-10 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132944#comment-17132944
 ] 

Danny Chen commented on CALCITE-3786:
-

bq. Let's use Join operator as an example

Even we are with these additional object references, then how much join nodes 
during the planning ? They do not expect to be the cause of the OOM. What 
causes the OOM is the digest strings, especially for complex RexCalls, before 
this patch, their string format digest were always patched up in the digest of 
the RelNode, which would be big memory comsuption.

bq. Do we really need a tool to unify the logic?

Why do you think we should use #hashCode and #equals to decide that two 
relational expression are semantically equivalent ? Or i asked it in different 
way, if two relational expression are equivalent, should they be equals to each 
other ? (For example, 2 projects differs only with field aliases.)

bq. Did you try to debug it? It is true that when you copy the RexCall it just 
pass the reference, but after column pruning or project transpose, the RexCall 
might be a complete new object,

No, if you have an example, can you share so we can see how much memory the 
planning would use. BTW, even if RexCall are new object, use the equals to 
compare would solve the problem.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-10 Thread Jin Xing (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132926#comment-17132926
 ] 

Jin Xing commented on CALCITE-3786:
---

In my environment, some users use program to generate big queries (about 10k 
lines) and suffers memory issue a lot.

The reuse of instance is the pain point and I agree with [~hyuan] that a 
thorough fix is to have the ability of deduplication inside planner. Really 
hope we can move forward on this direction. The approach proposed in the patch 
ease the memory issue and provide a lighter way to do shallow comparison by 
cached hashcode. It's very good and I can feel the benefit but I have the same 
concern – – does the cached hashcode has to be bound with Digest ?

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-10 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132916#comment-17132916
 ] 

Haisheng Yuan commented on CALCITE-3786:


{quote}
That's not true, for RelNode we only bookeep its class name and id, for rex 
node only an additional object reference.
{quote}
Let's use Join operator as an example. It has 11 members (without the original 
string digest): left, right, rowType, cluster, id, traitset, condition, 
variablesSet, hints, joinType, joinInfo. Suppose all by reference they will use 
44 bytes shallow heap size. 

The digest of Join has: hashCode, rel, digest, items. The reference use 16 
bytes.  And items is a list with 10 elements (the default size is 10), uses 40 
bytes. Inside the items array list, it has 4 pairs \{"left", leftinput\}, 
\{"right", rightinput\}, \{"condition", condition\}, \{"joinType", jointype\}, 
which use 32 bytes. If the relnode is registered inside VolcanoPlanner, the 
left and right inputs are subsets, which will be converted to string with 
set_ID+traitset, which can take more than 10 bytes for a single input. The 
Digest itself can use more than 100 bytes.


{quote}
The digest behaves like a tool to unify the logic. We can also let each rel 
node to handle its "digest" but the code would be a mess.
{quote}
Do we really need a tool to unify the logic? It sounds like forced marriage. 
The reality is that with digest it is a mess. We can't even use an object's 
hashcode and equals to determine it is equals with another object or not, but 
rather need to override explainTerms(), isn't it counter-intuitive?



{quote}
They are different instances, but all the operators expect to be a singleton, 
and each time we copy the RexCalls in the rule, we just pass around the object 
reference.
{quote}
Did you try to debug it? It is true that when you copy the RexCall it just pass 
the reference, but after column pruning or project transpose, the RexCall might 
be a complete new object, even there are equivalent rexcall already exist. 
 

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-10 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132839#comment-17132839
 ] 

Danny Chen commented on CALCITE-3786:
-

> However, the digest itself is just duplicate each field of the RelNode and 
> RexNode again, which still double memory usage of RelNode and RexNode.

That's not true, for RelNode we only bookeep its class name and id, for rex 
node only an additional object reference.

> But what is the proposal's advantage over just removing the digest from 
> RelNode and RexNode completely and caching the hashcode only?

The digest behaves like a tool to unify the logic. We can also let each rel 
node to handle its "digest" but the code would be a mess.

> After query parser, the two RexCalls a > 10 are different instances
They are different instances, but all the operators expect to be a singleton, 
and each time we copy the RexCalls in the rule, we just pass around the object 
reference. I believe the rules that split the RexCalls would be few, with the 
fact that we just copy new references.


> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-10 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132573#comment-17132573
 ] 

Haisheng Yuan commented on CALCITE-3786:


The digest of RelNode in the patch does reuse RexNode parts, but the RexNode 
doesn't, which can still result in OOM like CALCITE-3784. The reuse of RexNode 
is the true pain point. 

If the RexNode also changes its Digest as in the patch, it can solve the OOM, 
because the digest (unlike digest string) size won't grow exponentially. 
However, the digest itself is just duplicate each field of the RelNode and 
RexNode again, which still double memory usage of RelNode and RexNode. 

I guess everyone knows that the digest is mainly used to check whether there is 
equivalent node in the planner quickly, without deep comparison. If we agree to 
discard the shallow comparison, I am totally fine with that. But what is the 
proposal's advantage over just removing the digest from RelNode and RexNode 
completely and caching the hashcode only?

In addition, for queries with complex and duplicate expressions, e.g. (a > 10 
and b < 1) or (a > 10 and b > 5) ... 
After query parser, the two RexCalls {{a > 10}} are different instances, in 
real case that can be arbitrary complex expressions, and repeat many times.  
And after rule transformations, like column pruning, all the RexNodes will 
become distinct instances and many of them duplicate again. This causes a lot 
of memory waste for large queries. With the so-called Digest in the proposal, 
it can consume less memory than the original string digest, still doubles the 
actual memory of RelNode and RexNode, but at the same time loses the ability of 
shallow comparison.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-10 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130580#comment-17130580
 ] 

Danny Chen commented on CALCITE-3786:
-

[~hyuan] ~
1. The new implementation already reuse the expression parts
2. the Digest#toString is only used for debugging

Of course you can give a -1, but i'm sorry i would ignore that if there was no 
specific design, i'm not sure if the new design solves the OOM but at least 
this patch should not be a blocker for that, we can switch to the new design at 
any time if we are ready for that.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-10 Thread Vladimir Sitnikov (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130310#comment-17130310
 ] 

Vladimir Sitnikov commented on CALCITE-3786:


[~hyuan],
1) The digest should be able to reuse expression parts, so it should not result 
in OOM like in 3784
2) toString implementation might track a set of already printed expressions, so 
the overall result would be shorter, thus it would avoid OOM as well. Well, we 
can add de-duplication in the current toString as well, however, it might have 
unexpected performance issues.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-09 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130096#comment-17130096
 ] 

Haisheng Yuan commented on CALCITE-3786:


I am sorry for disappointing you, I should have explicitly -1 on the proposal 
before you started working on this.

It has nothing to do RexNode memo stuff, I am just against the idea in this 
proposal, which sounds like someone feels headache after banging his head into 
the wall, the doctor wrapped the patient's head with thick bandage, and gave 
the advice: bang your head slightly next time. The proposal can ease the issue 
to some extent by reusing digest, but doesn't solve the underlying problem 
fundamentally that RexNode can't be deduplicated inside planner.

You think the digest in current code base works well, then how do you think 
CALCITE-3784? Maybe no one would run a single 20k lines analytical query or 60 
table joins on Flink, because it is real time streaming platform. 

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-09 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129928#comment-17129928
 ] 

Danny Chen commented on CALCITE-3786:
-

Thanks [~hyuan] for the share, the digest was always there, and it works well 
for current code base, this patch is only a promotion to the original pure 
string format digest, because we need to consider many properties including in 
the "digest" (e.g. the row type sans field names, the hints), a new interface 
makes the thing easier.

As for the rex node memo, i expect a design doc there, before we moving forward 
to that, a digest interface is a good choice.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3786) Add Digest interface to enable efficient hashCode(equals) for RexNode and RelNode

2020-06-09 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129415#comment-17129415
 ] 

Haisheng Yuan commented on CALCITE-3786:


I noticed Danny opened a pull request, thanks for taking this, Danny!

But frankly speaking, I am still against this idea, digest should not be used 
at all. The reason remains unchanged.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> -
>
> Key: CALCITE-3786
> URL: https://issues.apache.org/jira/browse/CALCITE-3786
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.21.0
>Reporter: Vladimir Sitnikov
>Assignee: Danny Chen
>Priority: Major
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)