[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2022-07-18 Thread Alessandro Solimando (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567907#comment-17567907
 ] 

Alessandro Solimando commented on CALCITE-4223:
---

I have reviewed the changes and LGTM

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2022-07-17 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567773#comment-17567773
 ] 

Chunwei Lei commented on CALCITE-4223:
--

I would like to merge the PR in the next 48hours if no objections appear.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2022-07-05 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562492#comment-17562492
 ] 

Chunwei Lei commented on CALCITE-4223:
--

Could you please review the PR, [~julianhyde]?

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2022-07-03 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562018#comment-17562018
 ] 

Chunwei Lei commented on CALCITE-4223:
--

Thank you for your reply, [~asolimando]. I would like to continue this work. 
Hope it can help.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2022-06-29 Thread Alessandro Solimando (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Alessandro Solimando commented on  CALCITE-4223  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Introducing column statistics to RelOptTable   
 

  
 
 
 
 

 
 I am recently working on adding histogram statistics for HIVE-26221, your PR seems to be going into the right direction IMO. In my experience with Hive as a downstream project, I feel that Julian is right, in Hive we went for subclassing RelOptTable as RelOptHiveTable and we are adding column statistics support directly there. I feel that the approach based on the unwrap method __ looks cleaner and more composable.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2022-06-28 Thread Chunwei Lei (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Chunwei Lei commented on  CALCITE-4223  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Introducing column statistics to RelOptTable   
 

  
 
 
 
 

 
 It has been a long time since the last discussion. Recently I have time to move it forward. After I reviewed all discussions above, I opened a new PR as Julian suggested: https://github.com/apache/calcite/pull/2845. I am not sure whether it is the best way to introduce the column stats. So I would like to see what others think about it, especially those involved in the downstream projects.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-10-12 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212173#comment-17212173
 ] 

Chunwei Lei commented on CALCITE-4223:
--

I would check it. Thank you for your reply(:DJust back from vacation).

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-10-03 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206840#comment-17206840
 ] 

Julian Hyde commented on CALCITE-4223:
--

I did some more work here: 
https://github.com/julianhyde/calcite/tree/4223-metadata

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-10-01 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17205961#comment-17205961
 ] 

Julian Hyde commented on CALCITE-4223:
--

I started writing a PR, but there is already a perfect example. In CALCITE-1861 
I wanted a table to be able to declare its own predicates. So I added 4 lines 
to {{getAllPredicates}}:

{code}
  public RelOptPredicateList getAllPredicates(TableScan scan, RelMetadataQuery 
mq) {
final BuiltInMetadata.AllPredicates.Handler handler =
scan.getTable().unwrap(BuiltInMetadata.AllPredicates.Handler.class);
if (handler != null) {
  return handler.getAllPredicates(scan, mq);
}
return RelOptPredicateList.EMPTY;
  }
{code}

and I made a mock table that called {{addWrap}} to add its own implementation 
of {{AllPredicates.Handler}}:

{code}
restaurantTable.addWrap(
new BuiltInMetadata.AllPredicates.Handler() {
  public RelOptPredicateList getAllPredicates(RelNode r,
  RelMetadataQuery mq) {
...
  }
});
{code}

Put a break-point in {{RelMdAllPredicates.getAllPredicates(TableScan scan, 
RelMetadataQuery mq)}} and run {{RelOptRulesTest.testSpatialDWithinToHilbert}} 
and see how your break-point gets hit.

We need to check for handlers in all {{RelMdXxx.getXxx(TableScan, 
RelMdataQuery)}} methods. If I did it again I'd implement {{interface 
AllPredicates}} rather than {{interface AllPredicates.Handler}}, but the 
principle is the same.  

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-28 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203237#comment-17203237
 ] 

Chunwei Lei commented on CALCITE-4223:
--

{quote}I don't know why Flink and Drill have not integrated their statistics 
into Calcite. Maybe they didn't know how. They could have asked. Or we could 
have written better documentation.
{quote}
I think it is not a good practice if users have to ask the community how to do 
it and how to do it correctly. I have to admit that my proposal is not as 
extendable as expected. What I want to do is that introducing column statistics 
and taking advantage of them explicitly. BTW, it would be great if you can give 
a PR to show how to introduce column statistics in the way you think it should 
be. Thank you for your time~~

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-28 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203104#comment-17203104
 ] 

Julian Hyde commented on CALCITE-4223:
--

See above, my suggestion to expose interfaces such as {{BuiltInMetadata.Size}} 
when someone calls {{unwrap(BuiltInMetadata.Size.class)}} on your 
{{RelOptTable}} or {{Table}}. Then change methods such as 
{{averageColumnSizes(TableScan rel, RelMetadataQuery mq)}} to look for that 
interface and call it.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-27 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202982#comment-17202982
 ] 

Danny Chen commented on CALCITE-4223:
-

> Maybe they didn't know how

We write our own metadata and check the {{ColumnStatistic}} in it. For common 
metadata extending from Calcite, yes, we have no good way to integrate.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-27 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202938#comment-17202938
 ] 

Julian Hyde commented on CALCITE-4223:
--

I don't see how Flink's {{ColumnStats}} and Drill's {{ColumnStatistics}} play 
into this. I would expect each engine, Flink and Drill in this case, to have 
their own data structure(s) to store statistics. But then we need to make those 
statistics available to rules in Calcite that are not aware of the engine and 
its particular statistics data structure.

I don't know why Flink and Drill have not integrated their statistics into 
Calcite. Maybe they didn't know how. They could have asked. Or we could have 
written better documentation.

Using a Java interface is a poor choice for extensibility. Let's suppose that 
we add your {{interface ColumnStatistics}} to Calcite. Let's suppose that Drill 
creates {{interface DrillColumnStatistics extends ColumnStatistics}} with one 
extra method, and Flink creates {{interface FlinkColumnStatistics extends 
ColumnStatistics}} with two extra methods. Now there's no interface with all of 
the extra methods.

Calcite's approach is to make each statistic an interface with one method (or 
occasionally two, if closely related). So an engine can implement the ones it 
has, and ignore the others. It is a better extensibility story than what you 
propose.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-26 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202725#comment-17202725
 ] 

Chunwei Lei commented on CALCITE-4223:
--

Let's think about it in another way. A good interface is usually extendable and 
easily-and-widely-used by others, right? But what I find is that nobody uses 
the way you propose to introduce column statistics, including Flink[1], 
Drill[2]. I think it deserves to think twice. I am also glad to hear what 
others think about it.

 

[1]https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/plan/stats/ColumnStats.java

[2]https://github.com/apache/drill/blob/master/metastore/metastore-api/src/main/java/org/apache/drill/metastore/statistics/ColumnStatistics.java

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-22 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200258#comment-17200258
 ] 

Julian Hyde commented on CALCITE-4223:
--

And you need to implement the {{unwrap(Class)}} method in your class that 
implements {{interface Table}}. How you do that is your business.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-22 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200256#comment-17200256
 ] 

Julian Hyde commented on CALCITE-4223:
--

[~Chunwei Lei], We don't need to change {{interface RelOptTable}} at all. We 
don't need a new {{interface ColumnStatistics}}.  But we should change all of 
the metadata methods that deal with table scans to see whether the table has 
the statistics so that we can return a better result.

For example:
{noformat}
diff --git a/core/src/main/java/org/apache/calcite/rel/metadata/RelMdSize.java 
b/core/src/main/java/org/apache/calcite/rel/metadata/RelMdSize.java
index 458df6b34..d50e32a51 100644
--- a/core/src/main/java/org/apache/calcite/rel/metadata/RelMdSize.java
+++ b/core/src/main/java/org/apache/calcite/rel/metadata/RelMdSize.java
@@ -172,6 +172,11 @@ public Double averageRowSize(RelNode rel, RelMetadataQuery 
mq) {
 
   public List averageColumnSizes(TableScan rel, RelMetadataQuery mq) {
 final List fields = rel.getRowType().getFieldList();
+final BuiltInMetadata.Size size =
+rel.getTable().unwrap(BuiltInMetadata.Size.class);
+if (size != null && size.averageColumnSizes() != null) {
+  return size.averageColumnSizes();
+}
 final ImmutableList.Builder list = ImmutableList.builder();
 for (RelDataTypeField field : fields) {
   list.add(averageTypeValueSize(field.getType()));
{noformat}


> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-21 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199349#comment-17199349
 ] 

Liya Fan commented on CALCITE-4223:
---

IMHO, I like the idea of a quick path for some important table statistics. 

Admittedly, RelMetadata is a powerful tool, however, it can be an overkill for 
some scenarios:

1) It can be expensive, due to the cost of code-gen and compilation.

2) It may be inaccurate because some logic is intended for general purpose. 

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-20 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199143#comment-17199143
 ] 

Chunwei Lei commented on CALCITE-4223:
--

Do you mean that we don't need to do any changes since we already have such a 
framework for users who want to introduce column statistics? 

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-18 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198612#comment-17198612
 ] 

Julian Hyde commented on CALCITE-4223:
--

OK, I missed that it extends {{Wrapper}}. Still, there's no reason to create 
{{ColStatistics}} as a new interface. {{RelOptTable}} has all the extension 
points you need.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-18 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198161#comment-17198161
 ] 

Chunwei Lei commented on CALCITE-4223:
--

{quote}That doesn't work, because your {{ColStatistics}} doesn't have an 
{{unwrap}} method.
{quote}
{{ColStatistics}} has an unwrap method since it extends {{Wrapper}}.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-17 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198115#comment-17198115
 ] 

Julian Hyde commented on CALCITE-4223:
--

bq. Users can implement ColStatistics and add new methods. Then use unwarp() to 
get the customized ColStatistics.

That doesn't work, because your {{ColStatistics}} doesn't have an {{unwrap}} 
method.

bq. I think ... is much more straightforward and readable.

I don't think readability is the most important metric. The problem is to plug 
together providers, which is hard. A robust solution is bound to be somewhat 
complex.

bq. Besides,  does it mean that RelOptTable has to implement interfaces like 
BuiltinMetadata.size/BuiltinMetadata.DistinctRowCount in your proposal?

No. When I call unwrap on an object, it doesn't have to return itself. 
Typically it will return a lambda.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-17 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198105#comment-17198105
 ] 

Chunwei Lei commented on CALCITE-4223:
--

Thank you for your review, Julian.
{quote}it does not easily allow people to add new kinds of metadata, and it 
does not accommodate differences in data structures that may have more 
information (e.g. a system that has a histogram that returns not just number of 
distinct values, but the number of distinct values between 100 and 1000)
{quote}
Users can implement ColStatistics and add new methods. Then use unwarp() to get 
the customized ColStatistics. Comparing 
{{table.unwrap(BuiltinMetadata.Size.class)}}, I think 
sum({{table.getColumnStatistics(col).getAvgColLen())}} is much more 
straightforward and readable.

Besides,  does it mean that {{RelOptTable}} has to implement interfaces like 
{{BuiltinMetadata.size/BuiltinMetadata.DistinctRowCount}} in your proposal?

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-17 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197526#comment-17197526
 ] 

Julian Hyde commented on CALCITE-4223:
--

I see in the PR you have created {{interface ColStatistics}} and added a method 
to {{RelOptTable}} to get it.

I said above, and still think, that this is not the right approach. It does not 
easily allow people to add new kinds of metadata, and it does not accommodate 
differences in data structures that may have more information (e.g. a system 
that has a histogram that returns not just number of distinct values, but the 
number of distinct values between 100 and 1000).

I introduced {{interface Statistics}} to make the simple case easy. It is not a 
template that we should try to extend.

Suppose you could query any metadata interface on a {{RelOptTable}} using 
{{unwrap}}. Then you can easily implement metadata. For example, in 
{{RelMdSize}}:

{code}
  public Double averageRowSize(TableScan scan, RelMetadataQuery mq) {
final RelOptTable table = scan.getTable();
final BuiltInMetadata.Size size =
table.unwrap(BuiltInMetadata.Size.class);
if (size != null) {
  return size.averageRowSize();
}
return null;
  }
{code}

I think that is much more elegant and straightforward.

Of course the implementor of the particular type of table will have to 
implement the necessary interfaces, but I don't think that will be hard.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-10 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193391#comment-17193391
 ] 

Chunwei Lei commented on CALCITE-4223:
--

Thank you for your helpful insight, [~julianhyde]. I will start to prepare the 
PR.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-09 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192721#comment-17192721
 ] 

Julian Hyde commented on CALCITE-4223:
--

bq. 1) Do you agree to introduce column statistics?

Yes. I believe that {{RelMdPopulationSize}} and {{RelMdDistinctRowCount}} give 
you unfiltered and filtered NDV.

But feel free to propose other statistics.

bq. 2) If so, where should we put them? (RelOptTable? Statistics? Or other 
places)

You will need to do two things - first, store the statistics, and second make 
them accessible (e.g. to planner rules, other statistics and the cost model). 
To store them, very likely you will put a data structure such as a sketch in 
your {{Table}} and make it accessible via the {{RelOptTable}} that wraps it. 
Both of these can implement {{Wrapper}} to give access to the data structures 
holding the statistics.

Then you should make them accessible via new or existing statistics methods 
along the lines of {{RelMetadataQuery.getPopulationSize(RelNode, 
ImmutableBitSet)}}. You will obviously want to implement for {{TableScan}} but 
should try to implement for other {{RelNode}} sub-types as well.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-08 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192601#comment-17192601
 ] 

Chunwei Lei commented on CALCITE-4223:
--

There is always an alternative due to the flexible interface. But I think it 
would be great if Calcite has column statistics since it helps generate a 
better plan.

Even if the column statistics are provided, we still get the statistics like 
NDV using {{RelMetadataQuery}}. What we need to do is add/modify some methods 
in RelMdxxx. Taking NDV for example:

 
{code:java}
// RelMdDistinctRowCount.java

public Double getDistinctRowCount(TableScan rel,
RelMetadataQuery mq, ImmutableBitSet groupKey,
RexNode predicate) {
..
List allDistinctValue = 
getAllDistinctValue(rel.getTable().getColumnStatistics(), groupKey); 
return getJointDistinctValue(allDistinctValue);
}{code}
 

Regarding no fixed definition of statistics, we can provide some basic and 
frequently-used column statistics including NDV/AverageColumnSize/nullCount. 

 

Let me conclude to help others understand more. There are two questions:

1) Do you agree to introduce column statistics?

2) If so, where should we put them? (RelOptTable? Statistics? Or other places)

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-08 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192361#comment-17192361
 ] 

Julian Hyde commented on CALCITE-4223:
--

[~Chunwei Lei] wrote:

bq. RelOptTable does not provide a method to get Statistic. 

Now that I've told you about {{Wrapper.unwrap}}, can you reconsider this 
request to modify {{RelOptTable}}?

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-08 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192359#comment-17192359
 ] 

Julian Hyde commented on CALCITE-4223:
--

I don't think we should extend {{interface RelOptTable}} or {{interface 
Statistics}}. There are a few reasons.

First, {{RelOptTable}} already extends {{interface Wrapper}}. If you want your 
implementation of {{RelOptTable}} to provide some other interface, you can get 
it via {{Wrapper.unwrap()}}.

Second, there is no fixed definition of statistics. Different engines are going 
to have different variations. One engine might have a histogram that tells you 
the number of distinct values of productId when state = 'CA'; another might 
not. That's why we created the {{RelMetadataQuery}} framework.

Third and fourth, we will need caching, and a way to plug in multiple 
providers. Again, that's what the {{RelMetadataQuery}} framework is for.

Why not write a convenience method

{code}
  double getNdv(RelMetadataQuery mq, RelOptTable table, int column) {
return mq.getPopulationSize(new TableScan(table), 
ImmutableBitSet.of(column));
{code}


> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-07 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191916#comment-17191916
 ] 

Chunwei Lei commented on CALCITE-4223:
--

Thanks for sharing, [~danny0405]. IMO, one of the inconvenient places is that 
Flink has to implement its own {{RelOptTable}} to get {{Statistic}} due to 
{{RelOptTable}} does not provide a method to get {{Statistic}}. 

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-06 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191432#comment-17191432
 ] 

Danny Chen commented on CALCITE-4223:
-

I want to share the usage of Apache Flink, we sub-class the {{Statistic}}, 
there is a structure named {{FlinkStatistic}}, the {{FlinkStatistic}} keep an 
optional member named {{TableStats}}, the {{TableStats}} has {{ColumnStats}} 
for each column.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-06 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191409#comment-17191409
 ] 

Chunwei Lei commented on CALCITE-4223:
--

Let me make myself clear. What in my mind is that there is a method called 
getColumnStatistics() in RelOptTable. Thus, we can get NDV(or other statistics) 
of TableScan easily via 
{{rel.getTable().getColumnStatistics().get(columnName).getNdv()}}.  The 
implementation of it can be delgated to {{Statistic#getColumnStatistics()}}. Do 
you have any other ideas?

 

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-04 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190821#comment-17190821
 ] 

Julian Hyde commented on CALCITE-4223:
--

[~Chunwei Lei], But do the column statistics have to be *in RelOptTable*?

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-03 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190504#comment-17190504
 ] 

Chunwei Lei commented on CALCITE-4223:
--

[~julianhyde], column statistics might include NDV, average/max column length, 
number of nulls, number of trues, number of falses, TopK. Some systems like 
Hive[1] provide a command to collect these stats. Providing we have such column 
stats, we can:

1) get more accurate NDV of table scan than estimation.

2) estimate more accurate size of inputs of Join if the columns' types include 
varchar, which helps decide whether to use HashJoin or MergeJoin(Because we 
have average column length).

 

[1]https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ColumnStatistics

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-03 Thread Chunwei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190493#comment-17190493
 ] 

Chunwei Lei commented on CALCITE-4223:
--

[~amaliujia] , column stats are used in Hive[1], Flink[2], Spark[3].

[1] 
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/plan/ColStatistics.java]

[2][https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/plan/stats/ColumnStats.java]

[[3] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala]

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-03 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190362#comment-17190362
 ] 

Julian Hyde commented on CALCITE-4223:
--

Do they really need to be part of {{interface RelOptTable}}? I added 
{{interface Statistic}} and the {{Table.getStatistic()}} method to make it easy 
for people to write user-defined tables. But the intention was only ever to 
deal with simple statistics.

For more complex statistics, use {{RelMetadataQuery}}, and add your own 
metadata provider.

We already have {{RelMdPopulationSize}}, which gives cardinality for single 
columns and groups of columns. What other stats do you have in mind?

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-03 Thread Rui Wang (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190324#comment-17190324
 ] 

Rui Wang commented on CALCITE-4223:
---

[~Chunwei Lei]

Out of curiosity: could you share some relevant links/information about 
NDV/Column stats? Is such stats are used for column-oriented storage?

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable

2020-09-03 Thread Jiatao Tao (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189917#comment-17189917
 ] 

Jiatao Tao commented on CALCITE-4223:
-

+1 Stats and cost model is the key point of CBO.

> Introducing column statistics to RelOptTable
> 
>
> Key: CALCITE-4223
> URL: https://issues.apache.org/jira/browse/CALCITE-4223
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Chunwei Lei
>Assignee: Chunwei Lei
>Priority: Major
>
> Many systems depend on column statistics to compute more accurate stats, such 
> as NDV, average column size, and so on. It would be nice if Calcite can 
> provide such an interface.
> Column statistics might include NDV, average/max column length, number of 
> nulls, number of trues, number of falses and so on. 
> What do you think?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)