[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567907#comment-17567907 ] Alessandro Solimando commented on CALCITE-4223: --- I have reviewed the changes and LGTM > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 2h 10m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567773#comment-17567773 ] Chunwei Lei commented on CALCITE-4223: -- I would like to merge the PR in the next 48hours if no objections appear. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 2h 10m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562492#comment-17562492 ] Chunwei Lei commented on CALCITE-4223: -- Could you please review the PR, [~julianhyde]? > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 2h 10m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562018#comment-17562018 ] Chunwei Lei commented on CALCITE-4223: -- Thank you for your reply, [~asolimando]. I would like to continue this work. Hope it can help. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
Title: Message Title Alessandro Solimando commented on CALCITE-4223 Re: Introducing column statistics to RelOptTable I am recently working on adding histogram statistics for HIVE-26221, your PR seems to be going into the right direction IMO. In my experience with Hive as a downstream project, I feel that Julian is right, in Hive we went for subclassing RelOptTable as RelOptHiveTable and we are adding column statistics support directly there. I feel that the approach based on the unwrap method __ looks cleaner and more composable. Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
Title: Message Title Chunwei Lei commented on CALCITE-4223 Re: Introducing column statistics to RelOptTable It has been a long time since the last discussion. Recently I have time to move it forward. After I reviewed all discussions above, I opened a new PR as Julian suggested: https://github.com/apache/calcite/pull/2845. I am not sure whether it is the best way to introduce the column stats. So I would like to see what others think about it, especially those involved in the downstream projects. Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212173#comment-17212173 ] Chunwei Lei commented on CALCITE-4223: -- I would check it. Thank you for your reply(:DJust back from vacation). > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206840#comment-17206840 ] Julian Hyde commented on CALCITE-4223: -- I did some more work here: https://github.com/julianhyde/calcite/tree/4223-metadata > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17205961#comment-17205961 ] Julian Hyde commented on CALCITE-4223: -- I started writing a PR, but there is already a perfect example. In CALCITE-1861 I wanted a table to be able to declare its own predicates. So I added 4 lines to {{getAllPredicates}}: {code} public RelOptPredicateList getAllPredicates(TableScan scan, RelMetadataQuery mq) { final BuiltInMetadata.AllPredicates.Handler handler = scan.getTable().unwrap(BuiltInMetadata.AllPredicates.Handler.class); if (handler != null) { return handler.getAllPredicates(scan, mq); } return RelOptPredicateList.EMPTY; } {code} and I made a mock table that called {{addWrap}} to add its own implementation of {{AllPredicates.Handler}}: {code} restaurantTable.addWrap( new BuiltInMetadata.AllPredicates.Handler() { public RelOptPredicateList getAllPredicates(RelNode r, RelMetadataQuery mq) { ... } }); {code} Put a break-point in {{RelMdAllPredicates.getAllPredicates(TableScan scan, RelMetadataQuery mq)}} and run {{RelOptRulesTest.testSpatialDWithinToHilbert}} and see how your break-point gets hit. We need to check for handlers in all {{RelMdXxx.getXxx(TableScan, RelMdataQuery)}} methods. If I did it again I'd implement {{interface AllPredicates}} rather than {{interface AllPredicates.Handler}}, but the principle is the same. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203237#comment-17203237 ] Chunwei Lei commented on CALCITE-4223: -- {quote}I don't know why Flink and Drill have not integrated their statistics into Calcite. Maybe they didn't know how. They could have asked. Or we could have written better documentation. {quote} I think it is not a good practice if users have to ask the community how to do it and how to do it correctly. I have to admit that my proposal is not as extendable as expected. What I want to do is that introducing column statistics and taking advantage of them explicitly. BTW, it would be great if you can give a PR to show how to introduce column statistics in the way you think it should be. Thank you for your time~~ > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203104#comment-17203104 ] Julian Hyde commented on CALCITE-4223: -- See above, my suggestion to expose interfaces such as {{BuiltInMetadata.Size}} when someone calls {{unwrap(BuiltInMetadata.Size.class)}} on your {{RelOptTable}} or {{Table}}. Then change methods such as {{averageColumnSizes(TableScan rel, RelMetadataQuery mq)}} to look for that interface and call it. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202982#comment-17202982 ] Danny Chen commented on CALCITE-4223: - > Maybe they didn't know how We write our own metadata and check the {{ColumnStatistic}} in it. For common metadata extending from Calcite, yes, we have no good way to integrate. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202938#comment-17202938 ] Julian Hyde commented on CALCITE-4223: -- I don't see how Flink's {{ColumnStats}} and Drill's {{ColumnStatistics}} play into this. I would expect each engine, Flink and Drill in this case, to have their own data structure(s) to store statistics. But then we need to make those statistics available to rules in Calcite that are not aware of the engine and its particular statistics data structure. I don't know why Flink and Drill have not integrated their statistics into Calcite. Maybe they didn't know how. They could have asked. Or we could have written better documentation. Using a Java interface is a poor choice for extensibility. Let's suppose that we add your {{interface ColumnStatistics}} to Calcite. Let's suppose that Drill creates {{interface DrillColumnStatistics extends ColumnStatistics}} with one extra method, and Flink creates {{interface FlinkColumnStatistics extends ColumnStatistics}} with two extra methods. Now there's no interface with all of the extra methods. Calcite's approach is to make each statistic an interface with one method (or occasionally two, if closely related). So an engine can implement the ones it has, and ignore the others. It is a better extensibility story than what you propose. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202725#comment-17202725 ] Chunwei Lei commented on CALCITE-4223: -- Let's think about it in another way. A good interface is usually extendable and easily-and-widely-used by others, right? But what I find is that nobody uses the way you propose to introduce column statistics, including Flink[1], Drill[2]. I think it deserves to think twice. I am also glad to hear what others think about it. [1]https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/plan/stats/ColumnStats.java [2]https://github.com/apache/drill/blob/master/metastore/metastore-api/src/main/java/org/apache/drill/metastore/statistics/ColumnStatistics.java > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200258#comment-17200258 ] Julian Hyde commented on CALCITE-4223: -- And you need to implement the {{unwrap(Class)}} method in your class that implements {{interface Table}}. How you do that is your business. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200256#comment-17200256 ] Julian Hyde commented on CALCITE-4223: -- [~Chunwei Lei], We don't need to change {{interface RelOptTable}} at all. We don't need a new {{interface ColumnStatistics}}. But we should change all of the metadata methods that deal with table scans to see whether the table has the statistics so that we can return a better result. For example: {noformat} diff --git a/core/src/main/java/org/apache/calcite/rel/metadata/RelMdSize.java b/core/src/main/java/org/apache/calcite/rel/metadata/RelMdSize.java index 458df6b34..d50e32a51 100644 --- a/core/src/main/java/org/apache/calcite/rel/metadata/RelMdSize.java +++ b/core/src/main/java/org/apache/calcite/rel/metadata/RelMdSize.java @@ -172,6 +172,11 @@ public Double averageRowSize(RelNode rel, RelMetadataQuery mq) { public List averageColumnSizes(TableScan rel, RelMetadataQuery mq) { final List fields = rel.getRowType().getFieldList(); +final BuiltInMetadata.Size size = +rel.getTable().unwrap(BuiltInMetadata.Size.class); +if (size != null && size.averageColumnSizes() != null) { + return size.averageColumnSizes(); +} final ImmutableList.Builder list = ImmutableList.builder(); for (RelDataTypeField field : fields) { list.add(averageTypeValueSize(field.getType())); {noformat} > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199349#comment-17199349 ] Liya Fan commented on CALCITE-4223: --- IMHO, I like the idea of a quick path for some important table statistics. Admittedly, RelMetadata is a powerful tool, however, it can be an overkill for some scenarios: 1) It can be expensive, due to the cost of code-gen and compilation. 2) It may be inaccurate because some logic is intended for general purpose. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199143#comment-17199143 ] Chunwei Lei commented on CALCITE-4223: -- Do you mean that we don't need to do any changes since we already have such a framework for users who want to introduce column statistics? > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198612#comment-17198612 ] Julian Hyde commented on CALCITE-4223: -- OK, I missed that it extends {{Wrapper}}. Still, there's no reason to create {{ColStatistics}} as a new interface. {{RelOptTable}} has all the extension points you need. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198161#comment-17198161 ] Chunwei Lei commented on CALCITE-4223: -- {quote}That doesn't work, because your {{ColStatistics}} doesn't have an {{unwrap}} method. {quote} {{ColStatistics}} has an unwrap method since it extends {{Wrapper}}. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198115#comment-17198115 ] Julian Hyde commented on CALCITE-4223: -- bq. Users can implement ColStatistics and add new methods. Then use unwarp() to get the customized ColStatistics. That doesn't work, because your {{ColStatistics}} doesn't have an {{unwrap}} method. bq. I think ... is much more straightforward and readable. I don't think readability is the most important metric. The problem is to plug together providers, which is hard. A robust solution is bound to be somewhat complex. bq. Besides, does it mean that RelOptTable has to implement interfaces like BuiltinMetadata.size/BuiltinMetadata.DistinctRowCount in your proposal? No. When I call unwrap on an object, it doesn't have to return itself. Typically it will return a lambda. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198105#comment-17198105 ] Chunwei Lei commented on CALCITE-4223: -- Thank you for your review, Julian. {quote}it does not easily allow people to add new kinds of metadata, and it does not accommodate differences in data structures that may have more information (e.g. a system that has a histogram that returns not just number of distinct values, but the number of distinct values between 100 and 1000) {quote} Users can implement ColStatistics and add new methods. Then use unwarp() to get the customized ColStatistics. Comparing {{table.unwrap(BuiltinMetadata.Size.class)}}, I think sum({{table.getColumnStatistics(col).getAvgColLen())}} is much more straightforward and readable. Besides, does it mean that {{RelOptTable}} has to implement interfaces like {{BuiltinMetadata.size/BuiltinMetadata.DistinctRowCount}} in your proposal? > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197526#comment-17197526 ] Julian Hyde commented on CALCITE-4223: -- I see in the PR you have created {{interface ColStatistics}} and added a method to {{RelOptTable}} to get it. I said above, and still think, that this is not the right approach. It does not easily allow people to add new kinds of metadata, and it does not accommodate differences in data structures that may have more information (e.g. a system that has a histogram that returns not just number of distinct values, but the number of distinct values between 100 and 1000). I introduced {{interface Statistics}} to make the simple case easy. It is not a template that we should try to extend. Suppose you could query any metadata interface on a {{RelOptTable}} using {{unwrap}}. Then you can easily implement metadata. For example, in {{RelMdSize}}: {code} public Double averageRowSize(TableScan scan, RelMetadataQuery mq) { final RelOptTable table = scan.getTable(); final BuiltInMetadata.Size size = table.unwrap(BuiltInMetadata.Size.class); if (size != null) { return size.averageRowSize(); } return null; } {code} I think that is much more elegant and straightforward. Of course the implementor of the particular type of table will have to implement the necessary interfaces, but I don't think that will be hard. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193391#comment-17193391 ] Chunwei Lei commented on CALCITE-4223: -- Thank you for your helpful insight, [~julianhyde]. I will start to prepare the PR. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192721#comment-17192721 ] Julian Hyde commented on CALCITE-4223: -- bq. 1) Do you agree to introduce column statistics? Yes. I believe that {{RelMdPopulationSize}} and {{RelMdDistinctRowCount}} give you unfiltered and filtered NDV. But feel free to propose other statistics. bq. 2) If so, where should we put them? (RelOptTable? Statistics? Or other places) You will need to do two things - first, store the statistics, and second make them accessible (e.g. to planner rules, other statistics and the cost model). To store them, very likely you will put a data structure such as a sketch in your {{Table}} and make it accessible via the {{RelOptTable}} that wraps it. Both of these can implement {{Wrapper}} to give access to the data structures holding the statistics. Then you should make them accessible via new or existing statistics methods along the lines of {{RelMetadataQuery.getPopulationSize(RelNode, ImmutableBitSet)}}. You will obviously want to implement for {{TableScan}} but should try to implement for other {{RelNode}} sub-types as well. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192601#comment-17192601 ] Chunwei Lei commented on CALCITE-4223: -- There is always an alternative due to the flexible interface. But I think it would be great if Calcite has column statistics since it helps generate a better plan. Even if the column statistics are provided, we still get the statistics like NDV using {{RelMetadataQuery}}. What we need to do is add/modify some methods in RelMdxxx. Taking NDV for example: {code:java} // RelMdDistinctRowCount.java public Double getDistinctRowCount(TableScan rel, RelMetadataQuery mq, ImmutableBitSet groupKey, RexNode predicate) { .. List allDistinctValue = getAllDistinctValue(rel.getTable().getColumnStatistics(), groupKey); return getJointDistinctValue(allDistinctValue); }{code} Regarding no fixed definition of statistics, we can provide some basic and frequently-used column statistics including NDV/AverageColumnSize/nullCount. Let me conclude to help others understand more. There are two questions: 1) Do you agree to introduce column statistics? 2) If so, where should we put them? (RelOptTable? Statistics? Or other places) > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192361#comment-17192361 ] Julian Hyde commented on CALCITE-4223: -- [~Chunwei Lei] wrote: bq. RelOptTable does not provide a method to get Statistic. Now that I've told you about {{Wrapper.unwrap}}, can you reconsider this request to modify {{RelOptTable}}? > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192359#comment-17192359 ] Julian Hyde commented on CALCITE-4223: -- I don't think we should extend {{interface RelOptTable}} or {{interface Statistics}}. There are a few reasons. First, {{RelOptTable}} already extends {{interface Wrapper}}. If you want your implementation of {{RelOptTable}} to provide some other interface, you can get it via {{Wrapper.unwrap()}}. Second, there is no fixed definition of statistics. Different engines are going to have different variations. One engine might have a histogram that tells you the number of distinct values of productId when state = 'CA'; another might not. That's why we created the {{RelMetadataQuery}} framework. Third and fourth, we will need caching, and a way to plug in multiple providers. Again, that's what the {{RelMetadataQuery}} framework is for. Why not write a convenience method {code} double getNdv(RelMetadataQuery mq, RelOptTable table, int column) { return mq.getPopulationSize(new TableScan(table), ImmutableBitSet.of(column)); {code} > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191916#comment-17191916 ] Chunwei Lei commented on CALCITE-4223: -- Thanks for sharing, [~danny0405]. IMO, one of the inconvenient places is that Flink has to implement its own {{RelOptTable}} to get {{Statistic}} due to {{RelOptTable}} does not provide a method to get {{Statistic}}. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191432#comment-17191432 ] Danny Chen commented on CALCITE-4223: - I want to share the usage of Apache Flink, we sub-class the {{Statistic}}, there is a structure named {{FlinkStatistic}}, the {{FlinkStatistic}} keep an optional member named {{TableStats}}, the {{TableStats}} has {{ColumnStats}} for each column. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191409#comment-17191409 ] Chunwei Lei commented on CALCITE-4223: -- Let me make myself clear. What in my mind is that there is a method called getColumnStatistics() in RelOptTable. Thus, we can get NDV(or other statistics) of TableScan easily via {{rel.getTable().getColumnStatistics().get(columnName).getNdv()}}. The implementation of it can be delgated to {{Statistic#getColumnStatistics()}}. Do you have any other ideas? > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190821#comment-17190821 ] Julian Hyde commented on CALCITE-4223: -- [~Chunwei Lei], But do the column statistics have to be *in RelOptTable*? > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190504#comment-17190504 ] Chunwei Lei commented on CALCITE-4223: -- [~julianhyde], column statistics might include NDV, average/max column length, number of nulls, number of trues, number of falses, TopK. Some systems like Hive[1] provide a command to collect these stats. Providing we have such column stats, we can: 1) get more accurate NDV of table scan than estimation. 2) estimate more accurate size of inputs of Join if the columns' types include varchar, which helps decide whether to use HashJoin or MergeJoin(Because we have average column length). [1]https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ColumnStatistics > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190493#comment-17190493 ] Chunwei Lei commented on CALCITE-4223: -- [~amaliujia] , column stats are used in Hive[1], Flink[2], Spark[3]. [1] [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/plan/ColStatistics.java] [2][https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/plan/stats/ColumnStats.java] [[3] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala] > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190362#comment-17190362 ] Julian Hyde commented on CALCITE-4223: -- Do they really need to be part of {{interface RelOptTable}}? I added {{interface Statistic}} and the {{Table.getStatistic()}} method to make it easy for people to write user-defined tables. But the intention was only ever to deal with simple statistics. For more complex statistics, use {{RelMetadataQuery}}, and add your own metadata provider. We already have {{RelMdPopulationSize}}, which gives cardinality for single columns and groups of columns. What other stats do you have in mind? > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190324#comment-17190324 ] Rui Wang commented on CALCITE-4223: --- [~Chunwei Lei] Out of curiosity: could you share some relevant links/information about NDV/Column stats? Is such stats are used for column-oriented storage? > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-4223) Introducing column statistics to RelOptTable
[ https://issues.apache.org/jira/browse/CALCITE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189917#comment-17189917 ] Jiatao Tao commented on CALCITE-4223: - +1 Stats and cost model is the key point of CBO. > Introducing column statistics to RelOptTable > > > Key: CALCITE-4223 > URL: https://issues.apache.org/jira/browse/CALCITE-4223 > Project: Calcite > Issue Type: Improvement >Reporter: Chunwei Lei >Assignee: Chunwei Lei >Priority: Major > > Many systems depend on column statistics to compute more accurate stats, such > as NDV, average column size, and so on. It would be nice if Calcite can > provide such an interface. > Column statistics might include NDV, average/max column length, number of > nulls, number of trues, number of falses and so on. > What do you think? > -- This message was sent by Atlassian Jira (v8.3.4#803005)