[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022206#comment-17022206 ] ASF GitHub Bot commented on DRILL-7491: --- asfgit commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019469#comment-17019469 ] ASF GitHub Bot commented on DRILL-7491: --- ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r368547053 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = columnStats == null +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); Review comment: Hello @paul-rogers , could you please check the comment on line 182 in this file? It was added after your first review, and you may have accidentally missed it. Also test for given case was added in ```TestParquetComplex.java```. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019041#comment-17019041 ] ASF GitHub Bot commented on DRILL-7491: --- paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r368323139 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = columnStats == null +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); Review comment: HI @ihuzenko, thanks for the explanation. Very clear. In fact, I'd even suggest copying the details into the code somewhere to give future readers additional context. And, do we have a unit test that demonstrates this case? This seems like the kind of obscure issue that someone (such as me) could easily break at some point because the behavior is not obvious.. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019040#comment-17019040 ] ASF GitHub Bot commented on DRILL-7491: --- paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r368321457 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = columnStats == null +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); Review comment: HI @ihuzenko, thanks for the explanation. Very clear. In fact, I'd even suggest copying the details into the code somewhere to give future readers additional context. And, do we have a unit test that demonstrates this case? This seems like the kind of obscure issue that someone (such as me) could easily break at some point because the behavior is not obvious.. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017911#comment-17017911 ] ASF GitHub Bot commented on DRILL-7491: --- arina-ielchiieva commented on issue #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#issuecomment-575580335 @paul-rogers Igor had addressed code review comments and left one comment explaining the background of the issue (https://github.com/apache/drill/pull/1955#discussion_r366812967). Could you take a look one more time? Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015839#comment-17015839 ] ASF GitHub Bot commented on DRILL-7491: --- ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366812967 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = columnStats == null +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); Review comment: Hello @paul-rogers , As says Javadoc for this method, here we're looking for a count of non-null row values of the specific column. There is a special physical rule ```ConvertCountToDirectScanPrule``` purpose which is to reduce IO when we're looking for ```count(col)``` values, since the value may be present in the metadata read while planning. The problem which I'm trying to fix ( [DRILL-7491](https://issues.apache.org/jira/browse/DRILL-7491)) appeared in the method because the method for existing complex column returned **0** as if the column doesn't exist. So physical plan was converted to something like: ``` ScreenPrel ProjectPrel(cr=[$0]) DirectScanPrel(groupscan=[selectionRoot = classpath:/parquet/tmphive/hive_alltypes.parquet, numFiles = 1, usedMetadataSummaryFile = false, usedMetastore = false, DynamicPojoRecordReader{records = [[0]]}]) ``` and as a result incorrect count was returned. In debug mode, I noticed that parquet didn't provide metadata for top-level complex type column, but some metadata exists for nested fields, so I decided to check the existence of metadata to determine whether such complex column exists in parquet file and that scan is necessary to calculate correct row count value. Regarding your comments about the improvement of a metadata structure for complex fields, I think here there is no some special magic in Drill. I suppose that we're just reading & converting metadata from parquet into our model classes. If you have some good ideas on how we can improve our meta, could you please start a discussion at our mailing list to hear ideas of the community? Since the topic is out of scope for this pull request. Kind regards, Igor This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015587#comment-17015587 ] ASF GitHub Bot commented on DRILL-7491: --- paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366674972 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = columnStats == null +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; +} else if (existsNestedStatsForColumn(column, getTableMetadata()) +|| existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + // When statistics for nested field exists, this is complex column which is present in table. + // But its nested fields statistics can't be used to extract tableRowCount for this column. + // So NO_COLUMN_STATS returned here to avoid problems described in DRILL-7491. + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } -columnStats = columnStats != null ? columnStats : nonInterestingColStats; -nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); -colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; +Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); +if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; +} else { + return tableRowCount - nulls; Review comment: The above is true, but non-interesting. The number of nulls is useful when estimating selectivity of things like `IS NULL` or `IS NOT NULL`. If we are interested in scans, then the null count is not helpful. Given the math here, is the meaning of this function, "return the number of rows with non-null values for this column"? If so, that is somewhat a backward request. More typical is to return the null count directly, or the percentage of nulls. A reason that this value is not as helpful as it could be is if we are using stats to estimate something like `WHERE x = 10 AND y IS NOT NULL`. First, we estimate table row count, which might have been adjusted via partitioning. Then we estimate the selectivity reduction due to `x = 10`. Then we estimate the further reduction due to `y IS NOT NULL`. We cannot do this math if all we have is a number *n* that tell us the number of non-null rows in the table; we'd have to work out the selectivity number ourselves from the table row count. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015583#comment-17015583 ] ASF GitHub Bot commented on DRILL-7491: --- paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r39212 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = columnStats == null +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; +} else if (existsNestedStatsForColumn(column, getTableMetadata()) +|| existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { Review comment: Nit: `hasNestedStatsForColumn`, or `isCompoundColumnWithStats`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015588#comment-17015588 ] ASF GitHub Bot commented on DRILL-7491: --- paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366676515 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = (columnStats == null) +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; +} else if (existsNestedStatsForColumn(column, getTableMetadata()) +|| existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } -columnStats = columnStats != null ? columnStats : nonInterestingColStats; -nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); -colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; +Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); +if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; +} else { + return tableRowCount - nulls; +} + } -return Statistic.NO_COLUMN_STATS == tableRowCount -|| Statistic.NO_COLUMN_STATS == colNulls -? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls; + /** + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`, + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs + * to be tested. + * + * @param column column path + * @param metadata metadata with column statistics + * @return whether stats exists for nested fields + */ + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) { Review comment: I wonder about the premise of this function. The Java doc suggests we have the path `a.b`. We want to know if `a` is an uninteresting column. Are we suggesting that we have metadata that says that `a.b` is uninteresting, but we have no metadata for `a` itself? Why would this be? Do we store columns in a non-structured way? That is, do we have a list of columns like 'a', 'b', 'c.d', 'e.f.g', 'c.h', and so on? Rather than `(a, b, c(d, g), e((g)))`? Further we seem to be assuming that we will gather stats for, say, `a.c`, but not `a.b`. The only place I can see that such complexity would make sense is to estimate cardinality for Flatten or Lateral Join. In such a case, might it make more sense to treat this recursively as a nested table? That is 'a` is a table and `a.b` and `a.c` are just `b` and `c` columns in the nested table. Given this, we *must* have a metadata entry for `a` (the map) in order to have metadata for any of its nested columns. So, again, I wonder if we've defined our metadata semantics as clearly as we might. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > >
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015586#comment-17015586 ] ASF GitHub Bot commented on DRILL-7491: --- paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366671949 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = columnStats == null +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); Review comment: Having a hard time understanding this. If a column is uninteresing, we get the row count from the non-interesting columns metadata. Seems round-about. Should we get the row count from the table itself? That is, the indirection through non-interesting columns to get table metadata seems awkward. Also, if we do have column stats, we get the row count from the table metadata. This raises the question: by column value count, do we mean NDV (number of distinct values)? Otherwise, the column value count for top-level columns is defined as the same as the row count whether the column is interesting or not. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015585#comment-17015585 ] ASF GitHub Bot commented on DRILL-7491: --- paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366673114 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) { } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); } else { - return 0; // returns 0 if the column doesn't exist in the table. + return Statistic.NO_COLUMN_STATS; Review comment: I'm more confused. If this is a structured (complex) column, then it can have nested columns. The nested columns don't add information about this column. (Knowing the number of values in an array of maps does not tell us the cardinality of the map.) Again, if the Map is at the top level, then the value count is row count. If this stat is NDV, then we don't know the NDV if we don't have metadata. I'd even argue that NDV makes no sense for a complex column; it only makes sense for the members of the column. Now, back to Arina's point. The info here tells us something about scans. If I ask only for column `x`, and the table does not contain column `x`, then I don't even need to scan at all, I can just return *n* copies of NULL. (Most query engines would fail the query because the column is undefined. Drill will run the query and return nulls.) However, in practice, the only way to know the correct value of *n* is to do the scan (stats can be out of date.) Still, I don't get why we need *column* value counts. If we do a scan, we want the table row count, we don't care about the column value count. So, I wonder if there is some additional problem here where our use of stats needs some adjusting. If we want to estimate the row count after filtering (that is, the row count seen by, say, a join or sort), then we need the NDV which we can estimate only if we have stats, otherwise we should fall back on heuristic selectivity values. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015584#comment-17015584 ] ASF GitHub Bot commented on DRILL-7491: --- paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366677192 ## File path: logical/src/main/java/org/apache/drill/common/expression/SchemaPath.java ## @@ -334,36 +335,13 @@ public int hashCode() { @Override public boolean equals(Object obj) { -if (this == obj) { - return true; -} -if (obj == null) { - return false; -} -if (!(obj instanceof SchemaPath)) { - return false; -} - -SchemaPath other = (SchemaPath) obj; -if (rootSegment == null) { - return (other.rootSegment == null); -} -return rootSegment.equals(other.rootSegment); +return this == obj || obj instanceof SchemaPath +&& Objects.equals(rootSegment, ((SchemaPath) obj).rootSegment); } public boolean contains(Object obj) { Review comment: `contains()` is not a Java-defined method. Must the signature be `Object`? Or, can it be `SchemaPath` since that is the only kind of thing a `SchemaPath` could ever contain? Would save some casting. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015124#comment-17015124 ] ASF GitHub Bot commented on DRILL-7491: --- arina-ielchiieva commented on issue #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#issuecomment-574200097 +1, LGTM. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015093#comment-17015093 ] ASF GitHub Bot commented on DRILL-7491: --- KazydubB commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366339423 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = (columnStats == null) +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; +} else if (existsNestedStatsForColumn(column, getTableMetadata()) +|| existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } -columnStats = columnStats != null ? columnStats : nonInterestingColStats; -nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); -colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; +Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); +if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; +} else { + return tableRowCount - nulls; +} + } -return Statistic.NO_COLUMN_STATS == tableRowCount -|| Statistic.NO_COLUMN_STATS == colNulls -? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls; + /** + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`, + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs + * to be tested. + * + * @param column column path + * @param metadata metadata with column statistics + * @return whether stats exists for nested fields + */ + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) { Review comment: It depends from which perspective to think, but I agree the one proposed by me is confusing. Name still looks clumsy, but with the comments it should be fine. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015074#comment-17015074 ] ASF GitHub Bot commented on DRILL-7491: --- arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366328108 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = columnStats == null +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; +} else if (existsNestedStatsForColumn(column, getTableMetadata()) +|| existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + // When exists statistic for nested field, this is complex column which is present in table. Review comment: ```suggestion // When statistics for nested field exists, this is complex column which is present in table. ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015073#comment-17015073 ] ASF GitHub Bot commented on DRILL-7491: --- ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366327715 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = (columnStats == null) +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; +} else if (existsNestedStatsForColumn(column, getTableMetadata()) +|| existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } -columnStats = columnStats != null ? columnStats : nonInterestingColStats; -nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); -colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; +Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); +if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; +} else { + return tableRowCount - nulls; +} + } -return Statistic.NO_COLUMN_STATS == tableRowCount -|| Statistic.NO_COLUMN_STATS == colNulls -? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls; + /** + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`, + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs + * to be tested. + * + * @param column column path + * @param metadata metadata with column statistics + * @return whether stats exists for nested fields + */ + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) { Review comment: sorry @KazydubB , but the proposed new name also makes confusion, since column passed as the first argument isn't a leaf. I've added a comprehensive comment in place of usage, could you confirm that now intentions are clear? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015061#comment-17015061 ] ASF GitHub Bot commented on DRILL-7491: --- KazydubB commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366316876 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = (columnStats == null) +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; +} else if (existsNestedStatsForColumn(column, getTableMetadata()) +|| existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } -columnStats = columnStats != null ? columnStats : nonInterestingColStats; -nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); -colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; +Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); +if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; +} else { + return tableRowCount - nulls; +} + } -return Statistic.NO_COLUMN_STATS == tableRowCount -|| Statistic.NO_COLUMN_STATS == colNulls -? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls; + /** + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`, + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs + * to be tested. + * + * @param column column path + * @param metadata metadata with column statistics + * @return whether stats exists for nested fields + */ + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) { Review comment: The name is a little bit awkward, indeed. It can be changed to something like `existStatsForLeafColumn` (in the case, 'nested fields' should be changed to 'leaf columns' in the javadoc). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015055#comment-17015055 ] ASF GitHub Bot commented on DRILL-7491: --- ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366313821 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = (columnStats == null) +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; +} else if (existsNestedStatsForColumn(column, getTableMetadata()) +|| existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } -columnStats = columnStats != null ? columnStats : nonInterestingColStats; -nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); -colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; +Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); +if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; +} else { + return tableRowCount - nulls; +} + } -return Statistic.NO_COLUMN_STATS == tableRowCount -|| Statistic.NO_COLUMN_STATS == colNulls -? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls; + /** + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`, + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs + * to be tested. + * + * @param column column path + * @param metadata metadata with column statistics + * @return whether stats exists for nested fields + */ + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) { Review comment: No, the nested stats exists but not applicable for the parent column. As javadoc says, the method is used only to determine whether the column is actually absent. The current name expresses exactly what the method does. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015051#comment-17015051 ] ASF GitHub Bot commented on DRILL-7491: --- arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366307446 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = (columnStats == null) Review comment: ```suggestion ColumnStatistics nonInterestingColStats = columnStats == null ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015052#comment-17015052 ] ASF GitHub Bot commented on DRILL-7491: --- arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366310138 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = (columnStats == null) +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; +} else if (existsNestedStatsForColumn(column, getTableMetadata()) +|| existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } -columnStats = columnStats != null ? columnStats : nonInterestingColStats; -nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); -colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; +Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); +if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; +} else { + return tableRowCount - nulls; +} + } -return Statistic.NO_COLUMN_STATS == tableRowCount -|| Statistic.NO_COLUMN_STATS == colNulls -? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls; + /** + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`, + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs + * to be tested. + * + * @param column column path + * @param metadata metadata with column statistics + * @return whether stats exists for nested fields + */ + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) { Review comment: I am not quite sure I follow this method logic because when you call this method you return that there is no statistics is present but method states that it checks that nested column statistics exists. I guess we might need to rename method to depict exactly what it does, for example checking if column exists. By the way, if stats for nested column exists, can we use it for calculation? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015039#comment-17015039 ] ASF GitHub Bot commented on DRILL-7491: --- ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366303787 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) { } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); } else { - return 0; // returns 0 if the column doesn't exist in the table. + return Statistic.NO_COLUMN_STATS; Review comment: Hello @arina-ielchiieva , thank you very much for describing the issue. I've updated this PR, please take another look. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Fix For: 1.18.0 > > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014988#comment-17014988 ] ASF GitHub Bot commented on DRILL-7491: --- arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366250930 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) { } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); } else { - return 0; // returns 0 if the column doesn't exist in the table. + return Statistic.NO_COLUMN_STATS; Review comment: @ihuzenko I am not sure that change is correct. As you can see from java-doc and comment that you have removed, 0 is returned deliberately to avoid full scan if column does not exist. Here is an example of unit test that shows that with your change, you are breaking existing functionality, you can add it to `org.apache.drill.exec.planner.logical.TestConvertCountToDirectScan` class: ``` @Test public void textConvertAbsentColumn() throws Exception { String sql = "select count(abc) as cnt from cp.`tpch/nation.parquet`"; queryBuilder() .sql(sql) .planMatcher() .include("DynamicPojoRecordReader") .match(); testBuilder() .sqlQuery(sql) .unOrdered() .baselineColumns("cnt") .baselineValues(0L) .go(); } ``` After your changes, this test fill fail. I think you need to find a way to determine if column is absent or complex. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet
[ https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014648#comment-17014648 ] ASF GitHub Bot commented on DRILL-7491: --- ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955 # [DRILL-7491](https://issues.apache.org/jira/browse/DRILL-7491): Incorrect count() returned for complex types in parquet ## Description Convert ConvertCountToDirectScanPrule was applied to existing complex type columns with absent rowCount metadata. Returning ```Statistic.NO_COLUMN_STATS``` fixes the issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Incorrect count() returned for complex types in parquet > --- > > Key: DRILL-7491 > URL: https://issues.apache.org/jira/browse/DRILL-7491 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive, Storage - Parquet >Affects Versions: 1.16.0, 1.17.0 >Reporter: Igor Guzenko >Assignee: Igor Guzenko >Priority: Major > Attachments: hive_alltypes.parquet > > > To reproduce use the attached file for {{hive_alltypes.parquet}} (this is > parquet file generated by Hive) and try count on columns *c13 - c15.* For > example, > {code:sql} > SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet` > {code} > *Expected result:* {color:green}3 {color} > *Actual result:* {color:red}0{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)