[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r368323139 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = columnStats == null +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); Review comment: HI @ihuzenko, thanks for the explanation. Very clear. In fact, I'd even suggest copying the details into the code somewhere to give future readers additional context. And, do we have a unit test that demonstrates this case? This seems like the kind of obscure issue that someone (such as me) could easily break at some point because the behavior is not obvious.. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r368321457 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = columnStats == null +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); Review comment: HI @ihuzenko, thanks for the explanation. Very clear. In fact, I'd even suggest copying the details into the code somewhere to give future readers additional context. And, do we have a unit test that demonstrates this case? This seems like the kind of obscure issue that someone (such as me) could easily break at some point because the behavior is not obvious.. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r39212 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = columnStats == null +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; +} else if (existsNestedStatsForColumn(column, getTableMetadata()) +|| existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { Review comment: Nit: `hasNestedStatsForColumn`, or `isCompoundColumnWithStats`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366673114 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) { } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); } else { - return 0; // returns 0 if the column doesn't exist in the table. + return Statistic.NO_COLUMN_STATS; Review comment: I'm more confused. If this is a structured (complex) column, then it can have nested columns. The nested columns don't add information about this column. (Knowing the number of values in an array of maps does not tell us the cardinality of the map.) Again, if the Map is at the top level, then the value count is row count. If this stat is NDV, then we don't know the NDV if we don't have metadata. I'd even argue that NDV makes no sense for a complex column; it only makes sense for the members of the column. Now, back to Arina's point. The info here tells us something about scans. If I ask only for column `x`, and the table does not contain column `x`, then I don't even need to scan at all, I can just return *n* copies of NULL. (Most query engines would fail the query because the column is undefined. Drill will run the query and return nulls.) However, in practice, the only way to know the correct value of *n* is to do the scan (stats can be out of date.) Still, I don't get why we need *column* value counts. If we do a scan, we want the table row count, we don't care about the column value count. So, I wonder if there is some additional problem here where our use of stats needs some adjusting. If we want to estimate the row count after filtering (that is, the row count seen by, say, a join or sort), then we need the NDV which we can estimate only if we have stats, otherwise we should fall back on heuristic selectivity values. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366674972 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = columnStats == null +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; +} else if (existsNestedStatsForColumn(column, getTableMetadata()) +|| existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + // When statistics for nested field exists, this is complex column which is present in table. + // But its nested fields statistics can't be used to extract tableRowCount for this column. + // So NO_COLUMN_STATS returned here to avoid problems described in DRILL-7491. + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } -columnStats = columnStats != null ? columnStats : nonInterestingColStats; -nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); -colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; +Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); +if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; +} else { + return tableRowCount - nulls; Review comment: The above is true, but non-interesting. The number of nulls is useful when estimating selectivity of things like `IS NULL` or `IS NOT NULL`. If we are interested in scans, then the null count is not helpful. Given the math here, is the meaning of this function, "return the number of rows with non-null values for this column"? If so, that is somewhat a backward request. More typical is to return the null count directly, or the percentage of nulls. A reason that this value is not as helpful as it could be is if we are using stats to estimate something like `WHERE x = 10 AND y IS NOT NULL`. First, we estimate table row count, which might have been adjusted via partitioning. Then we estimate the selectivity reduction due to `x = 10`. Then we estimate the further reduction due to `y IS NOT NULL`. We cannot do this math if all we have is a number *n* that tell us the number of non-null rows in the table; we'd have to work out the selectivity number ourselves from the table row count. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366677192 ## File path: logical/src/main/java/org/apache/drill/common/expression/SchemaPath.java ## @@ -334,36 +335,13 @@ public int hashCode() { @Override public boolean equals(Object obj) { -if (this == obj) { - return true; -} -if (obj == null) { - return false; -} -if (!(obj instanceof SchemaPath)) { - return false; -} - -SchemaPath other = (SchemaPath) obj; -if (rootSegment == null) { - return (other.rootSegment == null); -} -return rootSegment.equals(other.rootSegment); +return this == obj || obj instanceof SchemaPath +&& Objects.equals(rootSegment, ((SchemaPath) obj).rootSegment); } public boolean contains(Object obj) { Review comment: `contains()` is not a Java-defined method. Must the signature be `Object`? Or, can it be `SchemaPath` since that is the only kind of thing a `SchemaPath` could ever contain? Would save some casting. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366676515 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = (columnStats == null) +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; +} else if (existsNestedStatsForColumn(column, getTableMetadata()) +|| existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } -columnStats = columnStats != null ? columnStats : nonInterestingColStats; -nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); -colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; +Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); +if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; +} else { + return tableRowCount - nulls; +} + } -return Statistic.NO_COLUMN_STATS == tableRowCount -|| Statistic.NO_COLUMN_STATS == colNulls -? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls; + /** + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`, + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs + * to be tested. + * + * @param column column path + * @param metadata metadata with column statistics + * @return whether stats exists for nested fields + */ + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) { Review comment: I wonder about the premise of this function. The Java doc suggests we have the path `a.b`. We want to know if `a` is an uninteresting column. Are we suggesting that we have metadata that says that `a.b` is uninteresting, but we have no metadata for `a` itself? Why would this be? Do we store columns in a non-structured way? That is, do we have a list of columns like 'a', 'b', 'c.d', 'e.f.g', 'c.h', and so on? Rather than `(a, b, c(d, g), e((g)))`? Further we seem to be assuming that we will gather stats for, say, `a.c`, but not `a.b`. The only place I can see that such complexity would make sense is to estimate cardinality for Flatten or Lateral Join. In such a case, might it make more sense to treat this recursively as a nested table? That is 'a` is a table and `a.b` and `a.c` are just `b` and `c` columns in the nested table. Given this, we *must* have a metadata entry for `a` (the map) in order to have metadata for any of its nested columns. So, again, I wonder if we've defined our metadata semantics as clearly as we might. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366671949 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { -long tableRowCount, colNulls; -Long nulls; ColumnStatistics columnStats = getTableMetadata().getColumnStatistics(column); -ColumnStatistics nonInterestingColStats = null; -if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); -} +ColumnStatistics nonInterestingColStats = columnStats == null +? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; +long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); Review comment: Having a hard time understanding this. If a column is uninteresing, we get the row count from the non-interesting columns metadata. Seems round-about. Should we get the row count from the table itself? That is, the indirection through non-interesting columns to get table metadata seems awkward. Also, if we do have column stats, we get the row count from the table metadata. This raises the question: by column value count, do we mean NDV (number of distinct values)? Otherwise, the column value count for top-level columns is defined as the same as the row count whether the column is interesting or not. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services