[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet

2020-01-19 Thread GitBox
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect 
count() returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r368323139
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 
 Review comment:
   HI @ihuzenko, thanks for the explanation. Very clear. In fact, I'd even 
suggest copying the details into the code somewhere to give future readers 
additional context. And, do we have a unit test that demonstrates this case? 
This seems like the kind of obscure issue that someone (such as me) could 
easily break at some point because the behavior is not obvious.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet

2020-01-19 Thread GitBox
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect 
count() returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r368321457
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 
 Review comment:
   HI @ihuzenko, thanks for the explanation. Very clear. In fact, I'd even 
suggest copying the details into the code somewhere to give future readers 
additional context. And, do we have a unit test that demonstrates this case? 
This seems like the kind of obscure issue that someone (such as me) could 
easily break at some point because the behavior is not obvious.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet

2020-01-14 Thread GitBox
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect 
count() returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r39212
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
 
 Review comment:
   Nit: `hasNestedStatsForColumn`, or `isCompoundColumnWithStats`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet

2020-01-14 Thread GitBox
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect 
count() returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366673114
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) {
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 } else {
-  return 0; // returns 0 if the column doesn't exist in the table.
+  return Statistic.NO_COLUMN_STATS;
 
 Review comment:
   I'm more confused. If this is a structured (complex) column, then it can 
have nested columns. The nested columns don't add information about this 
column. (Knowing the number of values in an array of maps does not tell us the 
cardinality of the map.) Again, if the Map is at the top level, then the value 
count is row count. If this stat is NDV, then we don't know the NDV if we don't 
have metadata. I'd even argue that NDV makes no sense for a complex column; it 
only makes sense for the members of the column.
   
   Now, back to Arina's point. The info here tells us something about scans. If 
I ask only for column `x`, and the table does not contain column `x`, then I 
don't even need to scan at all, I can just return *n* copies of NULL. (Most 
query engines would fail the query because the column is undefined. Drill will 
run the query and return nulls.) However, in practice, the only way to know the 
correct value of *n* is to do the scan (stats can be out of date.)
   
   Still, I don't get why we need *column* value counts. If we do a scan, we 
want the table row count, we don't care about the column value count.
   
   So, I wonder if there is some additional problem here where our use of stats 
needs some adjusting.
   
   If we want to estimate the row count after filtering (that is, the row count 
seen by, say, a join or sort), then we need the NDV which we can estimate only 
if we have stats, otherwise we should fall back on heuristic selectivity values.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet

2020-01-14 Thread GitBox
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect 
count() returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366674972
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  // When statistics for nested field exists, this is complex column which 
is present in table.
+  // But its nested fields statistics can't be used to extract 
tableRowCount for this column.
+  // So NO_COLUMN_STATS returned here to avoid problems described in 
DRILL-7491.
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
 
 Review comment:
   The above is true, but non-interesting. The number of nulls is useful when 
estimating selectivity of things like `IS NULL` or `IS NOT NULL`. If we are 
interested in scans, then the null count is not helpful.
   
   Given the math here, is the meaning of this function, "return the number of 
rows with non-null values for this column"? If so, that is somewhat a backward 
request. More typical is to return the null count directly, or the percentage 
of nulls.
   
   A reason that this value is not as helpful as it could be is if we are using 
stats to estimate something like `WHERE x = 10 AND y IS NOT NULL`. First, we 
estimate table row count, which might have been adjusted via partitioning. Then 
we estimate the selectivity reduction due to `x = 10`. Then we estimate the 
further reduction due to `y IS NOT NULL`. We cannot do this math if all we have 
is a number *n* that tell us the number of non-null rows in the table; we'd 
have to work out the selectivity number ourselves from the table row count.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet

2020-01-14 Thread GitBox
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect 
count() returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366677192
 
 

 ##
 File path: 
logical/src/main/java/org/apache/drill/common/expression/SchemaPath.java
 ##
 @@ -334,36 +335,13 @@ public int hashCode() {
 
   @Override
   public boolean equals(Object obj) {
-if (this == obj) {
-  return true;
-}
-if (obj == null) {
-  return false;
-}
-if (!(obj instanceof SchemaPath)) {
-  return false;
-}
-
-SchemaPath other = (SchemaPath) obj;
-if (rootSegment == null) {
-  return (other.rootSegment == null);
-}
-return rootSegment.equals(other.rootSegment);
+return this == obj || obj instanceof SchemaPath
+&& Objects.equals(rootSegment, ((SchemaPath) obj).rootSegment);
   }
 
   public boolean contains(Object obj) {
 
 Review comment:
   `contains()` is not a Java-defined method. Must the signature be `Object`? 
Or, can it be `SchemaPath` since that is the only kind of thing a `SchemaPath` 
could ever contain? Would save some casting.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet

2020-01-14 Thread GitBox
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect 
count() returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366676515
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
+}
+  }
 
-return Statistic.NO_COLUMN_STATS == tableRowCount
-|| Statistic.NO_COLUMN_STATS == colNulls
-? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
+  /**
+   * For complex columns, stats may be present only for nested fields. For 
example, a column path is `a`,
+   * but stats present for `a`.`b`. So before making a decision that column is 
absent, the case needs
+   * to be tested.
+   *
+   * @param column   column path
+   * @param metadata metadata with column statistics
+   * @return whether stats exists for nested fields
+   */
+  private boolean existsNestedStatsForColumn(SchemaPath column, Metadata 
metadata) {
 
 Review comment:
   I wonder about the premise of this function. The Java doc suggests we have 
the path `a.b`. We want to know if `a` is an uninteresting column. Are we 
suggesting that we have metadata that says that `a.b` is uninteresting, but we 
have no metadata for `a` itself?
   
   Why would this be? Do we store columns in a non-structured way? That is, do 
we have a list of columns like 'a', 'b', 'c.d', 'e.f.g', 'c.h', and so on? 
Rather than `(a, b, c(d, g), e((g)))`?
   
   Further we seem to be assuming that we will gather stats for, say, `a.c`, 
but not `a.b`.
   
   The only place I can see that such complexity would make sense is to 
estimate cardinality for Flatten or Lateral Join. In such a case, might it make 
more sense to treat this recursively as a nested table? That is 'a` is a table 
and `a.b` and `a.c` are just `b` and `c` columns in the nested table.
   
   Given this, we *must* have a metadata entry for `a` (the map) in order to 
have metadata for any of its nested columns.
   
   So, again, I wonder if we've defined our metadata semantics as clearly as we 
might.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet

2020-01-14 Thread GitBox
paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect 
count() returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366671949
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 
 Review comment:
   Having a hard time understanding this. If a column is uninteresing, we get 
the row count from the non-interesting columns metadata. Seems round-about. 
Should we get the row count from the table itself? That is, the indirection 
through non-interesting columns to get table metadata seems awkward.
   
   Also, if we do have column stats, we get the row count from the table 
metadata. This raises the question: by column value count, do we mean NDV 
(number of distinct values)? Otherwise, the column value count for top-level 
columns is defined as the same as the row count whether the column is 
interesting or not.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services