[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022206#comment-17022206
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

asfgit commented on pull request #1955: DRILL-7491: Incorrect count() returned 
for complex types in parquet
URL: https://github.com/apache/drill/pull/1955
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019469#comment-17019469
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r368547053
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 
 Review comment:
   Hello @paul-rogers , could you please check the comment on line 182 in this 
file? It was added after your first review, and you may have accidentally 
missed it. Also test for given case was added in ```TestParquetComplex.java```.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019041#comment-17019041
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r368323139
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 
 Review comment:
   HI @ihuzenko, thanks for the explanation. Very clear. In fact, I'd even 
suggest copying the details into the code somewhere to give future readers 
additional context. And, do we have a unit test that demonstrates this case? 
This seems like the kind of obscure issue that someone (such as me) could 
easily break at some point because the behavior is not obvious.. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019040#comment-17019040
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r368321457
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 
 Review comment:
   HI @ihuzenko, thanks for the explanation. Very clear. In fact, I'd even 
suggest copying the details into the code somewhere to give future readers 
additional context. And, do we have a unit test that demonstrates this case? 
This seems like the kind of obscure issue that someone (such as me) could 
easily break at some point because the behavior is not obvious.. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017911#comment-17017911
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

arina-ielchiieva commented on issue #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#issuecomment-575580335
 
 
   @paul-rogers Igor had addressed code review comments and left one comment 
explaining the background of the issue 
(https://github.com/apache/drill/pull/1955#discussion_r366812967). Could you 
take a look one more time? Thanks.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015839#comment-17015839
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366812967
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 
 Review comment:
   Hello @paul-rogers , 
   
   As says Javadoc for this method, here we're looking for a count of non-null 
row values of the specific column. There is a special physical rule 
```ConvertCountToDirectScanPrule``` purpose which is to reduce IO  when we're 
looking for ```count(col)``` values, since the value may be present in the 
metadata read while planning. The problem which I'm trying to fix ( 
[DRILL-7491](https://issues.apache.org/jira/browse/DRILL-7491)) appeared in the 
method because the method for existing complex column returned **0** as if the 
column doesn't exist. So physical plan was converted to something like: 
   ```
   ScreenPrel
 ProjectPrel(cr=[$0])
   DirectScanPrel(groupscan=[selectionRoot = 
classpath:/parquet/tmphive/hive_alltypes.parquet, numFiles = 1, 
usedMetadataSummaryFile = false, usedMetastore = false, 
DynamicPojoRecordReader{records = [[0]]}])
   ```
   and as a result incorrect count was returned. 
   
   In debug mode, I noticed that parquet didn't provide metadata for top-level 
complex type column, but some metadata exists for nested fields, so I decided 
to check the existence of metadata to determine whether such complex column 
exists in parquet file and that scan is necessary to calculate correct row 
count value.
   
   Regarding your comments about the improvement of a metadata structure for 
complex fields,  I think here there is no some special magic in Drill. I 
suppose that we're just reading & converting metadata from parquet into our 
model classes. If you have some good ideas on how we can improve our meta, 
could you please start a discussion at our mailing list to hear ideas of the 
community? Since the topic is out of scope for this pull request.  
   
   Kind regards, 
   Igor
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015587#comment-17015587
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366674972
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  // When statistics for nested field exists, this is complex column which 
is present in table.
+  // But its nested fields statistics can't be used to extract 
tableRowCount for this column.
+  // So NO_COLUMN_STATS returned here to avoid problems described in 
DRILL-7491.
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
 
 Review comment:
   The above is true, but non-interesting. The number of nulls is useful when 
estimating selectivity of things like `IS NULL` or `IS NOT NULL`. If we are 
interested in scans, then the null count is not helpful.
   
   Given the math here, is the meaning of this function, "return the number of 
rows with non-null values for this column"? If so, that is somewhat a backward 
request. More typical is to return the null count directly, or the percentage 
of nulls.
   
   A reason that this value is not as helpful as it could be is if we are using 
stats to estimate something like `WHERE x = 10 AND y IS NOT NULL`. First, we 
estimate table row count, which might have been adjusted via partitioning. Then 
we estimate the selectivity reduction due to `x = 10`. Then we estimate the 
further reduction due to `y IS NOT NULL`. We cannot do this math if all we have 
is a number *n* that tell us the number of non-null rows in the table; we'd 
have to work out the selectivity number ourselves from the table row count.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015583#comment-17015583
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r39212
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
 
 Review comment:
   Nit: `hasNestedStatsForColumn`, or `isCompoundColumnWithStats`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015588#comment-17015588
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366676515
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
+}
+  }
 
-return Statistic.NO_COLUMN_STATS == tableRowCount
-|| Statistic.NO_COLUMN_STATS == colNulls
-? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
+  /**
+   * For complex columns, stats may be present only for nested fields. For 
example, a column path is `a`,
+   * but stats present for `a`.`b`. So before making a decision that column is 
absent, the case needs
+   * to be tested.
+   *
+   * @param column   column path
+   * @param metadata metadata with column statistics
+   * @return whether stats exists for nested fields
+   */
+  private boolean existsNestedStatsForColumn(SchemaPath column, Metadata 
metadata) {
 
 Review comment:
   I wonder about the premise of this function. The Java doc suggests we have 
the path `a.b`. We want to know if `a` is an uninteresting column. Are we 
suggesting that we have metadata that says that `a.b` is uninteresting, but we 
have no metadata for `a` itself?
   
   Why would this be? Do we store columns in a non-structured way? That is, do 
we have a list of columns like 'a', 'b', 'c.d', 'e.f.g', 'c.h', and so on? 
Rather than `(a, b, c(d, g), e((g)))`?
   
   Further we seem to be assuming that we will gather stats for, say, `a.c`, 
but not `a.b`.
   
   The only place I can see that such complexity would make sense is to 
estimate cardinality for Flatten or Lateral Join. In such a case, might it make 
more sense to treat this recursively as a nested table? That is 'a` is a table 
and `a.b` and `a.c` are just `b` and `c` columns in the nested table.
   
   Given this, we *must* have a metadata entry for `a` (the map) in order to 
have metadata for any of its nested columns.
   
   So, again, I wonder if we've defined our metadata semantics as clearly as we 
might.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> 

[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015586#comment-17015586
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366671949
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 
 Review comment:
   Having a hard time understanding this. If a column is uninteresing, we get 
the row count from the non-interesting columns metadata. Seems round-about. 
Should we get the row count from the table itself? That is, the indirection 
through non-interesting columns to get table metadata seems awkward.
   
   Also, if we do have column stats, we get the row count from the table 
metadata. This raises the question: by column value count, do we mean NDV 
(number of distinct values)? Otherwise, the column value count for top-level 
columns is defined as the same as the row count whether the column is 
interesting or not.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015585#comment-17015585
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366673114
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) {
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 } else {
-  return 0; // returns 0 if the column doesn't exist in the table.
+  return Statistic.NO_COLUMN_STATS;
 
 Review comment:
   I'm more confused. If this is a structured (complex) column, then it can 
have nested columns. The nested columns don't add information about this 
column. (Knowing the number of values in an array of maps does not tell us the 
cardinality of the map.) Again, if the Map is at the top level, then the value 
count is row count. If this stat is NDV, then we don't know the NDV if we don't 
have metadata. I'd even argue that NDV makes no sense for a complex column; it 
only makes sense for the members of the column.
   
   Now, back to Arina's point. The info here tells us something about scans. If 
I ask only for column `x`, and the table does not contain column `x`, then I 
don't even need to scan at all, I can just return *n* copies of NULL. (Most 
query engines would fail the query because the column is undefined. Drill will 
run the query and return nulls.) However, in practice, the only way to know the 
correct value of *n* is to do the scan (stats can be out of date.)
   
   Still, I don't get why we need *column* value counts. If we do a scan, we 
want the table row count, we don't care about the column value count.
   
   So, I wonder if there is some additional problem here where our use of stats 
needs some adjusting.
   
   If we want to estimate the row count after filtering (that is, the row count 
seen by, say, a join or sort), then we need the NDV which we can estimate only 
if we have stats, otherwise we should fall back on heuristic selectivity values.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015584#comment-17015584
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366677192
 
 

 ##
 File path: 
logical/src/main/java/org/apache/drill/common/expression/SchemaPath.java
 ##
 @@ -334,36 +335,13 @@ public int hashCode() {
 
   @Override
   public boolean equals(Object obj) {
-if (this == obj) {
-  return true;
-}
-if (obj == null) {
-  return false;
-}
-if (!(obj instanceof SchemaPath)) {
-  return false;
-}
-
-SchemaPath other = (SchemaPath) obj;
-if (rootSegment == null) {
-  return (other.rootSegment == null);
-}
-return rootSegment.equals(other.rootSegment);
+return this == obj || obj instanceof SchemaPath
+&& Objects.equals(rootSegment, ((SchemaPath) obj).rootSegment);
   }
 
   public boolean contains(Object obj) {
 
 Review comment:
   `contains()` is not a Java-defined method. Must the signature be `Object`? 
Or, can it be `SchemaPath` since that is the only kind of thing a `SchemaPath` 
could ever contain? Would save some casting.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015124#comment-17015124
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

arina-ielchiieva commented on issue #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#issuecomment-574200097
 
 
   +1, LGTM.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015093#comment-17015093
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

KazydubB commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366339423
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
+}
+  }
 
-return Statistic.NO_COLUMN_STATS == tableRowCount
-|| Statistic.NO_COLUMN_STATS == colNulls
-? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
+  /**
+   * For complex columns, stats may be present only for nested fields. For 
example, a column path is `a`,
+   * but stats present for `a`.`b`. So before making a decision that column is 
absent, the case needs
+   * to be tested.
+   *
+   * @param column   column path
+   * @param metadata metadata with column statistics
+   * @return whether stats exists for nested fields
+   */
+  private boolean existsNestedStatsForColumn(SchemaPath column, Metadata 
metadata) {
 
 Review comment:
   It depends from which perspective to think, but I agree the one proposed by 
me is confusing. Name still looks clumsy, but with the comments it should be 
fine. Thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015074#comment-17015074
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366328108
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  // When exists statistic for nested field, this is complex column which 
is present in table.
 
 Review comment:
   ```suggestion
 // When statistics for nested field exists, this is complex column 
which is present in table.
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015073#comment-17015073
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366327715
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
+}
+  }
 
-return Statistic.NO_COLUMN_STATS == tableRowCount
-|| Statistic.NO_COLUMN_STATS == colNulls
-? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
+  /**
+   * For complex columns, stats may be present only for nested fields. For 
example, a column path is `a`,
+   * but stats present for `a`.`b`. So before making a decision that column is 
absent, the case needs
+   * to be tested.
+   *
+   * @param column   column path
+   * @param metadata metadata with column statistics
+   * @return whether stats exists for nested fields
+   */
+  private boolean existsNestedStatsForColumn(SchemaPath column, Metadata 
metadata) {
 
 Review comment:
   sorry @KazydubB , but the proposed new name also makes confusion, since 
column passed as the first argument isn't a leaf. I've added a comprehensive 
comment in place of usage, could you confirm that now intentions are clear?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015061#comment-17015061
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

KazydubB commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366316876
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
+}
+  }
 
-return Statistic.NO_COLUMN_STATS == tableRowCount
-|| Statistic.NO_COLUMN_STATS == colNulls
-? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
+  /**
+   * For complex columns, stats may be present only for nested fields. For 
example, a column path is `a`,
+   * but stats present for `a`.`b`. So before making a decision that column is 
absent, the case needs
+   * to be tested.
+   *
+   * @param column   column path
+   * @param metadata metadata with column statistics
+   * @return whether stats exists for nested fields
+   */
+  private boolean existsNestedStatsForColumn(SchemaPath column, Metadata 
metadata) {
 
 Review comment:
   The name is a little bit awkward, indeed. It can be changed to something 
like `existStatsForLeafColumn` (in the case, 'nested fields' should be changed 
to 'leaf columns' in the javadoc).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015055#comment-17015055
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366313821
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
+}
+  }
 
-return Statistic.NO_COLUMN_STATS == tableRowCount
-|| Statistic.NO_COLUMN_STATS == colNulls
-? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
+  /**
+   * For complex columns, stats may be present only for nested fields. For 
example, a column path is `a`,
+   * but stats present for `a`.`b`. So before making a decision that column is 
absent, the case needs
+   * to be tested.
+   *
+   * @param column   column path
+   * @param metadata metadata with column statistics
+   * @return whether stats exists for nested fields
+   */
+  private boolean existsNestedStatsForColumn(SchemaPath column, Metadata 
metadata) {
 
 Review comment:
   No, the nested stats exists but not applicable for the parent column.  As 
javadoc says, the method is used only to determine whether the column is 
actually absent. The current name expresses exactly what the method does.  
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015051#comment-17015051
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366307446
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
 
 Review comment:
   ```suggestion
   ColumnStatistics nonInterestingColStats = columnStats == null
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015052#comment-17015052
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366310138
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
+}
+  }
 
-return Statistic.NO_COLUMN_STATS == tableRowCount
-|| Statistic.NO_COLUMN_STATS == colNulls
-? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
+  /**
+   * For complex columns, stats may be present only for nested fields. For 
example, a column path is `a`,
+   * but stats present for `a`.`b`. So before making a decision that column is 
absent, the case needs
+   * to be tested.
+   *
+   * @param column   column path
+   * @param metadata metadata with column statistics
+   * @return whether stats exists for nested fields
+   */
+  private boolean existsNestedStatsForColumn(SchemaPath column, Metadata 
metadata) {
 
 Review comment:
   I am not quite sure I follow this method logic because when you call this 
method you return that there is no statistics is present but method states that 
it checks that nested column statistics exists. I guess we might need to rename 
method to depict exactly what it does, for example checking if column exists.
   By the way, if stats for nested column exists, can we use it for calculation?
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015039#comment-17015039
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366303787
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) {
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 } else {
-  return 0; // returns 0 if the column doesn't exist in the table.
+  return Statistic.NO_COLUMN_STATS;
 
 Review comment:
   Hello @arina-ielchiieva , thank you very much for describing the issue. I've 
updated this PR, please take another look. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014988#comment-17014988
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366250930
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) {
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 } else {
-  return 0; // returns 0 if the column doesn't exist in the table.
+  return Statistic.NO_COLUMN_STATS;
 
 Review comment:
   @ihuzenko I am not sure that change is correct. As you can see from java-doc 
and comment that you have removed, 0 is returned deliberately to avoid full 
scan if column does not exist.
   Here is an example of unit test that shows that with your change, you are 
breaking existing functionality, you can add it to 
`org.apache.drill.exec.planner.logical.TestConvertCountToDirectScan` class:
   ```
 @Test
 public void textConvertAbsentColumn() throws Exception {
   String sql = "select count(abc) as cnt from cp.`tpch/nation.parquet`";
   
   queryBuilder()
 .sql(sql)
 .planMatcher()
 .include("DynamicPojoRecordReader")
 .match();
   
   testBuilder()
 .sqlQuery(sql)
 .unOrdered()
 .baselineColumns("cnt")
 .baselineValues(0L)
 .go();
 }
   ```
   After your changes, this test fill fail. I think you need to find a way to 
determine if column is absent or complex.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014648#comment-17014648
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955
 
 
   # [DRILL-7491](https://issues.apache.org/jira/browse/DRILL-7491): Incorrect 
count() returned for complex types in parquet
   
   ## Description
   
   Convert ConvertCountToDirectScanPrule was applied to existing complex type 
columns with absent rowCount metadata. Returning 
```Statistic.NO_COLUMN_STATS``` fixes the issue.  
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)