[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015587#comment-17015587
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366674972
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  // When statistics for nested field exists, this is complex column which 
is present in table.
+  // But its nested fields statistics can't be used to extract 
tableRowCount for this column.
+  // So NO_COLUMN_STATS returned here to avoid problems described in 
DRILL-7491.
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
 
 Review comment:
   The above is true, but non-interesting. The number of nulls is useful when 
estimating selectivity of things like `IS NULL` or `IS NOT NULL`. If we are 
interested in scans, then the null count is not helpful.
   
   Given the math here, is the meaning of this function, "return the number of 
rows with non-null values for this column"? If so, that is somewhat a backward 
request. More typical is to return the null count directly, or the percentage 
of nulls.
   
   A reason that this value is not as helpful as it could be is if we are using 
stats to estimate something like `WHERE x = 10 AND y IS NOT NULL`. First, we 
estimate table row count, which might have been adjusted via partitioning. Then 
we estimate the selectivity reduction due to `x = 10`. Then we estimate the 
further reduction due to `y IS NOT NULL`. We cannot do this math if all we have 
is a number *n* that tell us the number of non-null rows in the table; we'd 
have to work out the selectivity number ourselves from the table row count.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015583#comment-17015583
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r39212
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
 
 Review comment:
   Nit: `hasNestedStatsForColumn`, or `isCompoundColumnWithStats`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015588#comment-17015588
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366676515
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
+}
+  }
 
-return Statistic.NO_COLUMN_STATS == tableRowCount
-|| Statistic.NO_COLUMN_STATS == colNulls
-? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
+  /**
+   * For complex columns, stats may be present only for nested fields. For 
example, a column path is `a`,
+   * but stats present for `a`.`b`. So before making a decision that column is 
absent, the case needs
+   * to be tested.
+   *
+   * @param column   column path
+   * @param metadata metadata with column statistics
+   * @return whether stats exists for nested fields
+   */
+  private boolean existsNestedStatsForColumn(SchemaPath column, Metadata 
metadata) {
 
 Review comment:
   I wonder about the premise of this function. The Java doc suggests we have 
the path `a.b`. We want to know if `a` is an uninteresting column. Are we 
suggesting that we have metadata that says that `a.b` is uninteresting, but we 
have no metadata for `a` itself?
   
   Why would this be? Do we store columns in a non-structured way? That is, do 
we have a list of columns like 'a', 'b', 'c.d', 'e.f.g', 'c.h', and so on? 
Rather than `(a, b, c(d, g), e((g)))`?
   
   Further we seem to be assuming that we will gather stats for, say, `a.c`, 
but not `a.b`.
   
   The only place I can see that such complexity would make sense is to 
estimate cardinality for Flatten or Lateral Join. In such a case, might it make 
more sense to treat this recursively as a nested table? That is 'a` is a table 
and `a.b` and `a.c` are just `b` and `c` columns in the nested table.
   
   Given this, we *must* have a metadata entry for `a` (the map) in order to 
have metadata for any of its nested columns.
   
   So, again, I wonder if we've defined our metadata semantics as clearly as we 
might.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> 

[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015586#comment-17015586
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366671949
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 
 Review comment:
   Having a hard time understanding this. If a column is uninteresing, we get 
the row count from the non-interesting columns metadata. Seems round-about. 
Should we get the row count from the table itself? That is, the indirection 
through non-interesting columns to get table metadata seems awkward.
   
   Also, if we do have column stats, we get the row count from the table 
metadata. This raises the question: by column value count, do we mean NDV 
(number of distinct values)? Otherwise, the column value count for top-level 
columns is defined as the same as the row count whether the column is 
interesting or not.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015585#comment-17015585
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366673114
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) {
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 } else {
-  return 0; // returns 0 if the column doesn't exist in the table.
+  return Statistic.NO_COLUMN_STATS;
 
 Review comment:
   I'm more confused. If this is a structured (complex) column, then it can 
have nested columns. The nested columns don't add information about this 
column. (Knowing the number of values in an array of maps does not tell us the 
cardinality of the map.) Again, if the Map is at the top level, then the value 
count is row count. If this stat is NDV, then we don't know the NDV if we don't 
have metadata. I'd even argue that NDV makes no sense for a complex column; it 
only makes sense for the members of the column.
   
   Now, back to Arina's point. The info here tells us something about scans. If 
I ask only for column `x`, and the table does not contain column `x`, then I 
don't even need to scan at all, I can just return *n* copies of NULL. (Most 
query engines would fail the query because the column is undefined. Drill will 
run the query and return nulls.) However, in practice, the only way to know the 
correct value of *n* is to do the scan (stats can be out of date.)
   
   Still, I don't get why we need *column* value counts. If we do a scan, we 
want the table row count, we don't care about the column value count.
   
   So, I wonder if there is some additional problem here where our use of stats 
needs some adjusting.
   
   If we want to estimate the row count after filtering (that is, the row count 
seen by, say, a join or sort), then we need the NDV which we can estimate only 
if we have stats, otherwise we should fall back on heuristic selectivity values.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015584#comment-17015584
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366677192
 
 

 ##
 File path: 
logical/src/main/java/org/apache/drill/common/expression/SchemaPath.java
 ##
 @@ -334,36 +335,13 @@ public int hashCode() {
 
   @Override
   public boolean equals(Object obj) {
-if (this == obj) {
-  return true;
-}
-if (obj == null) {
-  return false;
-}
-if (!(obj instanceof SchemaPath)) {
-  return false;
-}
-
-SchemaPath other = (SchemaPath) obj;
-if (rootSegment == null) {
-  return (other.rootSegment == null);
-}
-return rootSegment.equals(other.rootSegment);
+return this == obj || obj instanceof SchemaPath
+&& Objects.equals(rootSegment, ((SchemaPath) obj).rootSegment);
   }
 
   public boolean contains(Object obj) {
 
 Review comment:
   `contains()` is not a Java-defined method. Must the signature be `Object`? 
Or, can it be `SchemaPath` since that is the only kind of thing a `SchemaPath` 
could ever contain? Would save some casting.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7506) Simplify code gen error handling

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015571#comment-17015571
 ] 

ASF GitHub Bot commented on DRILL-7506:
---

paul-rogers commented on issue #1948: DRILL-7506: Simplify code gen error 
handling
URL: https://github.com/apache/drill/pull/1948#issuecomment-574464513
 
 
   @KazydubB, thank you for the review. Addressed comments. Squashed commits. 
Rebased on latest master and resolved conflicts.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Simplify code gen error handling
> 
>
> Key: DRILL-7506
> URL: https://issues.apache.org/jira/browse/DRILL-7506
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.17.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.18.0
>
>
> Code generation can generate a variety of errors. Most operators bubble these 
> exceptions up several layers in the code before catching them. This patch 
> moves error handling closer to the code gen itself to allow a) simpler code, 
> and b) clearer error messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7506) Simplify code gen error handling

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015570#comment-17015570
 ] 

ASF GitHub Bot commented on DRILL-7506:
---

paul-rogers commented on pull request #1948: DRILL-7506: Simplify code gen 
error handling
URL: https://github.com/apache/drill/pull/1948#discussion_r366610589
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalOperatorUtil.java
 ##
 @@ -46,10 +46,13 @@ private PhysicalOperatorUtil() {}
   }
 
   /**
-   * Helper method to create a list of MinorFragmentEndpoint instances from a 
given endpoint assignment list.
+   * Helper method to create a list of MinorFragmentEndpoint instances from a
+   * given endpoint assignment list.
*
-   * @param endpoints Assigned endpoint list. Index of each endpoint in list 
indicates the MinorFragmentId of the
-   *  fragment that is assigned to the endpoint.
+   * @param endpoints
 
 Review comment:
   Sorry, the formatting is how Eclipse does it. Maybe we can add this to our 
list of standard format rules and I can create a rule in Eclipse to format 
comments to our standard.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Simplify code gen error handling
> 
>
> Key: DRILL-7506
> URL: https://issues.apache.org/jira/browse/DRILL-7506
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.17.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.18.0
>
>
> Code generation can generate a variety of errors. Most operators bubble these 
> exceptions up several layers in the code before catching them. This patch 
> moves error handling closer to the code gen itself to allow a) simpler code, 
> and b) clearer error messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7509) Incorrect TupleSchema is created for DICT column when querying Parquet files

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015446#comment-17015446
 ] 

ASF GitHub Bot commented on DRILL-7509:
---

paul-rogers commented on pull request #1954: DRILL-7509: Incorrect TupleSchema 
is created for DICT column when querying Parquet files
URL: https://github.com/apache/drill/pull/1954#discussion_r366597636
 
 

 ##
 File path: 
exec/vector/src/main/java/org/apache/drill/exec/record/metadata/AbstractColumnMetadata.java
 ##
 @@ -306,7 +306,7 @@ public String columnString() {
 builder.append(typeString());
 
 // Drill does not have nullability notion for complex types
-if (!isNullable() && !isArray() && !isMap()) {
+if (!isNullable() && !isArray() && !isMap() && !isDict()) {
 
 Review comment:
   Bad "code smell". The base class is enumerating all the ways that something 
can be not null. Not even sure this makes sense. Repeated items (arrays) in 
Drill are non-nullable. But, LISTs, oddly, are nullable. Is a List an Array? A 
Map is not nullable, but a union with a map element is nullable. A Dict is not 
nullable, but UNION or LIST of Dict is.
   
   So, why is it that we have to check both if the type is not nullable and 
that it is not an array (or map or DICT?) It should be that, if we say that a 
DICT, say, is not nullable, then `isNullable()` should always be false for that 
type, or we have a logic error.
   
   Note: a while back I tried adding unit tests to this class (still need to 
issue that PR) and found that the semantics are getting pretty muddy: very hard 
to figure out what we're trying to do. Not in scope for this PR, but we really 
need to clean up our type semantics.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect TupleSchema is created for DICT column when querying Parquet files
> 
>
> Key: DRILL-7509
> URL: https://issues.apache.org/jira/browse/DRILL-7509
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
> Fix For: 1.18.0
>
>
> When {{DICT}} column is queried from Parquet file, its {{TupleSchema}} 
> contains nested element, e.g. `map`, itself contains `key` and `value` 
> fields, rather than containing the `key` and `value` fields in the {{DICT}}'s 
> {{TupleSchema}} itself. The nested element, `map`, comes from the inner 
> structure of Parquet's {{MAP}} (which corresponds to Drill's {{DICT}}) 
> representation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7509) Incorrect TupleSchema is created for DICT column when querying Parquet files

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015451#comment-17015451
 ] 

ASF GitHub Bot commented on DRILL-7509:
---

paul-rogers commented on pull request #1954: DRILL-7509: Incorrect TupleSchema 
is created for DICT column when querying Parquet files
URL: https://github.com/apache/drill/pull/1954#discussion_r366606167
 
 

 ##
 File path: 
metastore/metastore-api/src/main/java/org/apache/drill/metastore/util/SchemaPathUtils.java
 ##
 @@ -81,15 +106,50 @@ public static void addColumnMetadata(TupleMetadata 
schema, SchemaPath schemaPath
   names.add(colPath.getPath());
   colMetadata = schema.metadata(colPath.getPath());
   TypeProtos.MajorType pathType = 
types.get(SchemaPath.getCompoundPath(names.toArray(new String[0])));
+
+  boolean isDict = pathType != null && pathType.getMinorType() == 
TypeProtos.MinorType.DICT;
+  boolean isList = pathType != null && pathType.getMinorType() == 
TypeProtos.MinorType.LIST;
+  String name = colPath.getPath();
 
 Review comment:
   More bad code smell. See notes above.
   
   BTW: If you have to do a short-term fix, then this kind of error-prone, 
redundant code is fine. But, if we have time to fix the underlying problem, 
let's do it. Else, we can fix the problem in another PR.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect TupleSchema is created for DICT column when querying Parquet files
> 
>
> Key: DRILL-7509
> URL: https://issues.apache.org/jira/browse/DRILL-7509
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
> Fix For: 1.18.0
>
>
> When {{DICT}} column is queried from Parquet file, its {{TupleSchema}} 
> contains nested element, e.g. `map`, itself contains `key` and `value` 
> fields, rather than containing the `key` and `value` fields in the {{DICT}}'s 
> {{TupleSchema}} itself. The nested element, `map`, comes from the inner 
> structure of Parquet's {{MAP}} (which corresponds to Drill's {{DICT}}) 
> representation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7509) Incorrect TupleSchema is created for DICT column when querying Parquet files

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015448#comment-17015448
 ] 

ASF GitHub Bot commented on DRILL-7509:
---

paul-rogers commented on pull request #1954: DRILL-7509: Incorrect TupleSchema 
is created for DICT column when querying Parquet files
URL: https://github.com/apache/drill/pull/1954#discussion_r366598557
 
 

 ##
 File path: 
exec/vector/src/main/java/org/apache/drill/exec/record/metadata/MetadataUtils.java
 ##
 @@ -187,6 +187,10 @@ public static ColumnMetadata newMapArray(String name, 
TupleMetadata schema) {
 return new MapColumnMetadata(name, DataMode.REPEATED, (TupleSchema) 
schema);
   }
 
+  public static DictColumnMetadata newDictArray(String name, TupleMetadata 
schema) {
+return new DictColumnMetadata(name, DataMode.REPEATED, (TupleSchema) 
schema);
 
 Review comment:
   Picking up on a comment in your PR description, I wonder if we have the 
wrong semantics here. It is true that DICT is *implemented* as a map. But, at 
the metadata (descriptive) level, it is not a map, and is not constructed from 
a map; it is instead a `` pair.
   
   I'm not sure we're doing ourselves a favor by exposing the implementation 
detail (a kind of map) in our metadata description.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect TupleSchema is created for DICT column when querying Parquet files
> 
>
> Key: DRILL-7509
> URL: https://issues.apache.org/jira/browse/DRILL-7509
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
> Fix For: 1.18.0
>
>
> When {{DICT}} column is queried from Parquet file, its {{TupleSchema}} 
> contains nested element, e.g. `map`, itself contains `key` and `value` 
> fields, rather than containing the `key` and `value` fields in the {{DICT}}'s 
> {{TupleSchema}} itself. The nested element, `map`, comes from the inner 
> structure of Parquet's {{MAP}} (which corresponds to Drill's {{DICT}}) 
> representation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7509) Incorrect TupleSchema is created for DICT column when querying Parquet files

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015447#comment-17015447
 ] 

ASF GitHub Bot commented on DRILL-7509:
---

paul-rogers commented on pull request #1954: DRILL-7509: Incorrect TupleSchema 
is created for DICT column when querying Parquet files
URL: https://github.com/apache/drill/pull/1954#discussion_r366601149
 
 

 ##
 File path: 
metastore/metastore-api/src/main/java/org/apache/drill/metastore/util/SchemaPathUtils.java
 ##
 @@ -50,7 +51,7 @@ public static ColumnMetadata getColumnMetadata(SchemaPath 
schemaPath, TupleMetad
 while (!colPath.isLastPath() && colMetadata != null) {
   if (colMetadata.isDict()) {
 // get dict's value field metadata
-colMetadata = 
colMetadata.tupleSchema().metadata(0).tupleSchema().metadata(1);
+colMetadata = colMetadata.tupleSchema().metadata(1);
 
 Review comment:
   This also has a bad "code smell". We are asking each tool that uses metadata 
to know how to reference members within a map or DICT. Before DICT, the only 
type with internal structure was a MAP, which was represented as a tuple, so 
the original code made sense. But, with DICT, we now no longer tie the idea of 
"has named members" with the idea of "has a nested tuple schema".
   
   I wonder, should we add to the `ColumnMetadata` class a `member(String/int)` 
method? For, `MAP`, it would turn around and call into the tuple schema. For a 
`DICT`, it would have a static mapping of `key`/`value` and `0`/`1`.
   
   Thinking more generally, a `UNION` could could implement `member(String)` as 
an access to the subtype given a type.
   
   Not necessary in this PR, but perhaps we can file a JIRA for several 
DICT-aware cleanups.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect TupleSchema is created for DICT column when querying Parquet files
> 
>
> Key: DRILL-7509
> URL: https://issues.apache.org/jira/browse/DRILL-7509
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
> Fix For: 1.18.0
>
>
> When {{DICT}} column is queried from Parquet file, its {{TupleSchema}} 
> contains nested element, e.g. `map`, itself contains `key` and `value` 
> fields, rather than containing the `key` and `value` fields in the {{DICT}}'s 
> {{TupleSchema}} itself. The nested element, `map`, comes from the inner 
> structure of Parquet's {{MAP}} (which corresponds to Drill's {{DICT}}) 
> representation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7509) Incorrect TupleSchema is created for DICT column when querying Parquet files

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015450#comment-17015450
 ] 

ASF GitHub Bot commented on DRILL-7509:
---

paul-rogers commented on pull request #1954: DRILL-7509: Incorrect TupleSchema 
is created for DICT column when querying Parquet files
URL: https://github.com/apache/drill/pull/1954#discussion_r366602180
 
 

 ##
 File path: 
metastore/metastore-api/src/main/java/org/apache/drill/metastore/util/SchemaPathUtils.java
 ##
 @@ -63,6 +64,30 @@ public static ColumnMetadata getColumnMetadata(SchemaPath 
schemaPath, TupleMetad
 return colMetadata;
   }
 
+  /**
+   * Checks if field indetified by the schema path is child in either {@code 
DICT} or {@code REPEATED MAP}.
+   * For such fields, nested in {@code DICT} or {@code REPEATED MAP},
+   * filters can't be removed based on Parquet statistics.
+   * @param schemaPath schema path used in filter
+   * @param schema schema containing all the fields in the file
+   * @return {@literal true} if field is nested inside {@code DICT} (is {@code 
`key`} or {@code `value`})
+   * or inside {@code REPEATED MAP} field, {@literal false} otherwise.
+   */
+  public static boolean isFieldNestedInDictOrRepeatedMap(SchemaPath 
schemaPath, TupleMetadata schema) {
 
 Review comment:
   This is a kind of cluttered version of what was just suggested. Rather than 
a bunch of if-statements that answers a specific question, generalize to look 
up a member. You'd get the same result as this method if `member(String name)` 
returned `null` if no such member existed. (And, you'd do it generically 
without having to later extend this to 
`isFieldNestedInDictOrRepeatedMapOrSomeNewType()`.
   
   Also, why only `Repeated` Map? (Non-repeated) Maps also have fields and 
allow `a.b` notation.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect TupleSchema is created for DICT column when querying Parquet files
> 
>
> Key: DRILL-7509
> URL: https://issues.apache.org/jira/browse/DRILL-7509
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
> Fix For: 1.18.0
>
>
> When {{DICT}} column is queried from Parquet file, its {{TupleSchema}} 
> contains nested element, e.g. `map`, itself contains `key` and `value` 
> fields, rather than containing the `key` and `value` fields in the {{DICT}}'s 
> {{TupleSchema}} itself. The nested element, `map`, comes from the inner 
> structure of Parquet's {{MAP}} (which corresponds to Drill's {{DICT}}) 
> representation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7509) Incorrect TupleSchema is created for DICT column when querying Parquet files

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015449#comment-17015449
 ] 

ASF GitHub Bot commented on DRILL-7509:
---

paul-rogers commented on pull request #1954: DRILL-7509: Incorrect TupleSchema 
is created for DICT column when querying Parquet files
URL: https://github.com/apache/drill/pull/1954#discussion_r366605680
 
 

 ##
 File path: 
metastore/metastore-api/src/main/java/org/apache/drill/metastore/util/SchemaPathUtils.java
 ##
 @@ -63,6 +64,30 @@ public static ColumnMetadata getColumnMetadata(SchemaPath 
schemaPath, TupleMetad
 return colMetadata;
   }
 
+  /**
+   * Checks if field indetified by the schema path is child in either {@code 
DICT} or {@code REPEATED MAP}.
+   * For such fields, nested in {@code DICT} or {@code REPEATED MAP},
+   * filters can't be removed based on Parquet statistics.
+   * @param schemaPath schema path used in filter
+   * @param schema schema containing all the fields in the file
+   * @return {@literal true} if field is nested inside {@code DICT} (is {@code 
`key`} or {@code `value`})
+   * or inside {@code REPEATED MAP} field, {@literal false} otherwise.
+   */
+  public static boolean isFieldNestedInDictOrRepeatedMap(SchemaPath 
schemaPath, TupleMetadata schema) {
 
 Review comment:
   Another general observation is that we've accidentally created multiple 
versions of the same code. The `RequestedTupleImpl` class does something very 
similar: it converts a `SchemaPath` into a consolidated projection set using 
logic much like we have here. For example, `SELECT a, a.b` is consolidated into 
a single column, `a`, that must be a MAP that must contain at least a `b` 
member. That code was very tricky to get right, as, I'm sure, this is. That 
code has to be modified to handle `SELECT a, a.key`. Is `a` now a `MAP` or a 
`DICT`? We don't know, all we know is that `a` is consistent with either 
interpretation.
   
   Do we really need multiple copies?
   
   Some things to consolidate:
   
   * `ColumnMetadata` - which should be our go-to solution as it is the most 
general.
   * `MaterializedField` - which has all kinds of holes and problems for 
complex types.
   * `SerializedField` - Like `MaterializedField` but with buffer lengths?
   * `RequestedColumn` - Part of that code mentioned above that parses a 
project list into a consolidated set of columns.
   * `SchemaPath` - A description of a project list.
   * The logic here
   * ... - Probably others.
   
   We probably want three tiers of representation:
   
   * Project list - `SchemaPath`
   * "Semanticized" project list - `RequestedTuple`
   * Column metadata - `ColumnMetadata`
   
   Then, we create one set of code that handles transforms and validations. 
(Convert project list to semantisized list, then validate that against a 
schema.) As it turns out, that is what the EVF does for scan projection 
planning; maybe we can pull that out and generalize it for use elsewhere?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect TupleSchema is created for DICT column when querying Parquet files
> 
>
> Key: DRILL-7509
> URL: https://issues.apache.org/jira/browse/DRILL-7509
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
> Fix For: 1.18.0
>
>
> When {{DICT}} column is queried from Parquet file, its {{TupleSchema}} 
> contains nested element, e.g. `map`, itself contains `key` and `value` 
> fields, rather than containing the `key` and `value` fields in the {{DICT}}'s 
> {{TupleSchema}} itself. The nested element, `map`, comes from the inner 
> structure of Parquet's {{MAP}} (which corresponds to Drill's {{DICT}}) 
> representation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7467) Drill does not close DB connection when JDBC storage plugin is disabled

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7467:

Affects Version/s: 1.17.0

> Drill does not close DB connection when JDBC storage plugin is disabled
> ---
>
> Key: DRILL-7467
> URL: https://issues.apache.org/jira/browse/DRILL-7467
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - JDBC
>Affects Versions: 1.17.0
>Reporter: Priyanka Bhoir
>Assignee: Arina Ielchiieva
>Priority: Major
> Fix For: 1.18.0
>
>
> JdbcStoragePlugin does not implement the 'close' method, leaving all 
> connections open even after the plugin is disabled. This could be monitored 
> through 'lsof' command. Restarting a JDBC plugin adds to existing connections 
> and Drill restart is required to release all TCP connections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7467) Drill does not close DB connection when JDBC storage plugin is disabled

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7467:

Fix Version/s: 1.18.0

> Drill does not close DB connection when JDBC storage plugin is disabled
> ---
>
> Key: DRILL-7467
> URL: https://issues.apache.org/jira/browse/DRILL-7467
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - JDBC
>Reporter: Priyanka Bhoir
>Assignee: Arina Ielchiieva
>Priority: Major
> Fix For: 1.18.0
>
>
> JdbcStoragePlugin does not implement the 'close' method, leaving all 
> connections open even after the plugin is disabled. This could be monitored 
> through 'lsof' command. Restarting a JDBC plugin adds to existing connections 
> and Drill restart is required to release all TCP connections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (DRILL-7467) Drill does not close DB connection when JDBC storage plugin is disabled

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva reassigned DRILL-7467:
---

Assignee: Arina Ielchiieva

> Drill does not close DB connection when JDBC storage plugin is disabled
> ---
>
> Key: DRILL-7467
> URL: https://issues.apache.org/jira/browse/DRILL-7467
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - JDBC
>Reporter: Priyanka Bhoir
>Assignee: Arina Ielchiieva
>Priority: Major
>
> JdbcStoragePlugin does not implement the 'close' method, leaving all 
> connections open even after the plugin is disabled. This could be monitored 
> through 'lsof' command. Restarting a JDBC plugin adds to existing connections 
> and Drill restart is required to release all TCP connections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-4770) ParquetRecordReader throws NPE querying a single int64 column file

2020-01-14 Thread Arina Ielchiieva (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-4770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015133#comment-17015133
 ] 

Arina Ielchiieva commented on DRILL-4770:
-

Since Drill 1.16 / 1.17 file can be queried using complex reader:
{noformat}
apache drill (dfs.tmp)> select * from 
dfs.`/Users/arina/Downloads/int64_10_bs10k_ps1k_uncompressed.parquet`;
+--+
| int64_field_required |
+--+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+--+
{noformat}

> ParquetRecordReader throws NPE querying a single int64 column file
> --
>
> Key: DRILL-4770
> URL: https://issues.apache.org/jira/browse/DRILL-4770
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Chun Chang
>Assignee: Padma Penumarthy
>Priority: Major
> Attachments: int64_10_bs10k_ps1k_uncompressed.parquet
>
>
> I have a parquet file with a single int64 column. 
> {noformat}
> [root@perfnode166 parquet-mr]# java -jar 
> parquet-tools/target/parquet-tools-1.8.2-SNAPSHOT.jar dump 
> /mapr/drill50.perf.lab/drill/testdata/parquet_storage/int64_10_bs10k_ps1k_uncompressed.parquet
> row group 0
> 
> int64_field_required:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:55/55/1.00 VC:10 
> [more]...
> int64_field_required TV=10 RL=0 DL=0
> 
> 
> page 0:  DLE:RLE RLE:RLE VLE:DELTA_BINARY_PACKED ST:[min: 0, max:  
> [more]... VC:10
> INT64 int64_field_required
> 
> *** row group 1 of 1, values 1 to 10 ***
> value 1:  R:0 D:0 V:0
> value 2:  R:0 D:0 V:1
> value 3:  R:0 D:0 V:2
> value 4:  R:0 D:0 V:3
> value 5:  R:0 D:0 V:4
> value 6:  R:0 D:0 V:5
> value 7:  R:0 D:0 V:6
> value 8:  R:0 D:0 V:7
> value 9:  R:0 D:0 V:8
> value 10: R:0 D:0 V:9
> {noformat}
> Drill version:
> {noformat}
> 0: jdbc:drill:schema=dfs.drillTestDir> select * from sys.version;
> +-+---+-++-++
> | version | commit_id |   
>   commit_message  
> |commit_time | build_email |  
>build_time |
> +-+---+-++-++
> | 1.8.0-SNAPSHOT  | 05c42eae79ce3e309028b3824f9449b98e329f29  | DRILL-4707: 
> Fix memory leak or incorrect query result in case two column names are 
> case-insensitive identical.  | 29.06.2016 @ 08:15:13 PDT  | 
> inram...@gmail.com  | 07.07.2016 @ 10:50:40 PDT  |
> +-+---+-++-++
> 1 row selected (0.44 seconds)
> {noformat}
> drill throws NPE:
> {noformat}
> 2016-07-08 11:08:55,156 [288013c7-f122-f6be-936e-c18ebe9b92ef:foreman] INFO  
> o.a.drill.exec.work.foreman.Foreman - Query text for query id 
> 288013c7-f122-f6be-936e-c18ebe9b92ef: select * from 
> dfs.`drill/testdata/parquet_storage/int64_10_bs10k_ps1k_uncompressed.parquet`
> 2016-07-08 11:08:55,292 [288013c7-f122-f6be-936e-c18ebe9b92ef:foreman] INFO  
> o.a.d.exec.store.parquet.Metadata - Took 0 ms to get file statuses
> 2016-07-08 11:08:55,295 [288013c7-f122-f6be-936e-c18ebe9b92ef:foreman] INFO  
> o.a.d.exec.store.parquet.Metadata - Fetch parquet metadata: Executed 1 out of 
> 1 using 1 threads. Time: 2ms total, 2.423069ms avg, 2ms max.
> 2016-07-08 11:08:55,295 [288013c7-f122-f6be-936e-c18ebe9b92ef:foreman] INFO  
> o.a.d.exec.store.parquet.Metadata - Fetch parquet metadata: Executed 1 out of 
> 1 using 1 threads. Earliest start: 1.347000 μs, Latest start: 1.347000 μs, 
> Average start: 1.347000 μs .
> 2016-07-08 11:08:55,295 [288013c7-f122-f6be-936e-c18ebe9b92ef:foreman] INFO  
> o.a.d.exec.store.parquet.Metadata - Took 2 ms to read file metadata
> 

[jira] [Closed] (DRILL-2873) CTAS reports error when timestamp values in CSV file are quoted

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva closed DRILL-2873.
---

> CTAS reports error when timestamp values in CSV file are quoted
> ---
>
> Key: DRILL-2873
> URL: https://issues.apache.org/jira/browse/DRILL-2873
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Priority: Major
> Fix For: 1.17.0
>
>
> When timestamp values are quoted in quotes (") inside a CSV data file, CTAS 
> statement reports error.
> Failing CTAS
> {code}
> 0: jdbc:drill:> create table prqFrmCSV02 as select cast(columns[0] as int) 
> col_int, cast(columns[1] as bigint) col_bgint, cast(columns[2] as char(10)) 
> col_char, cast(columns[3] as varchar(18)) col_vchar, cast(columns[4] as 
> timestamp) col_tmstmp, cast(columns[5] as date) col_date, cast(columns[6] as 
> boolean) col_boln, cast(columns[7] as double) col_dbl from `csvToPrq.csv`;
> Query failed: SYSTEM ERROR: Invalid format: ""2015-04-23 23:47:00.124""
> [a601a66a-b305-4a92-9836-f39edcdc8fe8 on centos-02.qa.lab:31010]
> Error: exception while executing query: Failure while executing query. 
> (state=,code=0)
> {code}
> Stack trace from drillbit.log
> {code}
> 2015-04-24 18:41:09,721 [2ac571ba-778f-f3d5-c60f-af2e536905a3:frag:0:0] ERROR 
> o.a.drill.exec.ops.FragmentContext - Fragment Context received failure -- 
> Fragment: 0:0
> org.apache.drill.common.exceptions.DrillUserException: SYSTEM ERROR: Invalid 
> format: ""2015-04-23 23:47:00.124""
> [a601a66a-b305-4a92-9836-f39edcdc8fe8 on centos-02.qa.lab:31010]
> at 
> org.apache.drill.common.exceptions.DrillUserException$Builder.build(DrillUserException.java:115)
>  ~[drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.common.exceptions.ErrorHelper.wrap(ErrorHelper.java:39) 
> ~[drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.ops.FragmentContext.fail(FragmentContext.java:151) 
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext(WriterRecordBatch.java:131)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:99)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:89)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:135)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:74) 
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:76)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:64) 
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:164)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_75]
>  

[jira] [Resolved] (DRILL-2873) CTAS reports error when timestamp values in CSV file are quoted

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva resolved DRILL-2873.
-
Resolution: Fixed

> CTAS reports error when timestamp values in CSV file are quoted
> ---
>
> Key: DRILL-2873
> URL: https://issues.apache.org/jira/browse/DRILL-2873
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Priority: Major
> Fix For: 1.17.0
>
>
> When timestamp values are quoted in quotes (") inside a CSV data file, CTAS 
> statement reports error.
> Failing CTAS
> {code}
> 0: jdbc:drill:> create table prqFrmCSV02 as select cast(columns[0] as int) 
> col_int, cast(columns[1] as bigint) col_bgint, cast(columns[2] as char(10)) 
> col_char, cast(columns[3] as varchar(18)) col_vchar, cast(columns[4] as 
> timestamp) col_tmstmp, cast(columns[5] as date) col_date, cast(columns[6] as 
> boolean) col_boln, cast(columns[7] as double) col_dbl from `csvToPrq.csv`;
> Query failed: SYSTEM ERROR: Invalid format: ""2015-04-23 23:47:00.124""
> [a601a66a-b305-4a92-9836-f39edcdc8fe8 on centos-02.qa.lab:31010]
> Error: exception while executing query: Failure while executing query. 
> (state=,code=0)
> {code}
> Stack trace from drillbit.log
> {code}
> 2015-04-24 18:41:09,721 [2ac571ba-778f-f3d5-c60f-af2e536905a3:frag:0:0] ERROR 
> o.a.drill.exec.ops.FragmentContext - Fragment Context received failure -- 
> Fragment: 0:0
> org.apache.drill.common.exceptions.DrillUserException: SYSTEM ERROR: Invalid 
> format: ""2015-04-23 23:47:00.124""
> [a601a66a-b305-4a92-9836-f39edcdc8fe8 on centos-02.qa.lab:31010]
> at 
> org.apache.drill.common.exceptions.DrillUserException$Builder.build(DrillUserException.java:115)
>  ~[drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.common.exceptions.ErrorHelper.wrap(ErrorHelper.java:39) 
> ~[drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.ops.FragmentContext.fail(FragmentContext.java:151) 
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext(WriterRecordBatch.java:131)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:99)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:89)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:135)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:74) 
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:76)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:64) 
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:164)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> 

[jira] [Commented] (DRILL-2873) CTAS reports error when timestamp values in CSV file are quoted

2020-01-14 Thread Arina Ielchiieva (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015130#comment-17015130
 ] 

Arina Ielchiieva commented on DRILL-2873:
-

{noformat}
apache drill (dfs.tmp)> create table t as select cast(columns[0] as int) 
col_int, cast(columns[1] as bigint) col_bgint, cast(columns[2] as char(10)) 
col_char, cast(columns[3] as varchar(18)) col_vchar, cast(columns[4] as 
timestamp) col_tmstmp, cast(columns[5] as date) col_date, cast(columns[6] as 
boolean) col_boln, cast(columns[7] as double) col_dbl from dfs.tmp.`data.csv`;
+--+---+
| Fragment | Number of records written |
+--+---+
| 0_0  | 8 |
+--+---+
1 row selected (0.757 seconds)
apache drill (dfs.tmp)> select * from t;
++---+--+--+-++--+-+
|  col_int   | col_bgint | col_char |col_vchar |   
col_tmstmp|  col_date  | col_boln | col_dbl |
++---+--+--+-++--+-+
| -225058309 | 44225657827296040 | QAhHUTNv | njEl0iAivVwLEbAg | 2015-04-23 
23:47:00.124 | 2001-05-27 | true | 9.05562343765E8 |
| 112439 | 2211473083106365  | oBwtZmzK | WOLzjgzk5HgXO8l1 | 2015-04-23 
23:47:00.124 | 1958-03-12 | true | 9.91522739327E7 |
| 1784506402 | 29351094148275324 | QlyTFtkw | SmP428hFKw5A085  | 2015-04-23 
23:47:00.126 | 1990-07-21 | false| 7.5267099524E8  |
| 2012511944 | 53048515756157424 | BybVyYWH | cnDqVgMANMtJGbuv | 2015-04-23 
23:47:00.127 | 1953-02-10 | false| 7.84335138726E8 |
| 1546241934 | 77079445473902384 | ZDxIxxLs | JHp2B9q0rJKyZ12K | 2015-04-23 
23:47:00.127 | 1976-01-09 | false| 8.70192831047E8 |
| 1553741594 | 69736976577857208 | smtgHcaV | aJoVXdN9363arx1Y | 2015-04-23 
23:47:00.127 | 1953-10-18 | true | 1.109816755E9   |
| 395345402  | 18812010265162496 | pNetevKF | IrXNHyRm4EY02Mv7 | 2015-04-23 
23:47:00.127 | 1989-09-17 | false| 1.66747196389E8 |
| -478640426 | 71296213927402152 | jXsJbzuR | DJGy8xMZGEpHOEJh | 2015-04-23 
23:47:00.128 | 2013-02-21 | false| 5.41750692166E8 |
++---+--+--+-++--+-+
{noformat}

> CTAS reports error when timestamp values in CSV file are quoted
> ---
>
> Key: DRILL-2873
> URL: https://issues.apache.org/jira/browse/DRILL-2873
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Priority: Major
> Fix For: 1.17.0
>
>
> When timestamp values are quoted in quotes (") inside a CSV data file, CTAS 
> statement reports error.
> Failing CTAS
> {code}
> 0: jdbc:drill:> create table prqFrmCSV02 as select cast(columns[0] as int) 
> col_int, cast(columns[1] as bigint) col_bgint, cast(columns[2] as char(10)) 
> col_char, cast(columns[3] as varchar(18)) col_vchar, cast(columns[4] as 
> timestamp) col_tmstmp, cast(columns[5] as date) col_date, cast(columns[6] as 
> boolean) col_boln, cast(columns[7] as double) col_dbl from `csvToPrq.csv`;
> Query failed: SYSTEM ERROR: Invalid format: ""2015-04-23 23:47:00.124""
> [a601a66a-b305-4a92-9836-f39edcdc8fe8 on centos-02.qa.lab:31010]
> Error: exception while executing query: Failure while executing query. 
> (state=,code=0)
> {code}
> Stack trace from drillbit.log
> {code}
> 2015-04-24 18:41:09,721 [2ac571ba-778f-f3d5-c60f-af2e536905a3:frag:0:0] ERROR 
> o.a.drill.exec.ops.FragmentContext - Fragment Context received failure -- 
> Fragment: 0:0
> org.apache.drill.common.exceptions.DrillUserException: SYSTEM ERROR: Invalid 
> format: ""2015-04-23 23:47:00.124""
> [a601a66a-b305-4a92-9836-f39edcdc8fe8 on centos-02.qa.lab:31010]
> at 
> org.apache.drill.common.exceptions.DrillUserException$Builder.build(DrillUserException.java:115)
>  ~[drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.common.exceptions.ErrorHelper.wrap(ErrorHelper.java:39) 
> ~[drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.ops.FragmentContext.fail(FragmentContext.java:151) 
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext(WriterRecordBatch.java:131)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> 

[jira] [Updated] (DRILL-2873) CTAS reports error when timestamp values in CSV file are quoted

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-2873:

Fix Version/s: (was: Future)
   1.17.0

> CTAS reports error when timestamp values in CSV file are quoted
> ---
>
> Key: DRILL-2873
> URL: https://issues.apache.org/jira/browse/DRILL-2873
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Priority: Major
> Fix For: 1.17.0
>
>
> When timestamp values are quoted in quotes (") inside a CSV data file, CTAS 
> statement reports error.
> Failing CTAS
> {code}
> 0: jdbc:drill:> create table prqFrmCSV02 as select cast(columns[0] as int) 
> col_int, cast(columns[1] as bigint) col_bgint, cast(columns[2] as char(10)) 
> col_char, cast(columns[3] as varchar(18)) col_vchar, cast(columns[4] as 
> timestamp) col_tmstmp, cast(columns[5] as date) col_date, cast(columns[6] as 
> boolean) col_boln, cast(columns[7] as double) col_dbl from `csvToPrq.csv`;
> Query failed: SYSTEM ERROR: Invalid format: ""2015-04-23 23:47:00.124""
> [a601a66a-b305-4a92-9836-f39edcdc8fe8 on centos-02.qa.lab:31010]
> Error: exception while executing query: Failure while executing query. 
> (state=,code=0)
> {code}
> Stack trace from drillbit.log
> {code}
> 2015-04-24 18:41:09,721 [2ac571ba-778f-f3d5-c60f-af2e536905a3:frag:0:0] ERROR 
> o.a.drill.exec.ops.FragmentContext - Fragment Context received failure -- 
> Fragment: 0:0
> org.apache.drill.common.exceptions.DrillUserException: SYSTEM ERROR: Invalid 
> format: ""2015-04-23 23:47:00.124""
> [a601a66a-b305-4a92-9836-f39edcdc8fe8 on centos-02.qa.lab:31010]
> at 
> org.apache.drill.common.exceptions.DrillUserException$Builder.build(DrillUserException.java:115)
>  ~[drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.common.exceptions.ErrorHelper.wrap(ErrorHelper.java:39) 
> ~[drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.ops.FragmentContext.fail(FragmentContext.java:151) 
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext(WriterRecordBatch.java:131)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:99)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:89)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:135)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:74) 
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:76)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:64) 
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:164)
>  ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at 
> 

[jira] [Commented] (DRILL-7454) Convert the Avro format plugin to use EVF

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015125#comment-17015125
 ] 

ASF GitHub Bot commented on DRILL-7454:
---

arina-ielchiieva commented on issue #1951: DRILL-7454: Convert Avro to EVF
URL: https://github.com/apache/drill/pull/1951#issuecomment-574200463
 
 
   @cgivre create Jira for the documentation update 
(https://issues.apache.org/jira/browse/DRILL-7528).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Convert the Avro format plugin to use EVF
> -
>
> Key: DRILL-7454
> URL: https://issues.apache.org/jira/browse/DRILL-7454
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.17.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
>
> Convert the Avro format plugin to use EVF.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7454) Convert the Avro format plugin to use EVF

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015126#comment-17015126
 ] 

ASF GitHub Bot commented on DRILL-7454:
---

arina-ielchiieva commented on issue #1951: DRILL-7454: Convert Avro to EVF
URL: https://github.com/apache/drill/pull/1951#issuecomment-574200463
 
 
   @cgivre created Jira for the documentation update 
(https://issues.apache.org/jira/browse/DRILL-7528).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Convert the Avro format plugin to use EVF
> -
>
> Key: DRILL-7454
> URL: https://issues.apache.org/jira/browse/DRILL-7454
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.17.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
>
> Convert the Avro format plugin to use EVF.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015124#comment-17015124
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

arina-ielchiieva commented on issue #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#issuecomment-574200097
 
 
   +1, LGTM.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7528) Update Avro format plugin documentation

2020-01-14 Thread Arina Ielchiieva (Jira)
Arina Ielchiieva created DRILL-7528:
---

 Summary: Update Avro format plugin documentation
 Key: DRILL-7528
 URL: https://issues.apache.org/jira/browse/DRILL-7528
 Project: Apache Drill
  Issue Type: Task
Reporter: Arina Ielchiieva


Currently documentation states that Avro plugin is experimental.
As per Drill 1.17 / 1.18 it's code is pretty stable (since Drill 1.18 it uses 
EVF).
Documentation should be updated accordingly.

https://drill.apache.org/docs/querying-avro-files/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7491:

Labels: ready-to-commit  (was: )

> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-3818) Error when DISTINCT and GROUP BY is used in avro or json

2020-01-14 Thread Arina Ielchiieva (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-3818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015119#comment-17015119
 ] 

Arina Ielchiieva commented on DRILL-3818:
-

{noformat}
apache drill> select DISTINCT(t.a.b.c), MAX(t.a.e) from dfs.tmp.`data.json` t 
GROUP BY t.a.b.c LIMIT 1;
+++
| EXPR$0 | EXPR$1 |
+++
| d  | 2  |
+++
1 row selected (0.924 seconds)
apache drill> select DISTINCT(t.a.b.c), MAX(t.a.e) from dfs.tmp.`data.json` t 
GROUP BY t.a.b.c;
+++
| EXPR$0 | EXPR$1 |
+++
| d  | 2  |
+++
{noformat}

> Error when DISTINCT and GROUP BY is used in avro or json
> 
>
> Key: DRILL-3818
> URL: https://issues.apache.org/jira/browse/DRILL-3818
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - RPC, SQL Parser
>Affects Versions: 1.1.0, 1.2.0
> Environment: Linux Mint 17.1
> java version "1.7.0_80"
> Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
>Reporter: Philip Deegan
>Priority: Major
> Fix For: 1.17.0
>
>
> Data
> {noformat}
> { "a": { "b": { "c": "d" }, "e": 2 }}
> {noformat}
> Query
> {noformat}
> select DISTINCT(t.a.b.c), MAX(t.a.e)  FROM dfs.`json.json` t GROUP BY t.a.b.c 
> LIMIT 1;
> {noformat}
> Occurs on 1.1.0 and incubator-drill master
> {noformat}
> +---+
> | commit_id |
> +---+
> | 9f54aac33df3e783c0192ab56c7e1313dbc823fa  |
> +---+
> [Error Id: bb826851-d8cb-46f5-96c0-1ed01d3d8c45 on philix:31010]
>   at 
> org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:247)
>   at 
> org.apache.drill.jdbc.impl.DrillCursor.loadInitialSchema(DrillCursor.java:290)
>   at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:1359)
>   at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:74)
>   at 
> net.hydromatic.avatica.AvaticaConnection.executeQueryInternal(AvaticaConnection.java:404)
>   at 
> net.hydromatic.avatica.AvaticaStatement.executeQueryInternal(AvaticaStatement.java:351)
>   at 
> net.hydromatic.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:338)
>   at 
> net.hydromatic.avatica.AvaticaStatement.execute(AvaticaStatement.java:69)
>   at 
> org.apache.drill.jdbc.impl.DrillStatementImpl.execute(DrillStatementImpl.java:86)
>   at sqlline.Commands.execute(Commands.java:841)
>   at sqlline.Commands.sql(Commands.java:751)
>   at sqlline.SqlLine.dispatch(SqlLine.java:737)
>   at sqlline.SqlLine.begin(SqlLine.java:612)
>   at sqlline.SqlLine.start(SqlLine.java:366)
>   at sqlline.SqlLine.main(SqlLine.java:259)
> Caused by: org.apache.drill.common.exceptions.UserRemoteException: VALIDATION 
> ERROR: java.lang.NullPointerException
> [Error Id: bb826851-d8cb-46f5-96c0-1ed01d3d8c45 on philix:31010]
>   at 
> org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:118)
>   at 
> org.apache.drill.exec.rpc.user.UserClient.handleReponse(UserClient.java:110)
>   at 
> org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:47)
>   at 
> org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:32)
>   at org.apache.drill.exec.rpc.RpcBus.handle(RpcBus.java:61)
>   at 
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:233)
>   at 
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:205)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>   at 
> 

[jira] [Commented] (DRILL-7454) Convert the Avro format plugin to use EVF

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015118#comment-17015118
 ] 

ASF GitHub Bot commented on DRILL-7454:
---

cgivre commented on issue #1951: DRILL-7454: Convert Avro to EVF
URL: https://github.com/apache/drill/pull/1951#issuecomment-574198128
 
 
   Thanks @arina-ielchiieva for doing this. Can we also update the docs [1] as 
they state that querying Avro is experimental and there are known issues.
   
   [1]: https://drill.apache.org/docs/querying-avro-files/
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Convert the Avro format plugin to use EVF
> -
>
> Key: DRILL-7454
> URL: https://issues.apache.org/jira/browse/DRILL-7454
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.17.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
>
> Convert the Avro format plugin to use EVF.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-3818) Error when DISTINCT and GROUP BY is used in avro or json

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-3818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-3818:

Fix Version/s: (was: Future)
   1.17.0

> Error when DISTINCT and GROUP BY is used in avro or json
> 
>
> Key: DRILL-3818
> URL: https://issues.apache.org/jira/browse/DRILL-3818
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - RPC, SQL Parser
>Affects Versions: 1.1.0, 1.2.0
> Environment: Linux Mint 17.1
> java version "1.7.0_80"
> Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
>Reporter: Philip Deegan
>Priority: Major
> Fix For: 1.17.0
>
>
> Data
> {noformat}
> { "a": { "b": { "c": "d" }, "e": 2 }}
> {noformat}
> Query
> {noformat}
> select DISTINCT(t.a.b.c), MAX(t.a.e)  FROM dfs.`json.json` t GROUP BY t.a.b.c 
> LIMIT 1;
> {noformat}
> Occurs on 1.1.0 and incubator-drill master
> {noformat}
> +---+
> | commit_id |
> +---+
> | 9f54aac33df3e783c0192ab56c7e1313dbc823fa  |
> +---+
> [Error Id: bb826851-d8cb-46f5-96c0-1ed01d3d8c45 on philix:31010]
>   at 
> org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:247)
>   at 
> org.apache.drill.jdbc.impl.DrillCursor.loadInitialSchema(DrillCursor.java:290)
>   at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:1359)
>   at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:74)
>   at 
> net.hydromatic.avatica.AvaticaConnection.executeQueryInternal(AvaticaConnection.java:404)
>   at 
> net.hydromatic.avatica.AvaticaStatement.executeQueryInternal(AvaticaStatement.java:351)
>   at 
> net.hydromatic.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:338)
>   at 
> net.hydromatic.avatica.AvaticaStatement.execute(AvaticaStatement.java:69)
>   at 
> org.apache.drill.jdbc.impl.DrillStatementImpl.execute(DrillStatementImpl.java:86)
>   at sqlline.Commands.execute(Commands.java:841)
>   at sqlline.Commands.sql(Commands.java:751)
>   at sqlline.SqlLine.dispatch(SqlLine.java:737)
>   at sqlline.SqlLine.begin(SqlLine.java:612)
>   at sqlline.SqlLine.start(SqlLine.java:366)
>   at sqlline.SqlLine.main(SqlLine.java:259)
> Caused by: org.apache.drill.common.exceptions.UserRemoteException: VALIDATION 
> ERROR: java.lang.NullPointerException
> [Error Id: bb826851-d8cb-46f5-96c0-1ed01d3d8c45 on philix:31010]
>   at 
> org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:118)
>   at 
> org.apache.drill.exec.rpc.user.UserClient.handleReponse(UserClient.java:110)
>   at 
> org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:47)
>   at 
> org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:32)
>   at org.apache.drill.exec.rpc.RpcBus.handle(RpcBus.java:61)
>   at 
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:233)
>   at 
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:205)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
>   at 
> 

[jira] [Closed] (DRILL-3818) Error when DISTINCT and GROUP BY is used in avro or json

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-3818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva closed DRILL-3818.
---
Resolution: Fixed

> Error when DISTINCT and GROUP BY is used in avro or json
> 
>
> Key: DRILL-3818
> URL: https://issues.apache.org/jira/browse/DRILL-3818
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - RPC, SQL Parser
>Affects Versions: 1.1.0, 1.2.0
> Environment: Linux Mint 17.1
> java version "1.7.0_80"
> Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
>Reporter: Philip Deegan
>Priority: Major
> Fix For: 1.17.0
>
>
> Data
> {noformat}
> { "a": { "b": { "c": "d" }, "e": 2 }}
> {noformat}
> Query
> {noformat}
> select DISTINCT(t.a.b.c), MAX(t.a.e)  FROM dfs.`json.json` t GROUP BY t.a.b.c 
> LIMIT 1;
> {noformat}
> Occurs on 1.1.0 and incubator-drill master
> {noformat}
> +---+
> | commit_id |
> +---+
> | 9f54aac33df3e783c0192ab56c7e1313dbc823fa  |
> +---+
> [Error Id: bb826851-d8cb-46f5-96c0-1ed01d3d8c45 on philix:31010]
>   at 
> org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:247)
>   at 
> org.apache.drill.jdbc.impl.DrillCursor.loadInitialSchema(DrillCursor.java:290)
>   at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:1359)
>   at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:74)
>   at 
> net.hydromatic.avatica.AvaticaConnection.executeQueryInternal(AvaticaConnection.java:404)
>   at 
> net.hydromatic.avatica.AvaticaStatement.executeQueryInternal(AvaticaStatement.java:351)
>   at 
> net.hydromatic.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:338)
>   at 
> net.hydromatic.avatica.AvaticaStatement.execute(AvaticaStatement.java:69)
>   at 
> org.apache.drill.jdbc.impl.DrillStatementImpl.execute(DrillStatementImpl.java:86)
>   at sqlline.Commands.execute(Commands.java:841)
>   at sqlline.Commands.sql(Commands.java:751)
>   at sqlline.SqlLine.dispatch(SqlLine.java:737)
>   at sqlline.SqlLine.begin(SqlLine.java:612)
>   at sqlline.SqlLine.start(SqlLine.java:366)
>   at sqlline.SqlLine.main(SqlLine.java:259)
> Caused by: org.apache.drill.common.exceptions.UserRemoteException: VALIDATION 
> ERROR: java.lang.NullPointerException
> [Error Id: bb826851-d8cb-46f5-96c0-1ed01d3d8c45 on philix:31010]
>   at 
> org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:118)
>   at 
> org.apache.drill.exec.rpc.user.UserClient.handleReponse(UserClient.java:110)
>   at 
> org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:47)
>   at 
> org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:32)
>   at org.apache.drill.exec.rpc.RpcBus.handle(RpcBus.java:61)
>   at 
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:233)
>   at 
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:205)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
>   at 
> 

[jira] [Updated] (DRILL-5024) CTAS with LIMIT 0 query in SELECT stmt does not create parquet file

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-5024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-5024:

Fix Version/s: 1.17.0

> CTAS with LIMIT 0 query in SELECT stmt does not create parquet file
> ---
>
> Key: DRILL-5024
> URL: https://issues.apache.org/jira/browse/DRILL-5024
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.8.0
>Reporter: Khurram Faraaz
>Priority: Major
> Fix For: 1.17.0
>
>
> Note that CTAS was successful
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> create table regtbl_w0rows as select * from 
> typeall_l LIMIT 0;
> +---++
> | Fragment  | Number of records written  |
> +---++
> | 0_0   | 0  |
> +---++
> 1 row selected (0.51 seconds)
> {noformat}
> But a SELECT on CTAS created file fails.
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select * from regtbl_w0rows;
> Error: VALIDATION ERROR: From line 1, column 15 to line 1, column 27: Table 
> 'regtbl_w0rows' not found
> SQL Query null
> [Error Id: 0569cf98-3800-43ee-b635-aa101b016d46 on centos-01.qa.lab:31010] 
> (state=,code=0)
> {noformat}
> DROP on the CTAS created table also fails
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> drop table regtbl_w0rows;
> Error: VALIDATION ERROR: Table [regtbl_w0rows] not found
> [Error Id: fb0b1ea8-f76d-42e2-b69c-4beae2798bdf on centos-01.qa.lab:31010] 
> (state=,code=0)
> 0: jdbc:drill:schema=dfs.tmp>
> {noformat}
> Verified that CTAS did not create a physical file in dfs.tmp schema
> {noformat}
> [test@cent01 bin]# hadoop fs -ls /tmp/regtbl_w0rows
> ls: `/tmp/regtbl_w0rows': No such file or directory
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-5024) CTAS with LIMIT 0 query in SELECT stmt does not create parquet file

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-5024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva resolved DRILL-5024.
-
Resolution: Fixed

> CTAS with LIMIT 0 query in SELECT stmt does not create parquet file
> ---
>
> Key: DRILL-5024
> URL: https://issues.apache.org/jira/browse/DRILL-5024
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.8.0
>Reporter: Khurram Faraaz
>Priority: Major
> Fix For: 1.17.0
>
>
> Note that CTAS was successful
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> create table regtbl_w0rows as select * from 
> typeall_l LIMIT 0;
> +---++
> | Fragment  | Number of records written  |
> +---++
> | 0_0   | 0  |
> +---++
> 1 row selected (0.51 seconds)
> {noformat}
> But a SELECT on CTAS created file fails.
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select * from regtbl_w0rows;
> Error: VALIDATION ERROR: From line 1, column 15 to line 1, column 27: Table 
> 'regtbl_w0rows' not found
> SQL Query null
> [Error Id: 0569cf98-3800-43ee-b635-aa101b016d46 on centos-01.qa.lab:31010] 
> (state=,code=0)
> {noformat}
> DROP on the CTAS created table also fails
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> drop table regtbl_w0rows;
> Error: VALIDATION ERROR: Table [regtbl_w0rows] not found
> [Error Id: fb0b1ea8-f76d-42e2-b69c-4beae2798bdf on centos-01.qa.lab:31010] 
> (state=,code=0)
> 0: jdbc:drill:schema=dfs.tmp>
> {noformat}
> Verified that CTAS did not create a physical file in dfs.tmp schema
> {noformat}
> [test@cent01 bin]# hadoop fs -ls /tmp/regtbl_w0rows
> ls: `/tmp/regtbl_w0rows': No such file or directory
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015093#comment-17015093
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

KazydubB commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366339423
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
+}
+  }
 
-return Statistic.NO_COLUMN_STATS == tableRowCount
-|| Statistic.NO_COLUMN_STATS == colNulls
-? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
+  /**
+   * For complex columns, stats may be present only for nested fields. For 
example, a column path is `a`,
+   * but stats present for `a`.`b`. So before making a decision that column is 
absent, the case needs
+   * to be tested.
+   *
+   * @param column   column path
+   * @param metadata metadata with column statistics
+   * @return whether stats exists for nested fields
+   */
+  private boolean existsNestedStatsForColumn(SchemaPath column, Metadata 
metadata) {
 
 Review comment:
   It depends from which perspective to think, but I agree the one proposed by 
me is confusing. Name still looks clumsy, but with the comments it should be 
fine. Thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015074#comment-17015074
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366328108
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = columnStats == null
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  // When exists statistic for nested field, this is complex column which 
is present in table.
 
 Review comment:
   ```suggestion
 // When statistics for nested field exists, this is complex column 
which is present in table.
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015073#comment-17015073
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366327715
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
+}
+  }
 
-return Statistic.NO_COLUMN_STATS == tableRowCount
-|| Statistic.NO_COLUMN_STATS == colNulls
-? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
+  /**
+   * For complex columns, stats may be present only for nested fields. For 
example, a column path is `a`,
+   * but stats present for `a`.`b`. So before making a decision that column is 
absent, the case needs
+   * to be tested.
+   *
+   * @param column   column path
+   * @param metadata metadata with column statistics
+   * @return whether stats exists for nested fields
+   */
+  private boolean existsNestedStatsForColumn(SchemaPath column, Metadata 
metadata) {
 
 Review comment:
   sorry @KazydubB , but the proposed new name also makes confusion, since 
column passed as the first argument isn't a leaf. I've added a comprehensive 
comment in place of usage, could you confirm that now intentions are clear?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-3916) Assembly for JDBC storage plugin missing

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva resolved DRILL-3916.
-
Resolution: Fixed

> Assembly for JDBC storage plugin missing
> 
>
> Key: DRILL-3916
> URL: https://issues.apache.org/jira/browse/DRILL-3916
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - JDBC, Storage - Other
>Affects Versions: 1.2.0
>Reporter: Andrew
>Assignee: Andrew
>Priority: Major
>
> The JDBC storage plugin is missing from the assembly instructions, which 
> means that the plugin fails to be loaded by the drill bit on start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (DRILL-3916) Assembly for JDBC storage plugin missing

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva closed DRILL-3916.
---

> Assembly for JDBC storage plugin missing
> 
>
> Key: DRILL-3916
> URL: https://issues.apache.org/jira/browse/DRILL-3916
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - JDBC, Storage - Other
>Affects Versions: 1.2.0
>Reporter: Andrew
>Assignee: Andrew
>Priority: Major
>
> The JDBC storage plugin is missing from the assembly instructions, which 
> means that the plugin fails to be loaded by the drill bit on start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015061#comment-17015061
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

KazydubB commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366316876
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
+}
+  }
 
-return Statistic.NO_COLUMN_STATS == tableRowCount
-|| Statistic.NO_COLUMN_STATS == colNulls
-? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
+  /**
+   * For complex columns, stats may be present only for nested fields. For 
example, a column path is `a`,
+   * but stats present for `a`.`b`. So before making a decision that column is 
absent, the case needs
+   * to be tested.
+   *
+   * @param column   column path
+   * @param metadata metadata with column statistics
+   * @return whether stats exists for nested fields
+   */
+  private boolean existsNestedStatsForColumn(SchemaPath column, Metadata 
metadata) {
 
 Review comment:
   The name is a little bit awkward, indeed. It can be changed to something 
like `existStatsForLeafColumn` (in the case, 'nested fields' should be changed 
to 'leaf columns' in the javadoc).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015055#comment-17015055
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366313821
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
+}
+  }
 
-return Statistic.NO_COLUMN_STATS == tableRowCount
-|| Statistic.NO_COLUMN_STATS == colNulls
-? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
+  /**
+   * For complex columns, stats may be present only for nested fields. For 
example, a column path is `a`,
+   * but stats present for `a`.`b`. So before making a decision that column is 
absent, the case needs
+   * to be tested.
+   *
+   * @param column   column path
+   * @param metadata metadata with column statistics
+   * @return whether stats exists for nested fields
+   */
+  private boolean existsNestedStatsForColumn(SchemaPath column, Metadata 
metadata) {
 
 Review comment:
   No, the nested stats exists but not applicable for the parent column.  As 
javadoc says, the method is used only to determine whether the column is 
actually absent. The current name expresses exactly what the method does.  
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015051#comment-17015051
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366307446
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
 
 Review comment:
   ```suggestion
   ColumnStatistics nonInterestingColStats = columnStats == null
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015052#comment-17015052
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366310138
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
*/
   @Override
   public long getColumnValueCount(SchemaPath column) {
-long tableRowCount, colNulls;
-Long nulls;
 ColumnStatistics columnStats = 
getTableMetadata().getColumnStatistics(column);
-ColumnStatistics nonInterestingColStats = null;
-if (columnStats == null) {
-  nonInterestingColStats = 
getNonInterestingColumnsMetadata().getColumnStatistics(column);
-}
+ColumnStatistics nonInterestingColStats = (columnStats == null)
+? getNonInterestingColumnsMetadata().getColumnStatistics(column) : 
null;
 
+long tableRowCount;
 if (columnStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata());
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
+  columnStats = nonInterestingColStats;
+} else if (existsNestedStatsForColumn(column, getTableMetadata())
+|| existsNestedStatsForColumn(column, 
getNonInterestingColumnsMetadata())) {
+  return Statistic.NO_COLUMN_STATS;
 } else {
   return 0; // returns 0 if the column doesn't exist in the table.
 }
 
-columnStats = columnStats != null ? columnStats : nonInterestingColStats;
-nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
-colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
+Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
+if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || 
Statistic.NO_COLUMN_STATS == tableRowCount) {
+  return Statistic.NO_COLUMN_STATS;
+} else {
+  return tableRowCount - nulls;
+}
+  }
 
-return Statistic.NO_COLUMN_STATS == tableRowCount
-|| Statistic.NO_COLUMN_STATS == colNulls
-? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
+  /**
+   * For complex columns, stats may be present only for nested fields. For 
example, a column path is `a`,
+   * but stats present for `a`.`b`. So before making a decision that column is 
absent, the case needs
+   * to be tested.
+   *
+   * @param column   column path
+   * @param metadata metadata with column statistics
+   * @return whether stats exists for nested fields
+   */
+  private boolean existsNestedStatsForColumn(SchemaPath column, Metadata 
metadata) {
 
 Review comment:
   I am not quite sure I follow this method logic because when you call this 
method you return that there is no statistics is present but method states that 
it checks that nested column statistics exists. I guess we might need to rename 
method to depict exactly what it does, for example checking if column exists.
   By the way, if stats for nested column exists, can we use it for calculation?
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015039#comment-17015039
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366303787
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) {
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 } else {
-  return 0; // returns 0 if the column doesn't exist in the table.
+  return Statistic.NO_COLUMN_STATS;
 
 Review comment:
   Hello @arina-ielchiieva , thank you very much for describing the issue. I've 
updated this PR, please take another look. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7509) Incorrect TupleSchema is created for DICT column when querying Parquet files

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7509:

Reviewer: Igor Guzenko

> Incorrect TupleSchema is created for DICT column when querying Parquet files
> 
>
> Key: DRILL-7509
> URL: https://issues.apache.org/jira/browse/DRILL-7509
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
> Fix For: 1.18.0
>
>
> When {{DICT}} column is queried from Parquet file, its {{TupleSchema}} 
> contains nested element, e.g. `map`, itself contains `key` and `value` 
> fields, rather than containing the `key` and `value` fields in the {{DICT}}'s 
> {{TupleSchema}} itself. The nested element, `map`, comes from the inner 
> structure of Parquet's {{MAP}} (which corresponds to Drill's {{DICT}}) 
> representation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7491:

Fix Version/s: 1.18.0

> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7491:

Reviewer: Arina Ielchiieva

> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7491) Incorrect count() returned for complex types in parquet

2020-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014988#comment-17014988
 ] 

ASF GitHub Bot commented on DRILL-7491:
---

arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() 
returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366250930
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##
 @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) {
 } else if (nonInterestingColStats != null) {
   tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
 } else {
-  return 0; // returns 0 if the column doesn't exist in the table.
+  return Statistic.NO_COLUMN_STATS;
 
 Review comment:
   @ihuzenko I am not sure that change is correct. As you can see from java-doc 
and comment that you have removed, 0 is returned deliberately to avoid full 
scan if column does not exist.
   Here is an example of unit test that shows that with your change, you are 
breaking existing functionality, you can add it to 
`org.apache.drill.exec.planner.logical.TestConvertCountToDirectScan` class:
   ```
 @Test
 public void textConvertAbsentColumn() throws Exception {
   String sql = "select count(abc) as cnt from cp.`tpch/nation.parquet`";
   
   queryBuilder()
 .sql(sql)
 .planMatcher()
 .include("DynamicPojoRecordReader")
 .match();
   
   testBuilder()
 .sqlQuery(sql)
 .unOrdered()
 .baselineColumns("cnt")
 .baselineValues(0L)
 .go();
 }
   ```
   After your changes, this test fill fail. I think you need to find a way to 
determine if column is absent or complex.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect count() returned for complex types in parquet
> ---
>
> Key: DRILL-7491
> URL: https://issues.apache.org/jira/browse/DRILL-7491
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive, Storage - Parquet
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Igor Guzenko
>Assignee: Igor Guzenko
>Priority: Major
> Attachments: hive_alltypes.parquet
>
>
> To reproduce use the attached file for {{hive_alltypes.parquet}} (this is 
> parquet file generated by Hive) and try count on columns *c13 - c15.*  For 
> example, 
> {code:sql}
> SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
> {code}
> *Expected result:* {color:green}3 {color}
> *Actual result:* {color:red}0{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7449) memory leak parse_url function

2020-01-14 Thread benj (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014960#comment-17014960
 ] 

benj commented on DRILL-7449:
-

Hi [~IhorHuzenko]

The problem doesn't appears for each run. Sometimes (with exactly the same 
data) it will works 5 times before to crash.

With the official 1.17 on a small cluster 3 node (for each ~ 48 proc / 128 Go 
(DRILL_HEAP=15G, DRILL_MAX_DIRECT_MEMORY=80G))
 With a file of 688Mo / 1 118 320 JSON records

On cluster When comparing profile of correct and crashed executions I can see 
that :
 - crash appears at "02-xx-02 - EXTERNAL_SORT" level
 - on "02-xx-03 - UNORDERED_RECEIVER" :
 - correct execution : 99% of the Max Records are concentrated on 1 of the 8 
Minor fragment, and the cumulative total is correct
 - on crash execution : Max Record are ~ evenly/homogeneously distributed on 
the 8 Minor fragment and the cumulative total is incorrect (lower) (already 
incorrect in 03-xx-02 - PROJECT and 03-xx-00 - JSON_SUB_SCAN )

On my local Machine (1.17 too 8 Proc / 32Go),  in embedded mode, When comparing 
profile of correct and crashed executions I can see that :
 - crash appears at "02-xx-02 - EXTERNAL_SORT" level
 - The difference is on 03-xx-00 - JSON_SUB_SCAN, crash execution doesn't have 
the good number for Max Records
 - for 02-xx-03 - UNORDERED_RECEIVER , in correct and crash Max Records are ~ 
evenly/homogeneously distributed on the 6 Minor fragment

Example of log data from crash execution on cluster:
{noformat}
  2020-01-14 08:22:33,681 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:foreman] INFO  
o.a.drill.exec.work.foreman.Foreman - Query text for query with id 
21e285b6-4d53-58fd-8a4d-dedc0cbfb86a issued by anonymous: CREATE TABLE 
dfs.test.`output_pqt` AS (
SELECT R.parsed.host AS D FROM (SELECT parse_url(T.Url) AS parsed FROM 
dfs.test.`demo2.big.json` AS T) AS R ORDER BY D
)
2020-01-14 08:22:33,724 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:foreman] INFO  
o.a.d.e.p.s.h.CreateTableHandler - Creating persistent table [output_pqt].
2020-01-14 08:22:33,779 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:3] INFO  
o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:3: 
State change requested AWAITING_ALLOCATION --> RUNNING
2020-01-14 08:22:33,779 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:7] INFO  
o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:7: 
State change requested AWAITING_ALLOCATION --> RUNNING
2020-01-14 08:22:33,779 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:5] INFO  
o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:5: 
State change requested AWAITING_ALLOCATION --> RUNNING
2020-01-14 08:22:33,780 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:7] INFO  
o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:7: 
State to report: RUNNING
2020-01-14 08:22:33,780 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:3] INFO  
o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:3: 
State to report: RUNNING
2020-01-14 08:22:33,780 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:5] INFO  
o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:5: 
State to report: RUNNING
2020-01-14 08:22:33,782 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:1:2] INFO  
o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:1:2: 
State change requested AWAITING_ALLOCATION --> RUNNING
2020-01-14 08:22:33,782 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:1:2] INFO  
o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:1:2: 
State to report: RUNNING
2020-01-14 08:22:33,787 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:0:0] INFO  
o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:0:0: 
State change requested AWAITING_ALLOCATION --> RUNNING
2020-01-14 08:22:33,787 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:0:0] INFO  
o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:0:0: 
State to report: RUNNING
2020-01-14 08:22:41,672 [BitServer-2] INFO  o.a.d.e.w.fragment.FragmentExecutor 
- 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:0:0: State change requested RUNNING --> 
CANCELLATION_REQUESTED
2020-01-14 08:22:41,673 [BitServer-2] INFO  o.a.d.e.w.f.FragmentStatusReporter 
- 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:0:0: State to report: 
CANCELLATION_REQUESTED
2020-01-14 08:22:41,674 [BitServer-2] INFO  o.a.d.e.w.fragment.FragmentExecutor 
- 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:1:2: State change requested RUNNING --> 
CANCELLATION_REQUESTED
2020-01-14 08:22:41,674 [BitServer-2] INFO  o.a.d.e.w.f.FragmentStatusReporter 
- 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:1:2: State to report: 
CANCELLATION_REQUESTED
2020-01-14 08:22:41,675 [BitServer-2] INFO  o.a.d.e.w.fragment.FragmentExecutor 
- 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:3: State change requested RUNNING --> 
CANCELLATION_REQUESTED
2020-01-14 08:22:41,675