[jira] [Commented] (DRILL-6494) Drill Plugins Handler

2018-07-02 Thread Bridget Bevens (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530864#comment-16530864
 ] 

Bridget Bevens commented on DRILL-6494:
---

Hi [~vitalii],

I've created a new doc (rough draft) and added a section about the 
storage-plugins.conf file. Can you please review?
The doc is located 
[here|https://docs.google.com/document/d/1IUqf7YMXoHUdP1xOV2aOHesRXWjdX-S3hTftczW1Bqc/edit?usp=sharing].

Thank you!
~Bridget

> Drill Plugins Handler
> -
>
> Key: DRILL-6494
> URL: https://issues.apache.org/jira/browse/DRILL-6494
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Tools, Build  Test
>Affects Versions: 1.13.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.14.0
>
> Attachments: storage-plugins.conf
>
>
> The new service of updating Drill's plugins configs could be implemented.
> Please find details from design overview document:
> https://docs.google.com/document/d/14JKb2TA8dGnOIE5YT2RImkJ7R0IAYSGjJg8xItL5yMI/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530856#comment-16530856
 ] 

ASF GitHub Bot commented on DRILL-6578:
---

vrozov commented on a change in pull request #1360: DRILL-6578: Handle query 
cancellation in Parquet reader
URL: https://github.com/apache/drill/pull/1360#discussion_r199688079
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenColumnBulkInput.java
 ##
 @@ -479,6 +483,17 @@ private void deinitOverflowData() {
 fieldOverflowStateContainer = null;
   }
 
+  private void checkCancellation() {
+// - The methods hasNext() & next() are usually invoked within a loop
+// - We need to ensure that bugs do not cause such loops to execute forever
+// - To do that, will check the interrupted status (e.g., cancellation) to 
break the infinite loop
+if (Thread.currentThread().isInterrupted()) {
 
 Review comment:
   It is not necessary to keep the flag after the exception is thrown. It will 
be better to throw InterruptedException outside of Iterator.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ensure the Flat Parquet Reader can handle query cancellation
> 
>
> Key: DRILL-6578
> URL: https://issues.apache.org/jira/browse/DRILL-6578
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
>
> * The optimized Parquet reader uses an iterator style to load column data 
>  * We need to ensure the code can properly handle query cancellation even in 
> the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530839#comment-16530839
 ] 

ASF GitHub Bot commented on DRILL-6578:
---

vrozov commented on a change in pull request #1360: DRILL-6578: Handle query 
cancellation in Parquet reader
URL: https://github.com/apache/drill/pull/1360#discussion_r199686154
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenColumnBulkInput.java
 ##
 @@ -479,6 +483,17 @@ private void deinitOverflowData() {
 fieldOverflowStateContainer = null;
   }
 
+  private void checkCancellation() {
 
 Review comment:
   outside of parquet, it is not specific to parquet reader.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ensure the Flat Parquet Reader can handle query cancellation
> 
>
> Key: DRILL-6578
> URL: https://issues.apache.org/jira/browse/DRILL-6578
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
>
> * The optimized Parquet reader uses an iterator style to load column data 
>  * We need to ensure the code can properly handle query cancellation even in 
> the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530840#comment-16530840
 ] 

ASF GitHub Bot commented on DRILL-6578:
---

vrozov commented on a change in pull request #1360: DRILL-6578: Handle query 
cancellation in Parquet reader
URL: https://github.com/apache/drill/pull/1360#discussion_r199686189
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenColumnBulkInput.java
 ##
 @@ -109,6 +109,8 @@
   /** {@inheritDoc} */
   @Override
   public boolean hasNext() {
+checkCancellation(); // Checks whether query cancellation has been called
 
 Review comment:
   yes


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ensure the Flat Parquet Reader can handle query cancellation
> 
>
> Key: DRILL-6578
> URL: https://issues.apache.org/jira/browse/DRILL-6578
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
>
> * The optimized Parquet reader uses an iterator style to load column data 
>  * We need to ensure the code can properly handle query cancellation even in 
> the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6579) Add sanity checks to Parquet Reader

2018-07-02 Thread salim achouche (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

salim achouche updated DRILL-6579:
--
Reviewer: Boaz Ben-Zvi

> Add sanity checks to Parquet Reader 
> 
>
> Key: DRILL-6579
> URL: https://issues.apache.org/jira/browse/DRILL-6579
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
>
> Add sanity checks to the Parquet reader to avoid infinite loops.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6579) Add sanity checks to Parquet Reader

2018-07-02 Thread salim achouche (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

salim achouche updated DRILL-6579:
--
Labels: pull-request-available  (was: )

> Add sanity checks to Parquet Reader 
> 
>
> Key: DRILL-6579
> URL: https://issues.apache.org/jira/browse/DRILL-6579
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
>
> Add sanity checks to the Parquet reader to avoid infinite loops.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6579) Add sanity checks to Parquet Reader

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530835#comment-16530835
 ] 

ASF GitHub Bot commented on DRILL-6579:
---

sachouche commented on issue #1361: DRILL-6579: Added sanity checks to the 
Parquet reader to avoid infini…
URL: https://github.com/apache/drill/pull/1361#issuecomment-402014849
 
 
   @Ben-Zvi 
   
   Can you please review this PR?
   
   Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add sanity checks to Parquet Reader 
> 
>
> Key: DRILL-6579
> URL: https://issues.apache.org/jira/browse/DRILL-6579
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>
> Add sanity checks to the Parquet reader to avoid infinite loops.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6579) Add sanity checks to Parquet Reader

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530833#comment-16530833
 ] 

ASF GitHub Bot commented on DRILL-6579:
---

sachouche opened a new pull request #1361: DRILL-6579: Added sanity checks to 
the Parquet reader to avoid infini…
URL: https://github.com/apache/drill/pull/1361
 
 
   …te loops
   
   Added sanity checks to avoid infinite loops.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add sanity checks to Parquet Reader 
> 
>
> Key: DRILL-6579
> URL: https://issues.apache.org/jira/browse/DRILL-6579
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>
> Add sanity checks to the Parquet reader to avoid infinite loops.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6579) Add sanity checks to Parquet Reader

2018-07-02 Thread salim achouche (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

salim achouche updated DRILL-6579:
--
Summary: Add sanity checks to Parquet Reader   (was: Sanity checks to avoid 
infinite loops)

> Add sanity checks to Parquet Reader 
> 
>
> Key: DRILL-6579
> URL: https://issues.apache.org/jira/browse/DRILL-6579
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>
> Add sanity checks to the Parquet reader to avoid infinite loops.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6579) Sanity checks to avoid infinite loops

2018-07-02 Thread salim achouche (JIRA)
salim achouche created DRILL-6579:
-

 Summary: Sanity checks to avoid infinite loops
 Key: DRILL-6579
 URL: https://issues.apache.org/jira/browse/DRILL-6579
 Project: Apache Drill
  Issue Type: Improvement
Reporter: salim achouche
Assignee: salim achouche


Add sanity checks to the Parquet reader to avoid infinite loops.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530796#comment-16530796
 ] 

ASF GitHub Bot commented on DRILL-6578:
---

sachouche commented on a change in pull request #1360: DRILL-6578: Handle query 
cancellation in Parquet reader
URL: https://github.com/apache/drill/pull/1360#discussion_r199678649
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenColumnBulkInput.java
 ##
 @@ -479,6 +483,17 @@ private void deinitOverflowData() {
 fieldOverflowStateContainer = null;
   }
 
+  private void checkCancellation() {
+// - The methods hasNext() & next() are usually invoked within a loop
+// - We need to ensure that bugs do not cause such loops to execute forever
+// - To do that, will check the interrupted status (e.g., cancellation) to 
break the infinite loop
+if (Thread.currentThread().isInterrupted()) {
 
 Review comment:
   @vrozov,
   - Thread.interruped() will clear the thread interrupted status; I want to 
propagate the thread interruption status till the higher level framework code 
clears it to state "ok; received the cancellation request and now ready to 
handle other requests"
   - Are you suggesting that it is not necessary to still continue setting this 
flag since we are throwing an exception?
   
   please clarify!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ensure the Flat Parquet Reader can handle query cancellation
> 
>
> Key: DRILL-6578
> URL: https://issues.apache.org/jira/browse/DRILL-6578
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
>
> * The optimized Parquet reader uses an iterator style to load column data 
>  * We need to ensure the code can properly handle query cancellation even in 
> the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530771#comment-16530771
 ] 

ASF GitHub Bot commented on DRILL-6578:
---

sachouche commented on a change in pull request #1360: DRILL-6578: Handle query 
cancellation in Parquet reader
URL: https://github.com/apache/drill/pull/1360#discussion_r199674082
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenColumnBulkInput.java
 ##
 @@ -479,6 +483,17 @@ private void deinitOverflowData() {
 fieldOverflowStateContainer = null;
   }
 
+  private void checkCancellation() {
 
 Review comment:
   within Parquet or under exec/util?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ensure the Flat Parquet Reader can handle query cancellation
> 
>
> Key: DRILL-6578
> URL: https://issues.apache.org/jira/browse/DRILL-6578
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
>
> * The optimized Parquet reader uses an iterator style to load column data 
>  * We need to ensure the code can properly handle query cancellation even in 
> the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530769#comment-16530769
 ] 

ASF GitHub Bot commented on DRILL-6578:
---

sachouche commented on a change in pull request #1360: DRILL-6578: Handle query 
cancellation in Parquet reader
URL: https://github.com/apache/drill/pull/1360#discussion_r199673902
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenColumnBulkInput.java
 ##
 @@ -109,6 +109,8 @@
   /** {@inheritDoc} */
   @Override
   public boolean hasNext() {
+checkCancellation(); // Checks whether query cancellation has been called
 
 Review comment:
   do you mean rename the method to checkInterrupted?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ensure the Flat Parquet Reader can handle query cancellation
> 
>
> Key: DRILL-6578
> URL: https://issues.apache.org/jira/browse/DRILL-6578
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
>
> * The optimized Parquet reader uses an iterator style to load column data 
>  * We need to ensure the code can properly handle query cancellation even in 
> the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530764#comment-16530764
 ] 

ASF GitHub Bot commented on DRILL-6578:
---

vrozov commented on a change in pull request #1360: DRILL-6578: Handle query 
cancellation in Parquet reader
URL: https://github.com/apache/drill/pull/1360#discussion_r199673424
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenColumnBulkInput.java
 ##
 @@ -479,6 +483,17 @@ private void deinitOverflowData() {
 fieldOverflowStateContainer = null;
   }
 
+  private void checkCancellation() {
+// - The methods hasNext() & next() are usually invoked within a loop
+// - We need to ensure that bugs do not cause such loops to execute forever
+// - To do that, will check the interrupted status (e.g., cancellation) to 
break the infinite loop
+if (Thread.currentThread().isInterrupted()) {
 
 Review comment:
   use `Thread.interruped()`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ensure the Flat Parquet Reader can handle query cancellation
> 
>
> Key: DRILL-6578
> URL: https://issues.apache.org/jira/browse/DRILL-6578
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
>
> * The optimized Parquet reader uses an iterator style to load column data 
>  * We need to ensure the code can properly handle query cancellation even in 
> the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530765#comment-16530765
 ] 

ASF GitHub Bot commented on DRILL-6578:
---

vrozov commented on a change in pull request #1360: DRILL-6578: Handle query 
cancellation in Parquet reader
URL: https://github.com/apache/drill/pull/1360#discussion_r199673384
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenColumnBulkInput.java
 ##
 @@ -479,6 +483,17 @@ private void deinitOverflowData() {
 fieldOverflowStateContainer = null;
   }
 
+  private void checkCancellation() {
 
 Review comment:
   move to a common utility class. Make it static.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ensure the Flat Parquet Reader can handle query cancellation
> 
>
> Key: DRILL-6578
> URL: https://issues.apache.org/jira/browse/DRILL-6578
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
>
> * The optimized Parquet reader uses an iterator style to load column data 
>  * We need to ensure the code can properly handle query cancellation even in 
> the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530766#comment-16530766
 ] 

ASF GitHub Bot commented on DRILL-6578:
---

vrozov commented on a change in pull request #1360: DRILL-6578: Handle query 
cancellation in Parquet reader
URL: https://github.com/apache/drill/pull/1360#discussion_r199673288
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenColumnBulkInput.java
 ##
 @@ -109,6 +109,8 @@
   /** {@inheritDoc} */
   @Override
   public boolean hasNext() {
+checkCancellation(); // Checks whether query cancellation has been called
 
 Review comment:
   consider `checkInterrupted`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ensure the Flat Parquet Reader can handle query cancellation
> 
>
> Key: DRILL-6578
> URL: https://issues.apache.org/jira/browse/DRILL-6578
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
>
> * The optimized Parquet reader uses an iterator style to load column data 
>  * We need to ensure the code can properly handle query cancellation even in 
> the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6577) Change Hash-Join default to not fallback (into pre-1.14 unlimited memory)

2018-07-02 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6577:

Labels: ready-to-commit  (was: )

> Change Hash-Join default to not fallback (into pre-1.14 unlimited memory)
> -
>
> Key: DRILL-6577
> URL: https://issues.apache.org/jira/browse/DRILL-6577
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.13.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Change the default for `drill.exec.hashjoin.fallback.enabled` to *false* 
> (same as for the similar Hash-Agg option). This would force users to 
> calculate and assign sufficient memory for the query, or explicitly choose to 
> fallback.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-6569) Jenkins Regression: TPCDS query 19 fails with INTERNAL_ERROR ERROR: Can not read value at 2 in block 0 in file maprfs:///drill/testdata/tpcds_sf100/parquet/store_sales/1

2018-07-02 Thread salim achouche (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

salim achouche reassigned DRILL-6569:
-

Assignee: Robert Hou  (was: salim achouche)

 

 

> Jenkins Regression: TPCDS query 19 fails with INTERNAL_ERROR ERROR: Can not 
> read value at 2 in block 0 in file 
> maprfs:///drill/testdata/tpcds_sf100/parquet/store_sales/1_13_1.parquet
> --
>
> Key: DRILL-6569
> URL: https://issues.apache.org/jira/browse/DRILL-6569
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Robert Hou
>Assignee: Robert Hou
>Priority: Critical
> Fix For: 1.14.0
>
>
> This is TPCDS Query 19.
> I am able to scan the parquet file using:
>select * from 
> dfs.`/drill/testdata/tpcds_sf100/parquet/store_sales/1_13_1.parquet`
> and I get 3,349,279 rows selected.
> Query: 
> /root/drillAutomation/framework-master/framework/resources/Advanced/tpcds/tpcds_sf100/hive/parquet/query19.sql
> SELECT i_brand_id  brand_id,
> i_brand brand,
> i_manufact_id,
> i_manufact,
> Sum(ss_ext_sales_price) ext_price
> FROM   date_dim,
> store_sales,
> item,
> customer,
> customer_address,
> store
> WHERE  d_date_sk = ss_sold_date_sk
> AND ss_item_sk = i_item_sk
> AND i_manager_id = 38
> AND d_moy = 12
> AND d_year = 1998
> AND ss_customer_sk = c_customer_sk
> AND c_current_addr_sk = ca_address_sk
> AND Substr(ca_zip, 1, 5) <> Substr(s_zip, 1, 5)
> AND ss_store_sk = s_store_sk
> GROUP  BY i_brand,
> i_brand_id,
> i_manufact_id,
> i_manufact
> ORDER  BY ext_price DESC,
> i_brand,
> i_brand_id,
> i_manufact_id,
> i_manufact
> LIMIT 100;
> Here is the stack trace:
> 2018-06-29 07:00:32 INFO  DrillTestLogger:348 - 
> Exception:
> java.sql.SQLException: INTERNAL_ERROR ERROR: Can not read value at 2 in block 
> 0 in file 
> maprfs:///drill/testdata/tpcds_sf100/parquet/store_sales/1_13_1.parquet
> Fragment 4:26
> [Error Id: 6401a71e-7a5d-4a10-a17c-16873fc3239b on atsqa6c88.qa.lab:31010]
>   (hive.org.apache.parquet.io.ParquetDecodingException) Can not read value at 
> 2 in block 0 in file 
> maprfs:///drill/testdata/tpcds_sf100/parquet/store_sales/1_13_1.parquet
> 
> hive.org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue():243
> hive.org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue():227
> 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next():199
> 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next():57
> 
> org.apache.drill.exec.store.hive.readers.HiveAbstractReader.hasNextValue():417
> org.apache.drill.exec.store.hive.readers.HiveParquetReader.next():54
> org.apache.drill.exec.physical.impl.ScanBatch.next():172
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.sniffNonEmptyBatch():276
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.prefetchFirstBatchFromBothSides():238
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.buildSchema():218
> org.apache.drill.exec.record.AbstractRecordBatch.next():152
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.sniffNonEmptyBatch():276
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.prefetchFirstBatchFromBothSides():238
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.buildSchema():218
> org.apache.drill.exec.record.AbstractRecordBatch.next():152
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.sniffNonEmptyBatch():276
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.prefetchFirstBatchFromBothSides():238
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.buildSchema():218
> org.apache.drill.exec.record.AbstractRecordBatch.next():152
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.sniffNonEmptyBatch():276
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.prefetchFirstBatchFromBothSides():238
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.buildSchema():218
> org.apache.drill.exec.record.AbstractRecordBatch.next():152
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63
> 
> 

[jira] [Commented] (DRILL-6569) Jenkins Regression: TPCDS query 19 fails with INTERNAL_ERROR ERROR: Can not read value at 2 in block 0 in file maprfs:///drill/testdata/tpcds_sf100/parquet/store_sales/

2018-07-02 Thread salim achouche (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530733#comment-16530733
 ] 

salim achouche commented on DRILL-6569:
---

[~rhou],

This is a Hive Parquet reader issue (not the native Drill Parquet reader).

 

> Jenkins Regression: TPCDS query 19 fails with INTERNAL_ERROR ERROR: Can not 
> read value at 2 in block 0 in file 
> maprfs:///drill/testdata/tpcds_sf100/parquet/store_sales/1_13_1.parquet
> --
>
> Key: DRILL-6569
> URL: https://issues.apache.org/jira/browse/DRILL-6569
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Robert Hou
>Assignee: salim achouche
>Priority: Critical
> Fix For: 1.14.0
>
>
> This is TPCDS Query 19.
> I am able to scan the parquet file using:
>select * from 
> dfs.`/drill/testdata/tpcds_sf100/parquet/store_sales/1_13_1.parquet`
> and I get 3,349,279 rows selected.
> Query: 
> /root/drillAutomation/framework-master/framework/resources/Advanced/tpcds/tpcds_sf100/hive/parquet/query19.sql
> SELECT i_brand_id  brand_id,
> i_brand brand,
> i_manufact_id,
> i_manufact,
> Sum(ss_ext_sales_price) ext_price
> FROM   date_dim,
> store_sales,
> item,
> customer,
> customer_address,
> store
> WHERE  d_date_sk = ss_sold_date_sk
> AND ss_item_sk = i_item_sk
> AND i_manager_id = 38
> AND d_moy = 12
> AND d_year = 1998
> AND ss_customer_sk = c_customer_sk
> AND c_current_addr_sk = ca_address_sk
> AND Substr(ca_zip, 1, 5) <> Substr(s_zip, 1, 5)
> AND ss_store_sk = s_store_sk
> GROUP  BY i_brand,
> i_brand_id,
> i_manufact_id,
> i_manufact
> ORDER  BY ext_price DESC,
> i_brand,
> i_brand_id,
> i_manufact_id,
> i_manufact
> LIMIT 100;
> Here is the stack trace:
> 2018-06-29 07:00:32 INFO  DrillTestLogger:348 - 
> Exception:
> java.sql.SQLException: INTERNAL_ERROR ERROR: Can not read value at 2 in block 
> 0 in file 
> maprfs:///drill/testdata/tpcds_sf100/parquet/store_sales/1_13_1.parquet
> Fragment 4:26
> [Error Id: 6401a71e-7a5d-4a10-a17c-16873fc3239b on atsqa6c88.qa.lab:31010]
>   (hive.org.apache.parquet.io.ParquetDecodingException) Can not read value at 
> 2 in block 0 in file 
> maprfs:///drill/testdata/tpcds_sf100/parquet/store_sales/1_13_1.parquet
> 
> hive.org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue():243
> hive.org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue():227
> 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next():199
> 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next():57
> 
> org.apache.drill.exec.store.hive.readers.HiveAbstractReader.hasNextValue():417
> org.apache.drill.exec.store.hive.readers.HiveParquetReader.next():54
> org.apache.drill.exec.physical.impl.ScanBatch.next():172
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.sniffNonEmptyBatch():276
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.prefetchFirstBatchFromBothSides():238
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.buildSchema():218
> org.apache.drill.exec.record.AbstractRecordBatch.next():152
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.sniffNonEmptyBatch():276
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.prefetchFirstBatchFromBothSides():238
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.buildSchema():218
> org.apache.drill.exec.record.AbstractRecordBatch.next():152
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.sniffNonEmptyBatch():276
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.prefetchFirstBatchFromBothSides():238
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.buildSchema():218
> org.apache.drill.exec.record.AbstractRecordBatch.next():152
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.sniffNonEmptyBatch():276
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.prefetchFirstBatchFromBothSides():238
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.buildSchema():218
> org.apache.drill.exec.record.AbstractRecordBatch.next():152
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> 

[jira] [Updated] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread salim achouche (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

salim achouche updated DRILL-6578:
--
Labels: pull-request-available  (was: )

> Ensure the Flat Parquet Reader can handle query cancellation
> 
>
> Key: DRILL-6578
> URL: https://issues.apache.org/jira/browse/DRILL-6578
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
>
> * The optimized Parquet reader uses an iterator style to load column data 
>  * We need to ensure the code can properly handle query cancellation even in 
> the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread salim achouche (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

salim achouche updated DRILL-6578:
--
Reviewer: Vlad Rozov

> Ensure the Flat Parquet Reader can handle query cancellation
> 
>
> Key: DRILL-6578
> URL: https://issues.apache.org/jira/browse/DRILL-6578
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
>
> * The optimized Parquet reader uses an iterator style to load column data 
>  * We need to ensure the code can properly handle query cancellation even in 
> the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530716#comment-16530716
 ] 

ASF GitHub Bot commented on DRILL-6578:
---

sachouche commented on issue #1360: DRILL-6578: Handle query cancellation in 
Parquet reader
URL: https://github.com/apache/drill/pull/1360#issuecomment-401993011
 
 
   @vrozov, can you please review this fix?
   
   Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ensure the Flat Parquet Reader can handle query cancellation
> 
>
> Key: DRILL-6578
> URL: https://issues.apache.org/jira/browse/DRILL-6578
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>
> * The optimized Parquet reader uses an iterator style to load column data 
>  * We need to ensure the code can properly handle query cancellation even in 
> the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530714#comment-16530714
 ] 

ASF GitHub Bot commented on DRILL-6578:
---

sachouche opened a new pull request #1360: DRILL-6578: Handle query 
cancellation in Parquet reader
URL: https://github.com/apache/drill/pull/1360
 
 
   Goal -
   - The optimized Parquet reader uses an iterator style to load column data 
   - We need to ensure the code can properly handle query cancellation even in 
the presence of bugs within the hasNext() .. next() calls
   
   Fix Details -
   - Added a check within the hasNext() and next() to detect a thread interrupt
   - If this is the case, these methods will throw a runtime exception and keep 
the thread's interrupted state set
   - This will ensure that any blocking call will get the InterruptedException 
to be thrown 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ensure the Flat Parquet Reader can handle query cancellation
> 
>
> Key: DRILL-6578
> URL: https://issues.apache.org/jira/browse/DRILL-6578
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>
> * The optimized Parquet reader uses an iterator style to load column data 
>  * We need to ensure the code can properly handle query cancellation even in 
> the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6517) IllegalStateException: Record count not set for this vector container

2018-07-02 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530710#comment-16530710
 ] 

Boaz Ben-Zvi commented on DRILL-6517:
-

A relative recent change to the Hash-Join made the HJ prefetch the first 
non-empty batches of both side *early*; i.e., during schema build. This 
translated to next() calls into the operators below (i.e. upstream). In the 
above case, the selection vector remover (driven by such a next() call) was 
checking its incoming (first) batch, but the batch's producer (from the profile 
- likely the PARQUET_ROW_GROUP_SCAN) never applied setRecordCount(0) to that 
batch, hence the above exception. (This may be a special case, like an empty 
batch, etc.)





> IllegalStateException: Record count not set for this vector container
> -
>
> Key: DRILL-6517
> URL: https://issues.apache.org/jira/browse/DRILL-6517
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.14.0
>Reporter: Khurram Faraaz
>Assignee: salim achouche
>Priority: Critical
> Attachments: 24d7b377-7589-7928-f34f-57d02061acef.sys.drill
>
>
> TPC-DS query is Canceled after 2 hrs and 47 mins and we see an 
> IllegalStateException: Record count not set for this vector container, in 
> drillbit.log
> Steps to reproduce the problem, query profile 
> (24d7b377-7589-7928-f34f-57d02061acef) is attached here.
> {noformat}
> In drill-env.sh set max direct memory to 12G on all 4 nodes in cluster
> export DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"12G"}
> and set these options from sqlline,
> alter system set `planner.memory.max_query_memory_per_node` = 10737418240;
> alter system set `drill.exec.hashagg.fallback.enabled` = true;
> To run the query (replace IP-ADDRESS with your foreman node's IP address)
> cd /opt/mapr/drill/drill-1.14.0/bin
> ./sqlline -u 
> "jdbc:drill:schema=dfs.tpcds_sf1_parquet_views;drillbit=" -f 
> /root/query72.sql
> {noformat}
> Stack trace from drillbit.log
> {noformat}
> 2018-06-18 20:08:51,912 [24d7b377-7589-7928-f34f-57d02061acef:frag:4:49] 
> ERROR o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: 
> IllegalStateException: Record count not set for this vector container
> Fragment 4:49
> [Error Id: 73177a1c-f7aa-4c9e-99e1-d6e1280e3f27 on qa102-45.qa.lab:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> IllegalStateException: Record count not set for this vector container
> Fragment 4:49
> [Error Id: 73177a1c-f7aa-4c9e-99e1-d6e1280e3f27 on qa102-45.qa.lab:31010]
>  at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:633)
>  ~[drill-common-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
>  at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:361)
>  [drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
>  at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:216)
>  [drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
>  at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:327)
>  [drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
>  at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [na:1.8.0_161]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_161]
>  at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]
> Caused by: java.lang.IllegalStateException: Record count not set for this 
> vector container
>  at com.google.common.base.Preconditions.checkState(Preconditions.java:173) 
> ~[guava-18.0.jar:na]
>  at 
> org.apache.drill.exec.record.VectorContainer.getRecordCount(VectorContainer.java:394)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
>  at 
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.getRecordCount(RemovingRecordBatch.java:49)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
>  at 
> org.apache.drill.exec.record.RecordBatchSizer.(RecordBatchSizer.java:690)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
>  at 
> org.apache.drill.exec.record.RecordBatchSizer.(RecordBatchSizer.java:662)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
>  at 
> org.apache.drill.exec.record.JoinBatchMemoryManager.update(JoinBatchMemoryManager.java:73)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
>  at 
> org.apache.drill.exec.record.JoinBatchMemoryManager.update(JoinBatchMemoryManager.java:79)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
>  at 
> 

[jira] [Created] (DRILL-6578) Ensure the Flat Parquet Reader can handle query cancellation

2018-07-02 Thread salim achouche (JIRA)
salim achouche created DRILL-6578:
-

 Summary: Ensure the Flat Parquet Reader can handle query 
cancellation
 Key: DRILL-6578
 URL: https://issues.apache.org/jira/browse/DRILL-6578
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Parquet
Reporter: salim achouche
Assignee: salim achouche


* The optimized Parquet reader uses an iterator style to load column data 
 * We need to ensure the code can properly handle query cancellation even in 
the presence of bugs within the hasNext() .. next() calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6577) Change Hash-Join default to not fallback (into pre-1.14 unlimited memory)

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530662#comment-16530662
 ] 

ASF GitHub Bot commented on DRILL-6577:
---

Ben-Zvi opened a new pull request #1359: DRILL-6577: Change Hash-Join fallback 
default to false
URL: https://github.com/apache/drill/pull/1359
 
 
   Option's default setting changed to *false*. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Change Hash-Join default to not fallback (into pre-1.14 unlimited memory)
> -
>
> Key: DRILL-6577
> URL: https://issues.apache.org/jira/browse/DRILL-6577
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.13.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Fix For: 1.14.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Change the default for `drill.exec.hashjoin.fallback.enabled` to *false* 
> (same as for the similar Hash-Agg option). This would force users to 
> calculate and assign sufficient memory for the query, or explicitly choose to 
> fallback.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6543) Option for memory mgmt: Reserve allowance for non-buffered

2018-07-02 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6543:

Description: 
Introduce a new option to enforce/remind users to reserve some allowance when 
budgeting their memory:

The problem: When the "planner.memory.max_query_memory_per_node" (MQMPN) option 
is set equal (or "nearly equal") to the allocated *Direct Memory*, an OOM is 
still possible. The reason is that the memory used by the "non-buffered" 
operators is not taken into account.

For example, MQMPN == Direct-Memory == 100 MB. Run a query with 5 buffered 
operators (e.g., 5 instances of a Hash-Join), so each gets "promised" 20 MB. 
When other non-buffered operators (e.g., a Scanner, or a Sender) also grab some 
of the Direct Memory, then less than 100 MB is left available. And if all those 
5 Hash-Joins are pushing their limits, then one HJ may have only allocated 12MB 
so far, but on the next 1MB allocation it will hit an OOM (from the JVM, as all 
the 100MB Direct memory is already used).

A solution -- a new option to _*reserve*_ some of the Direct Memory for those 
non-buffered operators (e.g., default %25). This *allowance* may prevent many 
of the cases like the example above. The new option would return an error (when 
a query initiates) if the MQMPN is set too high. Note that this option +can 
not+ address concurrent queries.

This should also apply to the alternative for the MQMPN - the 
{{"planner.memory.percent_per_query"}} option (PPQ). The PPQ does not 
_*reserve*_ such memory (e.g., can set it to %100); only its documentation 
clearly explains this issue (that doc suggests reserving %50 allowance, as it 
was written when the Hash-Join was non-buffered; i.e., before spill was 
implemented).

The memory given to the buffered operators is the highest calculated between 
the MQMPN and the PPQ. The new reserve option would verify that this figure 
allows the allowance.

 

  was:
Changes to options related to memory budgeting:

(1) Change the default for "drill.exec.hashjoin.fallback.enabled" to *false* 
(same as for the similar Hash-Agg option). This would force users to calculate 
and assign sufficient memory for the query, or explicitly choose to fallback.

(2) When the "planner.memory.max_query_memory_per_node" (MQMPN) option is set 
equal (or "nearly equal") to the allocated *Direct Memory*, an OOM is still 
possible. The reason is that the memory used by the "non-buffered" operators is 
not taken into account.

For example, MQMPN == Direct-Memory == 100 MB. Run a query with 5 buffered 
operators (e.g., 5 instances of a Hash-Join), so each gets "promised" 20 MB. 
When other non-buffered operators (e.g., a Scanner, or a Sender) also grab some 
of the Direct Memory, then less than 100 MB is left available. And if all those 
5 Hash-Joins are pushing their limits, then one HJ may have only allocated 12MB 
so far, but on the next 1MB allocation it will hit an OOM (from the JVM, as all 
the 100MB Direct memory is already used).

A solution -- a new option to _*reserve*_ some of the Direct Memory for those 
non-buffered operators (e.g., default %25). This *allowance* may prevent many 
of the cases like the example above. The new option would return an error (when 
a query initiates) if the MQMPN is set too high. Note that this option +can 
not+ address concurrent queries.

This should also apply to the alternative for the MQMPN - the 
{{"planner.memory.percent_per_query"}} option (PPQ). The PPQ does not 
_*reserve*_ such memory (e.g., can set it to %100); only its documentation 
clearly explains this issue (that doc suggests reserving %50 allowance, as it 
was written when the Hash-Join was non-buffered; i.e., before spill was 
implemented).

The memory given to the buffered operators is the highest calculated between 
the MQMPN and the PPQ. The new reserve option would verify that this figure 
allows the allowance.

 

Summary: Option for memory mgmt: Reserve allowance for non-buffered  
(was: Options for memory mgmt: Reserve allowance for non-buffered, and 
Hash-Join default to not fallback   )

> Option for memory mgmt: Reserve allowance for non-buffered
> --
>
> Key: DRILL-6543
> URL: https://issues.apache.org/jira/browse/DRILL-6543
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.13.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.15.0
>
>
> Introduce a new option to enforce/remind users to reserve some allowance when 
> budgeting their memory:
> The problem: When the "planner.memory.max_query_memory_per_node" (MQMPN) 
> option is set equal (or "nearly equal") to the allocated *Direct Memory*, an 
> OOM is still 

[jira] [Created] (DRILL-6577) Change Hash-Join default to not fallback (into pre-1.14 unlimited memory)

2018-07-02 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-6577:
---

 Summary: Change Hash-Join default to not fallback (into pre-1.14 
unlimited memory)
 Key: DRILL-6577
 URL: https://issues.apache.org/jira/browse/DRILL-6577
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Relational Operators
Affects Versions: 1.13.0
Reporter: Boaz Ben-Zvi
Assignee: Boaz Ben-Zvi
 Fix For: 1.14.0


Change the default for `drill.exec.hashjoin.fallback.enabled` to *false* (same 
as for the similar Hash-Agg option). This would force users to calculate and 
assign sufficient memory for the query, or explicitly choose to fallback.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6494) Drill Plugins Handler

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530625#comment-16530625
 ] 

ASF GitHub Bot commented on DRILL-6494:
---

vdiravka commented on issue #1345: DRILL-6494: Drill Plugins Handler
URL: https://github.com/apache/drill/pull/1345#issuecomment-401977745
 
 
   @sohami @arina-ielchiieva 
   The commit with new BOOT option for controlling the file after it's use is 
added. It works as expected.
   The branch is rebased onto current master version.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill Plugins Handler
> -
>
> Key: DRILL-6494
> URL: https://issues.apache.org/jira/browse/DRILL-6494
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Tools, Build  Test
>Affects Versions: 1.13.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.14.0
>
> Attachments: storage-plugins.conf
>
>
> The new service of updating Drill's plugins configs could be implemented.
> Please find details from design overview document:
> https://docs.google.com/document/d/14JKb2TA8dGnOIE5YT2RImkJ7R0IAYSGjJg8xItL5yMI/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6494) Drill Plugins Handler

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530622#comment-16530622
 ] 

ASF GitHub Bot commented on DRILL-6494:
---

vdiravka commented on a change in pull request #1345: DRILL-6494: Drill Plugins 
Handler
URL: https://github.com/apache/drill/pull/1345#discussion_r199358015
 
 

 ##
 File path: 
contrib/storage-kafka/src/main/resources/bootstrap-storage-plugins.json
 ##
 @@ -2,8 +2,8 @@
   "storage":{
 kafka : {
   type:"kafka",
-  enabled: false,
-  kafkaConsumerProps: {"bootstrap.servers":"localhost:9092", "group.id" : 
"drill-consumer"}
+  kafkaConsumerProps: {"bootstrap.servers":"localhost:9092", "group.id" : 
"drill-consumer"},
+  enabled: false
 
 Review comment:
   It looks like Hive plugin is the only one plugin with such order of 
properties. For all other plugins the enabled status appears in the end of 
config after deserializing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill Plugins Handler
> -
>
> Key: DRILL-6494
> URL: https://issues.apache.org/jira/browse/DRILL-6494
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Tools, Build  Test
>Affects Versions: 1.13.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.14.0
>
> Attachments: storage-plugins.conf
>
>
> The new service of updating Drill's plugins configs could be implemented.
> Please find details from design overview document:
> https://docs.google.com/document/d/14JKb2TA8dGnOIE5YT2RImkJ7R0IAYSGjJg8xItL5yMI/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6543) Options for memory mgmt: Reserve allowance for non-buffered, and Hash-Join default to not fallback

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6543:
-
Reviewer: Timothy Farkas

> Options for memory mgmt: Reserve allowance for non-buffered, and Hash-Join 
> default to not fallback   
> -
>
> Key: DRILL-6543
> URL: https://issues.apache.org/jira/browse/DRILL-6543
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.13.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.15.0
>
>
> Changes to options related to memory budgeting:
> (1) Change the default for "drill.exec.hashjoin.fallback.enabled" to *false* 
> (same as for the similar Hash-Agg option). This would force users to 
> calculate and assign sufficient memory for the query, or explicitly choose to 
> fallback.
> (2) When the "planner.memory.max_query_memory_per_node" (MQMPN) option is set 
> equal (or "nearly equal") to the allocated *Direct Memory*, an OOM is still 
> possible. The reason is that the memory used by the "non-buffered" operators 
> is not taken into account.
> For example, MQMPN == Direct-Memory == 100 MB. Run a query with 5 buffered 
> operators (e.g., 5 instances of a Hash-Join), so each gets "promised" 20 MB. 
> When other non-buffered operators (e.g., a Scanner, or a Sender) also grab 
> some of the Direct Memory, then less than 100 MB is left available. And if 
> all those 5 Hash-Joins are pushing their limits, then one HJ may have only 
> allocated 12MB so far, but on the next 1MB allocation it will hit an OOM 
> (from the JVM, as all the 100MB Direct memory is already used).
> A solution -- a new option to _*reserve*_ some of the Direct Memory for those 
> non-buffered operators (e.g., default %25). This *allowance* may prevent many 
> of the cases like the example above. The new option would return an error 
> (when a query initiates) if the MQMPN is set too high. Note that this option 
> +can not+ address concurrent queries.
> This should also apply to the alternative for the MQMPN - the 
> {{"planner.memory.percent_per_query"}} option (PPQ). The PPQ does not 
> _*reserve*_ such memory (e.g., can set it to %100); only its documentation 
> clearly explains this issue (that doc suggests reserving %50 allowance, as it 
> was written when the Hash-Join was non-buffered; i.e., before spill was 
> implemented).
> The memory given to the buffered operators is the highest calculated between 
> the MQMPN and the PPQ. The new reserve option would verify that this figure 
> allows the allowance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6453) TPC-DS query 72 has regressed

2018-07-02 Thread Khurram Faraaz (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530592#comment-16530592
 ] 

Khurram Faraaz commented on DRILL-6453:
---

[~ben-zvi] [~priteshm] the Exception message is the same.

However, we still need to find out why the query takes so long, over 2 hours to 
execute and then fails.

I will re run the test on latest apache master to verify, if this is fixed.

> TPC-DS query 72 has regressed
> -
>
> Key: DRILL-6453
> URL: https://issues.apache.org/jira/browse/DRILL-6453
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.14.0
>Reporter: Khurram Faraaz
>Assignee: Boaz Ben-Zvi
>Priority: Blocker
> Fix For: 1.14.0
>
> Attachments: 24f75b18-014a-fb58-21d2-baeab5c3352c.sys.drill
>
>
> TPC-DS query 72 seems to have regressed, query profile for the case where it 
> Canceled after 2 hours on Drill 1.14.0 is attached here.
> {noformat}
> On, Drill 1.14.0-SNAPSHOT 
> commit : 931b43e (TPC-DS query 72 executed successfully on this commit, took 
> around 55 seconds to execute)
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> TPC-DS query 72 executed successfully & took 47 seconds to complete execution.
> {noformat}
> {noformat}
> TPC-DS data in the below run has date values stored as DATE datatype and not 
> VARCHAR type
> On, Drill 1.14.0-SNAPSHOT
> commit : 82e1a12
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> and
> alter system set `exec.hashjoin.num_partitions` = 1;
> TPC-DS query 72 executed for 2 hrs and 11 mins and did not complete, I had to 
> Cancel it by stopping the Foreman drillbit.
> As a result several minor fragments are reported to be in 
> CANCELLATION_REQUESTED state on UI.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6553) Fix TopN for unnest operator

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6553:
-
Reviewer: Hanumath Rao Maduri  (was: Aman Sinha)

> Fix TopN for unnest operator
> 
>
> Key: DRILL-6553
> URL: https://issues.apache.org/jira/browse/DRILL-6553
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.14.0
>
>
> Plan for the query with unnest is chosen non-optimally:
> {code:sql}
> select customer.c_custkey, customer.c_name, t.o.o_orderkey,t.o.o_totalprice
> from dfs.`lateraljoin/multipleFiles` customer,
> unnest(customer.c_orders) t(o)
> order by customer.c_custkey, t.o.o_orderkey, t.o.o_totalprice
> limit 50
> {code}
> Plan:
> {noformat}
> 00-00Screen
> 00-01  ProjectAllowDup(c_custkey=[$0], c_name=[$1], EXPR$2=[$2], 
> EXPR$3=[$3])
> 00-02SelectionVectorRemover
> 00-03  Limit(fetch=[50])
> 00-04SelectionVectorRemover
> 00-05  Sort(sort0=[$0], sort1=[$2], sort2=[$3], dir0=[ASC], 
> dir1=[ASC], dir2=[ASC])
> 00-06Project(c_custkey=[$2], c_name=[$3], EXPR$2=[ITEM($4, 
> 'o_orderkey')], EXPR$3=[ITEM($4, 'o_totalprice')])
> 00-07  LateralJoin(correlation=[$cor0], joinType=[inner], 
> requiredColumns=[{1}])
> 00-09Project(T0¦¦**=[$0], c_orders=[$1], c_custkey=[$2], 
> c_name=[$3])
> 00-11  Scan(groupscan=[EasyGroupScan 
> [selectionRoot=file:/home/mapr/drill/exec/java-exec/target/org.apache.drill.exec.physical.impl.lateraljoin.TestE2EUnnestAndLateral/root/lateraljoin/multipleFiles,
>  numFiles=2, columns=[`**`], 
> files=[file:/home/mapr/drill/exec/java-exec/target/org.apache.drill.exec.physical.impl.lateraljoin.TestE2EUnnestAndLateral/root/lateraljoin/multipleFiles/cust_order_10_2.json,
>  
> file:/home/mapr/drill/exec/java-exec/target/org.apache.drill.exec.physical.impl.lateraljoin.TestE2EUnnestAndLateral/root/lateraljoin/multipleFiles/cust_order_10_1.json]]])
> 00-08Project(c_orders0=[$0])
> 00-10  Unnest [srcOp=00-07] 
> {noformat}
> A similar query, but with flatten:
> {code:sql}
> select f.c_custkey, f.c_name, f.o.o_orderkey, f.o.o_totalprice from (select 
> c_custkey, c_name, flatten(c_orders) as o from 
> dfs.`lateraljoin/multipleFiles` customer) f order by f.c_custkey, 
> f.o.o_orderkey, f.o.o_totalprice limit 50
> {code}
> has plan:
> {noformat}
> 00-00Screen
> 00-01  Project(c_custkey=[$0], c_name=[$1], EXPR$2=[$2], EXPR$3=[$3])
> 00-02SelectionVectorRemover
> 00-03  Limit(fetch=[50])
> 00-04SelectionVectorRemover
> 00-05  TopN(limit=[50])
> 00-06Project(c_custkey=[$0], c_name=[$1], EXPR$2=[ITEM($2, 
> 'o_orderkey')], EXPR$3=[ITEM($2, 'o_totalprice')])
> 00-07  Flatten(flattenField=[$2])
> 00-08Project(c_custkey=[$0], c_name=[$1], o=[$2])
> 00-09  Scan(groupscan=[EasyGroupScan 
> [selectionRoot=file:/home/mapr/drill/exec/java-exec/target/org.apache.drill.exec.physical.impl.lateraljoin.TestE2EUnnestAndLateral/root/lateraljoin/multipleFiles,
>  numFiles=2, columns=[`c_custkey`, `c_name`, `c_orders`], 
> files=[file:/home/mapr/drill/exec/java-exec/target/org.apache.drill.exec.physical.impl.lateraljoin.TestE2EUnnestAndLateral/root/lateraljoin/multipleFiles/cust_order_10_2.json,
>  
> file:/home/mapr/drill/exec/java-exec/target/org.apache.drill.exec.physical.impl.lateraljoin.TestE2EUnnestAndLateral/root/lateraljoin/multipleFiles/cust_order_10_1.json]]])
> {noformat}
> The main difference is that for the case of unnest, a project wasn't pushed 
> to the scan and Limit with Sort weren't converted to TopN. 
> The first problem is tracked by DRILL-6545 and this Jira aims to fix the 
> problem with TopN



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6553) Fix TopN for unnest operator

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6553:
-
Reviewer: Aman Sinha  (was: Sorabh Hamirwasia)

> Fix TopN for unnest operator
> 
>
> Key: DRILL-6553
> URL: https://issues.apache.org/jira/browse/DRILL-6553
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.14.0
>
>
> Plan for the query with unnest is chosen non-optimally:
> {code:sql}
> select customer.c_custkey, customer.c_name, t.o.o_orderkey,t.o.o_totalprice
> from dfs.`lateraljoin/multipleFiles` customer,
> unnest(customer.c_orders) t(o)
> order by customer.c_custkey, t.o.o_orderkey, t.o.o_totalprice
> limit 50
> {code}
> Plan:
> {noformat}
> 00-00Screen
> 00-01  ProjectAllowDup(c_custkey=[$0], c_name=[$1], EXPR$2=[$2], 
> EXPR$3=[$3])
> 00-02SelectionVectorRemover
> 00-03  Limit(fetch=[50])
> 00-04SelectionVectorRemover
> 00-05  Sort(sort0=[$0], sort1=[$2], sort2=[$3], dir0=[ASC], 
> dir1=[ASC], dir2=[ASC])
> 00-06Project(c_custkey=[$2], c_name=[$3], EXPR$2=[ITEM($4, 
> 'o_orderkey')], EXPR$3=[ITEM($4, 'o_totalprice')])
> 00-07  LateralJoin(correlation=[$cor0], joinType=[inner], 
> requiredColumns=[{1}])
> 00-09Project(T0¦¦**=[$0], c_orders=[$1], c_custkey=[$2], 
> c_name=[$3])
> 00-11  Scan(groupscan=[EasyGroupScan 
> [selectionRoot=file:/home/mapr/drill/exec/java-exec/target/org.apache.drill.exec.physical.impl.lateraljoin.TestE2EUnnestAndLateral/root/lateraljoin/multipleFiles,
>  numFiles=2, columns=[`**`], 
> files=[file:/home/mapr/drill/exec/java-exec/target/org.apache.drill.exec.physical.impl.lateraljoin.TestE2EUnnestAndLateral/root/lateraljoin/multipleFiles/cust_order_10_2.json,
>  
> file:/home/mapr/drill/exec/java-exec/target/org.apache.drill.exec.physical.impl.lateraljoin.TestE2EUnnestAndLateral/root/lateraljoin/multipleFiles/cust_order_10_1.json]]])
> 00-08Project(c_orders0=[$0])
> 00-10  Unnest [srcOp=00-07] 
> {noformat}
> A similar query, but with flatten:
> {code:sql}
> select f.c_custkey, f.c_name, f.o.o_orderkey, f.o.o_totalprice from (select 
> c_custkey, c_name, flatten(c_orders) as o from 
> dfs.`lateraljoin/multipleFiles` customer) f order by f.c_custkey, 
> f.o.o_orderkey, f.o.o_totalprice limit 50
> {code}
> has plan:
> {noformat}
> 00-00Screen
> 00-01  Project(c_custkey=[$0], c_name=[$1], EXPR$2=[$2], EXPR$3=[$3])
> 00-02SelectionVectorRemover
> 00-03  Limit(fetch=[50])
> 00-04SelectionVectorRemover
> 00-05  TopN(limit=[50])
> 00-06Project(c_custkey=[$0], c_name=[$1], EXPR$2=[ITEM($2, 
> 'o_orderkey')], EXPR$3=[ITEM($2, 'o_totalprice')])
> 00-07  Flatten(flattenField=[$2])
> 00-08Project(c_custkey=[$0], c_name=[$1], o=[$2])
> 00-09  Scan(groupscan=[EasyGroupScan 
> [selectionRoot=file:/home/mapr/drill/exec/java-exec/target/org.apache.drill.exec.physical.impl.lateraljoin.TestE2EUnnestAndLateral/root/lateraljoin/multipleFiles,
>  numFiles=2, columns=[`c_custkey`, `c_name`, `c_orders`], 
> files=[file:/home/mapr/drill/exec/java-exec/target/org.apache.drill.exec.physical.impl.lateraljoin.TestE2EUnnestAndLateral/root/lateraljoin/multipleFiles/cust_order_10_2.json,
>  
> file:/home/mapr/drill/exec/java-exec/target/org.apache.drill.exec.physical.impl.lateraljoin.TestE2EUnnestAndLateral/root/lateraljoin/multipleFiles/cust_order_10_1.json]]])
> {noformat}
> The main difference is that for the case of unnest, a project wasn't pushed 
> to the scan and Limit with Sort weren't converted to TopN. 
> The first problem is tracked by DRILL-6545 and this Jira aims to fix the 
> problem with TopN



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6529) Project Batch Sizing causes two LargeFileCompilation tests to timeout

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6529:
-
Labels: ready-to-commit  (was: )

> Project Batch Sizing causes two LargeFileCompilation tests to timeout
> -
>
> Key: DRILL-6529
> URL: https://issues.apache.org/jira/browse/DRILL-6529
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Reporter: Karthikeyan Manivannan
>Assignee: Karthikeyan Manivannan
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> Timeout failures are seen in TestLargeFileCompilation testExternal_Sort and 
> testTop_N_Sort. These tests are stress tests for compilation where the 
> queries cover projections over 5000 columns and sort over 500 columns. These 
> tests pass if they are run stand-alone. Something triggers the timeouts when 
> the tests are run in parallel as part of a unit test run.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6453) TPC-DS query 72 has regressed

2018-07-02 Thread Pritesh Maker (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530575#comment-16530575
 ] 

Pritesh Maker commented on DRILL-6453:
--

[~khfaraaz] is this exception the same as DRILL-6517

> TPC-DS query 72 has regressed
> -
>
> Key: DRILL-6453
> URL: https://issues.apache.org/jira/browse/DRILL-6453
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.14.0
>Reporter: Khurram Faraaz
>Assignee: Boaz Ben-Zvi
>Priority: Blocker
> Fix For: 1.14.0
>
> Attachments: 24f75b18-014a-fb58-21d2-baeab5c3352c.sys.drill
>
>
> TPC-DS query 72 seems to have regressed, query profile for the case where it 
> Canceled after 2 hours on Drill 1.14.0 is attached here.
> {noformat}
> On, Drill 1.14.0-SNAPSHOT 
> commit : 931b43e (TPC-DS query 72 executed successfully on this commit, took 
> around 55 seconds to execute)
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> TPC-DS query 72 executed successfully & took 47 seconds to complete execution.
> {noformat}
> {noformat}
> TPC-DS data in the below run has date values stored as DATE datatype and not 
> VARCHAR type
> On, Drill 1.14.0-SNAPSHOT
> commit : 82e1a12
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> and
> alter system set `exec.hashjoin.num_partitions` = 1;
> TPC-DS query 72 executed for 2 hrs and 11 mins and did not complete, I had to 
> Cancel it by stopping the Foreman drillbit.
> As a result several minor fragments are reported to be in 
> CANCELLATION_REQUESTED state on UI.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6410) Memory leak in Parquet Reader during cancellation

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530571#comment-16530571
 ] 

ASF GitHub Bot commented on DRILL-6410:
---

vrozov edited a comment on issue #1333: DRILL-6410: Memory leak in Parquet 
Reader during cancellation
URL: https://github.com/apache/drill/pull/1333#issuecomment-401862300
 
 
   @ilooner I guess by "last discussion" you refer to the discussion between 
you, me and @sachouche, where "majority" does not mean the community majority. 
In the Apache, any contributor can provide a solution that (s)he considers to 
be the best solution possible and then it can either be accepted by the 
community/contributor or blocked with -1 (requires technical justification). If 
another contributor provides an alternative solution, a community may decide to 
go with the alternate solution as long as it addresses technical concerns of 
the initial contribution. For this particular case, my requirements are a) a 
unified approach (@parthchandra has the same requirement) and b) the ability to 
cancel tasks asynchronously. If that can be done with the approach outlined in 
PR #1257 and a contributor will change it to address all the issues, let's move 
forward with the alternate approach.
   
   A note regarding the complexity of the implementation. This implementation 
uses public java concurrency classes as well. It does not rely on unsupported 
or unsafe to use Java classes and/or API. Basically, `LockSupport` is the same 
first-class concurrency construct as `Thread` or `CountDownLatch` classes. The 
primary use case for using those constructs is to create a combination of 
`ExecutorService` and a `CountDownLatch` that is not provided by the Java 
itself.
   
   To summarize, I am perfectly fine to go with an alternate solution or with 
another committer to review the PR, it will be against Apache way to force a 
committer to review or commit a change, that (s)he is not comfortable with.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Memory leak in Parquet Reader during cancellation
> -
>
> Key: DRILL-6410
> URL: https://issues.apache.org/jira/browse/DRILL-6410
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: Vlad Rozov
>Priority: Major
> Fix For: 1.14.0
>
>
> Occasionally, a memory leak is observed within the flat Parquet reader when 
> query cancellation is invoked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6561) Lateral excluding the columns from output container provided by projection push into rules

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6561:
-
Labels: ready-to-commit  (was: )

> Lateral excluding the columns from output container provided by projection 
> push into rules
> --
>
> Key: DRILL-6561
> URL: https://issues.apache.org/jira/browse/DRILL-6561
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> With DRILL-6545, LateralPop will have information about list of columns to be 
> excluded from the Lateral output container. Mostly this is used to avoid 
> producing origin repeated column in Lateral output if it's not required from 
> the projection list. This is needed because in absence of it Lateral has to 
> copy the repeated column N number of times where N is the number of rows in 
> right incoming batch for each left incoming batch row. This copy was very 
> costly both from memory and latency perspective. Hence avoiding it is a must 
> for Lateral-Unnest case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6453) TPC-DS query 72 has regressed

2018-07-02 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530543#comment-16530543
 ] 

Boaz Ben-Zvi commented on DRILL-6453:
-

Is there a stack trace with this Exception ?


> TPC-DS query 72 has regressed
> -
>
> Key: DRILL-6453
> URL: https://issues.apache.org/jira/browse/DRILL-6453
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.14.0
>Reporter: Khurram Faraaz
>Assignee: Boaz Ben-Zvi
>Priority: Blocker
> Fix For: 1.14.0
>
> Attachments: 24f75b18-014a-fb58-21d2-baeab5c3352c.sys.drill
>
>
> TPC-DS query 72 seems to have regressed, query profile for the case where it 
> Canceled after 2 hours on Drill 1.14.0 is attached here.
> {noformat}
> On, Drill 1.14.0-SNAPSHOT 
> commit : 931b43e (TPC-DS query 72 executed successfully on this commit, took 
> around 55 seconds to execute)
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> TPC-DS query 72 executed successfully & took 47 seconds to complete execution.
> {noformat}
> {noformat}
> TPC-DS data in the below run has date values stored as DATE datatype and not 
> VARCHAR type
> On, Drill 1.14.0-SNAPSHOT
> commit : 82e1a12
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> and
> alter system set `exec.hashjoin.num_partitions` = 1;
> TPC-DS query 72 executed for 2 hrs and 11 mins and did not complete, I had to 
> Cancel it by stopping the Foreman drillbit.
> As a result several minor fragments are reported to be in 
> CANCELLATION_REQUESTED state on UI.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6561) Lateral excluding the columns from output container provided by projection push into rules

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530541#comment-16530541
 ] 

ASF GitHub Bot commented on DRILL-6561:
---

parthchandra commented on issue #1356: DRILL-6561: Lateral excluding the 
columns from output container provided by projection push into rules
URL: https://github.com/apache/drill/pull/1356#issuecomment-401961420
 
 
   +1 . 
   I also took care of the rebase and merge


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Lateral excluding the columns from output container provided by projection 
> push into rules
> --
>
> Key: DRILL-6561
> URL: https://issues.apache.org/jira/browse/DRILL-6561
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
> Fix For: 1.14.0
>
>
> With DRILL-6545, LateralPop will have information about list of columns to be 
> excluded from the Lateral output container. Mostly this is used to avoid 
> producing origin repeated column in Lateral output if it's not required from 
> the projection list. This is needed because in absence of it Lateral has to 
> copy the repeated column N number of times where N is the number of rows in 
> right incoming batch for each left incoming batch row. This copy was very 
> costly both from memory and latency perspective. Hence avoiding it is a must 
> for Lateral-Unnest case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6530) JVM crash with a query involving multiple json files with one file having a schema change of one column from string to list

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530535#comment-16530535
 ] 

ASF GitHub Bot commented on DRILL-6530:
---

parthchandra closed pull request #1343: DRILL-6530: JVM crash with a query 
involving multiple json files with…
URL: https://github.com/apache/drill/pull/1343
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/exec/vector/src/main/codegen/templates/ListWriters.java 
b/exec/vector/src/main/codegen/templates/ListWriters.java
index cab8772a741..4300857b9eb 100644
--- a/exec/vector/src/main/codegen/templates/ListWriters.java
+++ b/exec/vector/src/main/codegen/templates/ListWriters.java
@@ -107,11 +107,13 @@ public void setValueCount(int count){
   public MapWriter map() {
 switch (mode) {
 case INIT:
-  int vectorCount = container.size();
+  final ValueVector oldVector = container.getChild(name);
   final RepeatedMapVector vector = container.addOrGet(name, 
RepeatedMapVector.TYPE, RepeatedMapVector.class);
   innerVector = vector;
   writer = new RepeatedMapWriter(vector, this);
-  if(vectorCount != container.size()) {
+  // oldVector will be null if it's first batch being created and it might 
not be same as newly added vector
+  // if new batch has schema change
+  if (oldVector == null || oldVector != vector) {
 writer.allocate();
   }
   writer.setPosition(${index});
@@ -131,11 +133,13 @@ public MapWriter map() {
   public ListWriter list() {
 switch (mode) {
 case INIT:
-  final int vectorCount = container.size();
+  final ValueVector oldVector = container.getChild(name);
   final RepeatedListVector vector = container.addOrGet(name, 
RepeatedListVector.TYPE, RepeatedListVector.class);
   innerVector = vector;
   writer = new RepeatedListWriter(null, vector, this);
-  if (vectorCount != container.size()) {
+  // oldVector will be null if it's first batch being created and it might 
not be same as newly added vector
+  // if new batch has schema change
+  if (oldVector == null || oldVector != vector) {
 writer.allocate();
   }
   writer.setPosition(${index});
@@ -176,11 +180,13 @@ public ListWriter list() {
   
 switch (mode) {
 case INIT:
-  final int vectorCount = container.size();
+  final ValueVector oldVector = container.getChild(name);
   final Repeated${capName}Vector vector = container.addOrGet(name, 
${upperName}_TYPE, Repeated${capName}Vector.class);
   innerVector = vector;
   writer = new Repeated${capName}WriterImpl(vector, this);
-  if(vectorCount != container.size()) {
+  // oldVector will be null if it's first batch being created and it might 
not be same as newly added vector
+  // if new batch has schema change
+  if (oldVector == null || oldVector != vector) {
 writer.allocate();
   }
   writer.setPosition(${index});


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> JVM crash with a query involving multiple json files with one file having a 
> schema change of one column from string to list
> ---
>
> Key: DRILL-6530
> URL: https://issues.apache.org/jira/browse/DRILL-6530
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Data Types
>Affects Versions: 1.14.0
>Reporter: Kedar Sankar Behera
>Assignee: Sorabh Hamirwasia
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
> Attachments: 0_0_92.json, 0_0_93.json, drillbit.log, drillbit.out, 
> hs_err_pid32076.log
>
>
> JVM crash with a Lateral Unnest query involving multiple json files with one 
> file having a schema change of one column from string to list .
> Query :- 
> {code}
> SELECT customer.c_custkey,customer.c_acctbal,orders.o_orderkey, 
> orders.o_totalprice,orders.o_orderdate,orders.o_shippriority,customer.c_address,orders.o_orderpriority,customer.c_comment
> FROM customer, LATERAL 
> (SELECT O.ord.o_orderkey as o_orderkey, O.ord.o_totalprice as 
> o_totalprice,O.ord.o_orderdate as o_orderdate ,O.ord.o_shippriority as 
> o_shippriority,O.ord.o_orderpriority 
> as o_orderpriority FROM UNNEST(customer.c_orders) O(ord))orders;
> {code}
> The 

[jira] [Commented] (DRILL-6535) ClassCastException in Lateral Unnest queries when dealing with schema changed json data

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530536#comment-16530536
 ] 

ASF GitHub Bot commented on DRILL-6535:
---

parthchandra closed pull request #1339: DRILL-6535: ClassCastException in 
Lateral Unnest queries when dealing…
URL: https://github.com/apache/drill/pull/1339
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/LateralJoinBatch.java
 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/LateralJoinBatch.java
index 578cbc8742d..84dc5c344fc 100644
--- 
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/LateralJoinBatch.java
+++ 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/LateralJoinBatch.java
@@ -433,6 +433,14 @@ private IterOutcome processRightBatch() {
   rightUpstream = next(RIGHT_INDEX, right);
   switch (rightUpstream) {
 case OK_NEW_SCHEMA:
+
+  // If there is some records in the output batch that means left 
batch didn't came with OK_NEW_SCHEMA,
+  // otherwise it would have been marked for processInFuture and 
output will be returned. This means for
+  // current non processed left or new left non-empty batch there is 
unexpected right batch schema change
+  if (outputIndex > 0) {
+throw new IllegalStateException("SchemaChange on right batch is 
not expected in between the rows of " +
+  "current left batch or a new non-empty left batch with no schema 
change");
+  }
   // We should not get OK_NEW_SCHEMA multiple times for the same left 
incoming batch. So there won't be a
   // case where we get OK_NEW_SCHEMA --> OK (with batch) ---> 
OK_NEW_SCHEMA --> OK/EMIT fall through
   //
@@ -548,6 +556,7 @@ private IterOutcome produceOutputBatch() {
 // Get both left batch and the right batch and make sure indexes 
are properly set
 leftUpstream = processLeftBatch();
 
+// output batch is not empty and we have new left batch with 
OK_NEW_SCHEMA or terminal outcome
 if (processLeftBatchInFuture) {
   logger.debug("Received left batch with outcome {} such that we 
have to return the current outgoing " +
 "batch and process the new batch in subsequent next call", 
leftUpstream);
@@ -564,7 +573,7 @@ private IterOutcome produceOutputBatch() {
 
 // If we have received the left batch with EMIT outcome and is 
empty then we should return previous output
 // batch with EMIT outcome
-if (leftUpstream == EMIT && left.getRecordCount() == 0) {
+if ((leftUpstream == EMIT || leftUpstream == OK_NEW_SCHEMA) && 
left.getRecordCount() == 0) {
   isLeftProcessed = true;
   break;
 }
@@ -579,10 +588,16 @@ private IterOutcome produceOutputBatch() {
 // left in outgoing batch so let's get next right batch.
 // 2) OR previous left & right batch was fully processed and it came 
with OK outcome. There is space in outgoing
 // batch. Now we have got new left batch with OK outcome. Let's get 
next right batch
-//
-// It will not hit OK_NEW_SCHEMA since left side have not seen that 
outcome
+// 3) OR previous left & right batch was fully processed and left came 
with OK outcome. Outgoing batch is
+// empty since all right batches were empty for all left rows. Now we 
got another non-empty left batch with
+// OK_NEW_SCHEMA.
 rightUpstream = processRightBatch();
-Preconditions.checkState(rightUpstream != OK_NEW_SCHEMA, "Unexpected 
schema change in right branch");
+if (rightUpstream == OK_NEW_SCHEMA) {
+  leftUpstream = (leftUpstream != EMIT) ? OK : leftUpstream;
+  rightUpstream = OK;
+  finalizeOutputContainer();
+  return OK_NEW_SCHEMA;
+}
 
 if (isTerminalOutcome(rightUpstream)) {
   finalizeOutputContainer();
@@ -591,6 +606,17 @@ private IterOutcome produceOutputBatch() {
 
 // Update the batch memory manager to use new right incoming batch
 updateMemoryManager(RIGHT_INDEX);
+
+// If OK_NEW_SCHEMA is seen only on non empty left batch but not on 
right batch, then we should setup schema in
+// output container based on new left schema and old right schema. If 
schema change failed then return STOP
+// downstream
+if (leftUpstream == OK_NEW_SCHEMA && isLeftProcessed) {
+  if (!handleSchemaChange()) {
+return STOP;
+  }
+  // Since schema has 

[jira] [Commented] (DRILL-6561) Lateral excluding the columns from output container provided by projection push into rules

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530537#comment-16530537
 ] 

ASF GitHub Bot commented on DRILL-6561:
---

parthchandra closed pull request #1356: DRILL-6561: Lateral excluding the 
columns from output container provided by projection push into rules
URL: https://github.com/apache/drill/pull/1356
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/LateralJoinPOP.java
 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/LateralJoinPOP.java
index a12fed1267e..55ede962826 100644
--- 
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/LateralJoinPOP.java
+++ 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/LateralJoinPOP.java
@@ -23,6 +23,7 @@
 import com.fasterxml.jackson.annotation.JsonTypeName;
 import com.google.common.base.Preconditions;
 import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.drill.common.expression.SchemaPath;
 import org.apache.drill.exec.physical.base.AbstractJoinPop;
 import org.apache.drill.exec.physical.base.PhysicalOperator;
 import org.apache.drill.exec.physical.base.PhysicalVisitor;
@@ -34,6 +35,9 @@
 public class LateralJoinPOP extends AbstractJoinPop {
   static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(LateralJoinPOP.class);
 
+  @JsonProperty("excludedColumns")
+  private List excludedColumns;
+
   @JsonProperty("unnestForLateralJoin")
   private UnnestPOP unnestForLateralJoin;
 
@@ -41,19 +45,21 @@
   public LateralJoinPOP(
   @JsonProperty("left") PhysicalOperator left,
   @JsonProperty("right") PhysicalOperator right,
-  @JsonProperty("joinType") JoinRelType joinType) {
+  @JsonProperty("joinType") JoinRelType joinType,
+  @JsonProperty("excludedColumns") List excludedColumns) {
 super(left, right, joinType, null, null);
 Preconditions.checkArgument(joinType != JoinRelType.FULL,
   "Full outer join is currently not supported with Lateral Join");
 Preconditions.checkArgument(joinType != JoinRelType.RIGHT,
   "Right join is currently not supported with Lateral Join");
+this.excludedColumns = excludedColumns;
   }
 
   @Override
   public PhysicalOperator getNewWithChildren(List children) {
 Preconditions.checkArgument(children.size() == 2,
   "Lateral join should have two physical operators");
-LateralJoinPOP newPOP =  new LateralJoinPOP(children.get(0), 
children.get(1), joinType);
+LateralJoinPOP newPOP =  new LateralJoinPOP(children.get(0), 
children.get(1), joinType, this.excludedColumns);
 newPOP.unnestForLateralJoin = this.unnestForLateralJoin;
 return newPOP;
   }
@@ -63,6 +69,11 @@ public UnnestPOP getUnnestForLateralJoin() {
 return this.unnestForLateralJoin;
   }
 
+  @JsonProperty("excludedColumns")
+  public List getExcludedColumns() {
+return this.excludedColumns;
+  }
+
   public void setUnnestForLateralJoin(UnnestPOP unnest) {
 this.unnestForLateralJoin = unnest;
   }
diff --git 
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
index 428a47ebf33..63ac6ef90b8 100644
--- 
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
+++ 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
@@ -19,6 +19,7 @@
 
 import java.io.IOException;
 import java.util.ArrayList;
+import java.util.HashSet;
 import java.util.List;
 import java.util.Set;
 
@@ -67,9 +68,6 @@
 import org.apache.drill.exec.vector.complex.AbstractContainerVector;
 import org.apache.calcite.rel.core.JoinRelType;
 
-import static org.apache.drill.exec.record.JoinBatchMemoryManager.LEFT_INDEX;
-import static org.apache.drill.exec.record.JoinBatchMemoryManager.RIGHT_INDEX;
-
 /**
  *   This class implements the runtime execution for the Hash-Join operator
  *   supporting INNER, LEFT OUTER, RIGHT OUTER, and FULL OUTER joins
@@ -887,7 +885,7 @@ public HashJoinBatch(HashJoinPOP popConfig, FragmentContext 
context,
 
 // get the output batch size from config.
 int configuredBatchSize = (int) 
context.getOptions().getOption(ExecConstants.OUTPUT_BATCH_SIZE_VALIDATOR);
-batchMemoryManager = new JoinBatchMemoryManager(configuredBatchSize, left, 
right);
+batchMemoryManager = new JoinBatchMemoryManager(configuredBatchSize, left, 
right, new HashSet<>());
 logger.debug("BATCH_STATS, configured output batch size: {}", 
configuredBatchSize);
   }
 
diff --git 

[jira] [Commented] (DRILL-6346) Create an Official Drill Docker Container

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530529#comment-16530529
 ] 

ASF GitHub Bot commented on DRILL-6346:
---

priteshm commented on issue #1348: DRILL-6346: Create an Official Drill Docker 
Container
URL: https://github.com/apache/drill/pull/1348#issuecomment-401960422
 
 
   Since @arina-ielchiieva is the batch committer, I marked the JIRA as 
ready-to-commit, so she can review it as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create an Official Drill Docker Container
> -
>
> Key: DRILL-6346
> URL: https://issues.apache.org/jira/browse/DRILL-6346
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Abhishek Girish
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6346) Create an Official Drill Docker Container

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6346:
-
Labels: doc-impacting ready-to-commit  (was: doc-impacting)

> Create an Official Drill Docker Container
> -
>
> Key: DRILL-6346
> URL: https://issues.apache.org/jira/browse/DRILL-6346
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Abhishek Girish
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6575) Add store.hive.conf.properties option to allow set Hive properties at session level

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6575:
-
Fix Version/s: (was: 1.14.0)
   1.15.0

> Add store.hive.conf.properties option to allow set Hive properties at session 
> level
> ---
>
> Key: DRILL-6575
> URL: https://issues.apache.org/jira/browse/DRILL-6575
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.15.0
>
>
> *Use case*
> Hive external table ddl:
> {noformat}
> create external table my(key int, val string)
> row format delimited
> fields terminated by ','
> stored as textfile
> location '/data/my_tbl';
> {noformat}
> Path {{/data/my_tb}} contains sub directory and file in it: 
> {{/data/my_tbl/sub_dir/data.txt}} with the following data:
> {noformat}
> 1, value_1
> 2, value_2
> {noformat}
> When querying such table from Hive, user gets the following exception:
> {noformat}
> Failed with exception java.io.IOException:java.io.IOException: Not a file: 
> file:///data/my_tbl/sub_dir
> {noformat}
> To be able to query this table user needs to set two properties to true: 
> {{hive.mapred.supports.subdirectories}} and {{mapred.input.dir.recursive}}.
>  They can be set at system level in hive-site.xml or at session in Hive 
> console:
> {noformat}
> set hive.mapred.supports.subdirectories=true;
> set mapred.input.dir.recursive=true;
> {noformat}
> Currently to be able to query such table from Drill, user can specify this 
> properties in Hive plugin only:
> {noformat}
> {
>   "type": "hive",
>   "configProps": {
> "hive.metastore.uris": "thrift://localhost:9083",
> "hive.metastore.sasl.enabled": "false",
> "hbase.zookeeper.quorum": "localhost",
> "hbase.zookeeper.property.clientPort": "5181",
> "hive.mapred.supports.subdirectories": "true",
> "mapred.input.dir.recursive": "true"
>   }
>   "enabled": true
> }
> {noformat}
> *Jira scope*
>  This Jira aims to add new session option to Drill 
> {{store.hive.conf.properties}} which will allow user to specify hive 
> properties at session level. 
>  User should write properties in string delimiter by new line symbol. 
> Properties values should NOT be set in double-quotes or any other quotes 
> otherwise they would be parsed incorrectly. Key and value should be separated 
> by {{=}}. Each `alter session set` will override previously set properties at 
> session level. If during query Drill couldn't unparse property string, 
> warning will be logged. Properties will be parsed by loading into 
> {{java.util.Properties}}. Default value is empty string ("").
> Example:
> {noformat}
> alter session set `store.hive.conf.properties` = 
> 'hive.mapred.supports.subdirectories=true\nmapred.input.dir.recursive=true'";
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6572) Add memory calculattion of JPPD BloomFilter

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6572:
-
Fix Version/s: (was: 1.14.0)

> Add memory calculattion of JPPD BloomFilter
> ---
>
> Key: DRILL-6572
> URL: https://issues.apache.org/jira/browse/DRILL-6572
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Reporter: weijie.tong
>Assignee: weijie.tong
>Priority: Major
>
> This is an enhancement of DRILL-6385 to include the memory of BloomFilter in 
> the HashJoin's memory calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6179) Added pcapng-format support

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6179:
-
Fix Version/s: (was: 1.14.0)

> Added pcapng-format support
> ---
>
> Key: DRILL-6179
> URL: https://issues.apache.org/jira/browse/DRILL-6179
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.13.0
>Reporter: Vlad
>Assignee: Vlad
>Priority: Major
>  Labels: doc-impacting
>
> The _PCAP Next Generation Dump File Format_ (or pcapng for short) [1] is an 
> attempt to overcome the limitations of the currently widely used (but 
> limited) libpcap format.
> At a first level, it is desirable to query and filter by source and 
> destination IP and port, and src/dest mac addreses or by protocol. Beyond 
> that, however, it would be very useful to be able to group packets by TCP 
> session and eventually to look at packet contents.
> Initial work is available at  
> https://github.com/mapr-demos/drill/tree/pcapng_dev
> [1] https://pcapng.github.io/pcapng/
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6563) TPCDS query 10 has regressed

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6563:
-
Fix Version/s: (was: 1.14.0)
   1.15.0

> TPCDS query 10 has regressed 
> -
>
> Key: DRILL-6563
> URL: https://issues.apache.org/jira/browse/DRILL-6563
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.14.0
>Reporter: Khurram Faraaz
>Assignee: Pritesh Maker
>Priority: Major
> Fix For: 1.15.0
>
> Attachments: 24ca3c6c-90e1-a4bf-6c6f-3f981fa2d043.sys.drill, 
> query10.fast_plan_old_commit, tpcds_query_10_plan_slow_140d09e.pdf, 
> tpcds_query_plan_10_140d09e.txt
>
>
> TPC-DS query 10 has regressed in performance from taking 3.5 seconds to 
> execute on Apache Drill 1.14.0 commit  b92f599 , to 07 min 51.851 sec to 
> complete execution on Apache Drill 1.14.0 commit 140d09e. Query was executed 
> over SF1 parquet views on a 4 node cluster.
> Query plan from old and newer commit is attached here, with the query profile 
> from newer commit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6576) Unnest reports incoming record counts incorrectly

2018-07-02 Thread Parth Chandra (JIRA)
Parth Chandra created DRILL-6576:


 Summary: Unnest reports incoming record counts incorrectly
 Key: DRILL-6576
 URL: https://issues.apache.org/jira/browse/DRILL-6576
 Project: Apache Drill
  Issue Type: Bug
Reporter: Parth Chandra
Assignee: Parth Chandra






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6516) Support for EMIT outcome in streaming agg

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530486#comment-16530486
 ] 

ASF GitHub Bot commented on DRILL-6516:
---

parthchandra opened a new pull request #1358:  DRILL-6516: EMIT support in 
streaming agg
URL: https://github.com/apache/drill/pull/1358
 
 
   Support for EMIT in the streaming aggregator. 
   Also includes a fix from @sohami in the external sort memory management 
(since streaming agg requires sort to hold on to memory until atreaming agg is 
done).
   
   @Ben-Zvi, @sohami please review


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support for EMIT outcome in streaming agg
> -
>
> Key: DRILL-6516
> URL: https://issues.apache.org/jira/browse/DRILL-6516
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>Priority: Major
> Fix For: 1.14.0
>
>
> Update the streaming aggregator to recognize the EMIT outcome



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6535) ClassCastException in Lateral Unnest queries when dealing with schema changed json data

2018-07-02 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530464#comment-16530464
 ] 

Boaz Ben-Zvi commented on DRILL-6535:
-

https://github.com/apache/drill/pull/1339


> ClassCastException in Lateral Unnest queries when dealing with schema changed 
> json data
> ---
>
> Key: DRILL-6535
> URL: https://issues.apache.org/jira/browse/DRILL-6535
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> Found by [~kedarbcs16]
> ClassCastException in Lateral Unnest queries when dealing with schema changed 
> json data
> {code:java}
> Query : SELECT  customer.c_custkey,customer.c_acctbal,orders.o_orderkey, 
> orders.o_totalprice,orders.o_orderdate FROM customer, LATERAL 
> (SELECT O.ord.o_orderkey as o_orderkey, O.ord.o_totalprice as 
> o_totalprice,O.ord.o_orderdate as o_orderdate  FROM UNNEST(customer.c_orders) 
> O(ord) WHERE year(O.ord.o_orderdate) <> 1998)orders;
> {code}
> The data is sf001 complex data in multi json format partitioned based on the 
> year of the o_orderdata column.
>  The last json file(for year 1998) has 2 schema changes for c_acctbal and 
> o_shippriority .
>  The logs are :-
> {code:java}
> [Error Id: 6df4ceae-c989-4592-aeec-6d30b626f0ab on drill182:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> ClassCastException: org.apache.drill.exec.vector.NullableFloat8Vector cannot 
> be cast to org.apache.drill.exec.vector.NullableBigIntVector
> Fragment 0:0
> [Error Id: 6df4ceae-c989-4592-aeec-6d30b626f0ab on drill182:31010]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:633)
>  ~[drill-common-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:359)
>  [drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:214)
>  [drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:325)
>  [drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [na:1.8.0_161]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_161]
> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]
> Caused by: java.lang.ClassCastException: 
> org.apache.drill.exec.vector.NullableFloat8Vector cannot be cast to 
> org.apache.drill.exec.vector.NullableBigIntVector
> at 
> org.apache.drill.exec.vector.NullableBigIntVector.copyEntry(NullableBigIntVector.java:396)
>  ~[vector-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.join.LateralJoinBatch.copyDataToOutputVectors(LateralJoinBatch.java:802)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.join.LateralJoinBatch.emitLeft(LateralJoinBatch.java:813)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.join.LateralJoinBatch.crossJoinAndOutputRecords(LateralJoinBatch.java:761)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.join.LateralJoinBatch.produceOutputBatch(LateralJoinBatch.java:479)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.join.LateralJoinBatch.innerNext(LateralJoinBatch.java:157)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:172)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:63)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> 

[jira] [Commented] (DRILL-6530) JVM crash with a query involving multiple json files with one file having a schema change of one column from string to list

2018-07-02 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530459#comment-16530459
 ] 

Boaz Ben-Zvi commented on DRILL-6530:
-

https://github.com/apache/drill/pull/1343


> JVM crash with a query involving multiple json files with one file having a 
> schema change of one column from string to list
> ---
>
> Key: DRILL-6530
> URL: https://issues.apache.org/jira/browse/DRILL-6530
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Data Types
>Affects Versions: 1.14.0
>Reporter: Kedar Sankar Behera
>Assignee: Sorabh Hamirwasia
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
> Attachments: 0_0_92.json, 0_0_93.json, drillbit.log, drillbit.out, 
> hs_err_pid32076.log
>
>
> JVM crash with a Lateral Unnest query involving multiple json files with one 
> file having a schema change of one column from string to list .
> Query :- 
> {code}
> SELECT customer.c_custkey,customer.c_acctbal,orders.o_orderkey, 
> orders.o_totalprice,orders.o_orderdate,orders.o_shippriority,customer.c_address,orders.o_orderpriority,customer.c_comment
> FROM customer, LATERAL 
> (SELECT O.ord.o_orderkey as o_orderkey, O.ord.o_totalprice as 
> o_totalprice,O.ord.o_orderdate as o_orderdate ,O.ord.o_shippriority as 
> o_shippriority,O.ord.o_orderpriority 
> as o_orderpriority FROM UNNEST(customer.c_orders) O(ord))orders;
> {code}
> The error got was 
> {code}
> o.a.d.e.p.impl.join.LateralJoinBatch - Output batch still has some space 
> left, getting new batches from left and right
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_custkey
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_phone
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_acctbal
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_orders
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_mktsegment
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_address
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_nationkey
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_name
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_comment
> 2018-06-21 15:25:16,316 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.e.v.c.AbstractContainerVector - Field [o_comment] mutated from 
> [NullableVarCharVector] to [RepeatedVarCharVector]
> 2018-06-21 15:25:16,318 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.drill.exec.vector.UInt4Vector - Reallocating vector [[`$offsets$` 
> (UINT4:REQUIRED)]]. # of bytes: [16384] -> [32768]
> {code}
> On Further investigating with [~shamirwasia] it's found that the crash only 
> happens when [o_comment] mutates from  [NullableVarCharVector]  to 
> [RepeatedVarCharVector],not the other way around
> Please find the logs stack trace and the data file
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6560) Allow options for controlling the batch size per operator

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530452#comment-16530452
 ] 

ASF GitHub Bot commented on DRILL-6560:
---

priteshm commented on issue #1355: DRILL-6560: Enhanced the batch statistics 
logging enablement
URL: https://github.com/apache/drill/pull/1355#issuecomment-401940066
 
 
   @bitblender did you get a chance to review this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow options for controlling the batch size per operator
> -
>
> Key: DRILL-6560
> URL: https://issues.apache.org/jira/browse/DRILL-6560
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.14.0
>
>
> This Jira is for internal Drill DEV use; the following capabilities are 
> needed for automating the batch sizing functionality testing:
>  * Control the enablement of batch sizing statistics at session (per query) 
> and server level (all queries)
>  * Control the granularity of batch sizing statistics (summary or verbose)
>  * Control the set of operators that should log batch statistics



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6410) Memory leak in Parquet Reader during cancellation

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6410:
-
Reviewer: Parth Chandra

> Memory leak in Parquet Reader during cancellation
> -
>
> Key: DRILL-6410
> URL: https://issues.apache.org/jira/browse/DRILL-6410
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: Vlad Rozov
>Priority: Major
> Fix For: 1.14.0
>
>
> Occasionally, a memory leak is observed within the flat Parquet reader when 
> query cancellation is invoked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6496) VectorUtil.showVectorAccessibleContent does not log vector content

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530412#comment-16530412
 ] 

ASF GitHub Bot commented on DRILL-6496:
---

ilooner commented on issue #1336: DRILL-6496: Added missing logging statement 
in VectorUtil.showVectorAccessibleContent(VectorAccessible va, int[] 
columnWidths)
URL: https://github.com/apache/drill/pull/1336#issuecomment-401925379
 
 
   @arina-ielchiieva We can already do that by skipping checkstyle during local 
development with `-Dcheckstyle.skip`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VectorUtil.showVectorAccessibleContent does not log vector content
> --
>
> Key: DRILL-6496
> URL: https://issues.apache.org/jira/browse/DRILL-6496
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Arina Ielchiieva
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.14.0
>
>
> {{VectorUtil.showVectorAccessibleContent(VectorAccessible va, int[] 
> columnWidths)}} does not log vector content. Introduced after DRILL-6438.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6410) Memory leak in Parquet Reader during cancellation

2018-07-02 Thread Timothy Farkas (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Farkas updated DRILL-6410:
--
Reviewer:   (was: Timothy Farkas)

> Memory leak in Parquet Reader during cancellation
> -
>
> Key: DRILL-6410
> URL: https://issues.apache.org/jira/browse/DRILL-6410
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: Vlad Rozov
>Priority: Major
> Fix For: 1.14.0
>
>
> Occasionally, a memory leak is observed within the flat Parquet reader when 
> query cancellation is invoked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6570) IndexOutOfBoundsException when using Flat Parquet Reader

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6570:
-
Labels: ready-to-commit  (was: pull-request-available)

> IndexOutOfBoundsException when using Flat Parquet  Reader
> -
>
> Key: DRILL-6570
> URL: https://issues.apache.org/jira/browse/DRILL-6570
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> * The Parquet Reader creates a reusable bulk entry based on the column 
> precision
>  * It uses the column precision for optimizing the intermediary heap buffers
>  * It first detected the column was fixed length but then it reverted this 
> assumption when the column changed precision
>  * This step was fine except the bulk entry memory requirement changed though 
> the code didn't update the bulk entry intermediary buffers
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6530) JVM crash with a query involving multiple json files with one file having a schema change of one column from string to list

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6530:
-
Labels: ready-to-commit  (was: )

> JVM crash with a query involving multiple json files with one file having a 
> schema change of one column from string to list
> ---
>
> Key: DRILL-6530
> URL: https://issues.apache.org/jira/browse/DRILL-6530
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Data Types
>Affects Versions: 1.14.0
>Reporter: Kedar Sankar Behera
>Assignee: Sorabh Hamirwasia
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
> Attachments: 0_0_92.json, 0_0_93.json, drillbit.log, drillbit.out, 
> hs_err_pid32076.log
>
>
> JVM crash with a Lateral Unnest query involving multiple json files with one 
> file having a schema change of one column from string to list .
> Query :- 
> {code}
> SELECT customer.c_custkey,customer.c_acctbal,orders.o_orderkey, 
> orders.o_totalprice,orders.o_orderdate,orders.o_shippriority,customer.c_address,orders.o_orderpriority,customer.c_comment
> FROM customer, LATERAL 
> (SELECT O.ord.o_orderkey as o_orderkey, O.ord.o_totalprice as 
> o_totalprice,O.ord.o_orderdate as o_orderdate ,O.ord.o_shippriority as 
> o_shippriority,O.ord.o_orderpriority 
> as o_orderpriority FROM UNNEST(customer.c_orders) O(ord))orders;
> {code}
> The error got was 
> {code}
> o.a.d.e.p.impl.join.LateralJoinBatch - Output batch still has some space 
> left, getting new batches from left and right
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_custkey
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_phone
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_acctbal
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_orders
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_mktsegment
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_address
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_nationkey
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_name
> 2018-06-21 15:25:16,303 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.exec.physical.impl.ScanBatch - set record count 0 for vv c_comment
> 2018-06-21 15:25:16,316 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.d.e.v.c.AbstractContainerVector - Field [o_comment] mutated from 
> [NullableVarCharVector] to [RepeatedVarCharVector]
> 2018-06-21 15:25:16,318 [24d3da36-bdb8-cb5b-594c-82135bfb84aa:frag:0:0] DEBUG 
> o.a.drill.exec.vector.UInt4Vector - Reallocating vector [[`$offsets$` 
> (UINT4:REQUIRED)]]. # of bytes: [16384] -> [32768]
> {code}
> On Further investigating with [~shamirwasia] it's found that the crash only 
> happens when [o_comment] mutates from  [NullableVarCharVector]  to 
> [RepeatedVarCharVector],not the other way around
> Please find the logs stack trace and the data file
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6535) ClassCastException in Lateral Unnest queries when dealing with schema changed json data

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6535:
-
Labels: ready-to-commit  (was: )

> ClassCastException in Lateral Unnest queries when dealing with schema changed 
> json data
> ---
>
> Key: DRILL-6535
> URL: https://issues.apache.org/jira/browse/DRILL-6535
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> Found by [~kedarbcs16]
> ClassCastException in Lateral Unnest queries when dealing with schema changed 
> json data
> {code:java}
> Query : SELECT  customer.c_custkey,customer.c_acctbal,orders.o_orderkey, 
> orders.o_totalprice,orders.o_orderdate FROM customer, LATERAL 
> (SELECT O.ord.o_orderkey as o_orderkey, O.ord.o_totalprice as 
> o_totalprice,O.ord.o_orderdate as o_orderdate  FROM UNNEST(customer.c_orders) 
> O(ord) WHERE year(O.ord.o_orderdate) <> 1998)orders;
> {code}
> The data is sf001 complex data in multi json format partitioned based on the 
> year of the o_orderdata column.
>  The last json file(for year 1998) has 2 schema changes for c_acctbal and 
> o_shippriority .
>  The logs are :-
> {code:java}
> [Error Id: 6df4ceae-c989-4592-aeec-6d30b626f0ab on drill182:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> ClassCastException: org.apache.drill.exec.vector.NullableFloat8Vector cannot 
> be cast to org.apache.drill.exec.vector.NullableBigIntVector
> Fragment 0:0
> [Error Id: 6df4ceae-c989-4592-aeec-6d30b626f0ab on drill182:31010]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:633)
>  ~[drill-common-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:359)
>  [drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:214)
>  [drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:325)
>  [drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [na:1.8.0_161]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_161]
> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]
> Caused by: java.lang.ClassCastException: 
> org.apache.drill.exec.vector.NullableFloat8Vector cannot be cast to 
> org.apache.drill.exec.vector.NullableBigIntVector
> at 
> org.apache.drill.exec.vector.NullableBigIntVector.copyEntry(NullableBigIntVector.java:396)
>  ~[vector-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.join.LateralJoinBatch.copyDataToOutputVectors(LateralJoinBatch.java:802)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.join.LateralJoinBatch.emitLeft(LateralJoinBatch.java:813)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.join.LateralJoinBatch.crossJoinAndOutputRecords(LateralJoinBatch.java:761)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.join.LateralJoinBatch.produceOutputBatch(LateralJoinBatch.java:479)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.join.LateralJoinBatch.innerNext(LateralJoinBatch.java:157)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:172)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:63)
>  ~[drill-java-exec-1.14.0-SNAPSHOT.jar:1.14.0-SNAPSHOT]
> at 
> 

[jira] [Updated] (DRILL-6575) Add store.hive.conf.properties option to allow set Hive properties at session level

2018-07-02 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6575:

Description: 
*Use case*

Hive external table ddl:
{noformat}
create external table my(key int, val string)
row format delimited
fields terminated by ','
stored as textfile
location '/data/my_tbl';
{noformat}
Path {{/data/my_tb}} contains sub directory and file in it: 
{{/data/my_tbl/sub_dir/data.txt}} with the following data:
{noformat}
1, value_1
2, value_2
{noformat}
When querying such table from Hive, user gets the following exception:
{noformat}
Failed with exception java.io.IOException:java.io.IOException: Not a file: 
file:///data/my_tbl/sub_dir
{noformat}
To be able to query this table user needs to set two properties to true: 
{{hive.mapred.supports.subdirectories}} and {{mapred.input.dir.recursive}}.
 They can be set at system level in hive-site.xml or at session in Hive console:
{noformat}
set hive.mapred.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
{noformat}
Currently to be able to query such table from Drill, user can specify this 
properties in Hive plugin only:
{noformat}
{
  "type": "hive",
  "configProps": {
"hive.metastore.uris": "thrift://localhost:9083",
"hive.metastore.sasl.enabled": "false",
"hbase.zookeeper.quorum": "localhost",
"hbase.zookeeper.property.clientPort": "5181",
"hive.mapred.supports.subdirectories": "true",
"mapred.input.dir.recursive": "true"
  }
  "enabled": true
}
{noformat}
*Jira scope*
 This Jira aims to add new session option to Drill 
{{store.hive.conf.properties}} which will allow user to specify hive properties 
at session level. 
 User should write properties in string delimiter by new line symbol. 
Properties values should NOT be set in double-quotes or any other quotes 
otherwise they would be parsed incorrectly. Key and value should be separated 
by {{=}}. Each `alter session set` will override previously set properties at 
session level. If during query Drill couldn't unparse property string, warning 
will be logged. Properties will be parsed by loading into 
{{java.util.Properties}}. Default value is empty string ("").

Example:
{noformat}
alter session set `store.hive.conf.properties` = 
'hive.mapred.supports.subdirectories=true\nmapred.input.dir.recursive=true'";
{noformat}

  was:
*Use case*

Hive external table ddl:
{noformat}
create external table my(key int, val string)
row format delimited
fields terminated by ','
stored as textfile
location '/data/my_tbl';
{noformat}

Path {{/data/my_tb}} contains sub directory and file in it: 
{{/data/my_tbl/sub_dir/data.txt}} with the following data: 
{noformat}
1, value_1
2, value_2
{noformat}

When querying such table from Hive, user gets the following exception:
{noformat}
Failed with exception java.io.IOException:java.io.IOException: Not a file: 
file:///data/my_tbl/sub_dir
{noformat}

To be able to query this table user needs to set two properties to true: 
{{hive.mapred.supports.subdirectories}} and {{mapred.input.dir.recursive}}.
They can be set at system level in hive-site.xml or at session in Hive console:
{noformat}
set hive.mapred.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
{noformat}

Currently to be able to query such table from Drill, user can specify this 
properties in Hive plugin only:
{noformat}
{
  "type": "hive",
  "configProps": {
"hive.metastore.uris": "thrift://localhost:9083",
"hive.metastore.sasl.enabled": "false",
"hbase.zookeeper.quorum": "localhost",
"hbase.zookeeper.property.clientPort": "5181",
"hive.mapred.supports.subdirectories": "true",
"mapred.input.dir.recursive": "true"
  }
  "enabled": true
}
{noformat}

*Jira scope*
This Jira aims to add new session option to Drill 
{{store.hive.conf.properties}} which will allow user to specify hive properties 
at session level. 
User should write properties in string delimiter by new line symbol. Properties 
values should NOT be set in double-quotes or any other quotes otherwise they 
would be parsed incorrectly. Key and value should be separated by {{=}}. Each 
`alter session set` will override previously set properties at session level. 
If during query Drill couldn't unparse property string, warning will be logged. 
Properties will be parsed by loading into {{java.util.Properties}}.

Example:
{noformat}
alter session set `store.hive.conf.properties` = 
'hive.mapred.supports.subdirectories=true\nmapred.input.dir.recursive=true'";
{noformat}


> Add store.hive.conf.properties option to allow set Hive properties at session 
> level
> ---
>
> Key: DRILL-6575
> URL: https://issues.apache.org/jira/browse/DRILL-6575
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>

[jira] [Updated] (DRILL-6557) Use size in bytes during Hive statistics calculation if present

2018-07-02 Thread Pritesh Maker (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6557:
-
Labels: ready-to-commit  (was: )

> Use size in bytes during Hive statistics calculation if present
> ---
>
> Key: DRILL-6557
> URL: https://issues.apache.org/jira/browse/DRILL-6557
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> Drill considers Hive statistics valid if it contains number of rows and size 
> in bytes. If at least of them is absent, statistics is calculated based on 
> input splits size in bytes. This means that we fetch all input splits though 
> we might not need some after planning optimizations (ex: partition pruning). 
> Though if number of rows are missing and size in bytes is present, there is 
> no need to fetch all input splits since their size in bytes will be the same 
> as in statistics, this would improve time planning since fetching input 
> splits is rather costly operation.
> This Jira aims to:
>  1. check size in bytes presence in stats before fetching input splits and 
> use it if present;
>  2. add log trace suggesting to use ANALYZE command before running queries if 
> statistics is unavailable and Drill had to fetch all input splits;
>  3. minor refactoring /  cleanup in HiveMetadataProvider class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6575) Add store.hive.conf.properties option to allow set Hive properties at session level

2018-07-02 Thread Arina Ielchiieva (JIRA)
Arina Ielchiieva created DRILL-6575:
---

 Summary: Add store.hive.conf.properties option to allow set Hive 
properties at session level
 Key: DRILL-6575
 URL: https://issues.apache.org/jira/browse/DRILL-6575
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.13.0
Reporter: Arina Ielchiieva
Assignee: Arina Ielchiieva
 Fix For: 1.14.0


*Use case*

Hive external table ddl:
{noformat}
create external table my(key int, val string)
row format delimited
fields terminated by ','
stored as textfile
location '/data/my_tbl';
{noformat}

Path {{/data/my_tb}} contains sub directory and file in it: 
{{/data/my_tbl/sub_dir/data.txt}} with the following data: 
{noformat}
1, value_1
2, value_2
{noformat}

When querying such table from Hive, user gets the following exception:
{noformat}
Failed with exception java.io.IOException:java.io.IOException: Not a file: 
file:///data/my_tbl/sub_dir
{noformat}

To be able to query this table user needs to set two properties to true: 
{{hive.mapred.supports.subdirectories}} and {{mapred.input.dir.recursive}}.
They can be set at system level in hive-site.xml or at session in Hive console:
{noformat}
set hive.mapred.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
{noformat}

Currently to be able to query such table from Drill, user can specify this 
properties in Hive plugin only:
{noformat}
{
  "type": "hive",
  "configProps": {
"hive.metastore.uris": "thrift://localhost:9083",
"hive.metastore.sasl.enabled": "false",
"hbase.zookeeper.quorum": "localhost",
"hbase.zookeeper.property.clientPort": "5181",
"hive.mapred.supports.subdirectories": "true",
"mapred.input.dir.recursive": "true"
  }
  "enabled": true
}
{noformat}

*Jira scope*
This Jira aims to add new session option to Drill 
{{store.hive.conf.properties}} which will allow user to specify hive properties 
at session level. 
User should write properties in string delimiter by new line symbol. Properties 
values should NOT be set in double-quotes or any other quotes otherwise they 
would be parsed incorrectly. Key and value should be separated by {{=}}. Each 
`alter session set` will override previously set properties at session level. 
If during query Drill couldn't unparse property string, warning will be logged. 
Properties will be parsed by loading into {{java.util.Properties}}.

Example:
{noformat}
alter session set `store.hive.conf.properties` = 
'hive.mapred.supports.subdirectories=true\nmapred.input.dir.recursive=true'";
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6410) Memory leak in Parquet Reader during cancellation

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530156#comment-16530156
 ] 

ASF GitHub Bot commented on DRILL-6410:
---

vrozov commented on issue #1333: DRILL-6410: Memory leak in Parquet Reader 
during cancellation
URL: https://github.com/apache/drill/pull/1333#issuecomment-401862300
 
 
   @ilooner I guess by "last discussion" you refer to the discussion between 
you, me and @sachouche, where "majority" does not mean the community majority. 
In the Apache, any contributor can provide a solution that (s)he considers to 
be the best solution possible and then it can either be accepted by the 
community/contributor or blocked with -1 (requires technical justification). If 
another contributor provides an alternative solution, a community may decide to 
go with the alternate solution as long as it addresses technical concerns of 
the initial contribution. For this particular case, my requirements are a) a 
unified approach (@parthchandra has the same requirement) and b) the ability to 
cancel tasks asynchronously. If that can be done with the approach outlined in 
PR #1257 and a contributor will change it to address all the issues, let's move 
forward with the alternate approach.
   
   A note regarding the complexity of the implementation. This implementation 
uses public java concurrency classes as well. It does not rely on unsupported 
or unsafe to use Java classes and/or API. Basically, `LockSupport` is the same 
first-class concurrency construct as `Thread` or 'CountDownLatch` classes. The 
primary use case for using those constructs is to create a combination of 
`ExecutorService` and a `CountDownLatch` that is not provided by the Java 
itself.
   
   To summarize, I am perfectly fine to go with an alternate solution or with 
another committer to review the PR, it will be against Apache way to force a 
committer to review or commit a change, that (s)he is not comfortable with.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Memory leak in Parquet Reader during cancellation
> -
>
> Key: DRILL-6410
> URL: https://issues.apache.org/jira/browse/DRILL-6410
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: Vlad Rozov
>Priority: Major
> Fix For: 1.14.0
>
>
> Occasionally, a memory leak is observed within the flat Parquet reader when 
> query cancellation is invoked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6574) Add option to push LIMIT(0) on top of SCAN (late limit 0 optimization)

2018-07-02 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6574:

Description: 
Currently we have early limit 0 optimization 
(planner.enable_limit0_optimization) which determines query data types before 
actual scan. Since we not always able to determine data type during planning, 
we need to add one more option to enable late limit 0 optimization 
(planner.enable_limit0_on_scan, exit query right after scan. LIMIT(0) on SCAN 
for UNION and complex functions will be disabled i.e. UNION and complex 
functions need data to produce result schema. Also this would not work for the 
following list of functions: //todo add list of functions

Query plan examples:
// todo add two plans before and after the changes

Also both early and late limit 0 optimization will be turned on by default.






  was:
Currently we have early limit 0 optimization 
(planner.enable_limit0_optimization) which determines query data types before 
actual scan. Since we not always able to determine data type during planning, 
we need to add one more option to enable late limit 0 optimization 
(planner.enable_limit0_on_scan, exit query right after scan. LIMIT(0) on SCAN 
for UNION and complex functions will be disabled i.e. UNION and complex 
functions need data to produce result schema. Also this would not work for the 
following list of functions: //todo add list of functions

Query plan examples:
// todo add two plans before and after the changes

Also both early and late limit 0 optimization will be turn on by default.







> Add option to push LIMIT(0) on top of SCAN (late limit 0 optimization)
> --
>
> Key: DRILL-6574
> URL: https://issues.apache.org/jira/browse/DRILL-6574
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: doc-impacting
>
> Currently we have early limit 0 optimization 
> (planner.enable_limit0_optimization) which determines query data types before 
> actual scan. Since we not always able to determine data type during planning, 
> we need to add one more option to enable late limit 0 optimization 
> (planner.enable_limit0_on_scan, exit query right after scan. LIMIT(0) on SCAN 
> for UNION and complex functions will be disabled i.e. UNION and complex 
> functions need data to produce result schema. Also this would not work for 
> the following list of functions: //todo add list of functions
> Query plan examples:
> // todo add two plans before and after the changes
> Also both early and late limit 0 optimization will be turned on by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6546) Allow unnest function with nested columns and complex expressions

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530117#comment-16530117
 ] 

ASF GitHub Bot commented on DRILL-6546:
---

vvysotskyi commented on issue #1346: DRILL-6546: Allow unnest function with 
nested columns and complex expressions
URL: https://github.com/apache/drill/pull/1346#issuecomment-401854121
 
 
   After the rebase on current master, unit tests are failed. Fixing now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow unnest function with nested columns and complex expressions
> -
>
> Key: DRILL-6546
> URL: https://issues.apache.org/jira/browse/DRILL-6546
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.14.0
>
>
> Currently queries with unnest and nested columns or complex expressions 
> inside fails:
> {code:sql}
> select u.item from cp.`lateraljoin/nested-customer.parquet` c,
> unnest(c.orders.items) as u(item)
> {code}
> fails with error:
> {noformat}
> VALIDATION ERROR: From line 2, column 10 to line 2, column 21: Column 
> 'orders.items' not found in table 'c'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6546) Allow unnest function with nested columns and complex expressions

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530118#comment-16530118
 ] 

ASF GitHub Bot commented on DRILL-6546:
---

vvysotskyi edited a comment on issue #1346: DRILL-6546: Allow unnest function 
with nested columns and complex expressions
URL: https://github.com/apache/drill/pull/1346#issuecomment-401854121
 
 
   After the rebase on current master, unit tests failed. Fixing now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow unnest function with nested columns and complex expressions
> -
>
> Key: DRILL-6546
> URL: https://issues.apache.org/jira/browse/DRILL-6546
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.14.0
>
>
> Currently queries with unnest and nested columns or complex expressions 
> inside fails:
> {code:sql}
> select u.item from cp.`lateraljoin/nested-customer.parquet` c,
> unnest(c.orders.items) as u(item)
> {code}
> fails with error:
> {noformat}
> VALIDATION ERROR: From line 2, column 10 to line 2, column 21: Column 
> 'orders.items' not found in table 'c'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6519) Add String Distance and Phonetic Functions

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530054#comment-16530054
 ] 

ASF GitHub Bot commented on DRILL-6519:
---

arina-ielchiieva commented on issue #1331: DRILL-6519: Add String Distance and 
Phonetic Functions
URL: https://github.com/apache/drill/pull/1331#issuecomment-401837372
 
 
   +1, LGTM.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add String Distance and Phonetic Functions
> --
>
> Key: DRILL-6519
> URL: https://issues.apache.org/jira/browse/DRILL-6519
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.14.0
>
>
> From a recent project, this collection of functions makes it possible to do 
> fuzzy string matching as well as phonetic matching on strings. 
>  
> The following functions are all phonetic functions and map text to a number 
> or string based on how the word sounds.  For instance "Jayme" and "Jaime" 
> have the same soundex values and hence these functions can be used to match 
> similar sounding words.
>  * caverphone1(  )
>  * caverphone2(  )
>  * cologne_phonetic(  )
>  * dm_soundex(  )
>  * double_metaphone()
>  * match_rating_encoder(  )
>  * metaphone()
>  * nysiis(  )
>  * refined_soundex()
>  * soundex()
> Additionally, there is the
> {code:java}
> sounds_like(,){code}
> function which can be used to find strings that sound similar.   For instance:
>  
> {code:java}
> SELECT * 
> FROM 
> WHERE sounds_like( last_name, 'Gretsky' )
> {code}
> h2. String Distance Functions
> In addition to the phonetic functions, there are a series of distance 
> functions which measure the difference between two strings.  The functions 
> include:
>  * cosine_distance(,)
>  * fuzzy_score(,)
>  * hamming_distance (,)
>  * jaccard_distance (,)
>  * jaro_distance (,)
>  * levenshtein_distance (,)
>  * longest_common_substring_distance(,)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6519) Add String Distance and Phonetic Functions

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530028#comment-16530028
 ] 

ASF GitHub Bot commented on DRILL-6519:
---

cgivre commented on issue #1331: DRILL-6519: Add String Distance and Phonetic 
Functions
URL: https://github.com/apache/drill/pull/1331#issuecomment-401831328
 
 
   @arina-ielchiieva Should be ready to go...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add String Distance and Phonetic Functions
> --
>
> Key: DRILL-6519
> URL: https://issues.apache.org/jira/browse/DRILL-6519
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.14.0
>
>
> From a recent project, this collection of functions makes it possible to do 
> fuzzy string matching as well as phonetic matching on strings. 
>  
> The following functions are all phonetic functions and map text to a number 
> or string based on how the word sounds.  For instance "Jayme" and "Jaime" 
> have the same soundex values and hence these functions can be used to match 
> similar sounding words.
>  * caverphone1(  )
>  * caverphone2(  )
>  * cologne_phonetic(  )
>  * dm_soundex(  )
>  * double_metaphone()
>  * match_rating_encoder(  )
>  * metaphone()
>  * nysiis(  )
>  * refined_soundex()
>  * soundex()
> Additionally, there is the
> {code:java}
> sounds_like(,){code}
> function which can be used to find strings that sound similar.   For instance:
>  
> {code:java}
> SELECT * 
> FROM 
> WHERE sounds_like( last_name, 'Gretsky' )
> {code}
> h2. String Distance Functions
> In addition to the phonetic functions, there are a series of distance 
> functions which measure the difference between two strings.  The functions 
> include:
>  * cosine_distance(,)
>  * fuzzy_score(,)
>  * hamming_distance (,)
>  * jaccard_distance (,)
>  * jaro_distance (,)
>  * levenshtein_distance (,)
>  * longest_common_substring_distance(,)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5796) Filter pruning for multi rowgroup parquet file

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529846#comment-16529846
 ] 

ASF GitHub Bot commented on DRILL-5796:
---

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r199489637
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,90 +60,89 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return ParquetPredicatesHelper.isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  static RowsMatch checkNull(Statistics exprStat) {
+return exprStat.getNumNulls() > 0 ? RowsMatch.SOME : RowsMatch.ALL;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
 
 Review comment:
   I can't swear of it. Surrounded with instanceof. 
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter pruning for multi rowgroup parquet file
> --
>
> Key: DRILL-5796
> URL: https://issues.apache.org/jira/browse/DRILL-5796
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Jean-Blas IMBERT
>Priority: Major
> Fix For: 1.14.0
>
>
> Today, filter pruning use the file name as the partitioning key. This means 
> you can remove a partition only if the whole file is for the same partition. 
> With parquet, you can prune the filter if the rowgroup make a partition of 
> your dataset as the unit of work if the rowgroup not the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-3214) Config option to cast empty string to null does not cast empty string to null

2018-07-02 Thread Vitalii Diravka (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529843#comment-16529843
 ] 

Vitalii Diravka commented on DRILL-3214:


The scope for `_drill.exec.functions.cast_empty_string_to_null_` option is ALL, 
which means that it can be configured at the system, session, or query level.
But maybe the cause of the issue is _OptionDefinition_ for this option is 
placed in the _SystemOptionManager_ class. Looks like it leads for using this 
option only as a SYSTEM level option.
Possibly it is related to all other options, which are placed in 
_SystemOptionManager_.

> Config option to cast empty string to null does not cast empty string to null
> -
>
> Key: DRILL-3214
> URL: https://issues.apache.org/jira/browse/DRILL-3214
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.0.0
> Environment: faec150598840c40827e6493992d81209aa936da
>Reporter: Khurram Faraaz
>Assignee: Sean Hsuan-Yi Chu
>Priority: Major
> Fix For: 1.1.0
>
>
> Config option drill.exec.functions.cast_empty_string_to_null does not seem to 
> be working as designed.
> Disable casting of empty strings to null. 
> {code}
> 0: jdbc:drill:schema=dfs.tmp> alter session set 
> `drill.exec.functions.cast_empty_string_to_null` = false;
> +---+--+
> |  ok   | summary  |
> +---+--+
> | true  | drill.exec.functions.cast_empty_string_to_null updated.  |
> +---+--+
> 1 row selected (0.078 seconds)
> {code}
> In this query we see empty strings are retained in query output in columns[1].
> {code}
> 0: jdbc:drill:schema=dfs.tmp> SELECT columns[0], columns[1], columns[2] from 
> `threeColsDouble.csv`;
> +--+-+-+
> |  EXPR$0  | EXPR$1  | EXPR$2  |
> +--+-+-+
> | 156  | 234 | 1   |
> | 2653543  | 434 | 0   |
> | 367345   | 567567  | 23  |
> | 34554| 1234| 45  |
> | 4345 | 567678  | 19876   |
> | 34556| 0   | 1109|
> | 5456 | -1  | 1098|
> | 6567 | | 34534   |
> | 7678 | 1   | 6   |
> | 8798 | 456 | 243 |
> | 265354   | 234 | 123 |
> | 367345   | | 234 |
> | 34554| 1   | 2   |
> | 4345 | 0   | 10  |
> | 34556| -1  | 19  |
> | 5456 | 23423   | 345 |
> | 6567 | 0   | 2348|
> | 7678 | 1   | 2   |
> | 8798 | | 45  |
> | 099  | 19  | 17  |
> +--+-+-+
> 20 rows selected (0.13 seconds)
> {code}
> Casting empty strings to integer leads to NumberFormatException
> {code}
> 0: jdbc:drill:schema=dfs.tmp> SELECT columns[0], cast(columns[1] as int), 
> columns[2] from `threeColsDouble.csv`;
> Error: SYSTEM ERROR: java.lang.NumberFormatException: 
> Fragment 0:0
> [Error Id: b08f4247-263a-460d-b37b-91a70375f7ba on centos-03.qa.lab:31010] 
> (state=,code=0)
> {code}
> Enable casting empty string to null.
> {code}
> 0: jdbc:drill:schema=dfs.tmp> alter session set 
> `drill.exec.functions.cast_empty_string_to_null` = true;
> +---+--+
> |  ok   | summary  |
> +---+--+
> | true  | drill.exec.functions.cast_empty_string_to_null updated.  |
> +---+--+
> 1 row selected (0.077 seconds)
> {code}
> Run query
> {code}
> 0: jdbc:drill:schema=dfs.tmp> SELECT columns[0], cast(columns[1] as int), 
> columns[2] from `threeColsDouble.csv`;
> Error: SYSTEM ERROR: java.lang.NumberFormatException: 
> Fragment 0:0
> [Error Id: de633399-15f9-4a79-a21f-262bd5551207 on centos-03.qa.lab:31010] 
> (state=,code=0)
> {code}
> Note from the output of below query that the empty strings are not casted to 
> null, although drill.exec.functions.cast_empty_string_to_null was set to true.
> {code}
> 0: jdbc:drill:schema=dfs.tmp> SELECT columns[0], columns[1], columns[2] from 
> `threeColsDouble.csv`;
> +--+-+-+
> |  EXPR$0  | EXPR$1  | EXPR$2  |
> +--+-+-+
> | 156  | 234 | 1   |
> | 2653543  | 434 | 0   |
> | 367345   | 567567  | 23  |
> | 34554| 1234| 45  |
> | 4345 | 567678  | 19876   |
> | 34556| 0   | 1109|
> | 5456 | -1  | 1098|
> | 6567 | | 34534   |
> | 7678 | 1   | 6   |
> | 8798 | 456 | 

[jira] [Commented] (DRILL-5796) Filter pruning for multi rowgroup parquet file

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529833#comment-16529833
 ] 

ASF GitHub Bot commented on DRILL-5796:
---

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r199486245
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,90 +60,89 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return ParquetPredicatesHelper.isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  static RowsMatch checkNull(Statistics exprStat) {
+return exprStat.getNumNulls() > 0 ? RowsMatch.SOME : RowsMatch.ALL;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
+if (typedFieldExpr.getPath().isArray()) {
+  return RowsMatch.SOME;
+}
+if (hasNoNulls(exprStat)) {
+  return RowsMatch.NONE;
+}
+return isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.ALL : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS NOT NULL predicate.
*/
   private static > LogicalExpression 
createIsNotNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are all nulls  -> canDrop
-(exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount())
+  (exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount()) ? 
RowsMatch.NONE : checkNull(exprStat)
 );
   }
 
   /**
* IS TRUE predicate.
*/
-  private static LogicalExpression createIsTruePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr,
-//if max value is not true or if there are all nulls  -> canDrop
-(exprStat, evaluator) -> !((BooleanStatistics)exprStat).getMax() || 
isAllNulls(exprStat, evaluator.getRowCount())
-);
+  private static > LogicalExpression 
createIsTruePredicate(LogicalExpression expr) {
+return new ParquetIsPredicate(expr,
+  (exprStat, evaluator) -> {
+if (isAllNulls(exprStat, evaluator.getRowCount()) || 
(exprStat.genericGetMin().equals(Boolean.FALSE) && 
exprStat.genericGetMax().equals(Boolean.FALSE))) {
+  return RowsMatch.NONE;
+}
+return exprStat.genericGetMin().equals(Boolean.TRUE) && 
exprStat.genericGetMax().equals(Boolean.TRUE) ? checkNull(exprStat) : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS FALSE predicate.
*/
-  private static LogicalExpression createIsFalsePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr,
-//if min value is not false or if there are all nulls  -> canDrop
-(exprStat, evaluator) -> ((BooleanStatistics)exprStat).getMin() || 
isAllNulls(exprStat, evaluator.getRowCount())
+  private static > LogicalExpression 
createIsFalsePredicate(LogicalExpression expr) {
+return new ParquetIsPredicate(expr,
+  (exprStat, evaluator) -> 

[jira] [Commented] (DRILL-5796) Filter pruning for multi rowgroup parquet file

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529822#comment-16529822
 ] 

ASF GitHub Bot commented on DRILL-5796:
---

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r199483001
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetComparisonPredicate.java
 ##
 @@ -83,23 +84,26 @@ private ParquetComparisonPredicate(
* where Column1 and Column2 are from same parquet table.
*/
   @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics leftStat = left.accept(evaluator, null);
-if (isNullOrEmpty(leftStat)) {
-  return false;
+if (isNullOrEmpty(leftStat) || !leftStat.hasNonNullValue()) {
+  return RowsMatch.SOME;
 }
-
 Statistics rightStat = right.accept(evaluator, null);
-if (isNullOrEmpty(rightStat)) {
-  return false;
+if (isNullOrEmpty(rightStat) || !rightStat.hasNonNullValue()) {
+  return RowsMatch.SOME;
 }
-
-// if either side is ALL null, = is evaluated to UNKNOWN -> canDrop
 if (isAllNulls(leftStat, evaluator.getRowCount()) || isAllNulls(rightStat, 
evaluator.getRowCount())) {
-  return true;
+return RowsMatch.NONE;
 }
+return predicate.apply(leftStat, rightStat);
+  }
 
-return (leftStat.hasNonNullValue() && rightStat.hasNonNullValue()) && 
predicate.test(leftStat, rightStat);
+  /**
+   * If one rowgroup contains some null values, change the RowsMatch.ALL into 
RowsMatch.SOME (null values should be discarded by filter)
+   */
+  static RowsMatch checkNull(Statistics leftStat, Statistics rightStat) {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter pruning for multi rowgroup parquet file
> --
>
> Key: DRILL-5796
> URL: https://issues.apache.org/jira/browse/DRILL-5796
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Jean-Blas IMBERT
>Priority: Major
> Fix For: 1.14.0
>
>
> Today, filter pruning use the file name as the partitioning key. This means 
> you can remove a partition only if the whole file is for the same partition. 
> With parquet, you can prune the filter if the rowgroup make a partition of 
> your dataset as the unit of work if the rowgroup not the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5796) Filter pruning for multi rowgroup parquet file

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529819#comment-16529819
 ] 

ASF GitHub Bot commented on DRILL-5796:
---

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r199482330
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetPushDownFilter.java
 ##
 @@ -165,12 +167,32 @@ protected void doOnMatch(RelOptRuleCall call, FilterPrel 
filter, ProjectPrel pro
   return;
 }
 
-
 RelNode newScan = ScanPrel.create(scan, scan.getTraitSet(), newGroupScan, 
scan.getRowType());;
 
 if (project != null) {
   newScan = project.copy(project.getTraitSet(), ImmutableList.of(newScan));
 }
+
+RowsMatch matchAll = RowsMatch.ALL;
+if (newGroupScan instanceof AbstractParquetGroupScan) {
+  List rowGroupInfos = ((AbstractParquetGroupScan) 
newGroupScan).rowGroupInfos;
+  for (RowGroupInfo rowGroup : rowGroupInfos) {
+if (rowGroup.getRowsMatch() !=  RowsMatch.ALL) {
+  matchAll = RowsMatch.SOME;
+  break;
+}
+  }
+} else {
+  matchAll = RowsMatch.SOME;
+}
+// If all groups are in RowsMatch.ALL, no need to apply the filter again 
to their rows => prune the filter
+if (matchAll == ParquetFilterPredicate.RowsMatch.ALL) {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter pruning for multi rowgroup parquet file
> --
>
> Key: DRILL-5796
> URL: https://issues.apache.org/jira/browse/DRILL-5796
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Jean-Blas IMBERT
>Priority: Major
> Fix For: 1.14.0
>
>
> Today, filter pruning use the file name as the partitioning key. This means 
> you can remove a partition only if the whole file is for the same partition. 
> With parquet, you can prune the filter if the rowgroup make a partition of 
> your dataset as the unit of work if the rowgroup not the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5796) Filter pruning for multi rowgroup parquet file

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529812#comment-16529812
 ] 

ASF GitHub Bot commented on DRILL-5796:
---

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r199480607
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,90 +60,89 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return ParquetPredicatesHelper.isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  static RowsMatch checkNull(Statistics exprStat) {
+return exprStat.getNumNulls() > 0 ? RowsMatch.SOME : RowsMatch.ALL;
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter pruning for multi rowgroup parquet file
> --
>
> Key: DRILL-5796
> URL: https://issues.apache.org/jira/browse/DRILL-5796
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Jean-Blas IMBERT
>Priority: Major
> Fix For: 1.14.0
>
>
> Today, filter pruning use the file name as the partitioning key. This means 
> you can remove a partition only if the whole file is for the same partition. 
> With parquet, you can prune the filter if the rowgroup make a partition of 
> your dataset as the unit of work if the rowgroup not the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5796) Filter pruning for multi rowgroup parquet file

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529807#comment-16529807
 ] 

ASF GitHub Bot commented on DRILL-5796:
---

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r199479547
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -19,21 +19,18 @@
 
 import org.apache.drill.common.expression.LogicalExpression;
 import org.apache.drill.common.expression.LogicalExpressionBase;
-import org.apache.drill.common.expression.SchemaPath;
 import org.apache.drill.common.expression.TypedFieldExpr;
 import org.apache.drill.common.expression.visitors.ExprVisitor;
 import org.apache.drill.exec.expr.fn.FunctionGenerationHelper;
-import org.apache.parquet.column.statistics.BooleanStatistics;
 import org.apache.parquet.column.statistics.Statistics;
 
 import java.util.ArrayList;
 import java.util.Iterator;
 import java.util.List;
-import java.util.function.BiPredicate;
+import java.util.function.BiFunction;
 
 import static 
org.apache.drill.exec.expr.stat.ParquetPredicatesHelper.hasNoNulls;
 import static 
org.apache.drill.exec.expr.stat.ParquetPredicatesHelper.isAllNulls;
-import static 
org.apache.drill.exec.expr.stat.ParquetPredicatesHelper.isNullOrEmpty;
 
 Review comment:
   Back. Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter pruning for multi rowgroup parquet file
> --
>
> Key: DRILL-5796
> URL: https://issues.apache.org/jira/browse/DRILL-5796
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Jean-Blas IMBERT
>Priority: Major
> Fix For: 1.14.0
>
>
> Today, filter pruning use the file name as the partitioning key. This means 
> you can remove a partition only if the whole file is for the same partition. 
> With parquet, you can prune the filter if the rowgroup make a partition of 
> your dataset as the unit of work if the rowgroup not the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5796) Filter pruning for multi rowgroup parquet file

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529804#comment-16529804
 ] 

ASF GitHub Bot commented on DRILL-5796:
---

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r199479087
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetComparisonPredicate.java
 ##
 @@ -83,23 +84,26 @@ private ParquetComparisonPredicate(
* where Column1 and Column2 are from same parquet table.
*/
   @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics leftStat = left.accept(evaluator, null);
-if (isNullOrEmpty(leftStat)) {
-  return false;
+if (isNullOrEmpty(leftStat) || !leftStat.hasNonNullValue()) {
+  return RowsMatch.SOME;
 }
-
 Statistics rightStat = right.accept(evaluator, null);
-if (isNullOrEmpty(rightStat)) {
-  return false;
+if (isNullOrEmpty(rightStat) || !rightStat.hasNonNullValue()) {
+  return RowsMatch.SOME;
 }
-
-// if either side is ALL null, = is evaluated to UNKNOWN -> canDrop
 if (isAllNulls(leftStat, evaluator.getRowCount()) || isAllNulls(rightStat, 
evaluator.getRowCount())) {
-  return true;
+return RowsMatch.NONE;
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter pruning for multi rowgroup parquet file
> --
>
> Key: DRILL-5796
> URL: https://issues.apache.org/jira/browse/DRILL-5796
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Jean-Blas IMBERT
>Priority: Major
> Fix For: 1.14.0
>
>
> Today, filter pruning use the file name as the partitioning key. This means 
> you can remove a partition only if the whole file is for the same partition. 
> With parquet, you can prune the filter if the rowgroup make a partition of 
> your dataset as the unit of work if the rowgroup not the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5796) Filter pruning for multi rowgroup parquet file

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529803#comment-16529803
 ] 

ASF GitHub Bot commented on DRILL-5796:
---

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r199478733
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetComparisonPredicate.java
 ##
 @@ -83,23 +84,26 @@ private ParquetComparisonPredicate(
* where Column1 and Column2 are from same parquet table.
*/
   @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics leftStat = left.accept(evaluator, null);
-if (isNullOrEmpty(leftStat)) {
-  return false;
+if (isNullOrEmpty(leftStat) || !leftStat.hasNonNullValue()) {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter pruning for multi rowgroup parquet file
> --
>
> Key: DRILL-5796
> URL: https://issues.apache.org/jira/browse/DRILL-5796
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Jean-Blas IMBERT
>Priority: Major
> Fix For: 1.14.0
>
>
> Today, filter pruning use the file name as the partitioning key. This means 
> you can remove a partition only if the whole file is for the same partition. 
> With parquet, you can prune the filter if the rowgroup make a partition of 
> your dataset as the unit of work if the rowgroup not the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6574) Add option to push LIMIT(0) on top of SCAN (late limit 0 optimization)

2018-07-02 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6574:

Reviewer: Volodymyr Vysotskyi

> Add option to push LIMIT(0) on top of SCAN (late limit 0 optimization)
> --
>
> Key: DRILL-6574
> URL: https://issues.apache.org/jira/browse/DRILL-6574
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: doc-impacting
>
> Currently we have early limit 0 optimization 
> (planner.enable_limit0_optimization) which determines query data types before 
> actual scan. Since we not always able to determine data type during planning, 
> we need to add one more option to enable late limit 0 optimization 
> (planner.enable_limit0_on_scan, exit query right after scan. LIMIT(0) on SCAN 
> for UNION and complex functions will be disabled i.e. UNION and complex 
> functions need data to produce result schema. Also this would not work for 
> the following list of functions: //todo add list of functions
> Query plan examples:
> // todo add two plans before and after the changes
> Also both early and late limit 0 optimization will turn on by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6574) Add option to push LIMIT(0) on top of SCAN (late limit 0 optimization)

2018-07-02 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6574:

Description: 
Currently we have early limit 0 optimization 
(planner.enable_limit0_optimization) which determines query data types before 
actual scan. Since we not always able to determine data type during planning, 
we need to add one more option to enable late limit 0 optimization 
(planner.enable_limit0_on_scan, exit query right after scan. LIMIT(0) on SCAN 
for UNION and complex functions will be disabled i.e. UNION and complex 
functions need data to produce result schema. Also this would not work for the 
following list of functions: //todo add list of functions

Query plan examples:
// todo add two plans before and after the changes

Also both early and late limit 0 optimization will be turn on by default.






  was:
Currently we have early limit 0 optimization 
(planner.enable_limit0_optimization) which determines query data types before 
actual scan. Since we not always able to determine data type during planning, 
we need to add one more option to enable late limit 0 optimization 
(planner.enable_limit0_on_scan, exit query right after scan. LIMIT(0) on SCAN 
for UNION and complex functions will be disabled i.e. UNION and complex 
functions need data to produce result schema. Also this would not work for the 
following list of functions: //todo add list of functions

Query plan examples:
// todo add two plans before and after the changes

Also both early and late limit 0 optimization will turn on by default.







> Add option to push LIMIT(0) on top of SCAN (late limit 0 optimization)
> --
>
> Key: DRILL-6574
> URL: https://issues.apache.org/jira/browse/DRILL-6574
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: doc-impacting
>
> Currently we have early limit 0 optimization 
> (planner.enable_limit0_optimization) which determines query data types before 
> actual scan. Since we not always able to determine data type during planning, 
> we need to add one more option to enable late limit 0 optimization 
> (planner.enable_limit0_on_scan, exit query right after scan. LIMIT(0) on SCAN 
> for UNION and complex functions will be disabled i.e. UNION and complex 
> functions need data to produce result schema. Also this would not work for 
> the following list of functions: //todo add list of functions
> Query plan examples:
> // todo add two plans before and after the changes
> Also both early and late limit 0 optimization will be turn on by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6574) Add option to push LIMIT(0) on top of SCAN (late limit 0 optimization)

2018-07-02 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6574:

Description: 
Currently we have early limit 0 optimization 
(planner.enable_limit0_optimization) which determines query data types before 
actual scan. Since we not always able to determine data type during planning, 
we need to add one more option to enable late limit 0 optimization 
(planner.enable_limit0_on_scan, exit query right after scan. LIMIT(0) on SCAN 
for UNION and complex functions will be disabled i.e. UNION and complex 
functions need data to produce result schema. Also this would not work for the 
following list of functions: //todo add list of functions

Query plan examples:
// todo add two plans before and after the changes

Also both early and late limit 0 optimization will turn on by default.






  was:
Currently prepare statements use LIMIT 0 to get the result schema. Adding 
LIMIT(0) on top of SCAN causes an early termination of the query.

Create an option "planner.enable_limit0_on_scan", enabled by default. Change 
"planner.enable_limit0_optimization" option to be enabled by default.

LIMIT(0) on SCAN for UNION and complex functions are disabled i.e. UNION and 
complex functions need data to produce result schema. 

If function is unsupported, the plan won't be affected.


> Add option to push LIMIT(0) on top of SCAN (late limit 0 optimization)
> --
>
> Key: DRILL-6574
> URL: https://issues.apache.org/jira/browse/DRILL-6574
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: doc-impacting
>
> Currently we have early limit 0 optimization 
> (planner.enable_limit0_optimization) which determines query data types before 
> actual scan. Since we not always able to determine data type during planning, 
> we need to add one more option to enable late limit 0 optimization 
> (planner.enable_limit0_on_scan, exit query right after scan. LIMIT(0) on SCAN 
> for UNION and complex functions will be disabled i.e. UNION and complex 
> functions need data to produce result schema. Also this would not work for 
> the following list of functions: //todo add list of functions
> Query plan examples:
> // todo add two plans before and after the changes
> Also both early and late limit 0 optimization will turn on by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6574) Add option to push LIMIT(0) on top of SCAN (late limit 0 optimization)

2018-07-02 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6574:

Summary: Add option to push LIMIT(0) on top of SCAN (late limit 0 
optimization)  (was: Add option to push LIMIT(0) on top of SCAN for a prepare 
statement)

> Add option to push LIMIT(0) on top of SCAN (late limit 0 optimization)
> --
>
> Key: DRILL-6574
> URL: https://issues.apache.org/jira/browse/DRILL-6574
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: doc-impacting
>
> Currently prepare statements use LIMIT 0 to get the result schema. Adding 
> LIMIT(0) on top of SCAN causes an early termination of the query.
> Create an option "planner.enable_limit0_on_scan", enabled by default. Change 
> "planner.enable_limit0_optimization" option to be enabled by default.
> LIMIT(0) on SCAN for UNION and complex functions are disabled i.e. UNION and 
> complex functions need data to produce result schema. 
> If function is unsupported, the plan won't be affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6574) Add option to push LIMIT(0) on top of SCAN for a prepare statement

2018-07-02 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6574:

Summary: Add option to push LIMIT(0) on top of SCAN for a prepare statement 
 (was: Add LIMIT(0) on top of SCAN for a prepare statement)

> Add option to push LIMIT(0) on top of SCAN for a prepare statement
> --
>
> Key: DRILL-6574
> URL: https://issues.apache.org/jira/browse/DRILL-6574
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: doc-impacting
>
> Currently prepare statements use LIMIT 0 to get the result schema. Adding 
> LIMIT(0) on top of SCAN causes an early termination of the query.
> Create an option "planner.enable_limit0_on_scan", enabled by default. Change 
> "planner.enable_limit0_optimization" option to be enabled by default.
> LIMIT(0) on SCAN for UNION and complex functions are disabled i.e. UNION and 
> complex functions need data to produce result schema. 
> If function is unsupported, the plan won't be affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6574) Add LIMIT(0) on top of SCAN for a prepare statement

2018-07-02 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6574:

Labels: doc-impacting  (was: )

> Add LIMIT(0) on top of SCAN for a prepare statement
> ---
>
> Key: DRILL-6574
> URL: https://issues.apache.org/jira/browse/DRILL-6574
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: doc-impacting
>
> Currently prepare statements use LIMIT 0 to get the result schema. Adding 
> LIMIT(0) on top of SCAN causes an early termination of the query.
> Create an option "planner.enable_limit0_on_scan", enabled by default. Change 
> "planner.enable_limit0_optimization" option to be enabled by default.
> LIMIT(0) on SCAN for UNION and complex functions are disabled i.e. UNION and 
> complex functions need data to produce result schema. 
> If function is unsupported, the plan won't be affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6574) Add LIMIT(0) on top of SCAN for a prepare statement

2018-07-02 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6574:

Affects Version/s: 1.13.0

> Add LIMIT(0) on top of SCAN for a prepare statement
> ---
>
> Key: DRILL-6574
> URL: https://issues.apache.org/jira/browse/DRILL-6574
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: doc-impacting
>
> Currently prepare statements use LIMIT 0 to get the result schema. Adding 
> LIMIT(0) on top of SCAN causes an early termination of the query.
> Create an option "planner.enable_limit0_on_scan", enabled by default. Change 
> "planner.enable_limit0_optimization" option to be enabled by default.
> LIMIT(0) on SCAN for UNION and complex functions are disabled i.e. UNION and 
> complex functions need data to produce result schema. 
> If function is unsupported, the plan won't be affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6574) Add LIMIT(0) on top of SCAN for a prepare statement

2018-07-02 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6574:

Issue Type: Improvement  (was: Task)

> Add LIMIT(0) on top of SCAN for a prepare statement
> ---
>
> Key: DRILL-6574
> URL: https://issues.apache.org/jira/browse/DRILL-6574
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Minor
>
> Currently prepare statements use LIMIT 0 to get the result schema. Adding 
> LIMIT(0) on top of SCAN causes an early termination of the query.
> Create an option "planner.enable_limit0_on_scan", enabled by default. Change 
> "planner.enable_limit0_optimization" option to be enabled by default.
> LIMIT(0) on SCAN for UNION and complex functions are disabled i.e. UNION and 
> complex functions need data to produce result schema. 
> If function is unsupported, the plan won't be affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6574) Add LIMIT(0) on top of SCAN for a prepare statement

2018-07-02 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6574:

Priority: Major  (was: Minor)

> Add LIMIT(0) on top of SCAN for a prepare statement
> ---
>
> Key: DRILL-6574
> URL: https://issues.apache.org/jira/browse/DRILL-6574
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>
> Currently prepare statements use LIMIT 0 to get the result schema. Adding 
> LIMIT(0) on top of SCAN causes an early termination of the query.
> Create an option "planner.enable_limit0_on_scan", enabled by default. Change 
> "planner.enable_limit0_optimization" option to be enabled by default.
> LIMIT(0) on SCAN for UNION and complex functions are disabled i.e. UNION and 
> complex functions need data to produce result schema. 
> If function is unsupported, the plan won't be affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6574) Add LIMIT(0) on top of SCAN for a prepare statement

2018-07-02 Thread Bohdan Kazydub (JIRA)
Bohdan Kazydub created DRILL-6574:
-

 Summary: Add LIMIT(0) on top of SCAN for a prepare statement
 Key: DRILL-6574
 URL: https://issues.apache.org/jira/browse/DRILL-6574
 Project: Apache Drill
  Issue Type: Task
Reporter: Bohdan Kazydub
Assignee: Bohdan Kazydub


Currently prepare statements use LIMIT 0 to get the result schema. Adding 
LIMIT(0) on top of SCAN causes an early termination of the query.

Create an option "planner.enable_limit0_on_scan", enabled by default. Change 
"planner.enable_limit0_optimization" option to be enabled by default.

LIMIT(0) on SCAN for UNION and complex functions are disabled i.e. UNION and 
complex functions need data to produce result schema. 

If function is unsupported, the plan won't be affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6310) limit batch size for hash aggregate

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529489#comment-16529489
 ] 

ASF GitHub Bot commented on DRILL-6310:
---

asfgit closed pull request #1324: DRILL-6310: limit batch size for hash 
aggregate
URL: https://github.com/apache/drill/pull/1324
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java
 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java
index 57e9bd7d0c..d37631be45 100644
--- 
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java
+++ 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java
@@ -19,15 +19,19 @@
 
 import java.io.IOException;
 import java.util.List;
+import java.util.Map;
 
 import com.google.common.collect.Lists;
 import org.apache.drill.common.exceptions.UserException;
 import org.apache.drill.common.expression.ErrorCollector;
 import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCall;
 import org.apache.drill.common.expression.FunctionHolderExpression;
 import org.apache.drill.common.expression.IfExpression;
 import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
 import org.apache.drill.common.logical.data.NamedExpression;
+import org.apache.drill.common.map.CaseInsensitiveMap;
 import org.apache.drill.exec.ExecConstants;
 import org.apache.drill.exec.compile.sig.GeneratorMapping;
 import org.apache.drill.exec.compile.sig.MappingSet;
@@ -49,11 +53,14 @@
 import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
 import org.apache.drill.exec.record.MaterializedField;
 import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.RecordBatchMemoryManager;
+import org.apache.drill.exec.record.RecordBatchSizer;
 import org.apache.drill.exec.record.TypedFieldId;
 import org.apache.drill.exec.record.VectorWrapper;
 import org.apache.drill.exec.record.selection.SelectionVector2;
 import org.apache.drill.exec.record.selection.SelectionVector4;
 import org.apache.drill.exec.vector.AllocationHelper;
+import org.apache.drill.exec.vector.FixedWidthVector;
 import org.apache.drill.exec.vector.ValueVector;
 
 import com.sun.codemodel.JExpr;
@@ -71,6 +78,12 @@
   private BatchSchema incomingSchema;
   private boolean wasKilled;
 
+  private int numGroupByExprs, numAggrExprs;
+
+  // This map saves the mapping between outgoing column and incoming column.
+  private Map columnMapping;
+  private final HashAggMemoryManager hashAggMemoryManager;
+
   private final GeneratorMapping UPDATE_AGGR_INSIDE =
   GeneratorMapping.create("setupInterior" /* setup method */, 
"updateAggrValuesInternal" /* eval method */,
   "resetValues" /* reset */, "cleanup" /* cleanup */);
@@ -84,6 +97,67 @@
   "htRowIdx" /* workspace index */, "incoming" /* read container */, 
"outgoing" /* write container */,
   "aggrValuesContainer" /* workspace container */, UPDATE_AGGR_INSIDE, 
UPDATE_AGGR_OUTSIDE, UPDATE_AGGR_INSIDE);
 
+  public int getOutputRowCount() {
+return hashAggMemoryManager.getOutputRowCount();
+  }
+
+  public RecordBatchMemoryManager getRecordBatchMemoryManager() {
+return hashAggMemoryManager;
+  }
+
+  private class HashAggMemoryManager extends RecordBatchMemoryManager {
+private int valuesRowWidth = 0;
+
+HashAggMemoryManager(int outputBatchSize) {
+  super(outputBatchSize);
+}
+
+@Override
+public void update() {
+  // Get sizing information for the batch.
+  setRecordBatchSizer(new RecordBatchSizer(incoming));
+
+  int fieldId = 0;
+  int newOutgoingRowWidth = 0;
+  for (VectorWrapper w : container) {
+if (w.getValueVector() instanceof FixedWidthVector) {
+  newOutgoingRowWidth += ((FixedWidthVector) 
w.getValueVector()).getValueWidth();
+  if (fieldId >= numGroupByExprs) {
+valuesRowWidth += ((FixedWidthVector) 
w.getValueVector()).getValueWidth();
+  }
+} else {
+  int columnWidth;
+  if (columnMapping.get(w.getValueVector().getField().getName()) == 
null) {
+ columnWidth = TypeHelper.getSize(w.getField().getType());
+  } else {
+RecordBatchSizer.ColumnSize columnSize = 
getRecordBatchSizer().getColumn(columnMapping.get(w.getValueVector().getField().getName()));
+if (columnSize == null) {
+  columnWidth = TypeHelper.getSize(w.getField().getType());
+} else {
+   

[jira] [Commented] (DRILL-6570) IndexOutOfBoundsException when using Flat Parquet Reader

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529472#comment-16529472
 ] 

ASF GitHub Bot commented on DRILL-6570:
---

kkhatua commented on issue #1354: DRILL-6570: Fixed IndexOutofBoundException in 
Parquet Reader
URL: https://github.com/apache/drill/pull/1354#issuecomment-401693445
 
 
   The IDE (or Maven, itself) might report this as unused. It might be worth 
mentioning as a that this is a placeholder for the future. 
   That said, I'd still advice @Ben-Zvi  or someone else familiar with the 
original PR review this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> IndexOutOfBoundsException when using Flat Parquet  Reader
> -
>
> Key: DRILL-6570
> URL: https://issues.apache.org/jira/browse/DRILL-6570
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.14.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> * The Parquet Reader creates a reusable bulk entry based on the column 
> precision
>  * It uses the column precision for optimizing the intermediary heap buffers
>  * It first detected the column was fixed length but then it reverted this 
> assumption when the column changed precision
>  * This step was fine except the bulk entry memory requirement changed though 
> the code didn't update the bulk entry intermediary buffers
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6519) Add String Distance and Phonetic Functions

2018-07-02 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6519:

Labels: doc-impacting ready-to-commit  (was: doc-impacting)

> Add String Distance and Phonetic Functions
> --
>
> Key: DRILL-6519
> URL: https://issues.apache.org/jira/browse/DRILL-6519
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.14.0
>
>
> From a recent project, this collection of functions makes it possible to do 
> fuzzy string matching as well as phonetic matching on strings. 
>  
> The following functions are all phonetic functions and map text to a number 
> or string based on how the word sounds.  For instance "Jayme" and "Jaime" 
> have the same soundex values and hence these functions can be used to match 
> similar sounding words.
>  * caverphone1(  )
>  * caverphone2(  )
>  * cologne_phonetic(  )
>  * dm_soundex(  )
>  * double_metaphone()
>  * match_rating_encoder(  )
>  * metaphone()
>  * nysiis(  )
>  * refined_soundex()
>  * soundex()
> Additionally, there is the
> {code:java}
> sounds_like(,){code}
> function which can be used to find strings that sound similar.   For instance:
>  
> {code:java}
> SELECT * 
> FROM 
> WHERE sounds_like( last_name, 'Gretsky' )
> {code}
> h2. String Distance Functions
> In addition to the phonetic functions, there are a series of distance 
> functions which measure the difference between two strings.  The functions 
> include:
>  * cosine_distance(,)
>  * fuzzy_score(,)
>  * hamming_distance (,)
>  * jaccard_distance (,)
>  * jaro_distance (,)
>  * levenshtein_distance (,)
>  * longest_common_substring_distance(,)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6519) Add String Distance and Phonetic Functions

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529468#comment-16529468
 ] 

ASF GitHub Bot commented on DRILL-6519:
---

arina-ielchiieva commented on issue #1331: DRILL-6519: Add String Distance and 
Phonetic Functions
URL: https://github.com/apache/drill/pull/1331#issuecomment-401692405
 
 
   @cgivre please update commit message, it should correspond to Jira number 
and name.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add String Distance and Phonetic Functions
> --
>
> Key: DRILL-6519
> URL: https://issues.apache.org/jira/browse/DRILL-6519
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.14.0
>
>
> From a recent project, this collection of functions makes it possible to do 
> fuzzy string matching as well as phonetic matching on strings. 
>  
> The following functions are all phonetic functions and map text to a number 
> or string based on how the word sounds.  For instance "Jayme" and "Jaime" 
> have the same soundex values and hence these functions can be used to match 
> similar sounding words.
>  * caverphone1(  )
>  * caverphone2(  )
>  * cologne_phonetic(  )
>  * dm_soundex(  )
>  * double_metaphone()
>  * match_rating_encoder(  )
>  * metaphone()
>  * nysiis(  )
>  * refined_soundex()
>  * soundex()
> Additionally, there is the
> {code:java}
> sounds_like(,){code}
> function which can be used to find strings that sound similar.   For instance:
>  
> {code:java}
> SELECT * 
> FROM 
> WHERE sounds_like( last_name, 'Gretsky' )
> {code}
> h2. String Distance Functions
> In addition to the phonetic functions, there are a series of distance 
> functions which measure the difference between two strings.  The functions 
> include:
>  * cosine_distance(,)
>  * fuzzy_score(,)
>  * hamming_distance (,)
>  * jaccard_distance (,)
>  * jaro_distance (,)
>  * levenshtein_distance (,)
>  * longest_common_substring_distance(,)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6557) Use size in bytes during Hive statistics calculation if present

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529466#comment-16529466
 ] 

ASF GitHub Bot commented on DRILL-6557:
---

arina-ielchiieva commented on issue #1357: DRILL-6557: Use size in bytes during 
Hive statistics calculation if present
URL: https://github.com/apache/drill/pull/1357#issuecomment-401691697
 
 
   @vvysotskyi please review.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use size in bytes during Hive statistics calculation if present
> ---
>
> Key: DRILL-6557
> URL: https://issues.apache.org/jira/browse/DRILL-6557
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill considers Hive statistics valid if it contains number of rows and size 
> in bytes. If at least of them is absent, statistics is calculated based on 
> input splits size in bytes. This means that we fetch all input splits though 
> we might not need some after planning optimizations (ex: partition pruning). 
> Though if number of rows are missing and size in bytes is present, there is 
> no need to fetch all input splits since their size in bytes will be the same 
> as in statistics, this would improve time planning since fetching input 
> splits is rather costly operation.
> This Jira aims to:
>  1. check size in bytes presence in stats before fetching input splits and 
> use it if present;
>  2. add log trace suggesting to use ANALYZE command before running queries if 
> statistics is unavailable and Drill had to fetch all input splits;
>  3. minor refactoring /  cleanup in HiveMetadataProvider class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6557) Use size in bytes during Hive statistics calculation if present

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529465#comment-16529465
 ] 

ASF GitHub Bot commented on DRILL-6557:
---

arina-ielchiieva opened a new pull request #1357: DRILL-6557: Use size in bytes 
during Hive statistics calculation if present
URL: https://github.com/apache/drill/pull/1357
 
 
   1. Check size in bytes presence in stats before fetching input splits and 
use it if present.
   2. Add log trace suggesting to use ANALYZE command before running queries if 
statistics is unavailable and Drill had to fetch all input splits.
   3. Minor refactoring /  cleanup in HiveMetadataProvider class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use size in bytes during Hive statistics calculation if present
> ---
>
> Key: DRILL-6557
> URL: https://issues.apache.org/jira/browse/DRILL-6557
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill considers Hive statistics valid if it contains number of rows and size 
> in bytes. If at least of them is absent, statistics is calculated based on 
> input splits size in bytes. This means that we fetch all input splits though 
> we might not need some after planning optimizations (ex: partition pruning). 
> Though if number of rows are missing and size in bytes is present, there is 
> no need to fetch all input splits since their size in bytes will be the same 
> as in statistics, this would improve time planning since fetching input 
> splits is rather costly operation.
> This Jira aims to:
>  1. check size in bytes presence in stats before fetching input splits and 
> use it if present;
>  2. add log trace suggesting to use ANALYZE command before running queries if 
> statistics is unavailable and Drill had to fetch all input splits;
>  3. minor refactoring /  cleanup in HiveMetadataProvider class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6557) Use size in bytes during Hive statistics calculation if present

2018-07-02 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6557:

Description: 
Drill considers Hive statistics valid if it contains number of rows and size in 
bytes. If at least of them is absent, statistics is calculated based on input 
splits size in bytes. This means that we fetch all input splits though we might 
not need some after planning optimizations (ex: partition pruning). Though if 
number of rows are missing and size in bytes is present, there is no need to 
fetch all input splits since their size in bytes will be the same as in 
statistics, this would improve time planning since fetching input splits is 
rather costly operation.

This Jira aims to:
 1. check size in bytes presence in stats before fetching input splits and use 
it if present;
 2. add log trace suggesting to use ANALYZE command before running queries if 
statistics is unavailable and Drill had to fetch all input splits;
 3. minor refactoring /  cleanup in HiveMetadataProvider class.

  was:
Drill considers Hive statistics valid if it contains number of rows and size in 
bytes. If at least of them is absent, statistics is calculated based on input 
splits size in bytes. This means that we fetch all input splits though we might 
not need some after planning optimizations (ex: partition pruning). Though if 
number of rows are missing and size in bytes is present, there is no need to 
fetch all input splits since their size in bytes will be the same as in 
statistics, this would improve time planning since fetching input splits is 
rather costly operation.

This Jira aims to:
 1. check size in bytes presence in stats before fetching input splits and use 
it if present;
 2. add log debug suggesting to use ANALYZE command before running queries if 
statistics is unavailable and Drill had to fetch all input splits;
 3. minor refactoring /  cleanup in HiveMetadataProvider class.


> Use size in bytes during Hive statistics calculation if present
> ---
>
> Key: DRILL-6557
> URL: https://issues.apache.org/jira/browse/DRILL-6557
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill considers Hive statistics valid if it contains number of rows and size 
> in bytes. If at least of them is absent, statistics is calculated based on 
> input splits size in bytes. This means that we fetch all input splits though 
> we might not need some after planning optimizations (ex: partition pruning). 
> Though if number of rows are missing and size in bytes is present, there is 
> no need to fetch all input splits since their size in bytes will be the same 
> as in statistics, this would improve time planning since fetching input 
> splits is rather costly operation.
> This Jira aims to:
>  1. check size in bytes presence in stats before fetching input splits and 
> use it if present;
>  2. add log trace suggesting to use ANALYZE command before running queries if 
> statistics is unavailable and Drill had to fetch all input splits;
>  3. minor refactoring /  cleanup in HiveMetadataProvider class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6573) Enhance JPPD with NDV

2018-07-02 Thread weijie.tong (JIRA)
weijie.tong created DRILL-6573:
--

 Summary: Enhance JPPD with NDV
 Key: DRILL-6573
 URL: https://issues.apache.org/jira/browse/DRILL-6573
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.14.0
Reporter: weijie.tong
Assignee: weijie.tong


Using NDV from the metadata system to judge whether the BloomFilter should be 
enabled at a possible HashJoin node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6572) Add memory calculattion of JPPD BloomFilter

2018-07-02 Thread weijie.tong (JIRA)
weijie.tong created DRILL-6572:
--

 Summary: Add memory calculattion of JPPD BloomFilter
 Key: DRILL-6572
 URL: https://issues.apache.org/jira/browse/DRILL-6572
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Relational Operators
Reporter: weijie.tong
Assignee: weijie.tong
 Fix For: 1.14.0


This is an enhancement of DRILL-6385 to include the memory of BloomFilter in 
the HashJoin's memory calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529442#comment-16529442
 ] 

ASF GitHub Bot commented on DRILL-6385:
---

weijietong commented on issue #1334: DRILL-6385: Support JPPD feature
URL: https://github.com/apache/drill/pull/1334#issuecomment-401686350
 
 
   @amansinha100  The scan node's memory copy logic has removed. Thanks for the 
knowledge of `SelectionVectorPrelVisitor.addSelectionRemoverWhereNecessary`. I 
appreciate what you payed out. Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support JPPD (Join Predicate Push Down)
> ---
>
> Key: DRILL-6385
> URL: https://issues.apache.org/jira/browse/DRILL-6385
> Project: Apache Drill
>  Issue Type: New Feature
>  Components:  Server, Execution - Flow
>Affects Versions: 1.14.0
>Reporter: weijie.tong
>Assignee: weijie.tong
>Priority: Major
>
> This feature is to support the JPPD (Join Predicate Push Down). It will 
> benefit the HashJoin ,Broadcast HashJoin performance by reducing the number 
> of rows to send across the network ,the memory consumed. This feature is 
> already supported by Impala which calls it RuntimeFilter 
> ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]).
>  The first PR will try to push down a bloom filter of HashJoin node to 
> Parquet’s scan node.   The propose basic procedure is described as follow:
>  # The HashJoin build side accumulate the equal join condition rows to 
> construct a bloom filter. Then it sends out the bloom filter to the foreman 
> node.
>  # The foreman node accept the bloom filters passively from all the fragments 
> that has the HashJoin operator. It then aggregates the bloom filters to form 
> a global bloom filter.
>  # The foreman node broadcasts the global bloom filter to all the probe side 
> scan nodes which maybe already have send out partial data to the hash join 
> nodes(currently the hash join node will prefetch one batch from both sides ).
>       4.  The scan node accepts a global bloom filter from the foreman node. 
> It will filter the rest rows satisfying the bloom filter.
>  
> To implement above execution flow, some main new notion described as below:
>       1. RuntimeFilter
> It’s a filter container which may contain BloomFilter or MinMaxFilter.
>       2. RuntimeFilterReporter
> It wraps the logic to send hash join’s bloom filter to the foreman.The 
> serialized bloom filter will be sent out through the data tunnel.This object 
> will be instanced by the FragmentExecutor and passed to the 
> FragmentContext.So the HashJoin operator can obtain it through the 
> FragmentContext.
>      3. RuntimeFilterRequestHandler
> It is responsible to accept a SendRuntimeFilterRequest RPC to strip the 
> actual BloomFilter from the network. It then translates this filter to the 
> WorkerBee’s new interface registerRuntimeFilter.
> Another RPC type is BroadcastRuntimeFilterRequest. It will register the 
> accepted global bloom filter to the WorkerBee by the registerRuntimeFilter 
> method and then propagate to the FragmentContext through which the probe side 
> scan node can fetch the aggregated bloom filter.
>       4.RuntimeFilterManager
> The foreman will instance a RuntimeFilterManager .It will indirectly get 
> every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been 
> accepted and aggregated . It will broadcast the aggregated bloom filter to 
> all the probe side scan nodes through the data tunnel by a 
> BroadcastRuntimeFilterRequest RPC.
>      5. RuntimeFilterEnableOption 
>  A global option will be added to decide whether to enable this new feature.
>  
> Welcome suggestion and advice from you.The related PR will be presented as 
> soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529434#comment-16529434
 ] 

ASF GitHub Bot commented on DRILL-6385:
---

weijietong commented on a change in pull request #1334: DRILL-6385: Support 
JPPD feature
URL: https://github.com/apache/drill/pull/1334#discussion_r199395144
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
 ##
 @@ -696,6 +780,18 @@ public void executeBuildPhase() throws 
SchemaChangeException {
 if ( cycleNum > 0 ) {
   read_right_HV_vector = (IntVector) 
buildBatch.getContainer().getLast();
 }
+//create runtime filter
+if (cycleNum == 0 && enableRuntimeFilter) {
+  //create runtime filter and send out async
+  int condFieldIndex = 0;
+  for (BloomFilter bloomFilter : bloomFilters) {
+for (int ind = 0; ind < currentRecordCount; ind++) {
+  long hashCode = hash64.hash64Code(ind, 0, condFieldIndex);
+  bloomFilter.insert(hashCode);
 
 Review comment:
   I will enhance this at another JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support JPPD (Join Predicate Push Down)
> ---
>
> Key: DRILL-6385
> URL: https://issues.apache.org/jira/browse/DRILL-6385
> Project: Apache Drill
>  Issue Type: New Feature
>  Components:  Server, Execution - Flow
>Affects Versions: 1.14.0
>Reporter: weijie.tong
>Assignee: weijie.tong
>Priority: Major
>
> This feature is to support the JPPD (Join Predicate Push Down). It will 
> benefit the HashJoin ,Broadcast HashJoin performance by reducing the number 
> of rows to send across the network ,the memory consumed. This feature is 
> already supported by Impala which calls it RuntimeFilter 
> ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]).
>  The first PR will try to push down a bloom filter of HashJoin node to 
> Parquet’s scan node.   The propose basic procedure is described as follow:
>  # The HashJoin build side accumulate the equal join condition rows to 
> construct a bloom filter. Then it sends out the bloom filter to the foreman 
> node.
>  # The foreman node accept the bloom filters passively from all the fragments 
> that has the HashJoin operator. It then aggregates the bloom filters to form 
> a global bloom filter.
>  # The foreman node broadcasts the global bloom filter to all the probe side 
> scan nodes which maybe already have send out partial data to the hash join 
> nodes(currently the hash join node will prefetch one batch from both sides ).
>       4.  The scan node accepts a global bloom filter from the foreman node. 
> It will filter the rest rows satisfying the bloom filter.
>  
> To implement above execution flow, some main new notion described as below:
>       1. RuntimeFilter
> It’s a filter container which may contain BloomFilter or MinMaxFilter.
>       2. RuntimeFilterReporter
> It wraps the logic to send hash join’s bloom filter to the foreman.The 
> serialized bloom filter will be sent out through the data tunnel.This object 
> will be instanced by the FragmentExecutor and passed to the 
> FragmentContext.So the HashJoin operator can obtain it through the 
> FragmentContext.
>      3. RuntimeFilterRequestHandler
> It is responsible to accept a SendRuntimeFilterRequest RPC to strip the 
> actual BloomFilter from the network. It then translates this filter to the 
> WorkerBee’s new interface registerRuntimeFilter.
> Another RPC type is BroadcastRuntimeFilterRequest. It will register the 
> accepted global bloom filter to the WorkerBee by the registerRuntimeFilter 
> method and then propagate to the FragmentContext through which the probe side 
> scan node can fetch the aggregated bloom filter.
>       4.RuntimeFilterManager
> The foreman will instance a RuntimeFilterManager .It will indirectly get 
> every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been 
> accepted and aggregated . It will broadcast the aggregated bloom filter to 
> all the probe side scan nodes through the data tunnel by a 
> BroadcastRuntimeFilterRequest RPC.
>      5. RuntimeFilterEnableOption 
>  A global option will be added to decide whether to enable this new feature.
>  
> Welcome suggestion and advice from you.The related PR will be presented as 
> soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6537) Limit the batch size for buffering operators based on how much memory they get

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529424#comment-16529424
 ] 

ASF GitHub Bot commented on DRILL-6537:
---

asfgit closed pull request #1342: DRILL-6537:Limit the batch size for buffering 
operators based on how …
URL: https://github.com/apache/drill/pull/1342
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java 
b/exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
index bc16272ffb..49f149b37a 100644
--- a/exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
+++ b/exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
@@ -85,6 +85,10 @@ private ExecConstants() {
   // need to produce very large batches that take up lot of memory.
   public static final LongValidator OUTPUT_BATCH_SIZE_VALIDATOR = new 
RangeLongValidator(OUTPUT_BATCH_SIZE, 128, 512 * 1024 * 1024);
 
+  // Based on available memory, adjust output batch size for buffered 
operators by this factor.
+  public static final String OUTPUT_BATCH_SIZE_AVAIL_MEM_FACTOR = 
"drill.exec.memory.operator.output_batch_size_avail_mem_factor";
+  public static final DoubleValidator 
OUTPUT_BATCH_SIZE_AVAIL_MEM_FACTOR_VALIDATOR = new 
RangeDoubleValidator(OUTPUT_BATCH_SIZE_AVAIL_MEM_FACTOR, 0.01, 1.0);
+
   // External Sort Boot configuration
 
   public static final String EXTERNAL_SORT_TARGET_SPILL_BATCH_SIZE = 
"drill.exec.sort.external.spill.batch.size";
diff --git 
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
index 428a47ebf3..047c597051 100644
--- 
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
+++ 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
@@ -886,9 +886,13 @@ public HashJoinBatch(HashJoinPOP popConfig, 
FragmentContext context,
 partitions = new HashPartition[0];
 
 // get the output batch size from config.
-int configuredBatchSize = (int) 
context.getOptions().getOption(ExecConstants.OUTPUT_BATCH_SIZE_VALIDATOR);
-batchMemoryManager = new JoinBatchMemoryManager(configuredBatchSize, left, 
right);
-logger.debug("BATCH_STATS, configured output batch size: {}", 
configuredBatchSize);
+final int configuredBatchSize = (int) 
context.getOptions().getOption(ExecConstants.OUTPUT_BATCH_SIZE_VALIDATOR);
+final double avail_mem_factor = (double) 
context.getOptions().getOption(ExecConstants.OUTPUT_BATCH_SIZE_AVAIL_MEM_FACTOR_VALIDATOR);
+int outputBatchSize = Math.min(configuredBatchSize, 
Integer.highestOneBit((int)(allocator.getLimit() * avail_mem_factor)));
+logger.debug("BATCH_STATS, configured output batch size: {}, allocated 
memory {}, avail mem factor {}, output batch size: {}",
+  configuredBatchSize, allocator.getLimit(), avail_mem_factor, 
outputBatchSize);
+
+batchMemoryManager = new JoinBatchMemoryManager(outputBatchSize, left, 
right);
   }
 
   /**
diff --git 
a/exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java
 
b/exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java
index e6368f5aa5..a9c4742816 100644
--- 
a/exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java
+++ 
b/exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java
@@ -233,6 +233,7 @@
   new OptionDefinition(ExecConstants.OUTPUT_BATCH_SIZE_VALIDATOR, new 
OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, true, false)),
   new OptionDefinition(ExecConstants.STATS_LOGGING_BATCH_SIZE_VALIDATOR, 
new OptionMetaData(OptionValue.AccessibleScopes.SYSTEM_AND_SESSION, true, 
true)),
   new 
OptionDefinition(ExecConstants.STATS_LOGGING_BATCH_FG_SIZE_VALIDATOR,new 
OptionMetaData(OptionValue.AccessibleScopes.SYSTEM_AND_SESSION, true, true)),
+  new 
OptionDefinition(ExecConstants.OUTPUT_BATCH_SIZE_AVAIL_MEM_FACTOR_VALIDATOR, 
new OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, true, false)),
   new OptionDefinition(ExecConstants.FRAG_RUNNER_RPC_TIMEOUT_VALIDATOR, 
new OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, true, true)),
 };
 
@@ -294,7 +295,7 @@ public SystemOptionManager(final DrillConfig bootConfig) {
* Initializes this option manager.
*
* @return this option manager
-   * @throws IOException
+   * @throws Exception
*/
   public SystemOptionManager init() throws Exception {
 options = provider.getOrCreateStore(config);
diff --git 

  1   2   >