[jira] [Updated] (DRILL-6145) Implement Hive MapR-DB JSON handler.

2018-03-19 Thread Pritesh Maker (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6145:
-
Labels: doc-impacting ready-to-commit  (was: doc-impacting)

> Implement Hive MapR-DB JSON handler. 
> -
>
> Key: DRILL-6145
> URL: https://issues.apache.org/jira/browse/DRILL-6145
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Hive, Storage - MapRDB
>Affects Versions: 1.12.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.14.0
>
>
> Similar to "hive-hbase-storage-handler" to support querying MapR-DB Hive's 
> external tables it is necessary to add "hive-maprdb-json-handler".
> Use case:
>  # Create a table MapR-DB JSON table:
> {code}
> _> mapr dbshell_
> _maprdb root:> create /tmp/table/json_  (make sure /tmp/table exists)
> {code}
> -- insert data
> {code}
> insert /tmp/table/json --value '\{"_id":"movie002" , "title":"Developers 
> on the Edge", "studio":"Command Line Studios"}'
> insert /tmp/table/json --id movie003 --value '\{"title":"The Golden 
> Master", "studio":"All-Nighter"}'
> {code} 
>  #  Create a Hive external table:
> {code}
> hive> CREATE EXTERNAL TABLE mapr_db_json_hive_tbl ( 
> > movie_id string, title string, studio string) 
> > STORED BY 'org.apache.hadoop.hive.maprdb.json.MapRDBJsonStorageHandler' 
> > TBLPROPERTIES("maprdb.table.name" = 
> "/tmp/table/json","maprdb.column.id" = "movie_id");
> {code}
>  
>  #  Use hive schema to query this table via Drill:
> {code}
> 0: jdbc:drill:> select * from hive.mapr_db_json_hive_tbl;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (DRILL-6270) Add debug startup option flag for drill in embedded and server mode

2018-03-19 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405700#comment-16405700
 ] 

Paul Rogers edited comment on DRILL-6270 at 3/20/18 2:14 AM:
-

Three suggestions.

1. To prevent future conflicts, pick a flag name such as {{\-d 
\-agentlib:jdwp=transport=dt_socket,server=y,address=8000,suspend=n}} or, to be 
consistent with the existing {{\-\-config}} and {{\-\-site}} options, maybe 
{{\-\-debug \-agentlib: 
jdwp=transport=dt_socket,server=y,address=8000,suspend=n}} (two dashes).

2. Since all we are doing is passing a flag to the JVM, maybe generalize your 
flag: {{--java -}}.

3. Even this is not entirely necessary. You can just do the following:

{code}
export 
DRILL_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,address=8000,suspend=n
drillbit.sh start
{code}

To make this even easier, put it in a script so you don't have to either a) 
keep setting and unsetting the options, or b) remember the long flag to pass 
into the command line.


was (Author: paul-rogers):
Three suggestions.

1. To prevent future conflicts, pick a flag name such as {{-d 
-agentlib:jdwp=transport=dt_socket,server=y,address=8000,suspend=n}} or, to be 
consistent with the existing {{--config}} and {{--site}} options, maybe 
{{--debug -agentlib:jdwp=transport=dt_socket,server=y,address=8000,suspend=n}} 
(two dashes).

2. Since all we are doing is passing a flag to the JVM, maybe generalize your 
flag: {{--java -}}.

3. Even this is not entirely necessary. You can just do the following:

{code}
export 
DRILL_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,address=8000,suspend=n
drillbit.sh start
{code}

To make this even easier, put it in a script so you don't have to either a) 
keep setting and unsetting the options, or b) remember the long flag to pass 
into the command line.

> Add debug startup option flag for drill in embedded and server mode
> ---
>
> Key: DRILL-6270
> URL: https://issues.apache.org/jira/browse/DRILL-6270
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Volodymyr Tkach
>Assignee: Anton Gozhiy
>Priority: Minor
>
> Add possibility to run sqlline.sh and drillbit.sh scripts with -- 
> with standard java remote debug options with the ability to override port.
> Example: drillbit.sh start - 50001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6270) Add debug startup option flag for drill in embedded and server mode

2018-03-19 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405700#comment-16405700
 ] 

Paul Rogers commented on DRILL-6270:


Three suggestions.

1. To prevent future conflicts, pick a flag name such as {{-d 
-agentlib:jdwp=transport=dt_socket,server=y,address=8000,suspend=n}} or, to be 
consistent with the existing {{--config}} and {{--site}} options, maybe 
{{--debug -agentlib:jdwp=transport=dt_socket,server=y,address=8000,suspend=n}} 
(two dashes).

2. Since all we are doing is passing a flag to the JVM, maybe generalize your 
flag: {{--java -}}.

3. Even this is not entirely necessary. You can just do the following:

{code}
export 
DRILL_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,address=8000,suspend=n
drillbit.sh start
{code}

To make this even easier, put it in a script so you don't have to either a) 
keep setting and unsetting the options, or b) remember the long flag to pass 
into the command line.

> Add debug startup option flag for drill in embedded and server mode
> ---
>
> Key: DRILL-6270
> URL: https://issues.apache.org/jira/browse/DRILL-6270
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Volodymyr Tkach
>Assignee: Anton Gozhiy
>Priority: Minor
>
> Add possibility to run sqlline.sh and drillbit.sh scripts with -- 
> with standard java remote debug options with the ability to override port.
> Example: drillbit.sh start - 50001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6262) IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405691#comment-16405691
 ] 

ASF GitHub Bot commented on DRILL-6262:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1175
  
Thanks. See how confusing it is? I wrote the darn thing originally and even 
I can't keep the names straight... :-)


> IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector
> -
>
> Key: DRILL-6262
> URL: https://issues.apache.org/jira/browse/DRILL-6262
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> ColumnSize inside RecordBatchSizer while computing the totalDataSize for 
> VariableWidthVector throws IndexOutOfBoundException when the underlying 
> vector is empty without any allocated memory.
> This happens because the way totalDataSize is computed is using the 
> offsetVector value at an index n where n is total number of records in the 
> vector. When vector is empty then n=0 and offsetVector drillbuf is empty as 
> well. So while retrieving value at index 0 from offsetVector exception is 
> thrown. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6262) IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405661#comment-16405661
 ] 

ASF GitHub Bot commented on DRILL-6262:
---

Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/1175#discussion_r175629518
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -321,10 +321,8 @@ public ColumnSize(ValueVector v, String prefix) {
 
   // Calculate pure data size.
   if (isVariableWidth) {
-UInt4Vector offsetVector = ((RepeatedValueVector) 
v).getOffsetVector();
-int innerValueCount = 
offsetVector.getAccessor().get(valueCount);
 VariableWidthVector dataVector = ((VariableWidthVector) 
((RepeatedValueVector) v).getDataVector());
-totalDataSize = 
dataVector.getOffsetVector().getAccessor().get(innerValueCount);
+totalDataSize = dataVector.getCurrentSizeInBytes();
--- End diff --

@paul-rogers - I don't think `totalDataSize` includes both offset vector 
side and bytes size. It was meant to only include **pure data size only** for 
all entries in that column and that's what comment also suggests.

Instead `totalNetSize` includes the size for data and offset vector which 
is used for computing the rowWidth.


> IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector
> -
>
> Key: DRILL-6262
> URL: https://issues.apache.org/jira/browse/DRILL-6262
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> ColumnSize inside RecordBatchSizer while computing the totalDataSize for 
> VariableWidthVector throws IndexOutOfBoundException when the underlying 
> vector is empty without any allocated memory.
> This happens because the way totalDataSize is computed is using the 
> offsetVector value at an index n where n is total number of records in the 
> vector. When vector is empty then n=0 and offsetVector drillbuf is empty as 
> well. So while retrieving value at index 0 from offsetVector exception is 
> thrown. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6262) IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405603#comment-16405603
 ] 

ASF GitHub Bot commented on DRILL-6262:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1175#discussion_r175619764
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -321,10 +321,8 @@ public ColumnSize(ValueVector v, String prefix) {
 
   // Calculate pure data size.
   if (isVariableWidth) {
-UInt4Vector offsetVector = ((RepeatedValueVector) 
v).getOffsetVector();
-int innerValueCount = 
offsetVector.getAccessor().get(valueCount);
 VariableWidthVector dataVector = ((VariableWidthVector) 
((RepeatedValueVector) v).getDataVector());
-totalDataSize = 
dataVector.getOffsetVector().getAccessor().get(innerValueCount);
+totalDataSize = dataVector.getCurrentSizeInBytes();
--- End diff --

Good improvement. The original code exposes far too much of the 
implementation.

After all these changes, does the "dataSize" include both the offset vector 
and bytes? It should, else calls will be wrong. There are supposed to be three 
sizes:

* Payload size: actual data bytes.
* Data size: data + offsets + bits
* Overall size: full length of all vectors.

Payload size is what the user sees. Data size is how we calculate row width 
(since the rows must contain the overhead bytes). Vector length, here, only 
helps compute density, but is generated elsewhere. The point is, keep all three 
in mind, but keep the code separate. Otherwise, it is *very* easy to get 
confused and have the calculations blow up...


> IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector
> -
>
> Key: DRILL-6262
> URL: https://issues.apache.org/jira/browse/DRILL-6262
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> ColumnSize inside RecordBatchSizer while computing the totalDataSize for 
> VariableWidthVector throws IndexOutOfBoundException when the underlying 
> vector is empty without any allocated memory.
> This happens because the way totalDataSize is computed is using the 
> offsetVector value at an index n where n is total number of records in the 
> vector. When vector is empty then n=0 and offsetVector drillbuf is empty as 
> well. So while retrieving value at index 0 from offsetVector exception is 
> thrown. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-4587) Document Drillbit launch options

2018-03-19 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405591#comment-16405591
 ] 

Paul Rogers commented on DRILL-4587:


For item 1: there is no overlap between {{drill-env.sh}} (bootstrap) and 
{{drill-override.conf}} (startup) properties. Properties are in 
{{drill-override.conf}} unless they must be set before Java starts.

Better to say we have three systems:

1. {{drill-env.sh}} and environment.
2. {{drill-override.conf}}, and -Dprop=value
3. System/session properties set in SQL.

There is some confusion in the original description around systems 2 and 3. In 
general, a property is settable ONLY in the startup property system OR the 
system/session properties set in SQL.

Now, it is true that the defaults for system/session properties exist in our 
internal startup {{drill-module.conf}} system. But, this is ONLY for our own 
use. Users should NEVER set properties this way. Instead, use {{ALTER SYSTEM}} 
so that the property is set in ZK and propagated to all Drillbits.

As a result, the three systems are completely independent.

To revise: the first item 3 ({{$DRILL_HOME/conf/drill-env.sh}}), the user can 
(should!) also use a site directory:

{code:java}
drillbit.sh --site $DRILL_SITE
{code}

Where the config files are in the separate site directory. [Drill-on-YARN 
explains the site 
directory|https://github.com/apache/drill/blob/master/drill-yarn/USAGE.md].

To document the variables, see {{drill-env.sh}} which has explanations for all 
supported variables (or at least those as of 1.8. If others were added later, 
then comments should be added.)

The following is technically correct, but represents unsupported (legacy) usage:

bq. Drill startup propeties can be set in a number of locations. Those listed 
later take precedence over those listed earlier.
bq. 1. Drill-override.conf as identified by DRILL_CONF_DIR or its default.

Actually, {{drill-override.conf}} as determined by {{$DRILL_CONF_DIR}}, the 
{{--site site/dir}} option, or the default location of {{$DRILL_HOME/conf}}. 
But, note that the properties here are *distinct* from those described below, 
so they do not overlap. See note above. Grouping them together actually adds to 
confusion.

bq. 2. Set in the environment using DRILL_JAVA_OPTS or DRILL_DRILLBIT_JAVA_OPTS.

Not supported. Use these options ONLY for additional JVM options. 
{{drill-config.sh}} does some fancy processing of options that cannot be done 
if the options are hidden in one of the above.

bq. 3. Set in drill-env.sh using the above two variables.

Preferred solution.

bq. 4. Set on the drillbit.sh command line as explained above. (Drill 1.7 and 
later.)

See the confusion? The command line only works for the values in item 1 above, 
not for those in 3.

5. As environment variables.

This is the ultimate override for items from 4. DoY uses this to pass values 
from the DoY config file, though YARN, into the YARN Node Manager and on into 
Drill.

Suggestion: reword the description (or submit a revised note) that separates 
the three systems, then documents the alternatives for each.

> Document Drillbit launch options
> 
>
> Key: DRILL-4587
> URL: https://issues.apache.org/jira/browse/DRILL-4587
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Paul Rogers
>Assignee: Bridget Bevens
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill provides the drillbit.sh script to launch Drill. When Drill is run in 
> production environments, or when managed by a tool such as Mesos or YARN, 
> customers have many options to customize the launch options. We should 
> document this information as below.
> The user can configure Drill launch in one of four ways, depending on their 
> needs.
> 1. Using the properties in drill-override.conf. Sets only startup and runtime 
> properties. All drillbits should use a copy of the file so that properties 
> set here apply to all drill bits and to client applications.
> 2. By setting environment variables prior to launching Drill. See the list 
> below. Use this to customize properties per drill-bit, such as for setting 
> port numbers. This option is useful when launching Drill from a tool such as 
> Mesos or YARN.
> 3. By setting environment variables in $DRILL_HOME/conf/drill-env.sh. See the 
> list below. This script is intended to be unique to each node and is another 
> way to customize properties for this one node.
> 4. In Drill 1.7 and later, the administrator can set Drill configuration 
> options directly on the launch command as shown below. This option is also 
> useful when launching Drill from a tool such as YARN or Mesos. Options are of 
> the form:
> $ drillbit.sh start -Dvariable=value
> For example, to control the HTTP port:
> $ drillbit.sh start 

[jira] [Updated] (DRILL-3855) Enable FilterSetOpTransposeRule, DrillProjectSetOpTransposeRule

2018-03-19 Thread Kunal Khatua (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua updated DRILL-3855:

Fix Version/s: (was: Future)
   1.14.0

> Enable FilterSetOpTransposeRule, DrillProjectSetOpTransposeRule
> ---
>
> Key: DRILL-3855
> URL: https://issues.apache.org/jira/browse/DRILL-3855
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Query Planning  Optimization
>Reporter: Sean Hsuan-Yi Chu
>Assignee: Jinfeng Ni
>Priority: Major
> Fix For: 1.14.0
>
>
> Since the infinite planning issues (Calcite Volcano Planner: Calcite-900) 
> reported in DRILL-3257, FilterSetOpTransposeRule, 
> DrillProjectSetOpTransposeRule were disabled. After it can be resolved in 
> Calcite, we have to enable these two rules to lift the performance. 
> In addition, will update the plan validation in Unit test in response of the 
> newly enabled rules. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-3855) Enable FilterSetOpTransposeRule, DrillProjectSetOpTransposeRule

2018-03-19 Thread Kunal Khatua (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua reassigned DRILL-3855:
---

Assignee: Aman Sinha  (was: Jinfeng Ni)

> Enable FilterSetOpTransposeRule, DrillProjectSetOpTransposeRule
> ---
>
> Key: DRILL-3855
> URL: https://issues.apache.org/jira/browse/DRILL-3855
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Query Planning  Optimization
>Reporter: Sean Hsuan-Yi Chu
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.14.0
>
>
> Since the infinite planning issues (Calcite Volcano Planner: Calcite-900) 
> reported in DRILL-3257, FilterSetOpTransposeRule, 
> DrillProjectSetOpTransposeRule were disabled. After it can be resolved in 
> Calcite, we have to enable these two rules to lift the performance. 
> In addition, will update the plan validation in Unit test in response of the 
> newly enabled rules. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6275) drillbit direct_current memory usage is not populated/updated

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405560#comment-16405560
 ] 

ASF GitHub Bot commented on DRILL-6275:
---

Github user kkhatua commented on the issue:

https://github.com/apache/drill/pull/1176
  
LGTM. Verified on a 4-node setup with running queries.
+1


> drillbit direct_current memory usage is not populated/updated
> -
>
> Key: DRILL-6275
> URL: https://issues.apache.org/jira/browse/DRILL-6275
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.13.0
>Reporter: Chun Chang
>Assignee: Timothy Farkas
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> We used to keep track drill memory usage in sys.memory. And it was useful in 
> detecting memory leaks. This feature seems broken. The direct_current memory 
> usage is not populated or updated.
> {noformat}
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.memory;
> +---++---+-+-+-+--+
> | hostname | user_port | heap_current | heap_max | direct_current | 
> jvm_direct_current | direct_max |
> +---++---+-+-+-+--+
> | 10.10.30.168 | 31010 | 1162636800 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.169 | 31010 | 1301175040 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.166 | 31010 | 989448872 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.167 | 31010 | 1767205312 | 2147483648 | 0 | 22096 | 10737418240 |
> +---++---+-+-+-+--+
> 4 rows selected (1.564 seconds)
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.version;
> +--+---+---++-++
> | version | commit_id | commit_message | commit_time | build_email | 
> build_time |
> +--+---+---++-++
> | 1.13.0-SNAPSHOT | 534212456cc25a49272838cba91c223f63df7fd2 | Cleanup when 
> closing, and cleanup spill after a kill | 07.03.2018 @ 16:18:27 PST | 
> inram...@gmail.com | 08.03.2018 @ 10:09:28 PST |
> +--+---+---++-++
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-4587) Document Drillbit launch options

2018-03-19 Thread Kunal Khatua (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua updated DRILL-4587:

Fix Version/s: 1.14.0

> Document Drillbit launch options
> 
>
> Key: DRILL-4587
> URL: https://issues.apache.org/jira/browse/DRILL-4587
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Paul Rogers
>Assignee: Bridget Bevens
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill provides the drillbit.sh script to launch Drill. When Drill is run in 
> production environments, or when managed by a tool such as Mesos or YARN, 
> customers have many options to customize the launch options. We should 
> document this information as below.
> The user can configure Drill launch in one of four ways, depending on their 
> needs.
> 1. Using the properties in drill-override.conf. Sets only startup and runtime 
> properties. All drillbits should use a copy of the file so that properties 
> set here apply to all drill bits and to client applications.
> 2. By setting environment variables prior to launching Drill. See the list 
> below. Use this to customize properties per drill-bit, such as for setting 
> port numbers. This option is useful when launching Drill from a tool such as 
> Mesos or YARN.
> 3. By setting environment variables in $DRILL_HOME/conf/drill-env.sh. See the 
> list below. This script is intended to be unique to each node and is another 
> way to customize properties for this one node.
> 4. In Drill 1.7 and later, the administrator can set Drill configuration 
> options directly on the launch command as shown below. This option is also 
> useful when launching Drill from a tool such as YARN or Mesos. Options are of 
> the form:
> $ drillbit.sh start -Dvariable=value
> For example, to control the HTTP port:
> $ drillbit.sh start -Ddrill.exec.http.port=8099 
> Properties are of three types.
> 1. Launch-only properties: those that can be set only through environment 
> variables (such as JAVA_HOME.)
> 2. Drill startup properties which can be set in the locations detailed below.
> 3. Drill runtime properties which are set in drill-override.conf also via SQL.
> Drill startup propeties can be set in a number of locations. Those listed 
> later take precedence over those listed earlier.
> 1. Drill-override.conf as identified by DRILL_CONF_DIR or its default.
> 2. Set in the environment using DRILL_JAVA_OPTS or DRILL_DRILLBIT_JAVA_OPTS.
> 3. Set in drill-env.sh using the above two variables.
> 4. Set on the drill.bit command line as explained above. (Drill 1.7 and 
> later.)
> You can see the actual set of properties used (from items 2-3 above) by using 
> the "debug" command (Drill 1.7 or later):
> $ drillbit.sh debug



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-6016) Error reading INT96 created by Apache Spark

2018-03-19 Thread Pritesh Maker (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker reassigned DRILL-6016:


Assignee: Rahul Raj

> Error reading INT96 created by Apache Spark
> ---
>
> Key: DRILL-6016
> URL: https://issues.apache.org/jira/browse/DRILL-6016
> Project: Apache Drill
>  Issue Type: Bug
> Environment: Drill 1.11
>Reporter: Rahul Raj
>Assignee: Rahul Raj
>Priority: Major
> Fix For: 1.14.0
>
>
> Hi,
> I am getting the error - SYSTEM ERROR : ClassCastException: 
> org.apache.drill.exec.vector.TimeStampVector cannot be cast to 
> org.apache.drill.exec.vector.VariableWidthVector while trying to read a spark 
> INT96 datetime field on Drill 1.11 in spite of setting the property 
> store.parquet.reader.int96_as_timestamp to  true.
> I believe this was fixed in drill 
> 1.10(https://issues.apache.org/jira/browse/DRILL-4373). What could be wrong.
> I have attached the dataset at 
> https://github.com/rajrahul/files/blob/master/result.tar.gz



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-5937) prepare.statement.create_timeout_ms default is 10 seconds but code comment says default should be 10 mins

2018-03-19 Thread Kunal Khatua (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua updated DRILL-5937:

Fix Version/s: 1.14.0

>  prepare.statement.create_timeout_ms default is 10 seconds but code comment 
> says default should be 10 mins
> --
>
> Key: DRILL-5937
> URL: https://issues.apache.org/jira/browse/DRILL-5937
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.8.0
>Reporter: Pushpendra Jaiswal
>Priority: Major
> Fix For: 1.14.0
>
>
> prepare.statement.create_timeout_ms default is 10 seconds but code comment 
> says default should be 10 mins
> The value is by default set to 1 ms 
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java#L526
>  
> /**
>* Timeout for create prepare statement request. If the request exceeds 
> this timeout, then request is timed out.
>* Default value is 10mins.
>*/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6016) Error reading INT96 created by Apache Spark

2018-03-19 Thread Pritesh Maker (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6016:
-
Reviewer: Parth Chandra

> Error reading INT96 created by Apache Spark
> ---
>
> Key: DRILL-6016
> URL: https://issues.apache.org/jira/browse/DRILL-6016
> Project: Apache Drill
>  Issue Type: Bug
> Environment: Drill 1.11
>Reporter: Rahul Raj
>Priority: Major
> Fix For: 1.14.0
>
>
> Hi,
> I am getting the error - SYSTEM ERROR : ClassCastException: 
> org.apache.drill.exec.vector.TimeStampVector cannot be cast to 
> org.apache.drill.exec.vector.VariableWidthVector while trying to read a spark 
> INT96 datetime field on Drill 1.11 in spite of setting the property 
> store.parquet.reader.int96_as_timestamp to  true.
> I believe this was fixed in drill 
> 1.10(https://issues.apache.org/jira/browse/DRILL-4373). What could be wrong.
> I have attached the dataset at 
> https://github.com/rajrahul/files/blob/master/result.tar.gz



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6275) drillbit direct_current memory usage is not populated/updated

2018-03-19 Thread Pritesh Maker (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6275:
-
Labels: ready-to-commit  (was: )

> drillbit direct_current memory usage is not populated/updated
> -
>
> Key: DRILL-6275
> URL: https://issues.apache.org/jira/browse/DRILL-6275
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.13.0
>Reporter: Chun Chang
>Assignee: Timothy Farkas
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> We used to keep track drill memory usage in sys.memory. And it was useful in 
> detecting memory leaks. This feature seems broken. The direct_current memory 
> usage is not populated or updated.
> {noformat}
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.memory;
> +---++---+-+-+-+--+
> | hostname | user_port | heap_current | heap_max | direct_current | 
> jvm_direct_current | direct_max |
> +---++---+-+-+-+--+
> | 10.10.30.168 | 31010 | 1162636800 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.169 | 31010 | 1301175040 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.166 | 31010 | 989448872 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.167 | 31010 | 1767205312 | 2147483648 | 0 | 22096 | 10737418240 |
> +---++---+-+-+-+--+
> 4 rows selected (1.564 seconds)
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.version;
> +--+---+---++-++
> | version | commit_id | commit_message | commit_time | build_email | 
> build_time |
> +--+---+---++-++
> | 1.13.0-SNAPSHOT | 534212456cc25a49272838cba91c223f63df7fd2 | Cleanup when 
> closing, and cleanup spill after a kill | 07.03.2018 @ 16:18:27 PST | 
> inram...@gmail.com | 08.03.2018 @ 10:09:28 PST |
> +--+---+---++-++
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-5944) Single corrupt compressed json file (in s3) causes query failure

2018-03-19 Thread Kunal Khatua (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua updated DRILL-5944:

Description: 
I am running a CTAS query on an s3 bucket with 500 compressed json files - 
ouput as parquet.

Query ran from command line:
 /opt/drill/apache-drill-1.11.0/bin/sqlline --verbose=true --showWarnings=true 
--showNestedErrs=true --force=true --run=therm.sql -u 
jdbc:drill:zk-k8s-drill:2181

therm.sql:
{code:sql}
 use `s3`.`drill-output`; *(s3 points to kairos bucket)
 alter session set `store.format`='parquet';
 ALTER SESSION SET `store.json.all_text_mode` = true;
 create table temps_bucket0 as select t.id, t.`value` as temp, 
to_timestamp(cast(substr(t.`timestamp`,1,10) as int)) as ts, t.device_id from 
`s3`.`bucket=0/` as t where cast(t.`timestamp` as int) > 147528 and 
cast(t.`timestamp` as int) < 1491004799;
{code}

Drill ran for 17 min 50.246 sec and managed to write approx. 100M records then 
failed with the following message (see below). I tried to download and 
uncompress the file manually and it is corrupt. Ideally, Drill should log but 
skip the corrupt file.

{code:java}
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('d' (code 
100)): was expecting comma to separate OBJECT entries

File /bucket=0/190273.json.gz
 Record 3654965
 Column 37
 Fragment 1:0

[Error Id: 9458cb2c-d0a4-4b66-9b65-4e8015e2ca97 on 10.75.186.7 :31010] 
(state=,code=0)
 java.sql.SQLException: DATA_READ ERROR: Error parsing JSON - Unexpected 
character ('d' (code 100)): was expecting comma to separate OBJECT entries

File /bucket=0/190273.json.gz
 Record 3654965
 Column 37
 Fragment 1:0

[Error Id: 9458cb2c-d0a4-4b66-9b65-4e8015e2ca97 on 10.75.186.7 :31010]
 at 
org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:489)
 at org.apache.drill.jdbc.impl.DrillCursor.next(DrillCursor.java:593)
 at org.apache.calcite.avatica.AvaticaResultSet.next(AvaticaResultSet.java:215)
 at 
org.apache.drill.jdbc.impl.DrillResultSetImpl.next(DrillResultSetImpl.java:140)
 at sqlline.IncrementalRows.hasNext(IncrementalRows.java:62)
 at 
sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)
 at sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
 at sqlline.SqlLine.print(SqlLine.java:1593)
 at sqlline.Commands.execute(Commands.java:852)
 at sqlline.Commands.sql(Commands.java:751)
 at sqlline.SqlLine.dispatch(SqlLine.java:746)
 at sqlline.SqlLine.runCommands(SqlLine.java:1651)
 at sqlline.Commands.run(Commands.java:1304)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:36)
 at sqlline.SqlLine.dispatch(SqlLine.java:742)
 at sqlline.SqlLine.initArgs(SqlLine.java:553)
 at sqlline.SqlLine.begin(SqlLine.java:596)
 at sqlline.SqlLine.start(SqlLine.java:375)
 at sqlline.SqlLine.main(SqlLine.java:268)
 Caused by: org.apache.drill.common.exceptions.UserRemoteException: DATA_READ 
ERROR: Error parsing JSON - Unexpected character ('d' (code 100)): was 
expecting comma to separate OBJECT entries

File /bucket=0/190273.json.gz
 Record 3654965
 Column 37
 Fragment 1:0

[Error Id: 9458cb2c-d0a4-4b66-9b65-4e8015e2ca97 on 10.75.186.7 :31010]
 at 
org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:123)
 at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:368)
 at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:90)
 at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:274)
 at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:244)
 at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
 at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
 at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
 at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
 at 

[jira] [Closed] (DRILL-6021) Show shutdown button when authentication is not enabled

2018-03-19 Thread Krystal (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krystal closed DRILL-6021.
--

Verified that bug is fixed.

> Show shutdown button when authentication is not enabled
> ---
>
> Key: DRILL-6021
> URL: https://issues.apache.org/jira/browse/DRILL-6021
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Venkata Jyothsna Donapati
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.13.0
>
>
> After DRILL-6017 {{shouldShowAdminInfo}} is used to decide if shutdown button 
> should be displayed on index page. But this option is set to true when 
> authentication is enabled and user is an admin. When authentication is not 
> enabled, user by default is admin. So with this fix without authentication, 
> shutdown button is absent but should be present.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (DRILL-6044) Shutdown button does not work from WebUI

2018-03-19 Thread Krystal (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krystal closed DRILL-6044.
--

Verified that bug is fixed.

> Shutdown button does not work from WebUI
> 
>
> Key: DRILL-6044
> URL: https://issues.apache.org/jira/browse/DRILL-6044
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - HTTP
>Affects Versions: 1.13.0
>Reporter: Krystal
>Assignee: Venkata Jyothsna Donapati
>Priority: Critical
>  Labels: ready-to-commit
> Fix For: 1.13.0
>
> Attachments: Screen Shot 2017-12-19 at 10.51.16 AM.png
>
>
> git.commit.id.abbrev=eb0c403
> Nothing happens when click on the SHUTDOWN button from the WebUI.  The 
> browser's debugger showed that the request failed due to access control 
> checks (see attached screen shot).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6243) Alert box to confirm shutdown of drillbit after clicking shutdown button

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405513#comment-16405513
 ] 

ASF GitHub Bot commented on DRILL-6243:
---

Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/1169#discussion_r175598697
  
--- Diff: exec/java-exec/src/main/resources/rest/index.ftl ---
@@ -272,17 +272,19 @@
   }
<#if model.shouldShowAdminInfo() || !model.isAuthEnabled()>
   function shutdown(button) {
-  var requestPath = "/gracefulShutdown";
-  var url = getRequestUrl(requestPath);
-  var result = $.ajax({
-type: 'POST',
-url: url,
-contentType : 'text/plain',
-complete: function(data) {
-alert(data.responseJSON["response"]);
-button.prop('disabled',true).css('opacity',0.5);
-}
-  });
+  if (confirm("Click ok to shutdown")) {
--- End diff --

Message should be more of like` "Are you sure you want to shutdown Drillbit 
running on + location.host + node ?"`


> Alert box to confirm shutdown of drillbit after clicking shutdown button 
> -
>
> Key: DRILL-6243
> URL: https://issues.apache.org/jira/browse/DRILL-6243
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Venkata Jyothsna Donapati
>Assignee: Venkata Jyothsna Donapati
>Priority: Minor
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6243) Alert box to confirm shutdown of drillbit after clicking shutdown button

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405509#comment-16405509
 ] 

ASF GitHub Bot commented on DRILL-6243:
---

Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/1169#discussion_r175598108
  
--- Diff: exec/java-exec/src/main/resources/rest/index.ftl ---
@@ -272,17 +272,19 @@
   }
<#if model.shouldShowAdminInfo() || !model.isAuthEnabled()>
   function shutdown(button) {
-  var requestPath = "/gracefulShutdown";
-  var url = getRequestUrl(requestPath);
-  var result = $.ajax({
-type: 'POST',
-url: url,
-contentType : 'text/plain',
-complete: function(data) {
-alert(data.responseJSON["response"]);
-button.prop('disabled',true).css('opacity',0.5);
-}
-  });
+  if (confirm("Click ok to shutdown")) {
+  var requestPath = "/gracefulShutdown";
+  var url = getRequestUrl(requestPath);
+  var result = $.ajax({
+   type: 'POST',
+   url: url,
+   contentType : 'text/plain',
+   complete: function(data) {
+alert(data.responseJSON["response"]);
+
button.prop('disabled',true).css('opacity',0.5);
+}
--- End diff --

please fix indentation here and below. Also add the `error: `callback for 
Ajax request. 
Like alert with received error ?


> Alert box to confirm shutdown of drillbit after clicking shutdown button 
> -
>
> Key: DRILL-6243
> URL: https://issues.apache.org/jira/browse/DRILL-6243
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Venkata Jyothsna Donapati
>Assignee: Venkata Jyothsna Donapati
>Priority: Minor
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6262) IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector

2018-03-19 Thread Sorabh Hamirwasia (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sorabh Hamirwasia updated DRILL-6262:
-
Labels: ready-to-commit  (was: )

> IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector
> -
>
> Key: DRILL-6262
> URL: https://issues.apache.org/jira/browse/DRILL-6262
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> ColumnSize inside RecordBatchSizer while computing the totalDataSize for 
> VariableWidthVector throws IndexOutOfBoundException when the underlying 
> vector is empty without any allocated memory.
> This happens because the way totalDataSize is computed is using the 
> offsetVector value at an index n where n is total number of records in the 
> vector. When vector is empty then n=0 and offsetVector drillbuf is empty as 
> well. So while retrieving value at index 0 from offsetVector exception is 
> thrown. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6275) drillbit direct_current memory usage is not populated/updated

2018-03-19 Thread Pritesh Maker (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6275:
-
Fix Version/s: 1.14.0

> drillbit direct_current memory usage is not populated/updated
> -
>
> Key: DRILL-6275
> URL: https://issues.apache.org/jira/browse/DRILL-6275
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.13.0
>Reporter: Chun Chang
>Assignee: Timothy Farkas
>Priority: Minor
> Fix For: 1.14.0
>
>
> We used to keep track drill memory usage in sys.memory. And it was useful in 
> detecting memory leaks. This feature seems broken. The direct_current memory 
> usage is not populated or updated.
> {noformat}
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.memory;
> +---++---+-+-+-+--+
> | hostname | user_port | heap_current | heap_max | direct_current | 
> jvm_direct_current | direct_max |
> +---++---+-+-+-+--+
> | 10.10.30.168 | 31010 | 1162636800 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.169 | 31010 | 1301175040 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.166 | 31010 | 989448872 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.167 | 31010 | 1767205312 | 2147483648 | 0 | 22096 | 10737418240 |
> +---++---+-+-+-+--+
> 4 rows selected (1.564 seconds)
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.version;
> +--+---+---++-++
> | version | commit_id | commit_message | commit_time | build_email | 
> build_time |
> +--+---+---++-++
> | 1.13.0-SNAPSHOT | 534212456cc25a49272838cba91c223f63df7fd2 | Cleanup when 
> closing, and cleanup spill after a kill | 07.03.2018 @ 16:18:27 PST | 
> inram...@gmail.com | 08.03.2018 @ 10:09:28 PST |
> +--+---+---++-++
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (DRILL-6252) Foreman node is going down when the non foreman node is stopped

2018-03-19 Thread Venkata Jyothsna Donapati (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405472#comment-16405472
 ] 

Venkata Jyothsna Donapati edited comment on DRILL-6252 at 3/19/18 9:21 PM:
---

[~vrozov] I have attached corresponding logs.


was (Author: vdonapati):
[~vrozov] Please look for attached logs.

> Foreman node is going down when the non foreman node is stopped
> ---
>
> Key: DRILL-6252
> URL: https://issues.apache.org/jira/browse/DRILL-6252
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Venkata Jyothsna Donapati
>Assignee: Vlad Rozov
>Priority: Major
> Fix For: 1.14.0
>
> Attachments: foreman_drillbit.log, nonforeman_drillbit.log
>
>
> Two drillbits are running. I'm running a join query over parquet and tried to 
> stop the non-foreman node using drillbit.sh stop. The query fails with 
> *"Error: DATA_READ ERROR: Exception occurred while reading from disk".* The 
> non-foreman node goes down. The foreman node also goes down. When I looked at 
> the drillbit.log of both foreman and non-foreman I found that there is memory 
> leak  "Memory was leaked by query. Memory leaked: 
> (2097152)\nAllocator(op:2:0:0:HashPartitionSender) 
> 100/6291456/6832128/100 (res/actual/peak/limit)\n". Following are 
> the stack traces for memory leaks 
> {noformat} 
> [Error Id: 0d9a2799-7e97-46b3-953b-1f8d0dd87a04 on qa102-34.qa.lab:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> IllegalStateException: Memory was leaked by query. Memory leaked: (3145728)
> Allocator(op:2:1:0:HashPartitionSender) 100/6291456/6291456/100 
> (res/actual/peak/limit)
>  
>  
> Fragment 2:1 
> [Error Id: 0d9a2799-7e97-46b3-953b-1f8d0dd87a04 on qa102-34.qa.lab:31010]
>         at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:633)
>  ~[drill-common-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:297)
>  [drill-java-exec-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160)
>  [drill-java-exec-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:266)
>  [drill-java-exec-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>         at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [na:1.8.0_161]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_161]
>         at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]
> Caused by: java.lang.IllegalStateException: Memory was leaked by query. 
> Memory leaked: (3145728)
> Allocator(op:2:1:0:HashPartitionSender) 100/6291456/6291456/100 
> (res/actual/peak/limit)
> {noformat} 
>  
> Ping me for the logs and more information.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6252) Foreman node is going down when the non foreman node is stopped

2018-03-19 Thread Venkata Jyothsna Donapati (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405472#comment-16405472
 ] 

Venkata Jyothsna Donapati commented on DRILL-6252:
--

[~vrozov] Please look for attached logs.

> Foreman node is going down when the non foreman node is stopped
> ---
>
> Key: DRILL-6252
> URL: https://issues.apache.org/jira/browse/DRILL-6252
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Venkata Jyothsna Donapati
>Assignee: Vlad Rozov
>Priority: Major
> Fix For: 1.14.0
>
> Attachments: foreman_drillbit.log, nonforeman_drillbit.log
>
>
> Two drillbits are running. I'm running a join query over parquet and tried to 
> stop the non-foreman node using drillbit.sh stop. The query fails with 
> *"Error: DATA_READ ERROR: Exception occurred while reading from disk".* The 
> non-foreman node goes down. The foreman node also goes down. When I looked at 
> the drillbit.log of both foreman and non-foreman I found that there is memory 
> leak  "Memory was leaked by query. Memory leaked: 
> (2097152)\nAllocator(op:2:0:0:HashPartitionSender) 
> 100/6291456/6832128/100 (res/actual/peak/limit)\n". Following are 
> the stack traces for memory leaks 
> {noformat} 
> [Error Id: 0d9a2799-7e97-46b3-953b-1f8d0dd87a04 on qa102-34.qa.lab:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> IllegalStateException: Memory was leaked by query. Memory leaked: (3145728)
> Allocator(op:2:1:0:HashPartitionSender) 100/6291456/6291456/100 
> (res/actual/peak/limit)
>  
>  
> Fragment 2:1 
> [Error Id: 0d9a2799-7e97-46b3-953b-1f8d0dd87a04 on qa102-34.qa.lab:31010]
>         at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:633)
>  ~[drill-common-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:297)
>  [drill-java-exec-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160)
>  [drill-java-exec-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:266)
>  [drill-java-exec-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>         at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [na:1.8.0_161]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_161]
>         at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]
> Caused by: java.lang.IllegalStateException: Memory was leaked by query. 
> Memory leaked: (3145728)
> Allocator(op:2:1:0:HashPartitionSender) 100/6291456/6291456/100 
> (res/actual/peak/limit)
> {noformat} 
>  
> Ping me for the logs and more information.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6262) IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405468#comment-16405468
 ] 

ASF GitHub Bot commented on DRILL-6262:
---

Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1175
  
LGTM. +1.


> IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector
> -
>
> Key: DRILL-6262
> URL: https://issues.apache.org/jira/browse/DRILL-6262
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
> Fix For: 1.14.0
>
>
> ColumnSize inside RecordBatchSizer while computing the totalDataSize for 
> VariableWidthVector throws IndexOutOfBoundException when the underlying 
> vector is empty without any allocated memory.
> This happens because the way totalDataSize is computed is using the 
> offsetVector value at an index n where n is total number of records in the 
> vector. When vector is empty then n=0 and offsetVector drillbuf is empty as 
> well. So while retrieving value at index 0 from offsetVector exception is 
> thrown. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6275) drillbit direct_current memory usage is not populated/updated

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405460#comment-16405460
 ] 

ASF GitHub Bot commented on DRILL-6275:
---

Github user ilooner commented on the issue:

https://github.com/apache/drill/pull/1176
  
Tested fix manually on my laptop.


> drillbit direct_current memory usage is not populated/updated
> -
>
> Key: DRILL-6275
> URL: https://issues.apache.org/jira/browse/DRILL-6275
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.13.0
>Reporter: Chun Chang
>Assignee: Timothy Farkas
>Priority: Minor
>
> We used to keep track drill memory usage in sys.memory. And it was useful in 
> detecting memory leaks. This feature seems broken. The direct_current memory 
> usage is not populated or updated.
> {noformat}
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.memory;
> +---++---+-+-+-+--+
> | hostname | user_port | heap_current | heap_max | direct_current | 
> jvm_direct_current | direct_max |
> +---++---+-+-+-+--+
> | 10.10.30.168 | 31010 | 1162636800 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.169 | 31010 | 1301175040 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.166 | 31010 | 989448872 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.167 | 31010 | 1767205312 | 2147483648 | 0 | 22096 | 10737418240 |
> +---++---+-+-+-+--+
> 4 rows selected (1.564 seconds)
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.version;
> +--+---+---++-++
> | version | commit_id | commit_message | commit_time | build_email | 
> build_time |
> +--+---+---++-++
> | 1.13.0-SNAPSHOT | 534212456cc25a49272838cba91c223f63df7fd2 | Cleanup when 
> closing, and cleanup spill after a kill | 07.03.2018 @ 16:18:27 PST | 
> inram...@gmail.com | 08.03.2018 @ 10:09:28 PST |
> +--+---+---++-++
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6275) drillbit direct_current memory usage is not populated/updated

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405458#comment-16405458
 ] 

ASF GitHub Bot commented on DRILL-6275:
---

GitHub user ilooner opened a pull request:

https://github.com/apache/drill/pull/1176

DRILL-6275: Fixed direct memory reporting in sys.memory.

@kkhatua Thanks for pinpointing the root cause! Please review. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ilooner/drill DRILL-6275

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1176.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1176


commit 7f65b7d4b4b9e42dc3597ac9758c39c6ce0903b7
Author: Timothy Farkas 
Date:   2018-03-19T20:16:37Z

DRILL-6275: Fixed direct memory reporting in sys.memory.




> drillbit direct_current memory usage is not populated/updated
> -
>
> Key: DRILL-6275
> URL: https://issues.apache.org/jira/browse/DRILL-6275
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.13.0
>Reporter: Chun Chang
>Assignee: Timothy Farkas
>Priority: Minor
>
> We used to keep track drill memory usage in sys.memory. And it was useful in 
> detecting memory leaks. This feature seems broken. The direct_current memory 
> usage is not populated or updated.
> {noformat}
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.memory;
> +---++---+-+-+-+--+
> | hostname | user_port | heap_current | heap_max | direct_current | 
> jvm_direct_current | direct_max |
> +---++---+-+-+-+--+
> | 10.10.30.168 | 31010 | 1162636800 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.169 | 31010 | 1301175040 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.166 | 31010 | 989448872 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.167 | 31010 | 1767205312 | 2147483648 | 0 | 22096 | 10737418240 |
> +---++---+-+-+-+--+
> 4 rows selected (1.564 seconds)
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.version;
> +--+---+---++-++
> | version | commit_id | commit_message | commit_time | build_email | 
> build_time |
> +--+---+---++-++
> | 1.13.0-SNAPSHOT | 534212456cc25a49272838cba91c223f63df7fd2 | Cleanup when 
> closing, and cleanup spill after a kill | 07.03.2018 @ 16:18:27 PST | 
> inram...@gmail.com | 08.03.2018 @ 10:09:28 PST |
> +--+---+---++-++
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-6276) Drill CTAS creates parquet file having page greater than 200 MB.

2018-03-19 Thread Robert Hou (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Hou reassigned DRILL-6276:
-

Assignee: Pritesh Maker

> Drill CTAS creates parquet file having page greater than 200 MB.
> 
>
> Key: DRILL-6276
> URL: https://issues.apache.org/jira/browse/DRILL-6276
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.13.0
>Reporter: Robert Hou
>Assignee: Pritesh Maker
>Priority: Major
> Attachments: alltypes_asc_16MB.json
>
>
> I used this CTAS to create a parquet file from a json file:
> {noformat}
> create table `alltypes.parquet` as select cast(BigIntValue as BigInt) 
> BigIntValue, cast(BooleanValue as Boolean) BooleanValue, cast (DateValue as 
> Date) DateValue, cast (FloatValue as Float) FloatValue, cast (DoubleValue as 
> Double) DoubleValue, cast (IntegerValue as Integer) IntegerValue, cast 
> (TimeValue as Time) TimeValue, cast (TimestampValue as Timestamp) 
> TimestampValue, cast (IntervalYearValue as INTERVAL YEAR) IntervalYearValue, 
> cast (IntervalDayValue as INTERVAL DAY) IntervalDayValue, cast 
> (IntervalSecondValue as INTERVAL SECOND) IntervalSecondValue, cast 
> (BinaryValue as binary) Binaryvalue, cast (VarcharValue as varchar) 
> VarcharValue from `alltypes.json`;
> {noformat}
> I ran parquet-tools/parquet-dump :
> VarcharValue TV=6885 RL=0 DL=1
> 
> 
> page 0:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:17240317 VC:6885
> The page size is 16MB.  This is with a 16MB data set.  When I try a similar 
> 1GB data set, the page size starts at over 200 MB, decreasing down to 1MB.
> VarcharValue TV=208513 RL=0 DL=1
> 
> 
> page 0:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:215243750 VC:87433
> page 1:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:112350266 VC:43717
> page 2:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:52501154 VC:21859
> page 3:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:27725498 VC:10930
> page 4:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:12181241 VC:5466
> page 5:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:11005971 VC:2734
> page 6:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1133237 VC:1797
> page 7:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1462803 VC:899
> page 8:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050967 VC:490
> page 9:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1051603 VC:424
> page 10:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050919 VC:378
> page 11:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050487 VC:345
> page 12:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050783 VC:319
> page 13:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1052303 VC:299
> page 14:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1053235 VC:282
> page 15:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1055979 VC:268
> The column has a varchar, and the size varies from 2 bytes to 5000 bytes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6276) Drill CTAS creates parquet file having page greater than 200 MB.

2018-03-19 Thread Robert Hou (JIRA)
Robert Hou created DRILL-6276:
-

 Summary: Drill CTAS creates parquet file having page greater than 
200 MB.
 Key: DRILL-6276
 URL: https://issues.apache.org/jira/browse/DRILL-6276
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Parquet
Affects Versions: 1.13.0
Reporter: Robert Hou
 Attachments: alltypes_asc_16MB.json

I used this CTAS to create a parquet file from a json file:
{noformat}
create table `alltypes.parquet` as select cast(BigIntValue as BigInt) 
BigIntValue, cast(BooleanValue as Boolean) BooleanValue, cast (DateValue as 
Date) DateValue, cast (FloatValue as Float) FloatValue, cast (DoubleValue as 
Double) DoubleValue, cast (IntegerValue as Integer) IntegerValue, cast 
(TimeValue as Time) TimeValue, cast (TimestampValue as Timestamp) 
TimestampValue, cast (IntervalYearValue as INTERVAL YEAR) IntervalYearValue, 
cast (IntervalDayValue as INTERVAL DAY) IntervalDayValue, cast 
(IntervalSecondValue as INTERVAL SECOND) IntervalSecondValue, cast (BinaryValue 
as binary) Binaryvalue, cast (VarcharValue as varchar) VarcharValue from 
`alltypes.json`;
{noformat}

I ran parquet-tools/parquet-dump :

VarcharValue TV=6885 RL=0 DL=1


page 0:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:17240317 VC:6885

The page size is 16MB.  This is with a 16MB data set.  When I try a similar 1GB 
data set, the page size starts at over 200 MB, decreasing down to 1MB.

VarcharValue TV=208513 RL=0 DL=1


page 0:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:215243750 VC:87433
page 1:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:112350266 VC:43717
page 2:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:52501154 VC:21859
page 3:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:27725498 VC:10930
page 4:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:12181241 VC:5466
page 5:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:11005971 VC:2734
page 6:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1133237 VC:1797
page 7:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1462803 VC:899
page 8:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050967 VC:490
page 9:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1051603 VC:424
page 10:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050919 VC:378
page 11:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050487 VC:345
page 12:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1050783 VC:319
page 13:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1052303 VC:299
page 14:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1053235 VC:282
page 15:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1055979 VC:268

The column has a varchar, and the size varies from 2 bytes to 5000 bytes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6262) IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405425#comment-16405425
 ] 

ASF GitHub Bot commented on DRILL-6262:
---

Github user sohami commented on the issue:

https://github.com/apache/drill/pull/1175
  
@bitblender - Updated the test. Please review.


> IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector
> -
>
> Key: DRILL-6262
> URL: https://issues.apache.org/jira/browse/DRILL-6262
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
> Fix For: 1.14.0
>
>
> ColumnSize inside RecordBatchSizer while computing the totalDataSize for 
> VariableWidthVector throws IndexOutOfBoundException when the underlying 
> vector is empty without any allocated memory.
> This happens because the way totalDataSize is computed is using the 
> offsetVector value at an index n where n is total number of records in the 
> vector. When vector is empty then n=0 and offsetVector drillbuf is empty as 
> well. So while retrieving value at index 0 from offsetVector exception is 
> thrown. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6238) Batch sizing for operators

2018-03-19 Thread Padma Penumarthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Padma Penumarthy updated DRILL-6238:

Description: 
*Batch Sizing For Operators*

This document describes the approach we are taking for limiting batch sizes for 
operators other than scan.

*Motivation*

Main goals are
 # Improve concurrency - By having each query use less memory i.e. stay within 
budget, we can have more queries run concurrently. 
 # Reduce query failures because of out of memory errors

To accomplish these goals, we need to make queries execute within a specified 
memory budget. To enforce per query memory limit, we need to be able to enforce 
per fragment and per operator memory limits. Controlling individual operators 
batch sizes is the first step towards all this.

*Background*

In Drill, different operators have different limits w.r.to outgoing batches. 
Some use hard coded row counts, some use hard coded memory and some have none 
at all. Based on input data size and what the operator is doing, memory used by 
the outgoing batch can vary widely as there are no limits imposed. Queries fail 
because we are not able to allocate the memory needed. Some operators produce 
very large batches, causing blocking operators like sort, hash agg which have 
to work under tight memory constraints to fail. Size of batches should be a 
function of available memory rather than input data size and/or what the 
operator does. Please refer to table at the end of this document for details on 
what each operator does today.

*Design*

Goal is to have all operators behave the same way i.e. produce batches with 
size less than or equal to configured outgoing batch size with a minimum of 1 
row per batch and maximum of 64k rows per batch. A new system option 
‘drill.exec.memory.operator.output_batch_size’ is added which has default value 
of 16MB.

The basic idea is to limit size of outgoing batch by deciding how many rows we 
can have in the batch based on average entry size of each outgoing column, 
taking into account actual data size and metadata vector overhead we add on top 
for tracking variable length, mode(repeated, optional, required) etc. The 
calculations to figure out number of rows will be different for each operator 
and is based on
 # What the operator is doing
 # Incoming batch size that includes information on type and average size of 
each column
 # What is being projected out

By taking this adaptive approach based on actual average data sizes, for 
operators which were limiting batch size to less than 64K rows before can 
possibly do lot more rows (upto 64K rows) in a batch if the memory stays within 
the budget. For example, flatten and joins have batch size of 4K rows, which 
probably might have been done to be conservative w.r.to memory usage. By making 
these operators go upto 64K rows as long as they stay with in the memory budget 
should help improve performance.

Also, to improve performance and utilize memory more efficiently, we will
 # Allocate memory for value vectors upfront. Since we know the number of rows 
and sizing information for each column in the  outgoing batch, we will use that 
information to allocate memory for value vectors upfront.  Currently, we either 
do initial allocation for 4K values and keep doubling every time we need more 
or allocate for maximum needed upfront. With this change to pre allocate memory 
based on sizing calculation, we can improve performance by reducing the memory 
copies and zeroing the new half we do every time we double and help save memory 
in cases where we were over allocating before.
 # Round down the number of rows in outgoing batch to a power of two. Since 
memory is allocated in powers of two, this will help us pack the value vectors 
densely thereby reducing the amount of memory that gets wasted because of 
doubling effect.

So, to summarize, the benefits we will get are improved memory utilization, 
better performance, higher concurrency and less queries dying because of out of 
memory errors.

Note: Since these sizing calculations are based on averages, strict memory 
usage enforcement is not possible. There could be pathological cases where 
because of uneven data distribution, we might exceed the configured output 
batch size potentially causing OOM errors and problems in downstream operators.

Other issues that will be addressed:
 * We are adding extra processing for each batch in each operator to figure out 
the sizing information. This overhead can be reduced by passing this 
information along with the batch between operators.
 * For some operators, it will be complex to figure out average size of 
outgoing columns especially if we have to evaluate complex expression trees and 
UDFs to figure out the transformation on incoming batches. We will use 
approximations as appropriate.

Following table summarizes the limits we have today for each operator.


[jira] [Updated] (DRILL-6238) Batch sizing for operators

2018-03-19 Thread Padma Penumarthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Padma Penumarthy updated DRILL-6238:

Description: 
*Batch Sizing For Operators*

This document describes the approach we are taking for limiting batch sizes for 
operators other than scan.

*Motivation*

Main goals are
 # Improve concurrency - By having each query using less memory i.e. staying 
within budget, we can have more queries run concurrently. 
 # Reduce query failures because of out of memory errors

To accomplish these goals, we need to make queries execute within a specified 
memory budget. To enforce per query memory limit, we need to be able to enforce 
per fragment and per operator memory limits. Controlling individual operators 
batch sizes is the first step towards all this.

*Background*

In Drill, different operators have different limits w.r.to outgoing batches. 
Some use hard coded row counts, some use hard coded memory and some have none 
at all. Based on input data size and what the operator is doing, memory used by 
the outgoing batch can vary widely as there are no limits imposed. Queries fail 
because we are not able to allocate the memory needed. Some operators produce 
very large batches, causing blocking operators like sort, hash agg which have 
to work under tight memory constraints to fail. Size of batches should be a 
function of available memory rather than input data size and/or what the 
operator does. Please refer to table at the end of this document for details on 
what each operator does today.

*Design*

Goal is to have all operators behave the same way i.e. produce batches with 
size less than or equal to configured outgoing batch size with a minimum of 1 
row per batch and maximum of 64k rows per batch. A new system option 
‘drill.exec.memory.operator.output_batch_size’ is added which has default value 
of 16MB.

The basic idea is to limit size of outgoing batch by deciding how many rows we 
can have in the batch based on average entry size of each outgoing column, 
taking into account actual data size and metadata vector overhead we add on top 
for tracking variable length, mode(repeated, optional, required) etc. The 
calculations to figure out number of rows will be different for each operator 
and is based on
 # What the operator is doing
 # Incoming batch size that includes information on type and average size of 
each column
 # What is being projected out

By taking this adaptive approach based on actual average data sizes, for 
operators which were limiting batch size to less than 64K rows before can 
possibly do lot more rows (upto 64K rows) in a batch if the memory stays within 
the budget. For example, flatten and joins have batch size of 4K rows, which 
probably might have been done to be conservative w.r.to memory usage. By making 
these operators go upto 64K rows as long as they stay with in the memory budget 
should help improve performance.

Also, to improve performance and utilize memory more efficiently, we will
 # Allocate memory for value vectors upfront. Since we know the number of rows 
and sizing information for each column in the  outgoing batch, we will use that 
information to allocate memory for value vectors upfront.  Currently, we either 
do initial allocation for 4K values and keep doubling every time we need more 
or allocate for maximum needed upfront. With this change to pre allocate memory 
based on sizing calculation, we can improve performance by reducing the memory 
copies and zeroing the new half we do every time we double and help save memory 
in cases where we were over allocating before.
 # Round down the number of rows in outgoing batch to a power of two. Since 
memory is allocated in powers of two, this will help us pack the value vectors 
densely thereby reducing the amount of memory that gets wasted because of 
doubling effect.

So, to summarize, the benefits we will get are improved memory utilization, 
better performance, higher concurrency and less queries dying because of out of 
memory errors.

Note: Since these sizing calculations are based on averages, strict memory 
usage enforcement is not possible. There could be pathological cases where 
because of uneven data distribution, we might exceed the configured output 
batch size potentially causing OOM errors and problems in downstream operators.

Other issues that will be addressed:
 * We are adding extra processing for each batch in each operator to figure out 
the sizing information. This overhead can be reduced by passing this 
information along with the batch between operators.
 * For some operators, it will be complex to figure out average size of 
outgoing columns especially if we have to evaluate complex expression trees and 
UDFs to figure out the transformation on incoming batches. We will use 
approximations as appropriate.

Following table summarizes the limits we have today for each 

[jira] [Updated] (DRILL-6275) drillbit direct_current memory usage is not populated/updated

2018-03-19 Thread Timothy Farkas (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Farkas updated DRILL-6275:
--
Reviewer: Kunal Khatua

> drillbit direct_current memory usage is not populated/updated
> -
>
> Key: DRILL-6275
> URL: https://issues.apache.org/jira/browse/DRILL-6275
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.13.0
>Reporter: Chun Chang
>Assignee: Timothy Farkas
>Priority: Minor
>
> We used to keep track drill memory usage in sys.memory. And it was useful in 
> detecting memory leaks. This feature seems broken. The direct_current memory 
> usage is not populated or updated.
> {noformat}
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.memory;
> +---++---+-+-+-+--+
> | hostname | user_port | heap_current | heap_max | direct_current | 
> jvm_direct_current | direct_max |
> +---++---+-+-+-+--+
> | 10.10.30.168 | 31010 | 1162636800 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.169 | 31010 | 1301175040 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.166 | 31010 | 989448872 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.167 | 31010 | 1767205312 | 2147483648 | 0 | 22096 | 10737418240 |
> +---++---+-+-+-+--+
> 4 rows selected (1.564 seconds)
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.version;
> +--+---+---++-++
> | version | commit_id | commit_message | commit_time | build_email | 
> build_time |
> +--+---+---++-++
> | 1.13.0-SNAPSHOT | 534212456cc25a49272838cba91c223f63df7fd2 | Cleanup when 
> closing, and cleanup spill after a kill | 07.03.2018 @ 16:18:27 PST | 
> inram...@gmail.com | 08.03.2018 @ 10:09:28 PST |
> +--+---+---++-++
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6238) Batch sizing for operators

2018-03-19 Thread Padma Penumarthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Padma Penumarthy updated DRILL-6238:

Description: 
*Batch Sizing For Operators*

This document describes the approach we are taking for limiting batch sizes for 
operators other than scan.

*Motivation*

Main goals are
 # Improve concurrency
 # Reduce query failures because of out of memory errors

To accomplish these goals, we need to make queries execute within a specified 
memory budget. To enforce per query memory limit, we need to be able to enforce 
per fragment and per operator memory limits. Controlling individual operators 
batch sizes is the first step towards all this.

*Background*

In Drill, different operators have different limits w.r.to outgoing batches. 
Some use hard coded row counts, some use hard coded memory and some have none 
at all. Based on input data size and what the operator is doing, memory used by 
the outgoing batch can vary widely as there are no limits imposed. Queries fail 
because we are not able to allocate the memory needed. Some operators produce 
very large batches, causing blocking operators like sort, hash agg which have 
to work under tight memory constraints to fail. Size of batches should be a 
function of available memory rather than input data size and/or what the 
operator does. Please refer to table at the end of this document for details on 
what each operator does today.

*Design*

Goal is to have all operators behave the same way i.e. produce batches with 
size less than or equal to configured outgoing batch size with a minimum of 1 
row per batch and maximum of 64k rows per batch. A new system option 
‘drill.exec.memory.operator.output_batch_size’ is added which has default value 
of 16MB.

The basic idea is to limit size of outgoing batch by deciding how many rows we 
can have in the batch based on average entry size of each outgoing column, 
taking into account actual data size and metadata vector overhead we add on top 
for tracking variable length, mode(repeated, optional, required) etc. The 
calculations to figure out number of rows will be different for each operator 
and is based on
 # What the operator is doing
 # Incoming batch size that includes information on type and average size of 
each column
 # What is being projected out

By taking this adaptive approach based on actual average data sizes, for 
operators which were limiting batch size to less than 64K rows before can 
possibly do lot more rows (upto 64K rows) in a batch if the memory stays within 
the budget. For example, flatten and joins have batch size of 4K rows, which 
probably might have been done to be conservative w.r.to memory usage. By making 
these operators go upto 64K rows as long as they stay with in the memory budget 
should help improve performance.

Also, to improve performance and utilize memory more efficiently, we will
 # Allocate memory for value vectors upfront. Since we know the number of rows 
and sizing information for each column in the  outgoing batch, we will use that 
information to allocate memory for value vectors upfront.  Currently, we either 
do initial allocation for 4K values and keep doubling every time we need more 
or allocate for maximum needed upfront. With this change to pre allocate memory 
based on sizing calculation, we can improve performance by reducing the memory 
copies and zeroing the new half we do every time we double and help save memory 
in cases where we were over allocating before.
 # Round down the number of rows in outgoing batch to a power of two. Since 
memory is allocated in powers of two, this will help us pack the value vectors 
densely thereby reducing the amount of memory that gets wasted because of 
doubling effect.

So, to summarize, the benefits we will get are improved memory utilization, 
better performance, higher concurrency and less queries dying because of out of 
memory errors.

Note: Since these sizing calculations are based on averages, strict memory 
usage enforcement is not possible. There could be pathological cases where 
because of uneven data distribution, we might exceed the configured output 
batch size potentially causing OOM errors and problems in downstream operators.

Other issues that will be addressed:
 * We are adding extra processing for each batch in each operator to figure out 
the sizing information. This overhead can be reduced by passing this 
information along with the batch between operators.
 * For some operators, it will be complex to figure out average size of 
outgoing columns especially if we have to evaluate complex expression trees and 
UDFs to figure out the transformation on incoming batches. We will use 
approximations as appropriate.

Following table summarizes the limits we have today for each operator.

flatten, merge join and external sort adhere to batch size limits as described 
in this document as of drill 

[jira] [Updated] (DRILL-6238) Batch sizing for operators

2018-03-19 Thread Padma Penumarthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Padma Penumarthy updated DRILL-6238:

Description: 
*Batch Sizing For Operators*

This document describes the approach we are taking for limiting batch sizes for 
operators other than scan.

*Motivation*

Main goals are
 # Improve concurrency
 # Reduce query failures because of out of memory errors

To accomplish these goals, we need to make queries execute within a specified 
memory budget. To enforce per query memory limit, we need to be able to enforce 
per fragment and per operator memory limits. Controlling individual operators 
batch sizes is the first step towards all this.

*Background*

In Drill, different operators have different limits w.r.to outgoing batches. 
Some use hard coded row counts, some use hard coded memory and some have none 
at all. Based on input data size and what the operator is doing, memory used by 
the outgoing batch can vary widely as there are no limits imposed. Queries fail 
because we are not able to allocate the memory needed. Some operators produce 
very large batches, causing blocking operators like sort, hash agg which have 
to work under tight memory constraints to fail. Size of batches should be a 
function of available memory rather than input data size and/or what the 
operator does. Please refer to table at the end of this document for details on 
what each operator does today.

*Design*

Goal is to have all operators behave the same way i.e. produce batches with 
size less than or equal to configured outgoing batch size with a minimum of 1 
row per batch and maximum of 64k rows per batch. A new system option 
‘drill.exec.memory.operator.output_batch_size’ is added which has default value 
of 16MB.

The basic idea is to limit size of outgoing batch by deciding how many rows we 
can have in the batch based on average entry size of each outgoing column, 
taking into account actual data size and metadata vector overhead we add on top 
for tracking variable length, mode(repeated, optional, required) etc. The 
calculations to figure out number of rows will be different for each operator 
and is based on
 # What the operator is doing
 # Incoming batch size that includes information on type and average size of 
each column
 # What is being projected out

By taking this adaptive approach based on actual average data sizes, for 
operators which were limiting batch size to less than 64K rows before can 
possibly do lot more rows (upto 64K rows) in a batch if the memory stays within 
the budget. For example, flatten and joins have batch size of 4K rows, which 
probably might have been done to be conservative w.r.to memory usage. By making 
these operators go upto 64K rows as long as they stay with in the memory budget 
should help improve performance.

Also, to improve performance and utilize memory more efficiently, we will
 # Allocate memory for value vectors upfront. Since we know the number of rows 
and sizing information for each column in the  outgoing batch, we will use that 
information to allocate memory for value vectors upfront.  Currently, we either 
do initial allocation for 4K values and keep doubling every time we need more 
or allocate for maximum needed upfront. With this change to pre allocate memory 
based on sizing calculation, we can improve performance by reducing the memory 
copies and zeroing the new half we do every time we double and help save memory 
in cases where we were over allocating before.
 # Round down the number of rows in outgoing batch to a power of two. Since 
memory is allocated in powers of two, this will help us pack the value vectors 
densely thereby reducing the amount of memory that gets wasted because of 
doubling effect.

So, to summarize, the benefits we will get are improved memory utilization, 
better performance, higher concurrency and less queries dying because of out of 
memory errors.

Note: Since these sizing calculations are based on averages, strict memory 
usage enforcement is not possible. There could be pathological cases where 
because of uneven data distribution, we might exceed the configured output 
batch size potentially causing OOM errors and problems in downstream operators.

Other issues that will be addressed:
 * We are adding extra processing for each batch in each operator to figure out 
the sizing information. This overhead can be reduced by passing this 
information along with the batch between operators.
 * For some operators, it will be complex to figure out average size of 
outgoing columns especially if we have to evaluate complex expression trees and 
UDFs to figure out the transformation on incoming batches. We will use 
approximations as appropriate.

Following table summarizes the limits we have today for each operator.

flatten, merge join and external sort adhere to batch size limits as described 
in this document as of drill 

[jira] [Updated] (DRILL-6238) Batch sizing for operators

2018-03-19 Thread Padma Penumarthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Padma Penumarthy updated DRILL-6238:

Description: 
*Batch Sizing For Operators*

This document describes the approach we are taking for limiting batch sizes for 
operators other than scan.

*Motivation*

Main goals are
 # Improve concurrency
 # Reduce query failures because of out of memory errors

To accomplish these goals, we need to make queries execute within a specified 
memory budget. To enforce per query memory limit, we need to be able to enforce 
per fragment and per operator memory limits. Controlling individual operators 
batch sizes is the first step towards all this.

*Background*

In Drill, different operators have different limits w.r.to outgoing batches. 
Some use hard coded row counts, some use hard coded memory and some have none 
at all. Based on input data size and what the operator is doing, memory used by 
the outgoing batch can vary widely as there are no limits imposed. Queries fail 
because we are not able to allocate the memory needed. Some operators produce 
very large batches, causing blocking operators like sort, hash agg which have 
to work under tight memory constraints to fail. Size of batches should be a 
function of available memory rather than input data size and/or what the 
operator does. Please refer to table at the end of this document for details on 
what each operator does today.

*Design*

Goal is to have all operators behave the same way i.e. produce batches with 
size less than or equal to configured outgoing batch size with a minimum of 1 
row per batch and maximum of 64k rows per batch. A new system option 
‘drill.exec.memory.operator.output_batch_size’ is added which has default value 
of 16MB.

The basic idea is to limit size of outgoing batch by deciding how many rows we 
can have in the batch based on average entry size of each outgoing column, 
taking into account actual data size and metadata vector overhead we add on top 
for tracking variable length, mode(repeated, optional, required) etc. The 
calculations to figure out number of rows will be different for each operator 
and is based on
 # What the operator is doing
 # Incoming batch size that includes information on type and average size of 
each column
 # What is being projected out

By taking this adaptive approach based on actual average data sizes, for 
operators which were limiting batch size to less than 64K rows before can 
possibly do lot more rows (upto 64K rows) in a batch if the memory stays within 
the budget. For example, flatten and joins have batch size of 4K rows, which 
probably might have been done to be conservative w.r.to memory usage. By making 
these operators go upto 64K rows as long as they stay with in the memory budget 
should help improve performance.

Also, to improve performance and utilize memory more efficiently, we will
 # Allocate memory for value vectors upfront. Since we know the number of rows 
and sizing information for each column in the  outgoing batch, we will use that 
information to allocate memory for value vectors upfront.  Currently, we either 
do initial allocation for 4K values and keep doubling every time we need more 
or allocate for maximum needed upfront. With this change to pre allocate memory 
based on sizing calculation, we can improve performance by reducing the memory 
copies and zeroing the new half we do every time we double and help save memory 
in cases where we were over allocating before.
 # Round down the number of rows in outgoing batch to a power of two. Since 
memory is allocated in powers of two, this will help us pack the value vectors 
densely thereby reducing the amount of memory that gets wasted because of 
doubling effect.

So, to summarize, the benefits we will get are improved memory utilization, 
better performance, higher concurrency and less queries dying because of out of 
memory errors.

Note: Since these sizing calculations are based on averages, strict memory 
usage enforcement is not possible. There could be pathological cases where 
because of uneven data distribution, we might exceed the configured output 
batch size potentially causing OOM errors and problems in downstream operators.

Other issues that will be addressed:
 * We are adding extra processing for each batch in each operator to figure out 
the sizing information. This overhead can be reduced by passing this 
information along with the batch between operators.
 * For some operators, it will be complex to figure out average size of 
outgoing columns especially if we have to evaluate complex expression trees and 
UDFs to figure out the transformation on incoming batches. We will use 
approximations as appropriate.

Following table summarizes the limits we have today for each operator.

flatten, merge join and external sort adhere to batch size limits as described 
in this document as of drill 

[jira] [Updated] (DRILL-6238) Batch sizing for operators

2018-03-19 Thread Padma Penumarthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Padma Penumarthy updated DRILL-6238:

Description: 
*Batch Sizing For Operators*

This document describes the approach we are taking for limiting batch sizes for 
operators other than scan.

*Motivation*

Main goals are
 # Improve concurrency
 # Reduce query failures because of out of memory errors

To accomplish these goals, we need to make queries execute within a specified 
memory budget. To enforce per query memory limit, we need to be able to enforce 
per fragment and per operator memory limits. Controlling individual operators 
batch sizes is the first step towards all this.

*Background*

In Drill, different operators have different limits w.r.to outgoing batches. 
Some use hard coded row counts, some use hard coded memory and some have none 
at all. Based on input data size and what the operator is doing, memory used by 
the outgoing batch can vary widely as there are no limits imposed. Queries fail 
because we are not able to allocate the memory needed. Some operators produce 
very large batches, causing blocking operators like sort, hash agg which have 
to work under tight memory constraints to fail. Size of batches should be a 
function of available memory rather than input data size and/or what the 
operator does. Please refer to table at the end of this document for details on 
what each operator does today.

*Design*

Goal is to have all operators behave the same way i.e. produce batches with 
size less than or equal to configured outgoing batch size with a minimum of 1 
row per batch and maximum of 64k rows per batch. A new system option 
‘drill.exec.memory.operator.output_batch_size’ is added which has default value 
of 16MB.

The basic idea is to limit size of outgoing batch by deciding how many rows we 
can have in the batch based on average entry size of each outgoing column, 
taking into account actual data size and metadata vector overhead we add on top 
for tracking variable length, mode(repeated, optional, required) etc. The 
calculations to figure out number of rows will be different for each operator 
and is based on
 # What the operator is doing
 # Incoming batch size that includes information on type and average size of 
each column
 # What is being projected out

By taking this adaptive approach based on actual average data sizes, for 
operators which were limiting batch size to less than 64K rows before can 
possibly do lot more rows (upto 64K rows) in a batch if the memory stays within 
the budget. For example, flatten and joins have batch size of 4K rows, which 
probably might have been done to be conservative w.r.to memory usage. By making 
these operators go upto 64K rows as long as they stay with in the memory budget 
should help improve performance.

Also, to improve performance and utilize memory more efficiently, we will
 # Allocate memory for value vectors upfront. Since we know the number of rows 
and sizing information for each column in the  outgoing batch, we will use that 
information to allocate memory for value vectors upfront.  Currently, we either 
do initial allocation for 4K values and keep doubling every time we need more 
or allocate for maximum needed upfront. With this change to pre allocate memory 
based on sizing calculation, we can improve performance by reducing the memory 
copies and zeroing the new half we do every time we double and help save memory 
in cases where we were over allocating before.
 # Round down the number of rows in outgoing batch to a power of two. Since 
memory is allocated in powers of two, this will help us pack the value vectors 
densely thereby reducing the amount of memory that gets wasted because of 
doubling effect.

So, to summarize, the benefits we will get are improved memory utilization, 
better performance, higher concurrency and less queries dying because of out of 
memory errors.

Note: Since these sizing calculations are based on averages, strict memory 
usage enforcement is not possible. There could be pathological cases where 
because of uneven data distribution, we might exceed the configured output 
batch size potentially causing OOM errors and problems in downstream operators.

Other issues that will be addressed:
 * We are adding extra processing for each batch in each operator to figure out 
the sizing information. This overhead can be reduced by passing this 
information along with the batch between operators.
 * For some operators, it will be complex to figure out average size of 
outgoing columns especially if we have to evaluate complex expression trees and 
UDFs to figure out the transformation on incoming batches. We will use 
approximations as appropriate.

Following table summarizes the limits we have today for each operator.

flatten, merge join and external sort adhere to batch size limits as described 
in this document as of drill 

[jira] [Assigned] (DRILL-6275) drillbit direct_current memory usage is not populated/updated

2018-03-19 Thread Chun Chang (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chang reassigned DRILL-6275:
-

Assignee: Timothy Farkas

> drillbit direct_current memory usage is not populated/updated
> -
>
> Key: DRILL-6275
> URL: https://issues.apache.org/jira/browse/DRILL-6275
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.13.0
>Reporter: Chun Chang
>Assignee: Timothy Farkas
>Priority: Minor
>
> We used to keep track drill memory usage in sys.memory. And it was useful in 
> detecting memory leaks. This feature seems broken. The direct_current memory 
> usage is not populated or updated.
> {noformat}
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.memory;
> +---++---+-+-+-+--+
> | hostname | user_port | heap_current | heap_max | direct_current | 
> jvm_direct_current | direct_max |
> +---++---+-+-+-+--+
> | 10.10.30.168 | 31010 | 1162636800 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.169 | 31010 | 1301175040 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.166 | 31010 | 989448872 | 2147483648 | 0 | 22096 | 10737418240 |
> | 10.10.30.167 | 31010 | 1767205312 | 2147483648 | 0 | 22096 | 10737418240 |
> +---++---+-+-+-+--+
> 4 rows selected (1.564 seconds)
> 0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.version;
> +--+---+---++-++
> | version | commit_id | commit_message | commit_time | build_email | 
> build_time |
> +--+---+---++-++
> | 1.13.0-SNAPSHOT | 534212456cc25a49272838cba91c223f63df7fd2 | Cleanup when 
> closing, and cleanup spill after a kill | 07.03.2018 @ 16:18:27 PST | 
> inram...@gmail.com | 08.03.2018 @ 10:09:28 PST |
> +--+---+---++-++
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6103) lsb_release: command not found

2018-03-19 Thread Kunal Khatua (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405338#comment-16405338
 ] 

Kunal Khatua commented on DRILL-6103:
-

Thanks [~sanel] . 

For CentOS, i see 

{{[root@kk127 3rdparty]# grep OS /etc/centos-release }}
{{CentOS release 6.4 (Final)}}

So I'll create a PR for a generic version of your patch, that should apply to 
other flavors too.

> lsb_release: command not found
> --
>
> Key: DRILL-6103
> URL: https://issues.apache.org/jira/browse/DRILL-6103
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Chunhui Shi
>Priority: Major
> Fix For: 1.14.0
>
> Attachments: drill-config.sh.patch
>
>
> Got this error when running drillbit.sh:
>  
> $ bin/drillbit.sh restart
> bin/drill-config.sh: line 317: lsb_release: command not found



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-6103) lsb_release: command not found

2018-03-19 Thread Kunal Khatua (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua reassigned DRILL-6103:
---

Assignee: Kunal Khatua

> lsb_release: command not found
> --
>
> Key: DRILL-6103
> URL: https://issues.apache.org/jira/browse/DRILL-6103
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Chunhui Shi
>Assignee: Kunal Khatua
>Priority: Major
> Fix For: 1.14.0
>
> Attachments: drill-config.sh.patch
>
>
> Got this error when running drillbit.sh:
>  
> $ bin/drillbit.sh restart
> bin/drill-config.sh: line 317: lsb_release: command not found



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6103) lsb_release: command not found

2018-03-19 Thread Kunal Khatua (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua updated DRILL-6103:

Fix Version/s: 1.14.0

> lsb_release: command not found
> --
>
> Key: DRILL-6103
> URL: https://issues.apache.org/jira/browse/DRILL-6103
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Chunhui Shi
>Priority: Major
> Fix For: 1.14.0
>
> Attachments: drill-config.sh.patch
>
>
> Got this error when running drillbit.sh:
>  
> $ bin/drillbit.sh restart
> bin/drill-config.sh: line 317: lsb_release: command not found



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6275) drillbit direct_current memory usage is not populated/updated

2018-03-19 Thread Chun Chang (JIRA)
Chun Chang created DRILL-6275:
-

 Summary: drillbit direct_current memory usage is not 
populated/updated
 Key: DRILL-6275
 URL: https://issues.apache.org/jira/browse/DRILL-6275
 Project: Apache Drill
  Issue Type: Bug
  Components: Metadata
Affects Versions: 1.13.0
Reporter: Chun Chang


We used to keep track drill memory usage in sys.memory. And it was useful in 
detecting memory leaks. This feature seems broken. The direct_current memory 
usage is not populated or updated.

{noformat}
0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.memory;
+---++---+-+-+-+--+
| hostname | user_port | heap_current | heap_max | direct_current | 
jvm_direct_current | direct_max |
+---++---+-+-+-+--+
| 10.10.30.168 | 31010 | 1162636800 | 2147483648 | 0 | 22096 | 10737418240 |
| 10.10.30.169 | 31010 | 1301175040 | 2147483648 | 0 | 22096 | 10737418240 |
| 10.10.30.166 | 31010 | 989448872 | 2147483648 | 0 | 22096 | 10737418240 |
| 10.10.30.167 | 31010 | 1767205312 | 2147483648 | 0 | 22096 | 10737418240 |
+---++---+-+-+-+--+
4 rows selected (1.564 seconds)
0: jdbc:drill:zk=10.10.30.166:5181> select * from sys.version;
+--+---+---++-++
| version | commit_id | commit_message | commit_time | build_email | build_time 
|
+--+---+---++-++
| 1.13.0-SNAPSHOT | 534212456cc25a49272838cba91c223f63df7fd2 | Cleanup when 
closing, and cleanup spill after a kill | 07.03.2018 @ 16:18:27 PST | 
inram...@gmail.com | 08.03.2018 @ 10:09:28 PST |
+--+---+---++-++
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6274) MergeJoin Memory Manager is still using Fragmentation Factor

2018-03-19 Thread Sorabh Hamirwasia (JIRA)
Sorabh Hamirwasia created DRILL-6274:


 Summary: MergeJoin Memory Manager is still using Fragmentation 
Factor
 Key: DRILL-6274
 URL: https://issues.apache.org/jira/browse/DRILL-6274
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Flow
Affects Versions: 1.13.0
Reporter: Sorabh Hamirwasia
Assignee: Padma Penumarthy
 Fix For: 1.14.0


MergeJoinMemoryManager is using 
[WORST_CASE_FRAGMENTATION_FACTOR|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java#L156]
 for memory computation in outgoing batch. This needs to be updated to not use 
it anymore.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6238) Batch sizing for operators

2018-03-19 Thread Padma Penumarthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Padma Penumarthy updated DRILL-6238:

Description: 
*Batch Sizing For Operators*

This document describes the approach we are taking for limiting batch sizes for 
operators other than scan.

*Motivation*

Main goals are
 # Improve concurrency
 # Reduce query failures because of out of memory errors

To accomplish these goals, we need to make queries execute within a specified 
memory budget. To enforce per query memory limit, we need to be able to enforce 
per fragment and per operator memory limits. Controlling individual operators 
batch sizes is the first step towards all this.

*Background*

In Drill, different operators have different limits w.r.to outgoing batches. 
Some use hard coded row counts, some use hard coded memory and some have none 
at all. Based on input data size and what the operator is doing, memory used by 
the outgoing batch can vary widely as there are no limits imposed. Queries fail 
because we are not able to allocate the memory needed. Some operators produce 
very large batches, causing blocking operators like sort, hash agg which have 
to work under tight memory constraints to fail. Size of batches should be a 
function of available memory rather than input data size and/or what the 
operator does. Please refer to table at the end of this document for details on 
what each operator does today.

*Design*

Goal is to have all operators behave the same way i.e. produce batches with 
size less than or equal to configured outgoing batch size with a minimum of 1 
row per batch and maximum of 64k rows per batch. A new system option 
‘drill.exec.memory.operator.output_batch_size’ is added which has default value 
of 16MB.

The basic idea is to limit size of outgoing batch by deciding how many rows we 
can have in the batch based on average entry size of each outgoing column, 
taking into account actual data size and metadata vector overhead we add on top 
for tracking variable length, mode(repeated, optional, required) etc. The 
calculations to figure out number of rows will be different for each operator 
and is based on
 # What the operator is doing
 # Incoming batch size that includes information on type and average size of 
each column
 # What is being projected out

By taking this adaptive approach based on actual average data sizes, for 
operators which were limiting batch size to less than 64K rows before can 
possibly do lot more rows (upto 64K rows) in a batch if the memory stays within 
the budget. For example, flatten and joins have batch size of 4K rows, which 
probably might have been done to be conservative w.r.to memory usage. By making 
these operators go upto 64K rows as long as they stay with in the memory budget 
should help improve performance.

Also, to improve performance and utilize memory more efficiently, we will
 # Allocate memory for value vectors upfront. Since we know the number of rows 
and sizing information for each column in the  outgoing batch, we will use that 
information to allocate memory for value vectors upfront.  Currently, we either 
do initial allocation for 4K values and keep doubling every time we need more 
or allocate for maximum needed upfront. With this change to pre allocate memory 
based on sizing calculation, we can improve performance by reducing the memory 
copies and zeroing the new half we do every time we double and help save memory 
in cases where we were over allocating before.
 # Round down the number of rows in outgoing batch to a power of two. Since 
memory is allocated in powers of two, this will help us pack the value vectors 
densely thereby reducing the amount of memory that gets wasted because of 
doubling effect.

So, to summarize, the benefits we will get are improved memory utilization, 
better performance, higher concurrency and less queries dying because of out of 
memory errors.

Note: Since these sizing calculations are based on averages, strict memory 
usage enforcement is not possible. There could be pathological cases where 
because of uneven data distribution, we might exceed the configured output 
batch size potentially causing OOM errors and problems in downstream operators.

Other issues that will be addressed:
 * We are adding extra processing for each batch in each operator to figure out 
the sizing information. This overhead can be reduced by passing this 
information along with the batch between operators.
 * For some operators, it will be complex to figure out average size of 
outgoing columns especially if we have to evaluate complex expression trees and 
UDFs to figure out the transformation on incoming batches. We will use 
approximations as appropriate.

Following table summarizes the limits we have today for each operator.

flatten, merge join and external sort adhere to batch size limits as described 
in this document as of drill 

[jira] [Commented] (DRILL-6199) Filter push down doesn't work with more than one nested subqueries

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405227#comment-16405227
 ] 

ASF GitHub Bot commented on DRILL-6199:
---

Github user priteshm commented on the issue:

https://github.com/apache/drill/pull/1152
  
Thanks, @chunhui-shi - marked it as ready-to-commit since the original 
feature was already merged to 1.13. The batch committer this week can take 
another look as well.


> Filter push down doesn't work with more than one nested subqueries
> --
>
> Key: DRILL-6199
> URL: https://issues.apache.org/jira/browse/DRILL-6199
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Anton Gozhiy
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
> Attachments: DRILL_6118_data_source.csv
>
>
> *Data set:*
> The data is generated used the attached file: *DRILL_6118_data_source.csv*
> Data gen commands:
> {code:sql}
> create table dfs.tmp.`DRILL_6118_parquet_partitioned_by_folders/d1` (c1, c2, 
> c3, c4, c5) as select cast(columns[0] as int) c1, columns[1] c2, columns[2] 
> c3, columns[3] c4, columns[4] c5 from dfs.tmp.`DRILL_6118_data_source.csv` 
> where columns[0] in (1, 3);
> create table dfs.tmp.`DRILL_6118_parquet_partitioned_by_folders/d2` (c1, c2, 
> c3, c4, c5) as select cast(columns[0] as int) c1, columns[1] c2, columns[2] 
> c3, columns[3] c4, columns[4] c5 from dfs.tmp.`DRILL_6118_data_source.csv` 
> where columns[0]=2;
> create table dfs.tmp.`DRILL_6118_parquet_partitioned_by_folders/d3` (c1, c2, 
> c3, c4, c5) as select cast(columns[0] as int) c1, columns[1] c2, columns[2] 
> c3, columns[3] c4, columns[4] c5 from dfs.tmp.`DRILL_6118_data_source.csv` 
> where columns[0]>3;
> {code}
> *Steps:*
> # Execute the following query:
> {code:sql}
> explain plan for select * from (select * from (select * from 
> dfs.tmp.`DRILL_6118_parquet_partitioned_by_folders`)) where c1<3
> {code}
> *Expected result:*
> numFiles=2, numRowGroups=2, only files from the folders d1 and d2 should be 
> scanned.
> *Actual result:*
> Filter push down doesn't work:
> numFiles=3, numRowGroups=3, scanning from all files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6199) Filter push down doesn't work with more than one nested subqueries

2018-03-19 Thread Pritesh Maker (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6199:
-
Labels: ready-to-commit  (was: )

> Filter push down doesn't work with more than one nested subqueries
> --
>
> Key: DRILL-6199
> URL: https://issues.apache.org/jira/browse/DRILL-6199
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Anton Gozhiy
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
> Attachments: DRILL_6118_data_source.csv
>
>
> *Data set:*
> The data is generated used the attached file: *DRILL_6118_data_source.csv*
> Data gen commands:
> {code:sql}
> create table dfs.tmp.`DRILL_6118_parquet_partitioned_by_folders/d1` (c1, c2, 
> c3, c4, c5) as select cast(columns[0] as int) c1, columns[1] c2, columns[2] 
> c3, columns[3] c4, columns[4] c5 from dfs.tmp.`DRILL_6118_data_source.csv` 
> where columns[0] in (1, 3);
> create table dfs.tmp.`DRILL_6118_parquet_partitioned_by_folders/d2` (c1, c2, 
> c3, c4, c5) as select cast(columns[0] as int) c1, columns[1] c2, columns[2] 
> c3, columns[3] c4, columns[4] c5 from dfs.tmp.`DRILL_6118_data_source.csv` 
> where columns[0]=2;
> create table dfs.tmp.`DRILL_6118_parquet_partitioned_by_folders/d3` (c1, c2, 
> c3, c4, c5) as select cast(columns[0] as int) c1, columns[1] c2, columns[2] 
> c3, columns[3] c4, columns[4] c5 from dfs.tmp.`DRILL_6118_data_source.csv` 
> where columns[0]>3;
> {code}
> *Steps:*
> # Execute the following query:
> {code:sql}
> explain plan for select * from (select * from (select * from 
> dfs.tmp.`DRILL_6118_parquet_partitioned_by_folders`)) where c1<3
> {code}
> *Expected result:*
> numFiles=2, numRowGroups=2, only files from the folders d1 and d2 should be 
> scanned.
> *Actual result:*
> Filter push down doesn't work:
> numFiles=3, numRowGroups=3, scanning from all files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6199) Filter push down doesn't work with more than one nested subqueries

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405210#comment-16405210
 ] 

ASF GitHub Bot commented on DRILL-6199:
---

Github user chunhui-shi commented on the issue:

https://github.com/apache/drill/pull/1152
  
+1, good to me.


> Filter push down doesn't work with more than one nested subqueries
> --
>
> Key: DRILL-6199
> URL: https://issues.apache.org/jira/browse/DRILL-6199
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Anton Gozhiy
>Assignee: Arina Ielchiieva
>Priority: Major
> Fix For: 1.14.0
>
> Attachments: DRILL_6118_data_source.csv
>
>
> *Data set:*
> The data is generated used the attached file: *DRILL_6118_data_source.csv*
> Data gen commands:
> {code:sql}
> create table dfs.tmp.`DRILL_6118_parquet_partitioned_by_folders/d1` (c1, c2, 
> c3, c4, c5) as select cast(columns[0] as int) c1, columns[1] c2, columns[2] 
> c3, columns[3] c4, columns[4] c5 from dfs.tmp.`DRILL_6118_data_source.csv` 
> where columns[0] in (1, 3);
> create table dfs.tmp.`DRILL_6118_parquet_partitioned_by_folders/d2` (c1, c2, 
> c3, c4, c5) as select cast(columns[0] as int) c1, columns[1] c2, columns[2] 
> c3, columns[3] c4, columns[4] c5 from dfs.tmp.`DRILL_6118_data_source.csv` 
> where columns[0]=2;
> create table dfs.tmp.`DRILL_6118_parquet_partitioned_by_folders/d3` (c1, c2, 
> c3, c4, c5) as select cast(columns[0] as int) c1, columns[1] c2, columns[2] 
> c3, columns[3] c4, columns[4] c5 from dfs.tmp.`DRILL_6118_data_source.csv` 
> where columns[0]>3;
> {code}
> *Steps:*
> # Execute the following query:
> {code:sql}
> explain plan for select * from (select * from (select * from 
> dfs.tmp.`DRILL_6118_parquet_partitioned_by_folders`)) where c1<3
> {code}
> *Expected result:*
> numFiles=2, numRowGroups=2, only files from the folders d1 and d2 should be 
> scanned.
> *Actual result:*
> Filter push down doesn't work:
> numFiles=3, numRowGroups=3, scanning from all files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6262) IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405168#comment-16405168
 ] 

ASF GitHub Bot commented on DRILL-6262:
---

Github user sohami commented on the issue:

https://github.com/apache/drill/pull/1175
  
@ppadma - Please review


> IndexOutOfBoundException in RecordBatchSize for empty variableWidthVector
> -
>
> Key: DRILL-6262
> URL: https://issues.apache.org/jira/browse/DRILL-6262
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
> Fix For: 1.14.0
>
>
> ColumnSize inside RecordBatchSizer while computing the totalDataSize for 
> VariableWidthVector throws IndexOutOfBoundException when the underlying 
> vector is empty without any allocated memory.
> This happens because the way totalDataSize is computed is using the 
> offsetVector value at an index n where n is total number of records in the 
> vector. When vector is empty then n=0 and offsetVector drillbuf is empty as 
> well. So while retrieving value at index 0 from offsetVector exception is 
> thrown. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6273) Remove dependency licensed under Category X

2018-03-19 Thread Vlad Rozov (JIRA)
Vlad Rozov created DRILL-6273:
-

 Summary: Remove dependency licensed under Category X
 Key: DRILL-6273
 URL: https://issues.apache.org/jira/browse/DRILL-6273
 Project: Apache Drill
  Issue Type: Task
Reporter: Vlad Rozov
 Fix For: 1.14.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6272) Remove binary jars files from source distribution

2018-03-19 Thread Vlad Rozov (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vlad Rozov updated DRILL-6272:
--
Priority: Critical  (was: Major)

> Remove binary jars files from source distribution
> -
>
> Key: DRILL-6272
> URL: https://issues.apache.org/jira/browse/DRILL-6272
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Vlad Rozov
>Priority: Critical
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6272) Remove binary jars files from source distribution

2018-03-19 Thread Vlad Rozov (JIRA)
Vlad Rozov created DRILL-6272:
-

 Summary: Remove binary jars files from source distribution
 Key: DRILL-6272
 URL: https://issues.apache.org/jira/browse/DRILL-6272
 Project: Apache Drill
  Issue Type: Task
Reporter: Vlad Rozov
 Fix For: 1.14.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6271) Update copyright range in NOTICE

2018-03-19 Thread Vlad Rozov (JIRA)
Vlad Rozov created DRILL-6271:
-

 Summary: Update copyright range in NOTICE
 Key: DRILL-6271
 URL: https://issues.apache.org/jira/browse/DRILL-6271
 Project: Apache Drill
  Issue Type: Task
Reporter: Vlad Rozov
 Fix For: 1.14.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6103) lsb_release: command not found

2018-03-19 Thread Sanel Zukan (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404868#comment-16404868
 ] 

Sanel Zukan commented on DRILL-6103:


Here is a small patch that checks for _/etc/fedora-release_ path. This is more 
common, than _lsb_release_ command on Linux distros.

[^drill-config.sh.patch]

> lsb_release: command not found
> --
>
> Key: DRILL-6103
> URL: https://issues.apache.org/jira/browse/DRILL-6103
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Chunhui Shi
>Priority: Major
> Attachments: drill-config.sh.patch
>
>
> Got this error when running drillbit.sh:
>  
> $ bin/drillbit.sh restart
> bin/drill-config.sh: line 317: lsb_release: command not found



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6103) lsb_release: command not found

2018-03-19 Thread Sanel Zukan (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanel Zukan updated DRILL-6103:
---
Attachment: drill-config.sh.patch

> lsb_release: command not found
> --
>
> Key: DRILL-6103
> URL: https://issues.apache.org/jira/browse/DRILL-6103
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Chunhui Shi
>Priority: Major
> Attachments: drill-config.sh.patch
>
>
> Got this error when running drillbit.sh:
>  
> $ bin/drillbit.sh restart
> bin/drill-config.sh: line 317: lsb_release: command not found



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6231) Fix memory allocation for repeated list vector

2018-03-19 Thread Padma Penumarthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Padma Penumarthy updated DRILL-6231:

  Labels: ready-to-commit  (was: )
Reviewer: Paul Rogers

> Fix memory allocation for repeated list vector
> --
>
> Key: DRILL-6231
> URL: https://issues.apache.org/jira/browse/DRILL-6231
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.13.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>Priority: Critical
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> Vector allocation in record batch sizer can be enhanced to allocate memory 
> for repeated list vector more accurately rather than using default functions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6270) Add debug startup option flag for drill in embedded and server mode

2018-03-19 Thread Volodymyr Tkach (JIRA)
Volodymyr Tkach created DRILL-6270:
--

 Summary: Add debug startup option flag for drill in embedded and 
server mode
 Key: DRILL-6270
 URL: https://issues.apache.org/jira/browse/DRILL-6270
 Project: Apache Drill
  Issue Type: Task
Reporter: Volodymyr Tkach
Assignee: Anton Gozhiy


Add possibility to run sqlline.sh and drillbit.sh scripts with -- 
with standard java remote debug options with the ability to override port.

Example: drillbit.sh start - 50001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6250) Sqlline start command with password appears in the sqlline.log

2018-03-19 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6250:

Labels: ready-to-commit  (was: )

> Sqlline start command with password appears in the sqlline.log
> --
>
> Key: DRILL-6250
> URL: https://issues.apache.org/jira/browse/DRILL-6250
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Anton Gozhiy
>Assignee: Volodymyr Tkach
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> *Prerequisites:*
>  *1.* Log level is set to "all" in the conf/logback.xml:
> {code:xml}
> 
> 
> 
> 
> {code}
> *2.* PLAIN authentication mechanism is configured:
> {code:java}
>   security.user.auth: {
>   enabled: true,
>   packages += "org.apache.drill.exec.rpc.user.security",
>   impl: "pam",
>   pam_profiles: [ "sudo", "login" ]
>   }
> {code}
> *Steps:*
>  *1.* Start the drillbits
>  *2.* Connect by sqlline:
> {noformat}
> /opt/mapr/drill/drill-1.13.0/bin/sqlline -u "jdbc:drill:zk=node1:5181;" -n 
> user1 -p 1234
> {noformat}
> *3.* Use check the sqlline logs:
> {noformat}
> tail -F log/sqlline.log|grep 1234 -a5 -b5
> {noformat}
> *Expected result:* Logs shouldn't contain clear-text passwords
> *Actual result:* The logs contain the sqlline start command with password:
> {noformat}
> # system properties
> 35333-"java" : {
> 35352-# system properties
> 35384:"command" : "sqlline.SqlLine -d 
> org.apache.drill.jdbc.Driver --maxWidth=1 --color=true -u 
> jdbc:drill:zk=node1:5181; -n user1 -p 1234",
> 35535-# system properties
> 35567-"launcher" : "SUN_STANDARD"
> 35607-}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6250) Sqlline start command with password appears in the sqlline.log

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404625#comment-16404625
 ] 

ASF GitHub Bot commented on DRILL-6250:
---

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1174
  
+1


> Sqlline start command with password appears in the sqlline.log
> --
>
> Key: DRILL-6250
> URL: https://issues.apache.org/jira/browse/DRILL-6250
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Anton Gozhiy
>Assignee: Volodymyr Tkach
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> *Prerequisites:*
>  *1.* Log level is set to "all" in the conf/logback.xml:
> {code:xml}
> 
> 
> 
> 
> {code}
> *2.* PLAIN authentication mechanism is configured:
> {code:java}
>   security.user.auth: {
>   enabled: true,
>   packages += "org.apache.drill.exec.rpc.user.security",
>   impl: "pam",
>   pam_profiles: [ "sudo", "login" ]
>   }
> {code}
> *Steps:*
>  *1.* Start the drillbits
>  *2.* Connect by sqlline:
> {noformat}
> /opt/mapr/drill/drill-1.13.0/bin/sqlline -u "jdbc:drill:zk=node1:5181;" -n 
> user1 -p 1234
> {noformat}
> *3.* Use check the sqlline logs:
> {noformat}
> tail -F log/sqlline.log|grep 1234 -a5 -b5
> {noformat}
> *Expected result:* Logs shouldn't contain clear-text passwords
> *Actual result:* The logs contain the sqlline start command with password:
> {noformat}
> # system properties
> 35333-"java" : {
> 35352-# system properties
> 35384:"command" : "sqlline.SqlLine -d 
> org.apache.drill.jdbc.Driver --maxWidth=1 --color=true -u 
> jdbc:drill:zk=node1:5181; -n user1 -p 1234",
> 35535-# system properties
> 35567-"launcher" : "SUN_STANDARD"
> 35607-}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6250) Sqlline start command with password appears in the sqlline.log

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404614#comment-16404614
 ] 

ASF GitHub Bot commented on DRILL-6250:
---

Github user vladimirtkach commented on the issue:

https://github.com/apache/drill/pull/1174
  
@arina-ielchiieva addressed code review  comments


> Sqlline start command with password appears in the sqlline.log
> --
>
> Key: DRILL-6250
> URL: https://issues.apache.org/jira/browse/DRILL-6250
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Anton Gozhiy
>Assignee: Volodymyr Tkach
>Priority: Major
> Fix For: 1.14.0
>
>
> *Prerequisites:*
>  *1.* Log level is set to "all" in the conf/logback.xml:
> {code:xml}
> 
> 
> 
> 
> {code}
> *2.* PLAIN authentication mechanism is configured:
> {code:java}
>   security.user.auth: {
>   enabled: true,
>   packages += "org.apache.drill.exec.rpc.user.security",
>   impl: "pam",
>   pam_profiles: [ "sudo", "login" ]
>   }
> {code}
> *Steps:*
>  *1.* Start the drillbits
>  *2.* Connect by sqlline:
> {noformat}
> /opt/mapr/drill/drill-1.13.0/bin/sqlline -u "jdbc:drill:zk=node1:5181;" -n 
> user1 -p 1234
> {noformat}
> *3.* Use check the sqlline logs:
> {noformat}
> tail -F log/sqlline.log|grep 1234 -a5 -b5
> {noformat}
> *Expected result:* Logs shouldn't contain clear-text passwords
> *Actual result:* The logs contain the sqlline start command with password:
> {noformat}
> # system properties
> 35333-"java" : {
> 35352-# system properties
> 35384:"command" : "sqlline.SqlLine -d 
> org.apache.drill.jdbc.Driver --maxWidth=1 --color=true -u 
> jdbc:drill:zk=node1:5181; -n user1 -p 1234",
> 35535-# system properties
> 35567-"launcher" : "SUN_STANDARD"
> 35607-}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6250) Sqlline start command with password appears in the sqlline.log

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404541#comment-16404541
 ] 

ASF GitHub Bot commented on DRILL-6250:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/1174#discussion_r175368114
  
--- Diff: 
common/src/main/java/org/apache/drill/common/config/DrillConfig.java ---
@@ -52,8 +52,8 @@
   public DrillConfig(Config config) {
 super(config);
 logger.debug("Setting up DrillConfig object.");
-logger.trace("Given Config object is:\n{}",
- config.root().render(ConfigRenderOptions.defaults()));
+logger.trace("Given Config object is:\n{}", 
config.withoutPath("password").withoutPath("sun.java.command")
--- End diff --

Please add comment why we exclude `sun.jaba.command`.


> Sqlline start command with password appears in the sqlline.log
> --
>
> Key: DRILL-6250
> URL: https://issues.apache.org/jira/browse/DRILL-6250
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Anton Gozhiy
>Assignee: Volodymyr Tkach
>Priority: Major
> Fix For: 1.14.0
>
>
> *Prerequisites:*
>  *1.* Log level is set to "all" in the conf/logback.xml:
> {code:xml}
> 
> 
> 
> 
> {code}
> *2.* PLAIN authentication mechanism is configured:
> {code:java}
>   security.user.auth: {
>   enabled: true,
>   packages += "org.apache.drill.exec.rpc.user.security",
>   impl: "pam",
>   pam_profiles: [ "sudo", "login" ]
>   }
> {code}
> *Steps:*
>  *1.* Start the drillbits
>  *2.* Connect by sqlline:
> {noformat}
> /opt/mapr/drill/drill-1.13.0/bin/sqlline -u "jdbc:drill:zk=node1:5181;" -n 
> user1 -p 1234
> {noformat}
> *3.* Use check the sqlline logs:
> {noformat}
> tail -F log/sqlline.log|grep 1234 -a5 -b5
> {noformat}
> *Expected result:* Logs shouldn't contain clear-text passwords
> *Actual result:* The logs contain the sqlline start command with password:
> {noformat}
> # system properties
> 35333-"java" : {
> 35352-# system properties
> 35384:"command" : "sqlline.SqlLine -d 
> org.apache.drill.jdbc.Driver --maxWidth=1 --color=true -u 
> jdbc:drill:zk=node1:5181; -n user1 -p 1234",
> 35535-# system properties
> 35567-"launcher" : "SUN_STANDARD"
> 35607-}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6250) Sqlline start command with password appears in the sqlline.log

2018-03-19 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6250:

Fix Version/s: 1.14.0

> Sqlline start command with password appears in the sqlline.log
> --
>
> Key: DRILL-6250
> URL: https://issues.apache.org/jira/browse/DRILL-6250
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Anton Gozhiy
>Assignee: Volodymyr Tkach
>Priority: Major
> Fix For: 1.14.0
>
>
> *Prerequisites:*
>  *1.* Log level is set to "all" in the conf/logback.xml:
> {code:xml}
> 
> 
> 
> 
> {code}
> *2.* PLAIN authentication mechanism is configured:
> {code:java}
>   security.user.auth: {
>   enabled: true,
>   packages += "org.apache.drill.exec.rpc.user.security",
>   impl: "pam",
>   pam_profiles: [ "sudo", "login" ]
>   }
> {code}
> *Steps:*
>  *1.* Start the drillbits
>  *2.* Connect by sqlline:
> {noformat}
> /opt/mapr/drill/drill-1.13.0/bin/sqlline -u "jdbc:drill:zk=node1:5181;" -n 
> user1 -p 1234
> {noformat}
> *3.* Use check the sqlline logs:
> {noformat}
> tail -F log/sqlline.log|grep 1234 -a5 -b5
> {noformat}
> *Expected result:* Logs shouldn't contain clear-text passwords
> *Actual result:* The logs contain the sqlline start command with password:
> {noformat}
> # system properties
> 35333-"java" : {
> 35352-# system properties
> 35384:"command" : "sqlline.SqlLine -d 
> org.apache.drill.jdbc.Driver --maxWidth=1 --color=true -u 
> jdbc:drill:zk=node1:5181; -n user1 -p 1234",
> 35535-# system properties
> 35567-"launcher" : "SUN_STANDARD"
> 35607-}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)