[jira] [Assigned] (DRILL-7391) Wrong result when doing left outer join on CSV table

2019-10-18 Thread Aman Sinha (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha reassigned DRILL-7391:
-

Assignee: Vova Vysotskyi  (was: Aman Sinha)

> Wrong result when doing left outer join on CSV table
> 
>
> Key: DRILL-7391
> URL: https://issues.apache.org/jira/browse/DRILL-7391
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Aman Sinha
>Assignee: Vova Vysotskyi
>Priority: Major
> Fix For: 1.17.0
>
> Attachments: tt5.tar.gz, tt6.tar.gz
>
>
> The following query shows 1 row that is incorrect.  For the non-null rows, 
> both columns should have the same value.  This is on CSV sample data (I will 
> attach the files). 
> {noformat}
> apache drill (dfs.tmp)> select tt5.columns[0], tt6.columns[0] from tt5 left 
> outer join tt6  on tt5.columns[0] = tt6.columns[0];
> +++
> | EXPR$0 | EXPR$1 |
> +++
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 580| null   |
> |    | null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 580| null   |
> | 6767   | null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 6767   | null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 9669   | 1414   |  <--- Wrong result
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 580| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 409| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 555| null   |
> +++
> 75 rows selected 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7391) Wrong result when doing left outer join on CSV table

2019-10-18 Thread Aman Sinha (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954921#comment-16954921
 ] 

Aman Sinha commented on DRILL-7391:
---

[~volodymyr], I assume this fix would go into 1.17 since it is a wrong result.  
Since I will be offline for a few days, I am assigning this to you for updating 
the pom.xml file for Calcite version once CALCITE-3390 is merged and also 
adding the following unit test to TestExampleQueries.java:
{noformat}

  @Test  // DRILL-7391
  public void testItemPushdownPastLeftOuterJoin() throws Exception {
String query = "select t1.columns[0] as a, t2.columns[0] as b from 
cp.`store/text/data/regions.csv` t1 "
+ " left outer join cp.`store/text/data/regions.csv` t2 on 
t1.columns[0] = t2.columns[0]";

PlanTestBase.testPlanMatchingPatterns(query,
new String[] {},
// exclude pattern where Project is projecting the 'columns' field
new String[]{"Project.*columns"});

  }
{noformat}

> Wrong result when doing left outer join on CSV table
> 
>
> Key: DRILL-7391
> URL: https://issues.apache.org/jira/browse/DRILL-7391
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.17.0
>
> Attachments: tt5.tar.gz, tt6.tar.gz
>
>
> The following query shows 1 row that is incorrect.  For the non-null rows, 
> both columns should have the same value.  This is on CSV sample data (I will 
> attach the files). 
> {noformat}
> apache drill (dfs.tmp)> select tt5.columns[0], tt6.columns[0] from tt5 left 
> outer join tt6  on tt5.columns[0] = tt6.columns[0];
> +++
> | EXPR$0 | EXPR$1 |
> +++
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 580| null   |
> |    | null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 580| null   |
> | 6767   | null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 6767   | null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 9669   | 1414   |  <--- Wrong result
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 580| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 409| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 555| null   |
> +++
> 75 rows selected 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7391) Wrong result when doing left outer join on CSV table

2019-10-18 Thread Aman Sinha (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954729#comment-16954729
 ] 

Aman Sinha commented on DRILL-7391:
---

Adding link to CALCITE-3390 on which this fix depends. 

> Wrong result when doing left outer join on CSV table
> 
>
> Key: DRILL-7391
> URL: https://issues.apache.org/jira/browse/DRILL-7391
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.17.0
>
> Attachments: tt5.tar.gz, tt6.tar.gz
>
>
> The following query shows 1 row that is incorrect.  For the non-null rows, 
> both columns should have the same value.  This is on CSV sample data (I will 
> attach the files). 
> {noformat}
> apache drill (dfs.tmp)> select tt5.columns[0], tt6.columns[0] from tt5 left 
> outer join tt6  on tt5.columns[0] = tt6.columns[0];
> +++
> | EXPR$0 | EXPR$1 |
> +++
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 580| null   |
> |    | null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 580| null   |
> | 6767   | null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 6767   | null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 9669   | 1414   |  <--- Wrong result
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 580| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 409| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 555| null   |
> +++
> 75 rows selected 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7391) Wrong result when doing left outer join on CSV table

2019-09-30 Thread Aman Sinha (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941377#comment-16941377
 ] 

Aman Sinha commented on DRILL-7391:
---

I narrowed it down to the Calcite upgrade [1] that was done in Drill 1.13.0. 
Just before the Calcite upgrade, the logical plan and the results are correct 
(See below). After the upgrade, the logical plan has an extra 'columns' field 
in the Project on the right input of the Left Outer Join, which is what is 
causing the problem.  The 'columns'  is present because the ITEM expression 
above the LOJ was only pushed down to the left side of the join but not to the 
right side.  

Here's the sequence which should clarify:  (here I am using 2 different tables 
compared to what I used in the Description but the result is similar)

{noformat}
// Before calcite upgrade in Drill 1.13.0
// RIGHT result

0: jdbc:drill:zk=local> select * from sys.version;
+--+---+++--++
| version  | commit_id | 
commit_message |commit_time | build_email  |
 build_time |
+--+---+++--++
| 1.13.0-SNAPSHOT  | 450e67094eb6e9a6484d7f86c49b51c77a08d7b2  | REVERTED:  
DRILL-5089  | 16.01.2018 @ 02:10:13 PST  | asi...@maprtech.com  | 27.09.2019 @ 
11:25:45 PDT  |
+--+---+++--++

0: jdbc:drill:zk=local> explain plan without implementation for select 
tt7.columns[0], tt8.columns[0] as x from tt7 left outer join tt8  on 
tt7.columns[0] = tt8.columns[0];
+--+--+
| text | json |
+--+--+
| DrillScreenRel
  DrillProjectRel(EXPR$0=[$1], x=[$3])
DrillJoinRel(condition=[=($0, $2)], joinType=[left])
  DrillProjectRel($f2=[ITEM($0, 0)], ITEM=[ITEM($0, 0)])
DrillScanRel(table=[[dfs, tmp, tt7]], groupscan=[EasyGroupScan 
[selectionRoot=file:/tmp/tt7, numFiles=1, columns=[`columns`[0]], 
files=[file:/tmp/tt7/0_0_0.csv]]])
  DrillProjectRel($f2=[ITEM($0, 0)], ITEM=[ITEM($0, 0)])
DrillScanRel(table=[[dfs, tmp, tt8]], groupscan=[EasyGroupScan 
[selectionRoot=file:/tmp/tt8, numFiles=1, columns=[`columns`[0]], 
files=[file:/tmp/tt8/0_0_0.csv]]])


0: jdbc:drill:zk=local> select tt7.columns[0], tt8.columns[0] as x from tt7 
left outer join tt8  on tt7.columns[0] = tt8.columns[0];
+-+---+
| EXPR$0  |   x   |
+-+---+
| 1414| 1414  |
| 455 | null  |
| 555 | null  |
| 1414| 1414  |
| 455 | null  |
| 9669| 9669  |
| 555 | null  |
+-+---+


// After Calcite upgrade in Drill 1.13.0
// WRONG result

0: jdbc:drill:zk=local> select * from sys.version;
+--+---+++--++
| version  | commit_id | 
commit_message |commit_time | build_email  |
 build_time |
+--+---+++--++
| 1.13.0-SNAPSHOT  | 3f0e517fb62a1ebad92fb473e787d343152920d6  | DRILL-3993: 
Resolve conflicts  | 16.01.2018 @ 02:10:13 PST  | asi...@maprtech.com  | 
27.09.2019 @ 11:53:51 PDT  |
+--+---+++--++

0: jdbc:drill:zk=local> explain plan without implementation for select 
tt7.columns[0], tt8.columns[0] as x from tt7 left outer join tt8  on 
tt7.columns[0] = tt8.columns[0];
+--+--+
| text | json |
+--+--+
| DrillScreenRel
  DrillProjectRel(EXPR$0=[$1], x=[ITEM($2, 0)])
DrillJoinRel(condition=[=($0, $3)], joinType=[left])
  DrillProjectRel($f2=[ITEM($0, 0)], ITEM=[ITEM($0, 0)])
DrillScanRel(table=[[dfs, tmp, tt7]], groupscan=[EasyGroupScan 
[selectionRoot=file:/tmp/tt7, numFiles=1, columns=[`columns`[0]], 
files=[file:/tmp/tt7/0_0_0.csv]]])
  DrillProjectRel(columns=[$0], $f2=[ITEM($0, 0)])
DrillScanRel(table=[[dfs, tmp, tt8]], groupscan=[EasyGroupScan 
[selectionRoot=file:/tmp/tt8, numFiles=1, columns=[`columns`, `columns`[0]], 
files=[file:/tmp/tt8/0_0_0.csv]]])

0: jdbc:drill:zk=local> select tt7.columns[0], tt8.columns[0] as x from tt7 
left outer join tt8  on tt7.columns[0] = tt8.columns[0];
+-+---+
| 

[jira] [Updated] (DRILL-7391) Wrong result when doing left outer join on CSV table

2019-09-30 Thread Aman Sinha (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7391:
--
Attachment: tt5.tar.gz

> Wrong result when doing left outer join on CSV table
> 
>
> Key: DRILL-7391
> URL: https://issues.apache.org/jira/browse/DRILL-7391
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.17.0
>
> Attachments: tt5.tar.gz, tt6.tar.gz
>
>
> The following query shows 1 row that is incorrect.  For the non-null rows, 
> both columns should have the same value.  This is on CSV sample data (I will 
> attach the files). 
> {noformat}
> apache drill (dfs.tmp)> select tt5.columns[0], tt6.columns[0] from tt5 left 
> outer join tt6  on tt5.columns[0] = tt6.columns[0];
> +++
> | EXPR$0 | EXPR$1 |
> +++
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 580| null   |
> |    | null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 580| null   |
> | 6767   | null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 6767   | null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 9669   | 1414   |  <--- Wrong result
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 580| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 409| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 555| null   |
> +++
> 75 rows selected 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7391) Wrong result when doing left outer join on CSV table

2019-09-30 Thread Aman Sinha (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7391:
--
Attachment: tt6.tar.gz

> Wrong result when doing left outer join on CSV table
> 
>
> Key: DRILL-7391
> URL: https://issues.apache.org/jira/browse/DRILL-7391
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.17.0
>
> Attachments: tt5.tar.gz, tt6.tar.gz
>
>
> The following query shows 1 row that is incorrect.  For the non-null rows, 
> both columns should have the same value.  This is on CSV sample data (I will 
> attach the files). 
> {noformat}
> apache drill (dfs.tmp)> select tt5.columns[0], tt6.columns[0] from tt5 left 
> outer join tt6  on tt5.columns[0] = tt6.columns[0];
> +++
> | EXPR$0 | EXPR$1 |
> +++
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 580| null   |
> |    | null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 580| null   |
> | 6767   | null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 6767   | null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 9669   | 1414   |  <--- Wrong result
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 580| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 409| null   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 455| null   |
> | 455| null   |
> | 555| null   |
> | 455| null   |
> | 555| null   |
> | 1414   | 1414   |
> | 455| null   |
> | 555| null   |
> | 555| null   |
> | 555| null   |
> +++
> 75 rows selected 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7391) Wrong result when doing left outer join on CSV table

2019-09-30 Thread Aman Sinha (Jira)
Aman Sinha created DRILL-7391:
-

 Summary: Wrong result when doing left outer join on CSV table
 Key: DRILL-7391
 URL: https://issues.apache.org/jira/browse/DRILL-7391
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning  Optimization
Affects Versions: 1.16.0
Reporter: Aman Sinha
Assignee: Aman Sinha
 Fix For: 1.17.0


The following query shows 1 row that is incorrect.  For the non-null rows, both 
columns should have the same value.  This is on CSV sample data (I will attach 
the files). 

{noformat}
apache drill (dfs.tmp)> select tt5.columns[0], tt6.columns[0] from tt5 left 
outer join tt6  on tt5.columns[0] = tt6.columns[0];
+++
| EXPR$0 | EXPR$1 |
+++
| 455| null   |
| 455| null   |
| 555| null   |
| 1414   | 1414   |
| 455| null   |
| 580| null   |
|    | null   |
| 555| null   |
| 455| null   |
| 455| null   |
| 455| null   |
| 455| null   |
| 455| null   |
| 555| null   |
| 455| null   |
| 455| null   |
| 455| null   |
| 580| null   |
| 6767   | null   |
| 455| null   |
| 555| null   |
| 455| null   |
| 555| null   |
| 555| null   |
| 555| null   |
| 455| null   |
| 555| null   |
| 455| null   |
| 455| null   |
| 455| null   |
| 6767   | null   |
| 555| null   |
| 555| null   |
| 455| null   |
| 555| null   |
| 555| null   |
| 1414   | 1414   |
| 455| null   |
| 555| null   |
| 555| null   |
| 455| null   |
| 455| null   |
| 555| null   |
| 455| null   |
| 555| null   |
| 555| null   |
| 455| null   |
| 455| null   |
| 9669   | 1414   |  <--- Wrong result
| 555| null   |
| 455| null   |
| 455| null   |
| 455| null   |
| 555| null   |
| 580| null   |
| 455| null   |
| 555| null   |
| 455| null   |
| 555| null   |
| 455| null   |
| 455| null   |
| 409| null   |
| 455| null   |
| 555| null   |
| 555| null   |
| 455| null   |
| 455| null   |
| 555| null   |
| 455| null   |
| 555| null   |
| 1414   | 1414   |
| 455| null   |
| 555| null   |
| 555| null   |
| 555| null   |
+++
75 rows selected 
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7302) Bump Apache Avro from 1.8.2 to 1.9.0

2019-06-23 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870648#comment-16870648
 ] 

Aman Sinha commented on DRILL-7302:
---

[~Fokko] thanks for the contribution. For some reason I am unable to assign the 
JIRA to you.  Pls assign it to yourself. 

> Bump Apache Avro from 1.8.2 to 1.9.0
> 
>
> Key: DRILL-7302
> URL: https://issues.apache.org/jira/browse/DRILL-7302
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Fokko Driesprong
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7297) Query hangs in planning stage when Error is thrown

2019-06-23 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7297:
--
Reviewer: Bohdan Kazydub

> Query hangs in planning stage when Error is thrown
> --
>
> Key: DRILL-7297
> URL: https://issues.apache.org/jira/browse/DRILL-7297
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Query hangs in the planning stage when Error (not OOM or AssertionError) is 
> thrown during query planning. After canceling the query it will stay in 
> Cancellation Requested state.
> Such error may be thrown due to the mistake in the code, including UDF. Since 
> the user may provide custom UDFs, Drill should be able to handle such cases 
> also.
> Steps to reproduce this issue:
> 1. Create UDF which throws Error in either {{eval()}} or {{setup()}} method 
> (instructions how to create custom UDF may be found 
> [here|https://drill.apache.org/docs/tutorial-develop-a-simple-function/].
>  2. Register custom UDF which throws an error (instruction is 
> [here|https://drill.apache.org/docs/adding-custom-functions-to-drill-introduction/]).
>  3. Run the query with this UDF.
> After submitting the query, the following stack trace is printed:
> {noformat}
> Exception in thread "drill-executor-1" java.lang.Error
>   at 
> org.apache.drill.contrib.function.FunctionExample.setup(FunctionExample.java:19)
>   at 
> org.apache.drill.exec.expr.fn.interpreter.InterpreterEvaluator.evaluateFunction(InterpreterEvaluator.java:139)
>   at 
> org.apache.drill.exec.expr.fn.interpreter.InterpreterEvaluator$EvalVisitor.visitFunctionHolderExpression(InterpreterEvaluator.java:355)
>   at 
> org.apache.drill.exec.expr.fn.interpreter.InterpreterEvaluator$EvalVisitor.visitFunctionHolderExpression(InterpreterEvaluator.java:204)
>   at 
> org.apache.drill.common.expression.FunctionHolderExpression.accept(FunctionHolderExpression.java:53)
>   at 
> org.apache.drill.exec.expr.fn.interpreter.InterpreterEvaluator.evaluateConstantExpr(InterpreterEvaluator.java:70)
>   at 
> org.apache.drill.exec.planner.logical.DrillConstExecutor.reduce(DrillConstExecutor.java:152)
>   at 
> org.apache.calcite.rel.rules.ReduceExpressionsRule.reduceExpressionsInternal(ReduceExpressionsRule.java:620)
>   at 
> org.apache.calcite.rel.rules.ReduceExpressionsRule.reduceExpressions(ReduceExpressionsRule.java:541)
>   at 
> org.apache.calcite.rel.rules.ReduceExpressionsRule$ProjectReduceExpressionsRule.onMatch(ReduceExpressionsRule.java:288)
>   at 
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
>   at 
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:643)
>   at 
> org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:339)
>   at 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform(DefaultSqlHandler.java:430)
>   at 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform(DefaultSqlHandler.java:370)
>   at 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToRawDrel(DefaultSqlHandler.java:250)
>   at 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:319)
>   at 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan(DefaultSqlHandler.java:177)
>   at 
> org.apache.drill.exec.planner.sql.DrillSqlWorker.getQueryPlan(DrillSqlWorker.java:226)
>   at 
> org.apache.drill.exec.planner.sql.DrillSqlWorker.convertPlan(DrillSqlWorker.java:124)
>   at 
> org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:90)
>   at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:593)
>   at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:276)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> 4. Check that query is still in progress state, cancel query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7242) Query with range predicate hits IOBE when accessing histogram buckets

2019-05-14 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7242:
--
  Labels: ready-to-commit  (was: )
Reviewer: Gautam Parai

> Query with range predicate hits IOBE when accessing histogram buckets
> -
>
> Key: DRILL-7242
> URL: https://issues.apache.org/jira/browse/DRILL-7242
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Following query hits an IOBE during histogram access:  (make sure to run 
> ANALYZE command before running this query): 
> {noformat}
>  select 1 from dfs.tmp.employee where store_id > 24;
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 11
>   at 
> org.apache.drill.exec.planner.common.NumericEquiDepthHistogram.getSelectedRows(NumericEquiDepthHistogram.java:215)
>  ~[drill-java-exec-1.16.0.0-mapr.jar:1.16.0.0-mapr]
>   at 
> org.apache.drill.exec.planner.common.NumericEquiDepthHistogram.estimatedSelectivity(NumericEquiDepthHistogram.java:130)
>  ~[drill-java-exec-1.16.0.0-mapr.jar:1.16.0.0-mapr]
>   at 
> org.apache.drill.exec.planner.cost.DrillRelMdSelectivity.computeRangeSelectivity(DrillRelMd
> {noformat}
> Here, 24.0 is the end point of the last histogram bucket and the  boundary 
> condition is not being correctly handled. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7242) Query with range predicate hits IOBE when accessing histogram buckets

2019-05-06 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-7242:
-

 Summary: Query with range predicate hits IOBE when accessing 
histogram buckets
 Key: DRILL-7242
 URL: https://issues.apache.org/jira/browse/DRILL-7242
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning  Optimization
Affects Versions: 1.16.0
Reporter: Aman Sinha
Assignee: Aman Sinha
 Fix For: 1.17.0


Following query hits an IOBE during histogram access:  (make sure to run 
ANALYZE command before running this query): 

{noformat}
 select 1 from dfs.tmp.employee where store_id > 24;

Caused by: java.lang.ArrayIndexOutOfBoundsException: 11
at 
org.apache.drill.exec.planner.common.NumericEquiDepthHistogram.getSelectedRows(NumericEquiDepthHistogram.java:215)
 ~[drill-java-exec-1.16.0.0-mapr.jar:1.16.0.0-mapr]
at 
org.apache.drill.exec.planner.common.NumericEquiDepthHistogram.estimatedSelectivity(NumericEquiDepthHistogram.java:130)
 ~[drill-java-exec-1.16.0.0-mapr.jar:1.16.0.0-mapr]
at 
org.apache.drill.exec.planner.cost.DrillRelMdSelectivity.computeRangeSelectivity(DrillRelMd
{noformat}

Here, 24.0 is the end point of the last histogram bucket and the  boundary 
condition is not being correctly handled. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7228) Histogram end points show high deviation for a sample data set

2019-05-01 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7228:
--
  Labels: ready-to-commit  (was: )
Reviewer: Gautam Parai

> Histogram end points show high deviation for a sample data set
> --
>
> Key: DRILL-7228
> URL: https://issues.apache.org/jira/browse/DRILL-7228
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> There are couple of scenarios where the histogram bucket end points show high 
> deviation for the attached sample data set. 
> +Scenario 1: +
> There are total 100 rows in the sample. Here are the first 10 values of the 
> c_float column ordered by the column.  
> {noformat}
> select c_float from `table_stats/alltypes_with_nulls` order by c_float;
> +--+
> |   c_float|
> +--+
> | -4.6873795E9 |
> | 8.1855632E7  |
> | 2.65311632E8 |
> | 4.50677952E8 |
> | 4.6864464E8  |
> | 5.7848493E8  |
> | 6.6793114E8  |
> | 7.1175571E8  |
> | 9.0065581E8  |
> | 9.2245773E8  |
> ...
> ...
> <100 rows>
> {noformat}
> Here the minimum value is a small negative number.  Here's the output of the 
> histogram after running ANALYZE command: 
> {noformat}
>  "buckets" : [ 8.1855488E7, 9.13736816E8, 1.720863011198E9, 
> 3.2401755232E9, 4.6546719328E9, 5.130497904E9, 5.9901393504E9, 6.779930992E9, 
> 7.998626672E9, 8.69159614398E9, 9.983783792E9 ]
> {noformat}
> Note that the starting end point of bucket 0 is actually the 2nd value in the 
> ordered list and the small negative number is not represented in the 
> histogram at all. 
> +Scenario 2:+
> Histogram for the c_bigint column is as below: 
> {noformat}
>  {
>   "column" : "`c_bigint`",
>   "majortype" : {
> "type" : "BIGINT",
> "mode" : "OPTIONAL"
>   },
>   "schema" : 1.0,
>   "rowcount" : 100.0,
>   "nonnullrowcount" : 87.0,
>   "ndv" : 46,
>   "avgwidth" : 8.0,
>   "histogram" : {
> "category" : "numeric-equi-depth",
> "numRowsPerBucket" : 8,
> "buckets" : [ -8.6390506354062131E18, -7.679478802017577E18, 
> -5.8389791200382024E18, -2.9165328693138038E18, -1.77746633649836621E18, 
> 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 
> 8.848383132345303E17, 4.6441480083157811E18 ]
>   }
> }
> {noformat}
> This indicates that there are duplicate rows with the value close to 2.83 
> which is not true when we analyze the source data.
> This is the output of the ntile function:
> {noformat}
> SELECT bucket_num,
>  min(c_bigint) as min_amount,
> max(c_bigint) as max_amount,
>  count(*) as total_count
>FROM (
>  SELECT c_bigint,
>  NTILE(10) OVER (ORDER BY c_bigint) as bucket_num
>  FROM `table_stats/alltypes_with_nulls`
>   )
> GROUP BY bucket_num
>  ORDER BY bucket_num;
> ++--+--+-+
> | bucket_num |  min_amount  |  max_amount  | total_count |
> ++--+--+-+
> | 1  | -8804872880253829120 | -6983033704176156672 | 10  |
> | 2  | -6772904422084182016 | -5326061597989273600 | 10  |
> | 3  | -5111449881868763136 | -2561061038367703040 | 10  |
> | 4  | -2424523650070740992 | -449093763428515840  | 10  |
> | 5  | 0| 0| 10  |
> | 6  | 0| 0| 10  |
> | 7  | 0| 0| 10  |
> | 8  | 0| 884838034226544640   | 10  |
> | 9  | 884838034226544640   | 4644147690488201216  | 10  |
> | 10 | null | null | 10  |
> ++--+--+-+
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7171) Count(*) query on leaf level directory is not reading summary cache file.

2019-04-30 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7171:
--
Labels: ready-to-commit  (was: )

> Count(*) query on leaf level directory is not reading summary cache file.
> -
>
> Key: DRILL-7171
> URL: https://issues.apache.org/jira/browse/DRILL-7171
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Venkata Jyothsna Donapati
>Assignee: Venkata Jyothsna Donapati
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Since the leaf level directory doesn't store the metadata directories file, 
> while reading summary if the directories cache file is not present, it is 
> assumed that the cache is possibly corrupt and reading of the summary cache 
> file is skipped. Metadata directories cache file should be created at the 
> leaf level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram

2019-04-30 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7187:
--
Reviewer: Gautam Parai

> Improve selectivity estimates for range predicates when using histogram
> ---
>
> Key: DRILL-7187
> URL: https://issues.apache.org/jira/browse/DRILL-7187
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> 2 types of selectivity estimation improvements need to be done:
> 1.  For range predicates on the same column, we need to collect all such 
> predicates in 1 group and do a histogram lookup for them together. 
> For instance: 
> {noformat}
>  WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50
> {noformat}
>  Currently, the Drill behavior is to treat each of the conjuncts 
> independently and multiply the individual selectivities.  However, that will 
> not give the accurate estimates. Here, we want to group the predicates on 'a' 
> together and do a single lookup.  Similarly for 'b'.  
> 2. NULLs are not maintained by the histogram but when doing the selectivity 
> calculations, the histogram should use the totalRowCount as the denominator 
> rather than the non-null count. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7225) Merging of columnTypeInfo for file with different schema throws NullPointerException during refresh metadata

2019-04-30 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7225:
--
Labels: ready-to-commit  (was: )

> Merging of columnTypeInfo for file with different schema throws 
> NullPointerException during refresh metadata
> 
>
> Key: DRILL-7225
> URL: https://issues.apache.org/jira/browse/DRILL-7225
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Venkata Jyothsna Donapati
>Assignee: Venkata Jyothsna Donapati
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Merging of columnTypeInfo from two files with different schemas throws 
> nullpointerexception. For example if a directory Orders has two files:
>  * orders.parquet (with columns order_id, order_name, order_date)
>  * orders_with_address.parquet (with columns order_id, order_name, address)
> When refresh table metadata is triggered, metadata such as total_null_count 
> for columns in both the files is aggregated and updated in the 
> ColumnTypeInfo. Initially ColumnTypeInfo is initialized with the first file's 
> ColumnTypeInfo (i.e., order_id, order_name, order_date). While aggregating, 
> the existing ColumnTypeInfo is looked up for columns in the second file and 
> since some of them don't exist in the ColumnTypeInfo, a npe is thrown. This 
> can be fixed by initializing ColumnTypeInfo for columns that are not yet 
> present.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram

2019-04-30 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7187:
--
Labels: ready-to-commit  (was: )

> Improve selectivity estimates for range predicates when using histogram
> ---
>
> Key: DRILL-7187
> URL: https://issues.apache.org/jira/browse/DRILL-7187
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> 2 types of selectivity estimation improvements need to be done:
> 1.  For range predicates on the same column, we need to collect all such 
> predicates in 1 group and do a histogram lookup for them together. 
> For instance: 
> {noformat}
>  WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50
> {noformat}
>  Currently, the Drill behavior is to treat each of the conjuncts 
> independently and multiply the individual selectivities.  However, that will 
> not give the accurate estimates. Here, we want to group the predicates on 'a' 
> together and do a single lookup.  Similarly for 'b'.  
> 2. NULLs are not maintained by the histogram but when doing the selectivity 
> calculations, the histogram should use the totalRowCount as the denominator 
> rather than the non-null count. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7228) Histogram end points show high deviation for a sample data set

2019-04-30 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-7228:
-

 Summary: Histogram end points show high deviation for a sample 
data set
 Key: DRILL-7228
 URL: https://issues.apache.org/jira/browse/DRILL-7228
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Aman Sinha
Assignee: Aman Sinha
 Fix For: 1.17.0


There are couple of scenarios where the histogram bucket end points show high 
deviation for the attached sample data set. 

+Scenario 1: +
There are total 100 rows in the sample. Here are the first 10 values of the 
c_float column ordered by the column.  

{noformat}
select c_float from `table_stats/alltypes_with_nulls` order by c_float;
+--+
|   c_float|
+--+
| -4.6873795E9 |
| 8.1855632E7  |
| 2.65311632E8 |
| 4.50677952E8 |
| 4.6864464E8  |
| 5.7848493E8  |
| 6.6793114E8  |
| 7.1175571E8  |
| 9.0065581E8  |
| 9.2245773E8  |
...
...
<100 rows>
{noformat}

Here the minimum value is a small negative number.  Here's the output of the 
histogram after running ANALYZE command: 

{noformat}
 "buckets" : [ 8.1855488E7, 9.13736816E8, 1.720863011198E9, 3.2401755232E9, 
4.6546719328E9, 5.130497904E9, 5.9901393504E9, 6.779930992E9, 7.998626672E9, 
8.69159614398E9, 9.983783792E9 ]
{noformat}

Note that the starting end point of bucket 0 is actually the 2nd value in the 
ordered list and the small negative number is not represented in the histogram 
at all. 

+Scenario 2:+
Histogram for the c_bigint column is as below: 
{noformat}
 {
  "column" : "`c_bigint`",
  "majortype" : {
"type" : "BIGINT",
"mode" : "OPTIONAL"
  },
  "schema" : 1.0,
  "rowcount" : 100.0,
  "nonnullrowcount" : 87.0,
  "ndv" : 46,
  "avgwidth" : 8.0,
  "histogram" : {
"category" : "numeric-equi-depth",
"numRowsPerBucket" : 8,
"buckets" : [ -8.6390506354062131E18, -7.679478802017577E18, 
-5.8389791200382024E18, -2.9165328693138038E18, -1.77746633649836621E18, 
2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 
8.848383132345303E17, 4.6441480083157811E18 ]
  }
}
{noformat}

This indicates that there are duplicate rows with the value close to 2.83 which 
is not true when we analyze the source data.
This is the output of the ntile function:

{noformat}
SELECT bucket_num,
 min(c_bigint) as min_amount,
max(c_bigint) as max_amount,
 count(*) as total_count
   FROM (
 SELECT c_bigint,
 NTILE(10) OVER (ORDER BY c_bigint) as bucket_num
 FROM `table_stats/alltypes_with_nulls`
  )
GROUP BY bucket_num
 ORDER BY bucket_num;
++--+--+-+
| bucket_num |  min_amount  |  max_amount  | total_count |
++--+--+-+
| 1  | -8804872880253829120 | -6983033704176156672 | 10  |
| 2  | -6772904422084182016 | -5326061597989273600 | 10  |
| 3  | -5111449881868763136 | -2561061038367703040 | 10  |
| 4  | -2424523650070740992 | -449093763428515840  | 10  |
| 5  | 0| 0| 10  |
| 6  | 0| 0| 10  |
| 7  | 0| 0| 10  |
| 8  | 0| 884838034226544640   | 10  |
| 9  | 884838034226544640   | 4644147690488201216  | 10  |
| 10 | null | null | 10  |
++--+--+-+
{noformat}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7225) Merging of columnTypeInfo for file with different schema throws NullPointerException during refresh metadata

2019-04-29 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7225:
--
 Reviewer: Aman Sinha
Fix Version/s: 1.17.0

Technically this is a regression from 1.15 but since Case 1 is the most common 
case and that is working, I am marking the fix version as 1.17.  

> Merging of columnTypeInfo for file with different schema throws 
> NullPointerException during refresh metadata
> 
>
> Key: DRILL-7225
> URL: https://issues.apache.org/jira/browse/DRILL-7225
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Venkata Jyothsna Donapati
>Assignee: Venkata Jyothsna Donapati
>Priority: Major
> Fix For: 1.17.0
>
>
> Merging of columnTypeInfo from two files with different schemas throws 
> nullpointerexception. For example if a directory Orders has two files:
>  * orders.parquet (with columns order_id, order_name, order_date)
>  * orders_with_address.parquet (with columns order_id, order_name, address)
> When refresh table metadata is triggered, metadata such as total_null_count 
> for columns in both the files is aggregated and updated in the 
> ColumnTypeInfo. Initially ColumnTypeInfo is initialized with the first file's 
> ColumnTypeInfo (i.e., order_id, order_name, order_date). While aggregating, 
> the existing ColumnTypeInfo is looked up for columns in the second file and 
> since some of them don't exist in the ColumnTypeInfo, a npe is thrown. This 
> can be fixed by initializing ColumnTypeInfo for columns that are not yet 
> present.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7225) Merging of columnTypeInfo for file with different schema throws NullPointerException during refresh metadata

2019-04-29 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829814#comment-16829814
 ] 

Aman Sinha commented on DRILL-7225:
---

[~vdonapati], let's clarify this further: 
{noformat}
Case 1:
   /a/b contains files with Schema A
   /a/c contains files with Schema B
Case 2:
   /a/b  contains files with Schema A and Schema B
{noformat}

Does the existing code work correctly for Case 1 and fails for Case 2 ?  Case 1 
is the more common scenario, Case 2 is much less common. 
   

> Merging of columnTypeInfo for file with different schema throws 
> NullPointerException during refresh metadata
> 
>
> Key: DRILL-7225
> URL: https://issues.apache.org/jira/browse/DRILL-7225
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Venkata Jyothsna Donapati
>Assignee: Venkata Jyothsna Donapati
>Priority: Major
>
> Merging of columnTypeInfo from two files with different schemas throws 
> nullpointerexception. For example if a directory Orders has two files:
>  * orders.parquet (with columns order_id, order_name, order_date)
>  * orders_with_address.parquet (with columns order_id, order_name, address)
> When refresh table metadata is triggered, metadata such as total_null_count 
> for columns in both the files is aggregated and updated in the 
> ColumnTypeInfo. Initially ColumnTypeInfo is initialized with the first file's 
> ColumnTypeInfo (i.e., order_id, order_name, order_date). While aggregating, 
> the existing ColumnTypeInfo is looked up for columns in the second file and 
> since some of them don't exist in the ColumnTypeInfo, a npe is thrown. This 
> can be fixed by initializing ColumnTypeInfo for columns that are not yet 
> present.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7223) Make the timeout in TimedCallable a configurable boot time parameter

2019-04-27 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-7223:
-

 Summary: Make the timeout in TimedCallable a configurable boot 
time parameter
 Key: DRILL-7223
 URL: https://issues.apache.org/jira/browse/DRILL-7223
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.16.0
Reporter: Aman Sinha
 Fix For: 1.17.0


The 
[TimedCallable.TIMEOUT_PER_RUNNABLE_IN_MSECS|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/TimedCallable.java#L52]
 is currently an internal Drill constant defined as 15 secs. This has been 
there from day 1 of the introduction. Drill's TimedCallable implements the Java 
concurrency's Callable interface to create timed threads. It is used by the 
REFRESH METADATA command which creates multiple threads on the Foreman node to 
gather Parquet metadata to build the metadata cache.
Depending on the load on the system or for very large scale number of parquet 
files (millions) it is possible to exceed this timeout.  While the exact root 
cause of exceeding the timeout is being investigated, it makes sense to make 
this timeout a configurable parameter to aid with large scale testing. This 
JIRA is to make this a configurable bootstrapping option in the drill-override.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7198) Issuing a control-C in Sqlline exits the session (it does cancel the query)

2019-04-23 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-7198:
-

 Summary: Issuing a control-C in Sqlline exits the session (it does 
cancel the query)
 Key: DRILL-7198
 URL: https://issues.apache.org/jira/browse/DRILL-7198
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.15.0, 1.16.0
Reporter: Aman Sinha


This behavior is observed both in Drill 1.15.0 and the RC1 of 1.16.0.   Run a 
long-running query in sqlline and cancel it using control-c.  It exits the 
sqlline session although it does cancel the query.  Behavior is seen in both 
embedded mode and distributed mode.  If the query is submitted through sqlline  
and cancelled from the Web UI, it does behave correctly..the session does not 
get killed and subsequent queries can be submitted in the same sqlline session. 

Same query in Drill 1.14.0 works correctly and returns the column headers while 
canceling the query. 

Since the query can be cancelled just fine through the Web UI,  I am not 
considering this a blocker for 1.16.   Very likely the sqlline upgrade in 
1.15.0 changed the behavior.  




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7195) Query returns incorrect result or does not fail when cast with is null is used in filter condition

2019-04-23 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824243#comment-16824243
 ] 

Aman Sinha commented on DRILL-7195:
---

[~vvysotskyi] regarding this issue, it is worth debating whether or not it 
really is a blocker.  The first scenario where no error is shown does not seem 
like a blocker. The second case with wrong result occurs only when the 
cast_empty_string_to_null is set to True, which is not the default setting.  
Looking at the history of IS NULL simplification in Calcite, it is somewhat 
complex issue and I feel that any fix we do could potentially introduce other 
side effects which will prolong the release.  Can we document this for 1.16 and 
fix it in master ? 

> Query returns incorrect result or does not fail when cast with is null is 
> used in filter condition
> --
>
> Key: DRILL-7195
> URL: https://issues.apache.org/jira/browse/DRILL-7195
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Blocker
> Fix For: 1.16.0
>
>
> 1. For the case when a query contains filter with a {{cast}} which cannot be 
> done with {{is null}}, the query does not fail:
> {code:sql}
> select * from dfs.tmp.`a.json` as t where cast(t.a as integer) is null;
> +---+
> | a |
> +---+
> +---+
> No rows selected (0.142 seconds)
> {code}
> where
> {noformat}
> cat /tmp/a.json
> {"a":"aaa"}
> {noformat}
> But for the case when this condition is specified in project, query, as it is 
> expected, fails:
> {code:sql}
> select cast(t.a as integer) is null from dfs.tmp.`a.json` t;
> Error: SYSTEM ERROR: NumberFormatException: aaa
> Fragment 0:0
> Please, refer to logs for more information.
> [Error Id: ed3982ce-a12f-4d63-bc6e-cafddf28cc24 on user515050-pc:31010] 
> (state=,code=0)
> {code}
> This is a regression, for Drill 1.15 the first and the second queries are 
> failed:
> {code:sql}
> select * from dfs.tmp.`a.json` as t where cast(t.a as integer) is null;
> Error: SYSTEM ERROR: NumberFormatException: aaa
> Fragment 0:0
> Please, refer to logs for more information.
> [Error Id: 2f878f15-ddaa-48cd-9dfb-45c04db39048 on user515050-pc:31010] 
> (state=,code=0)
> {code}
> 2. For the case when {{drill.exec.functions.cast_empty_string_to_null}} is 
> enabled, this issue will cause wrong results:
> {code:sql}
> alter system set `drill.exec.functions.cast_empty_string_to_null`=true;
> select * from dfs.tmp.`a1.json` t where cast(t.a as integer) is null;
> +---+
> | a |
> +---+
> +---+
> No rows selected (1.759 seconds)
> {code}
> where
> {noformat}
> cat /tmp/a1.json 
> {"a":"1"}
> {"a":""}
> {noformat}
> Result for Drill 1.15.0:
> {code:sql}
> select * from dfs.tmp.`a1.json` t where cast(t.a as integer) is null;
> ++
> | a  |
> ++
> ||
> ++
> 1 row selected (1.724 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram

2019-04-19 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7187:
--
Fix Version/s: 1.17.0

> Improve selectivity estimates for range predicates when using histogram
> ---
>
> Key: DRILL-7187
> URL: https://issues.apache.org/jira/browse/DRILL-7187
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.17.0
>
>
> 2 types of selectivity estimation improvements need to be done:
> 1.  For range predicates on the same column, we need to collect all such 
> predicates in 1 group and do a histogram lookup for them together. 
> For instance: 
> {noformat}
>  WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50
> {noformat}
>  Currently, the Drill behavior is to treat each of the conjuncts 
> independently and multiply the individual selectivities.  However, that will 
> not give the accurate estimates. Here, we want to group the predicates on 'a' 
> together and do a single lookup.  Similarly for 'b'.  
> 2. NULLs are not maintained by the histogram but when doing the selectivity 
> calculations, the histogram should use the totalRowCount as the denominator 
> rather than the non-null count. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram

2019-04-19 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-7187:
-

 Summary: Improve selectivity estimates for range predicates when 
using histogram
 Key: DRILL-7187
 URL: https://issues.apache.org/jira/browse/DRILL-7187
 Project: Apache Drill
  Issue Type: Bug
Reporter: Aman Sinha
Assignee: Aman Sinha


2 types of selectivity estimation improvements need to be done:

1.  For range predicates on the same column, we need to collect all such 
predicates in 1 group and do a histogram lookup for them together. 
For instance: 
{noformat}
 WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50
{noformat}
 Currently, the Drill behavior is to treat each of the conjuncts independently 
and multiply the individual selectivities.  However, that will not give the 
accurate estimates. Here, we want to group the predicates on 'a' together and 
do a single lookup.  Similarly for 'b'.  

2. NULLs are not maintained by the histogram but when doing the selectivity 
calculations, the histogram should use the totalRowCount as the denominator 
rather than the non-null count. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (DRILL-3929) Support the ability to query database tables using external indices

2019-04-19 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha resolved DRILL-3929.
---
   Resolution: Fixed
Fix Version/s: 1.15.0

This feature was done in the scope of DRILL-6381. 

> Support the ability to query database tables using external indices   
> --
>
> Key: DRILL-3929
> URL: https://issues.apache.org/jira/browse/DRILL-3929
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.15.0
>
>
> This is a placeholder for adding support in Drill to query database tables 
> using external indices.  I will add more details about the use case and a 
> preliminary design proposal.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7063) Create separate summary file for schema, totalRowCount, totalNullCount (includes maintenance)

2019-04-12 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816410#comment-16816410
 ] 

Aman Sinha commented on DRILL-7063:
---

Thanks [~vvysotskyi] .  I will work with [~vdonapati] to add the full 
description.  Although the details are in the design doc, we need a simplified 
version for the documentation. 

> Create separate summary file for schema, totalRowCount, totalNullCount 
> (includes maintenance)
> -
>
> Key: DRILL-7063
> URL: https://issues.apache.org/jira/browse/DRILL-7063
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Metadata
>Reporter: Venkata Jyothsna Donapati
>Assignee: Venkata Jyothsna Donapati
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.16.0
>
>   Original Estimate: 252h
>  Remaining Estimate: 252h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7166) Count(*) queries with wildcards in table name are reading metadata cache and returning wrong results

2019-04-11 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7166:
--
Labels: ready-to-commit  (was: )

> Count(*) queries with wildcards in table name are reading metadata cache and 
> returning wrong results
> 
>
> Key: DRILL-7166
> URL: https://issues.apache.org/jira/browse/DRILL-7166
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.16.0
>Reporter: Abhishek Girish
>Assignee: Venkata Jyothsna Donapati
>Priority: Blocker
>  Labels: ready-to-commit
> Fix For: 1.16.0
>
>
> Tests:
> {code}
> Functional/metadata_caching/data/drill4376_1.q
> Functional/metadata_caching/data/drill4376_2.q
> Functional/metadata_caching/data/drill4376_3.q
> Functional/metadata_caching/data/drill4376_4.q
> Functional/metadata_caching/data/drill4376_5.q
> Functional/metadata_caching/data/drill4376_6.q
> Functional/metadata_caching/data/drill4376_8.q
> {code}
> Example pattern of queries:
> {code}
> select count(*) from `lineitem_hierarchical_intint/*8*/3*`;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7166) Count(*) queries with wildcards in table name are reading metadata cache and returning wrong results

2019-04-11 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7166:
--
Reviewer: Aman Sinha

> Count(*) queries with wildcards in table name are reading metadata cache and 
> returning wrong results
> 
>
> Key: DRILL-7166
> URL: https://issues.apache.org/jira/browse/DRILL-7166
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.16.0
>Reporter: Abhishek Girish
>Assignee: Venkata Jyothsna Donapati
>Priority: Blocker
> Fix For: 1.16.0
>
>
> Tests:
> {code}
> Functional/metadata_caching/data/drill4376_1.q
> Functional/metadata_caching/data/drill4376_2.q
> Functional/metadata_caching/data/drill4376_3.q
> Functional/metadata_caching/data/drill4376_4.q
> Functional/metadata_caching/data/drill4376_5.q
> Functional/metadata_caching/data/drill4376_6.q
> Functional/metadata_caching/data/drill4376_8.q
> {code}
> Example pattern of queries:
> {code}
> select count(*) from `lineitem_hierarchical_intint/*8*/3*`;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6992) Support column histogram statistics

2019-04-08 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812899#comment-16812899
 ] 

Aman Sinha commented on DRILL-6992:
---

Histogram creation and usage for the following data types is supported as of 
commit # 849f896:  INT, BIGINT, FLOAT, DOUBLE, TIME, DATE, TIMESTAMP, BOOLEAN.  
 Marking this umbrella JIRA as fixed.  For other data types, in particular 
VARCHAR, VARBINARY, I will open enhancement JIRA separately for a future 
release. 

> Support column histogram statistics
> ---
>
> Key: DRILL-6992
> URL: https://issues.apache.org/jira/browse/DRILL-6992
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Query Planning  Optimization
>Affects Versions: 1.15.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>
> As a follow-up to 
> [DRILL-1328|https://issues.apache.org/jira/browse/DRILL-1328] which is adding 
> NDV (num distinct values) support and creating the framework for statistics, 
> we also need Histograms.   These are needed  for range predicates selectivity 
> estimation as well as equality predicates when there is non-uniform 
> distribution of data.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7063) Create separate summary file for schema, totalRowCount, totalNullCount (includes maintenance)

2019-04-08 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7063:
--
Labels: ready-to-commit  (was: )

> Create separate summary file for schema, totalRowCount, totalNullCount 
> (includes maintenance)
> -
>
> Key: DRILL-7063
> URL: https://issues.apache.org/jira/browse/DRILL-7063
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Metadata
>Reporter: Venkata Jyothsna Donapati
>Assignee: Venkata Jyothsna Donapati
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.16.0
>
>   Original Estimate: 252h
>  Remaining Estimate: 252h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7119) Modify selectivity calculations to use histograms for supported data types

2019-04-08 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7119:
--
Labels: ready-to-commit  (was: )

> Modify selectivity calculations to use histograms for supported data types
> --
>
> Key: DRILL-7119
> URL: https://issues.apache.org/jira/browse/DRILL-7119
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.16.0
>
>
> (Please see parent JIRA for the design document)
> Once the t-digest based histogram is created, we need to read it back and 
> modify the selectivity calculations such that they use the histogram buckets 
> for range conditions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7119) Modify selectivity calculations to use histograms for supported data types

2019-04-08 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7119:
--
Summary: Modify selectivity calculations to use histograms for supported 
data types  (was: Modify selectivity calculations to use histograms)

> Modify selectivity calculations to use histograms for supported data types
> --
>
> Key: DRILL-7119
> URL: https://issues.apache.org/jira/browse/DRILL-7119
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>
> (Please see parent JIRA for the design document)
> Once the t-digest based histogram is created, we need to read it back and 
> modify the selectivity calculations such that they use the histogram buckets 
> for range conditions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-3846) Metadata Caching : A count(*) query took more time with the cache in place

2019-04-08 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812807#comment-16812807
 ] 

Aman Sinha commented on DRILL-3846:
---

Let's try this after DRILL-7064 is fixed. 

> Metadata Caching : A count(*) query took more time with the cache in place
> --
>
> Key: DRILL-3846
> URL: https://issues.apache.org/jira/browse/DRILL-3846
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Reporter: Rahul Challapalli
>Assignee: Aman Sinha
>Priority: Critical
> Fix For: 1.16.0
>
>
> git.commit.id.abbrev=3c89b30
> I have a folder with 10k complex files. The generated cache file is around 
> 486 MB. The below numbers indicate that we regressed in terms of performance 
> when we generated the metadata cache
> {code}
> 0: jdbc:drill:zk=10.10.100.190:5181> select count(*) from 
> `complex_sparse_5files`;
> +--+
> |  EXPR$0  |
> +--+
> | 100  |
> +--+
> 1 row selected (30.835 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> refresh table metadata 
> `complex_sparse_5files`;
> +---+-+
> |  ok   |   summary   
> |
> +---+-+
> | true  | Successfully updated metadata for table complex_sparse_5files.  
> |
> +---+-+
> 1 row selected (10.69 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> select count(*) from 
> `complex_sparse_5files`;
> +--+
> |  EXPR$0  |
> +--+
> | 100  |
> +--+
> 1 row selected (47.614 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-3846) Metadata Caching : A count(*) query took more time with the cache in place

2019-04-08 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha reassigned DRILL-3846:
-

Assignee: Aman Sinha  (was: Venkata Jyothsna Donapati)

> Metadata Caching : A count(*) query took more time with the cache in place
> --
>
> Key: DRILL-3846
> URL: https://issues.apache.org/jira/browse/DRILL-3846
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Reporter: Rahul Challapalli
>Assignee: Aman Sinha
>Priority: Critical
> Fix For: 1.16.0
>
>
> git.commit.id.abbrev=3c89b30
> I have a folder with 10k complex files. The generated cache file is around 
> 486 MB. The below numbers indicate that we regressed in terms of performance 
> when we generated the metadata cache
> {code}
> 0: jdbc:drill:zk=10.10.100.190:5181> select count(*) from 
> `complex_sparse_5files`;
> +--+
> |  EXPR$0  |
> +--+
> | 100  |
> +--+
> 1 row selected (30.835 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> refresh table metadata 
> `complex_sparse_5files`;
> +---+-+
> |  ok   |   summary   
> |
> +---+-+
> | true  | Successfully updated metadata for table complex_sparse_5files.  
> |
> +---+-+
> 1 row selected (10.69 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> select count(*) from 
> `complex_sparse_5files`;
> +--+
> |  EXPR$0  |
> +--+
> | 100  |
> +--+
> 1 row selected (47.614 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7114) ANALYZE command generates warnings for stats file and materialization

2019-04-05 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811076#comment-16811076
 ] 

Aman Sinha commented on DRILL-7114:
---

[~gparai] for tracking, pls provide the commit id for the fix since this was 
fixed as part of another PR.  

> ANALYZE command generates warnings for stats file and materialization
> -
>
> Key: DRILL-7114
> URL: https://issues.apache.org/jira/browse/DRILL-7114
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Gautam Parai
>Priority: Minor
> Fix For: 1.16.0
>
>
> When I run ANALYZE, I see warnings in the log file as shown below. The 
> ANALYZE command should not try to read the stats file or materialize the 
> stats.  
> {noformat}
> 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
> o.a.d.e.p.common.DrillStatsTable - Failed to read the stats file.
> java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does 
> not exist
> 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
> o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. 
> Continuing without stats.
> java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does 
> not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7119) Modify selectivity calculations to use histograms

2019-04-04 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7119:
--
Reviewer: Gautam Parai

> Modify selectivity calculations to use histograms
> -
>
> Key: DRILL-7119
> URL: https://issues.apache.org/jira/browse/DRILL-7119
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>
> (Please see parent JIRA for the design document)
> Once the t-digest based histogram is created, we need to read it back and 
> modify the selectivity calculations such that they use the histogram buckets 
> for range conditions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (DRILL-7152) Histogram creation throws exception for all nulls column

2019-04-02 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha resolved DRILL-7152.
---
Resolution: Fixed

Fixed in 54384a9. 

> Histogram creation throws exception for all nulls column
> 
>
> Key: DRILL-7152
> URL: https://issues.apache.org/jira/browse/DRILL-7152
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>
> ANALYZE command fails when creating the histogram for a table with 1 column 
> with all NULLs. 
> Analyze table `table_stats/parquet_col_nulls` compute statistics;
> {noformat}
> Error: SYSTEM ERROR: NullPointerException
>   (org.apache.drill.common.exceptions.DrillRuntimeException) Failed to get 
> TDigest output
> 
> org.apache.drill.exec.test.generated.StreamingAggregatorGen32.outputRecordValues():1085
> 
> org.apache.drill.exec.test.generated.StreamingAggregatorGen32.outputToBatchPrev():492
> org.apache.drill.exec.test.generated.StreamingAggregatorGen32.doWork():224
> 
> org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext():288
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> 
> org.apache.drill.exec.physical.impl.statistics.StatisticsMergeBatch.innerNext():358
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> 
> org.apache.drill.exec.physical.impl.unpivot.UnpivotMapsRecordBatch.innerNext():106
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> 
> org.apache.drill.exec.physical.impl.StatisticsWriterRecordBatch.innerNext():96
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63
> 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
> 
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():83
> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():422
> org.apache.hadoop.security.UserGroupInformation.doAs():1669
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():283
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> java.lang.Thread.run():748
> {noformat}
> This table has 1 column with all NULL values:
> {noformat}
> apache drill (dfs.drilltestdir)> select * from 
> `table_stats/parquet_col_nulls` limit 20;
> +--+--+
> | col1 | col2 |
> +--+--+
> | 0| null |
> | 1| null |
> | 2| null |
> | 3| null |
> | 4| null |
> | 5| null |
> | 6| null |
> | 7| null |
> | 8| null |
> | 9| null |
> | 10   | null |
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7152) Histogram creation throws exception for all nulls column

2019-04-02 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7152:
--
Reviewer: Gautam Parai

> Histogram creation throws exception for all nulls column
> 
>
> Key: DRILL-7152
> URL: https://issues.apache.org/jira/browse/DRILL-7152
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>
> ANALYZE command fails when creating the histogram for a table with 1 column 
> with all NULLs. 
> Analyze table `table_stats/parquet_col_nulls` compute statistics;
> {noformat}
> Error: SYSTEM ERROR: NullPointerException
>   (org.apache.drill.common.exceptions.DrillRuntimeException) Failed to get 
> TDigest output
> 
> org.apache.drill.exec.test.generated.StreamingAggregatorGen32.outputRecordValues():1085
> 
> org.apache.drill.exec.test.generated.StreamingAggregatorGen32.outputToBatchPrev():492
> org.apache.drill.exec.test.generated.StreamingAggregatorGen32.doWork():224
> 
> org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext():288
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> 
> org.apache.drill.exec.physical.impl.statistics.StatisticsMergeBatch.innerNext():358
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> 
> org.apache.drill.exec.physical.impl.unpivot.UnpivotMapsRecordBatch.innerNext():106
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> 
> org.apache.drill.exec.physical.impl.StatisticsWriterRecordBatch.innerNext():96
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63
> 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
> 
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():83
> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():422
> org.apache.hadoop.security.UserGroupInformation.doAs():1669
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():283
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> java.lang.Thread.run():748
> {noformat}
> This table has 1 column with all NULL values:
> {noformat}
> apache drill (dfs.drilltestdir)> select * from 
> `table_stats/parquet_col_nulls` limit 20;
> +--+--+
> | col1 | col2 |
> +--+--+
> | 0| null |
> | 1| null |
> | 2| null |
> | 3| null |
> | 4| null |
> | 5| null |
> | 6| null |
> | 7| null |
> | 8| null |
> | 9| null |
> | 10   | null |
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7152) Histogram creation throws exception for all nulls column

2019-04-02 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-7152:
-

 Summary: Histogram creation throws exception for all nulls column
 Key: DRILL-7152
 URL: https://issues.apache.org/jira/browse/DRILL-7152
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning  Optimization
Reporter: Aman Sinha
Assignee: Aman Sinha
 Fix For: 1.16.0


ANALYZE command fails when creating the histogram for a table with 1 column 
with all NULLs. 

Analyze table `table_stats/parquet_col_nulls` compute statistics;

{noformat}
Error: SYSTEM ERROR: NullPointerException
  (org.apache.drill.common.exceptions.DrillRuntimeException) Failed to get 
TDigest output

org.apache.drill.exec.test.generated.StreamingAggregatorGen32.outputRecordValues():1085

org.apache.drill.exec.test.generated.StreamingAggregatorGen32.outputToBatchPrev():492
org.apache.drill.exec.test.generated.StreamingAggregatorGen32.doWork():224

org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext():288
org.apache.drill.exec.record.AbstractRecordBatch.next():186
org.apache.drill.exec.record.AbstractRecordBatch.next():126
org.apache.drill.exec.record.AbstractRecordBatch.next():116

org.apache.drill.exec.physical.impl.statistics.StatisticsMergeBatch.innerNext():358
org.apache.drill.exec.record.AbstractRecordBatch.next():186
org.apache.drill.exec.record.AbstractRecordBatch.next():126
org.apache.drill.exec.record.AbstractRecordBatch.next():116

org.apache.drill.exec.physical.impl.unpivot.UnpivotMapsRecordBatch.innerNext():106
org.apache.drill.exec.record.AbstractRecordBatch.next():186
org.apache.drill.exec.record.AbstractRecordBatch.next():126
org.apache.drill.exec.record.AbstractRecordBatch.next():116

org.apache.drill.exec.physical.impl.StatisticsWriterRecordBatch.innerNext():96
org.apache.drill.exec.record.AbstractRecordBatch.next():186
org.apache.drill.exec.record.AbstractRecordBatch.next():126
org.apache.drill.exec.record.AbstractRecordBatch.next():116
org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141
org.apache.drill.exec.record.AbstractRecordBatch.next():186
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():83
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1669
org.apache.drill.exec.work.fragment.FragmentExecutor.run():283
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1149
java.util.concurrent.ThreadPoolExecutor$Worker.run():624
java.lang.Thread.run():748
{noformat}

This table has 1 column with all NULL values:

{noformat}
apache drill (dfs.drilltestdir)> select * from `table_stats/parquet_col_nulls` 
limit 20;
+--+--+
| col1 | col2 |
+--+--+
| 0| null |
| 1| null |
| 2| null |
| 3| null |
| 4| null |
| 5| null |
| 6| null |
| 7| null |
| 8| null |
| 9| null |
| 10   | null |
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal) and date/time/timestamp

2019-04-01 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7117:
--
Labels: doc-impacting  (was: )

> Support creation of histograms for numeric data types (except Decimal) and 
> date/time/timestamp
> --
>
> Key: DRILL-7117
> URL: https://issues.apache.org/jira/browse/DRILL-7117
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.16.0
>
>
> This JIRA is specific to creating histograms for numeric data types: INT, 
> BIGINT, FLOAT4, FLOAT8  and their corresponding nullable/non-nullable 
> versions.  Additionally, since DATE/TIME/TIMESTAMP are internally stored as 
> longs, we should allow the same numeric type histogram creation for these 
> data types as well. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7064) Leverage the summary's totalRowCount and totalNullCount for COUNT() queries (also prevent eager expansion of files)

2019-03-31 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7064:
--
Description: 
This sub-task is meant to leverage the Parquet metadata cache's summary stats: 
totalRowCount (across all files and row groups) and the per-column 
totalNullCount (across all files and row groups) to answer plain COUNT 
aggregation queries without Group-By.  These are currently converted to a 
DirectScan by the ConvertCountToDirectScanRule which utilizes the row group 
metadata; however this rule is applied on Drill Logical rels and converts the 
logical plan to a physical plan with DirectScanPrel but this is too late since 
the DrillScanRel that is already created during logical planning has already 
read the entire metadata cache file along with its full list of row group 
entries. The metadata cache file can grow quite large and this does not scale. 

The solution is to use the Metadata Summary file that is created in DRILL-7063 
and create a new rule that will apply early on such that it operates on the 
Calcite logical rels instead of the Drill logical rels and prevents eager 
expansion of the list of files/row groups.   

We will not remove the existing rule. The existing rule will continue to 
operate as before because it is possible that after some transformations, we 
still want to apply the optimizations for COUNT queries. 

> Leverage the summary's totalRowCount and totalNullCount for COUNT() queries 
> (also prevent eager expansion of files)
> ---
>
> Key: DRILL-7064
> URL: https://issues.apache.org/jira/browse/DRILL-7064
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Metadata
>Reporter: Venkata Jyothsna Donapati
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> This sub-task is meant to leverage the Parquet metadata cache's summary 
> stats: totalRowCount (across all files and row groups) and the per-column 
> totalNullCount (across all files and row groups) to answer plain COUNT 
> aggregation queries without Group-By.  These are currently converted to a 
> DirectScan by the ConvertCountToDirectScanRule which utilizes the row group 
> metadata; however this rule is applied on Drill Logical rels and converts the 
> logical plan to a physical plan with DirectScanPrel but this is too late 
> since the DrillScanRel that is already created during logical planning has 
> already read the entire metadata cache file along with its full list of row 
> group entries. The metadata cache file can grow quite large and this does not 
> scale. 
> The solution is to use the Metadata Summary file that is created in 
> DRILL-7063 and create a new rule that will apply early on such that it 
> operates on the Calcite logical rels instead of the Drill logical rels and 
> prevents eager expansion of the list of files/row groups.   
> We will not remove the existing rule. The existing rule will continue to 
> operate as before because it is possible that after some transformations, we 
> still want to apply the optimizations for COUNT queries. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-7064) Leverage the summary's totalRowCount and totalNullCount for COUNT() queries (also prevent eager expansion of files)

2019-03-31 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha reassigned DRILL-7064:
-

Assignee: Aman Sinha  (was: Venkata Jyothsna Donapati)

> Leverage the summary's totalRowCount and totalNullCount for COUNT() queries 
> (also prevent eager expansion of files)
> ---
>
> Key: DRILL-7064
> URL: https://issues.apache.org/jira/browse/DRILL-7064
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Metadata
>Reporter: Venkata Jyothsna Donapati
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal) and date/time/timestamp

2019-03-23 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7117:
--
Reviewer: Gautam Parai

> Support creation of histograms for numeric data types (except Decimal) and 
> date/time/timestamp
> --
>
> Key: DRILL-7117
> URL: https://issues.apache.org/jira/browse/DRILL-7117
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>
> This JIRA is specific to creating histograms for numeric data types: INT, 
> BIGINT, FLOAT4, FLOAT8  and their corresponding nullable/non-nullable 
> versions.  Additionally, since DATE/TIME/TIMESTAMP are internally stored as 
> longs, we should allow the same numeric type histogram creation for these 
> data types as well. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal) and date/time/timestamp

2019-03-23 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7117:
--
Description: This JIRA is specific to creating histograms for numeric data 
types: INT, BIGINT, FLOAT4, FLOAT8  and their corresponding 
nullable/non-nullable versions.  Additionally, since DATE/TIME/TIMESTAMP are 
internally stored as longs, we should allow the same numeric type histogram 
creation for these data types as well.   (was: This JIRA is specific to 
creating histograms for numeric data types: INT, BIGINT, FLOAT4, FLOAT8  and 
their corresponding nullable/non-nullable versions. )

> Support creation of histograms for numeric data types (except Decimal) and 
> date/time/timestamp
> --
>
> Key: DRILL-7117
> URL: https://issues.apache.org/jira/browse/DRILL-7117
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>
> This JIRA is specific to creating histograms for numeric data types: INT, 
> BIGINT, FLOAT4, FLOAT8  and their corresponding nullable/non-nullable 
> versions.  Additionally, since DATE/TIME/TIMESTAMP are internally stored as 
> longs, we should allow the same numeric type histogram creation for these 
> data types as well. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal) and date/time/timestamp

2019-03-23 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7117:
--
Summary: Support creation of histograms for numeric data types (except 
Decimal) and date/time/timestamp  (was: Support creation of histograms for 
numeric data types (except Decimal))

> Support creation of histograms for numeric data types (except Decimal) and 
> date/time/timestamp
> --
>
> Key: DRILL-7117
> URL: https://issues.apache.org/jira/browse/DRILL-7117
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>
> This JIRA is specific to creating histograms for numeric data types: INT, 
> BIGINT, FLOAT4, FLOAT8  and their corresponding nullable/non-nullable 
> versions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal)

2019-03-23 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799742#comment-16799742
 ] 

Aman Sinha edited comment on DRILL-7117 at 3/23/19 6:16 PM:


I have a work-in-progress implementation and initial results for the histogram 
using the proposed design (see design doc in parent JIRA) are encouraging.  I 
analyzed a few columns from the TPC-DS table Web_Sales and in particular for a 
skewed column ws_ext_tax  I compared the output of the NTILE() function with 
what is generated by the histogram implementation.  They are very close. 

Here's the NTILE() output (this is the baseline): 
{noformat}

select x as ntile, min(ws_ext_tax) as min, max(ws_ext_tax) as max, count(*) as 
count
from (select ws_ext_tax, ntile(10) over (order by ws_ext_tax) as x  from 
dfs.tmp.web_sales1)
 group by x order by x;

|ntile  | min   |   max| count  |
| 1  | 0.0 | 0.0  | 71939  |
| 2  | 0.0 | 3.05 | 71939  |
| 3  | 3.05| 9.28 | 71939  |
| 4  | 9.28| 19.24| 71939  |
| 5  | 19.24   | 34.51| 71938  |
| 6  | 34.51   | 57.96| 71938  |
| 7  | 57.96   | 95.01| 71938  |
| 8  | 95.01   | 158.71   | 71938  |
| 9  | 158.71  | 292.44   | 71938  |
| 10 | 292.44  | 2433.41  | 71938  |

{noformat}

Here's the Histogram output after running ANALYZE and using 10 buckets: 

{noformat}
{
  "column" : "`ws_ext_tax`",
  "majortype" : {
"type" : "FLOAT4",
"mode" : "OPTIONAL"
  },
  "schema" : 1.0,
  "rowcount" : 719384.0,
  "nonnullrowcount" : 719200.0,
  "ndv" : 68400,
  "avgwidth" : 4.0,
  "histogram" : {
"category" : "numeric-histogram",
"numRowsPerBucket" : 71920,
"buckets" : [ 0.0, 0.0, 3.0012382450412556, 9.294927927208075, 
19.451534502867155, 34.26953338960548, 57.30967536502495, 95.25307998787451, 
158.88263313484484, 289.2295684860717, 24
33.409910105169 ]
  }
}
{noformat}

Note that the bucket boundaries closely align with the baseline and there's 
non-uniformity in the data distribution. 


was (Author: amansinha100):
I have a work-in-progress implementation and initial results for the histogram 
using the proposed design (see design doc in parent JIRA) are encouraging.  I 
analyzed a few columns from the TPC-DS table Web_Sales and in particular for a 
skewed column ws_ext_tax  I compared the output of the NTILE() function with 
what is generated by the histogram implementation.  They are very close. 

Here's the NTILE() output (this is the baseline): 
{noformat}

||Heading 1||Heading 2||
|Col A1|Col A2|

select x as ntile, min(ws_ext_tax) as min, max(ws_ext_tax) as max, count(*) as 
count
from (select ws_ext_tax, ntile(10) over (order by ws_ext_tax) as x  from 
dfs.tmp.web_sales1)
 group by x order by x;

||ntile  || min   ||   max|| count  ||
| 1  | 0.0 | 0.0  | 71939  |
| 2  | 0.0 | 3.05 | 71939  |
| 3  | 3.05| 9.28 | 71939  |
| 4  | 9.28| 19.24| 71939  |
| 5  | 19.24   | 34.51| 71938  |
| 6  | 34.51   | 57.96| 71938  |
| 7  | 57.96   | 95.01| 71938  |
| 8  | 95.01   | 158.71   | 71938  |
| 9  | 158.71  | 292.44   | 71938  |
| 10 | 292.44  | 2433.41  | 71938  |

{noformat}

Here's the Histogram output after running ANALYZE and using 10 buckets: 

{noformat}
{
  "column" : "`ws_ext_tax`",
  "majortype" : {
"type" : "FLOAT4",
"mode" : "OPTIONAL"
  },
  "schema" : 1.0,
  "rowcount" : 719384.0,
  "nonnullrowcount" : 719200.0,
  "ndv" : 68400,
  "avgwidth" : 4.0,
  "histogram" : {
"category" : "numeric-histogram",
"numRowsPerBucket" : 71920,
"buckets" : [ 0.0, 0.0, 3.0012382450412556, 9.294927927208075, 
19.451534502867155, 34.26953338960548, 57.30967536502495, 95.25307998787451, 
158.88263313484484, 289.2295684860717, 24
33.409910105169 ]
  }
}
{noformat}

Note that the bucket boundaries closely align with the baseline and there's 
non-uniformity in the data distribution. 

> Support creation of histograms for numeric data types (except Decimal)
> --
>
> Key: DRILL-7117
> URL: https://issues.apache.org/jira/browse/DRILL-7117
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>
> This JIRA is specific to creating histograms for numeric data types: INT, 
> BIGINT, FLOAT4, FLOAT8  and their corresponding nullable/non-nullable 
> versions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal)

2019-03-23 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799742#comment-16799742
 ] 

Aman Sinha commented on DRILL-7117:
---

I have a work-in-progress implementation and initial results for the histogram 
using the proposed design (see design doc in parent JIRA) are encouraging.  I 
analyzed a few columns from the TPC-DS table Web_Sales and in particular for a 
skewed column ws_ext_tax  I compared the output of the NTILE() function with 
what is generated by the histogram implementation.  They are very close. 

Here's the NTILE() output (this is the baseline): 
{noformat}

||Heading 1||Heading 2||
|Col A1|Col A2|

select x as ntile, min(ws_ext_tax) as min, max(ws_ext_tax) as max, count(*) as 
count
from (select ws_ext_tax, ntile(10) over (order by ws_ext_tax) as x  from 
dfs.tmp.web_sales1)
 group by x order by x;

||ntile  || min   ||   max|| count  ||
| 1  | 0.0 | 0.0  | 71939  |
| 2  | 0.0 | 3.05 | 71939  |
| 3  | 3.05| 9.28 | 71939  |
| 4  | 9.28| 19.24| 71939  |
| 5  | 19.24   | 34.51| 71938  |
| 6  | 34.51   | 57.96| 71938  |
| 7  | 57.96   | 95.01| 71938  |
| 8  | 95.01   | 158.71   | 71938  |
| 9  | 158.71  | 292.44   | 71938  |
| 10 | 292.44  | 2433.41  | 71938  |

{noformat}

Here's the Histogram output after running ANALYZE and using 10 buckets: 

{noformat}
{
  "column" : "`ws_ext_tax`",
  "majortype" : {
"type" : "FLOAT4",
"mode" : "OPTIONAL"
  },
  "schema" : 1.0,
  "rowcount" : 719384.0,
  "nonnullrowcount" : 719200.0,
  "ndv" : 68400,
  "avgwidth" : 4.0,
  "histogram" : {
"category" : "numeric-histogram",
"numRowsPerBucket" : 71920,
"buckets" : [ 0.0, 0.0, 3.0012382450412556, 9.294927927208075, 
19.451534502867155, 34.26953338960548, 57.30967536502495, 95.25307998787451, 
158.88263313484484, 289.2295684860717, 24
33.409910105169 ]
  }
}
{noformat}

Note that the bucket boundaries closely align with the baseline and there's 
non-uniformity in the data distribution. 

> Support creation of histograms for numeric data types (except Decimal)
> --
>
> Key: DRILL-7117
> URL: https://issues.apache.org/jira/browse/DRILL-7117
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>
> This JIRA is specific to creating histograms for numeric data types: INT, 
> BIGINT, FLOAT4, FLOAT8  and their corresponding nullable/non-nullable 
> versions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7109) Statistics adds external sort, which spills to disk

2019-03-22 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799215#comment-16799215
 ] 

Aman Sinha commented on DRILL-7109:
---

[~rhou] for all such issues related to the planning, please add the EXPLAIN 
plan with and without statistics for faster diagnosis.  

> Statistics adds external sort, which spills to disk
> ---
>
> Key: DRILL-7109
> URL: https://issues.apache.org/jira/browse/DRILL-7109
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Robert Hou
>Assignee: Gautam Parai
>Priority: Major
> Fix For: 1.16.0
>
>
> TPCH query 4 with sf 100 runs many times slower.  One issue is that an extra 
> external sort has been added, and both external sorts spill to disk.
> Also, the hash join sees 100x more data.
> Here is the query:
> {noformat}
> select
>   o.o_orderpriority,
>   count(*) as order_count
> from
>   orders o
> where
>   o.o_orderdate >= date '1996-10-01'
>   and o.o_orderdate < date '1996-10-01' + interval '3' month
>   and 
>   exists (
> select
>   *
> from
>   lineitem l
> where
>   l.l_orderkey = o.o_orderkey
>   and l.l_commitdate < l.l_receiptdate
>   )
> group by
>   o.o_orderpriority
> order by
>   o.o_orderpriority;
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6852) Adapt current Parquet Metadata cache implementation to use Drill Metastore API

2019-03-20 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-6852:
--
Labels: ready-to-commit  (was: )

> Adapt current Parquet Metadata cache implementation to use Drill Metastore API
> --
>
> Key: DRILL-6852
> URL: https://issues.apache.org/jira/browse/DRILL-6852
> Project: Apache Drill
>  Issue Type: Sub-task
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.16.0
>
>
> According to the design document for DRILL-6552, existing metadata cache API 
> should be adapted to use generalized API for metastore and parquet metadata 
> cache will be presented as the implementation of metastore API.
> The aim of this Jira is to refactor Parquet Metadata cache implementation and 
> adapt it to use Drill Metastore API.
> Execution plan:
>  - Refactor AbstractParquetGroupScan and its implementations to use metastore 
> metadata classes. Store Drill data types in metadata files for Parquet tables.
>  - Storing the least restrictive type instead of current first file’s column 
> data type.
>  - Rework logic in AbstractParquetGroupScan to allow filtering at different 
> metadata layers: partition, file, row group, etc. The same for pushing the 
> limit.
>  - Implement logic to convert existing parquet metadata to metastore metadata 
> to preserve backward compatibility.
>  - Implement fetching metadata only when it is needed (for filtering, limit, 
> count(*) etc.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7113) Issue with filtering null values from MapRDB-JSON

2019-03-19 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7113:
--
Labels: ready-to-commit  (was: )

> Issue with filtering null values from MapRDB-JSON
> -
>
> Key: DRILL-7113
> URL: https://issues.apache.org/jira/browse/DRILL-7113
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.15.0
>Reporter: Hanumath Rao Maduri
>Assignee: Aman Sinha
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.16.0
>
>
> When the Drill is querying documents from MapRDBJSON that contain fields with 
> null value, it returns the wrong result.
>  The issue is locally reproduced.
> Please find the repro steps:
>  [1] Create a MaprDBJSON table. Say '/tmp/dmdb2/'.
> [2] Insert the following sample records to table:
> {code:java}
> insert --table /tmp/dmdb2/ --value '{"_id": "1", "label": "person", 
> "confidence": 0.24}'
> insert --table /tmp/dmdb2/ --value '{"_id": "2", "label": "person2"}'
> insert --table /tmp/dmdb2/ --value '{"_id": "3", "label": "person3", 
> "confidence": 0.54}'
> insert --table /tmp/dmdb2/ --value '{"_id": "4", "label": "person4", 
> "confidence": null}'
> {code}
> We can see that for field 'confidence' document 1 has value 0.24, document 3 
> has value 0.54, document 2 does not have the field and document 4 has the 
> field with value null.
> [3] Query the table from DRILL.
>  *Query 1:*
> {code:java}
> 0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2;
> +--+-+
> |  label   | confidence  |
> +--+-+
> | person   | 0.24|
> | person2  | null|
> | person3  | 0.54|
> | person4  | null|
> +--+-+
> 4 rows selected (0.2 seconds)
> {code}
> *Query 2:*
> {code:java}
> 0: jdbc:drill:> select * from dfs.tmp.dmdb2;
> +--+-+--+
> | _id  | confidence  |  label   |
> +--+-+--+
> | 1| 0.24| person   |
> | 2| null| person2  |
> | 3| 0.54| person3  |
> | 4| null| person4  |
> +--+-+--+
> 4 rows selected (0.174 seconds)
> {code}
> *Query 3:*
> {code:java}
> 0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2 where confidence 
> is not null;
> +--+-+
> |  label   | confidence  |
> +--+-+
> | person   | 0.24|
> | person3  | 0.54|
> | person4  | null|
> +--+-+
> 3 rows selected (0.192 seconds)
> {code}
> *Query 4:*
> {code:java}
> 0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2 where confidence 
> is  null;
> +--+-+
> |  label   | confidence  |
> +--+-+
> | person2  | null|
> +--+-+
> 1 row selected (0.262 seconds)
> {code}
> As you can see, Query 3 which queries for all documents with confidence value 
> 'is not null', returns a document with null value.
> *Other observation:*
>  Querying the same data using DRILL without MapRDB provides the correct 
> result.
>  For example, create 4 different JSON files with following data:
> {"label": "person", "confidence": 0.24} \{"label": "person2"} \{"label": 
> "person3", "confidence": 0.54} \{"label": "person4", "confidence": null}
> Query it directly using DRILL:
> *Query 5:*
> {code:java}
> 0: jdbc:drill:> select label,confidence from dfs.tmp.t2;
> +--+-+
> |  label   | confidence  |
> +--+-+
> | person4  | null|
> | person3  | 0.54|
> | person2  | null|
> | person   | 0.24|
> +--+-+
> 4 rows selected (0.203 seconds)
> {code}
> *Query 6:*
> {code:java}
> 0: jdbc:drill:> select label,confidence from dfs.tmp.t2 where confidence is 
> null;
> +--+-+
> |  label   | confidence  |
> +--+-+
> | person4  | null|
> | person2  | null|
> +--+-+
> 2 rows selected (0.352 seconds)
> {code}
> *Query 7:*
> {code:java}
> 0: jdbc:drill:> select label,confidence from dfs.tmp.t2 where confidence is 
> not null;
> +--+-+
> |  label   | confidence  |
> +--+-+
> | person3  | 0.54|
> | person   | 0.24|
> +--+-+
> 2 rows selected (0.265 seconds)
> {code}
> As seen in query 6 & 7, it returns the correct result.
> I believe the issue is at the MapRDB layer where it is fetching the results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7119) Modify selectivity calculations to use histograms

2019-03-19 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-7119:
-

 Summary: Modify selectivity calculations to use histograms
 Key: DRILL-7119
 URL: https://issues.apache.org/jira/browse/DRILL-7119
 Project: Apache Drill
  Issue Type: Sub-task
  Components: Query Planning  Optimization
Reporter: Aman Sinha
Assignee: Aman Sinha
 Fix For: 1.16.0


(Please see parent JIRA for the design document)
Once the t-digest based histogram is created, we need to read it back and 
modify the selectivity calculations such that they use the histogram buckets 
for range conditions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal)

2019-03-19 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-7117:
-

 Summary: Support creation of histograms for numeric data types 
(except Decimal)
 Key: DRILL-7117
 URL: https://issues.apache.org/jira/browse/DRILL-7117
 Project: Apache Drill
  Issue Type: Sub-task
  Components: Query Planning  Optimization
Reporter: Aman Sinha
Assignee: Aman Sinha
 Fix For: 1.16.0


This JIRA is specific to creating histograms for numeric data types: INT, 
BIGINT, FLOAT4, FLOAT8  and their corresponding nullable/non-nullable versions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7063) Create separate summary file for schema, totalRowCount, totalNullCount (includes maintenance)

2019-03-19 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7063:
--
Reviewer: Aman Sinha

> Create separate summary file for schema, totalRowCount, totalNullCount 
> (includes maintenance)
> -
>
> Key: DRILL-7063
> URL: https://issues.apache.org/jira/browse/DRILL-7063
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Metadata
>Reporter: Venkata Jyothsna Donapati
>Assignee: Venkata Jyothsna Donapati
>Priority: Major
> Fix For: 1.16.0
>
>   Original Estimate: 252h
>  Remaining Estimate: 252h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7113) Issue with filtering null values from MapRDB-JSON

2019-03-19 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7113:
--
Fix Version/s: (was: 1.17.0)

> Issue with filtering null values from MapRDB-JSON
> -
>
> Key: DRILL-7113
> URL: https://issues.apache.org/jira/browse/DRILL-7113
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.15.0
>Reporter: Hanumath Rao Maduri
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>
> When the Drill is querying documents from MapRDBJSON that contain fields with 
> null value, it returns the wrong result.
>  The issue is locally reproduced.
> Please find the repro steps:
>  [1] Create a MaprDBJSON table. Say '/tmp/dmdb2/'.
> [2] Insert the following sample records to table:
> {code:java}
> insert --table /tmp/dmdb2/ --value '{"_id": "1", "label": "person", 
> "confidence": 0.24}'
> insert --table /tmp/dmdb2/ --value '{"_id": "2", "label": "person2"}'
> insert --table /tmp/dmdb2/ --value '{"_id": "3", "label": "person3", 
> "confidence": 0.54}'
> insert --table /tmp/dmdb2/ --value '{"_id": "4", "label": "person4", 
> "confidence": null}'
> {code}
> We can see that for field 'confidence' document 1 has value 0.24, document 3 
> has value 0.54, document 2 does not have the field and document 4 has the 
> field with value null.
> [3] Query the table from DRILL.
>  *Query 1:*
> {code:java}
> 0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2;
> +--+-+
> |  label   | confidence  |
> +--+-+
> | person   | 0.24|
> | person2  | null|
> | person3  | 0.54|
> | person4  | null|
> +--+-+
> 4 rows selected (0.2 seconds)
> {code}
> *Query 2:*
> {code:java}
> 0: jdbc:drill:> select * from dfs.tmp.dmdb2;
> +--+-+--+
> | _id  | confidence  |  label   |
> +--+-+--+
> | 1| 0.24| person   |
> | 2| null| person2  |
> | 3| 0.54| person3  |
> | 4| null| person4  |
> +--+-+--+
> 4 rows selected (0.174 seconds)
> {code}
> *Query 3:*
> {code:java}
> 0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2 where confidence 
> is not null;
> +--+-+
> |  label   | confidence  |
> +--+-+
> | person   | 0.24|
> | person3  | 0.54|
> | person4  | null|
> +--+-+
> 3 rows selected (0.192 seconds)
> {code}
> *Query 4:*
> {code:java}
> 0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2 where confidence 
> is  null;
> +--+-+
> |  label   | confidence  |
> +--+-+
> | person2  | null|
> +--+-+
> 1 row selected (0.262 seconds)
> {code}
> As you can see, Query 3 which queries for all documents with confidence value 
> 'is not null', returns a document with null value.
> *Other observation:*
>  Querying the same data using DRILL without MapRDB provides the correct 
> result.
>  For example, create 4 different JSON files with following data:
> {"label": "person", "confidence": 0.24} \{"label": "person2"} \{"label": 
> "person3", "confidence": 0.54} \{"label": "person4", "confidence": null}
> Query it directly using DRILL:
> *Query 5:*
> {code:java}
> 0: jdbc:drill:> select label,confidence from dfs.tmp.t2;
> +--+-+
> |  label   | confidence  |
> +--+-+
> | person4  | null|
> | person3  | 0.54|
> | person2  | null|
> | person   | 0.24|
> +--+-+
> 4 rows selected (0.203 seconds)
> {code}
> *Query 6:*
> {code:java}
> 0: jdbc:drill:> select label,confidence from dfs.tmp.t2 where confidence is 
> null;
> +--+-+
> |  label   | confidence  |
> +--+-+
> | person4  | null|
> | person2  | null|
> +--+-+
> 2 rows selected (0.352 seconds)
> {code}
> *Query 7:*
> {code:java}
> 0: jdbc:drill:> select label,confidence from dfs.tmp.t2 where confidence is 
> not null;
> +--+-+
> |  label   | confidence  |
> +--+-+
> | person3  | 0.54|
> | person   | 0.24|
> +--+-+
> 2 rows selected (0.265 seconds)
> {code}
> As seen in query 6 & 7, it returns the correct result.
> I believe the issue is at the MapRDB layer where it is fetching the results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7114) ANALYZE command generates warnings for stats file and materialization

2019-03-18 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7114:
--
Description: 
When I run ANALYZE, I see warnings in the log file as shown below. The ANALYZE 
command should not try to read the stats file or materialize the stats.  

{noformat}
12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
o.a.d.e.p.common.DrillStatsTable - Failed to read the stats file.
java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does not 
exist

12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. Continuing 
without stats.
java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does not 
exist
{noformat}


  was:
When I run ANALYZE, I see warnings shown below. The ANALYZE command should not 
try to read the stats file or materialize the stats.  

{noformat}
12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
o.a.d.e.p.common.DrillStatsTable - Failed to read the stats file.
java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does not 
exist

12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. Continuing 
without stats.
java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does not 
exist
{noformat}



> ANALYZE command generates warnings for stats file and materialization
> -
>
> Key: DRILL-7114
> URL: https://issues.apache.org/jira/browse/DRILL-7114
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Gautam Parai
>Priority: Minor
> Fix For: 1.16.0
>
>
> When I run ANALYZE, I see warnings in the log file as shown below. The 
> ANALYZE command should not try to read the stats file or materialize the 
> stats.  
> {noformat}
> 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
> o.a.d.e.p.common.DrillStatsTable - Failed to read the stats file.
> java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does 
> not exist
> 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
> o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. 
> Continuing without stats.
> java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does 
> not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7114) ANALYZE command generates warnings for stats file and materialization

2019-03-18 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-7114:
-

 Summary: ANALYZE command generates warnings for stats file and 
materialization
 Key: DRILL-7114
 URL: https://issues.apache.org/jira/browse/DRILL-7114
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning  Optimization
Reporter: Aman Sinha
Assignee: Gautam Parai
 Fix For: 1.16.0


When I run ANALYZE, I see warnings shown below. The ANALYZE command should not 
try to read the stats file or materialize the stats.  

{noformat}
12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
o.a.d.e.p.common.DrillStatsTable - Failed to read the stats file.
java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does not 
exist

12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. Continuing 
without stats.
java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does not 
exist
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7108) With statistics enabled TPCH 16 has two additional exchange operators

2019-03-18 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7108:
--
Reviewer: Aman Sinha
 Summary: With statistics enabled TPCH 16 has two additional exchange 
operators  (was: Statistics adds two exchange operators)

> With statistics enabled TPCH 16 has two additional exchange operators
> -
>
> Key: DRILL-7108
> URL: https://issues.apache.org/jira/browse/DRILL-7108
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Robert Hou
>Assignee: Gautam Parai
>Priority: Major
> Fix For: 1.16.0
>
>
> TPCH 16 with sf 100 runs 14% slower.  Here is the query:
> {noformat}
> select
>   p.p_brand,
>   p.p_type,
>   p.p_size,
>   count(distinct ps.ps_suppkey) as supplier_cnt
> from
>   partsupp ps,
>   part p
> where
>   p.p_partkey = ps.ps_partkey
>   and p.p_brand <> 'Brand#21'
>   and p.p_type not like 'MEDIUM PLATED%'
>   and p.p_size in (38, 2, 8, 31, 44, 5, 14, 24)
>   and ps.ps_suppkey not in (
> select
>   s.s_suppkey
> from
>   supplier s
> where
>   s.s_comment like '%Customer%Complaints%'
>   )
> group by
>   p.p_brand,
>   p.p_type,
>   p.p_size
> order by
>   supplier_cnt desc,
>   p.p_brand,
>   p.p_type,
>   p.p_size;
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7108) Statistics adds two exchange operators

2019-03-18 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795153#comment-16795153
 ] 

Aman Sinha commented on DRILL-7108:
---

[~rhou] the description seems incomplete and does not match the title which 
says there are extraneous exchange operators.  Can you add the textual plan 
with and without statistics ? 

> Statistics adds two exchange operators
> --
>
> Key: DRILL-7108
> URL: https://issues.apache.org/jira/browse/DRILL-7108
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Robert Hou
>Assignee: Gautam Parai
>Priority: Major
> Fix For: 1.16.0
>
>
> TPCH 16 with sf 100 runs 14% slower.  Here is the query:
> {noformat}
> select
>   p.p_brand,
>   p.p_type,
>   p.p_size,
>   count(distinct ps.ps_suppkey) as supplier_cnt
> from
>   partsupp ps,
>   part p
> where
>   p.p_partkey = ps.ps_partkey
>   and p.p_brand <> 'Brand#21'
>   and p.p_type not like 'MEDIUM PLATED%'
>   and p.p_size in (38, 2, 8, 31, 44, 5, 14, 24)
>   and ps.ps_suppkey not in (
> select
>   s.s_suppkey
> from
>   supplier s
> where
>   s.s_comment like '%Customer%Complaints%'
>   )
> group by
>   p.p_brand,
>   p.p_type,
>   p.p_size
> order by
>   supplier_cnt desc,
>   p.p_brand,
>   p.p_type,
>   p.p_size;
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7019) Add check for redundant imports

2019-02-01 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-7019:
--
Labels:   (was: ready-to-commit)

> Add check for redundant imports
> ---
>
> Key: DRILL-7019
> URL: https://issues.apache.org/jira/browse/DRILL-7019
> Project: Apache Drill
>  Issue Type: Task
>  Components: Tools, Build  Test
>Affects Versions: 1.15.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Minor
> Fix For: 1.16.0
>
>
> Currently, used only {{UnusedImports}} check which does not prevents 
> duplicate imports or imports from the same package.
> The goal of this Jira is to add {{RedundantImport}} check and fix checkstyle 
> errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6992) Support column histogram statistics

2019-01-31 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757549#comment-16757549
 ] 

Aman Sinha commented on DRILL-6992:
---

Added a link to the histogram design doc. 

> Support column histogram statistics
> ---
>
> Key: DRILL-6992
> URL: https://issues.apache.org/jira/browse/DRILL-6992
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Query Planning  Optimization
>Affects Versions: 1.15.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
>
> As a follow-up to 
> [DRILL-1328|https://issues.apache.org/jira/browse/DRILL-1328] which is adding 
> NDV (num distinct values) support and creating the framework for statistics, 
> we also need Histograms.   These are needed  for range predicates selectivity 
> estimation as well as equality predicates when there is non-uniform 
> distribution of data.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6992) Support column histogram statistics

2019-01-22 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-6992:
-

 Summary: Support column histogram statistics
 Key: DRILL-6992
 URL: https://issues.apache.org/jira/browse/DRILL-6992
 Project: Apache Drill
  Issue Type: New Feature
  Components: Query Planning  Optimization
Affects Versions: 1.15.0
Reporter: Aman Sinha
Assignee: Aman Sinha


As a follow-up to [DRILL-1328|https://issues.apache.org/jira/browse/DRILL-1328] 
which is adding NDV (num distinct values) support and creating the framework 
for statistics, we also need Histograms.   These are needed  for range 
predicates selectivity estimation as well as equality predicates when there is 
non-uniform distribution of data.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-519) Some queries fail with complaint about cartesian product when cartesian product not required

2019-01-16 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744472#comment-16744472
 ] 

Aman Sinha commented on DRILL-519:
--

[~vitalii] there's a slight variant of query 19 called 19_1.sql  that we do run 
as part of 
[TestTpchDistributed.tpch19_1()|https://github.com/apache/drill/blob/master/exec/java-exec/src/test/java/org/apache/drill/TestTpchDistributed.java#L131].
  The main difference is that one join condition which was present in all the 
OR blocks is pulled up outside the OR blocks, so semantically it is the same.   
However, if we want to address the original query you might want to file a new 
JIRA and do some investiagation to see if it's a Drill issue or Calcite.  

> Some queries fail with complaint about cartesian product when cartesian 
> product not required
> 
>
> Key: DRILL-519
> URL: https://issues.apache.org/jira/browse/DRILL-519
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Reporter: Jacques Nadeau
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 0.4.0
>
>
> Failure for TPCH21:
> org.eigenbase.relopt.RelOptPlanner$CannotPlanException: Node 
> [rel#1080:Subset#89.PHYSICAL.SINGLETON([]).[1 DESC, 0]] could not be 
> implemented; planner state:
> Root: rel#1080:Subset#89.PHYSICAL.SINGLETON([]).[1 DESC, 0]
> Original rel:
> AbstractConverter(subset=[rel#1080:Subset#89.PHYSICAL.SINGLETON([]).[1 DESC, 
> 0]], convention=[PHYSICAL], DrillDistributionTraitDef=[SINGLETON([])], 
> sort=[[1 DESC, 0]]): rowcount = 5.7017433643341056E16, cumulative cost = 
> {inf}, id = 1082
>   DrillScreenRel(subset=[rel#1079:Subset#89.LOGICAL.ANY([]).[1 DESC, 0]]): 
> rowcount = 5.7017433643341056E16, cumulative cost = {5.701743364334106E15 
> rows, 5.701743364334106E15 cpu, 0.0 io}, id = 1078
> DrillLimitRel(subset=[rel#1077:Subset#88.LOGICAL.ANY([]).[1 DESC, 0]], 
> fetch=[100]): rowcount = 5.7017433643341056E16, cumulative cost = 
> {5.7017433643341056E16 rows, 5.7017433643341056E16 cpu, 0.0 io}, id = 1076
>   DrillSortRel(subset=[rel#1075:Subset#87.LOGICAL.ANY([]).[1 DESC, 0]], 
> sort0=[$1], sort1=[$0], dir0=[DESC], dir1=[ASC]): rowcount = 
> 5.7017433643341056E16, cumulative cost = {1.7598833879268286E19 rows, 
> 5.7017433643341056E16 cpu, 0.0 io}, id = 1074
> DrillAggregateRel(subset=[rel#1073:Subset#86.LOGICAL.ANY([]).[]], 
> group=[{0}], numwait=[COUNT()]): rowcount = 5.7017433643341056E16, cumulative 
> cost = {5.7017433643341056E16 rows, 0.0 cpu, 0.0 io}, id = 1072
>   DrillProjectRel(subset=[rel#1071:Subset#85.LOGICAL.ANY([]).[]], 
> s_name=[$0]): rowcount = 5.7017433643341056E17, cumulative cost = 
> {5.7017433643341056E16 rows, 5.7017433643341056E16 cpu, 0.0 io}, id = 1070
> DrillFilterRel(subset=[rel#1069:Subset#84.LOGICAL.ANY([]).[]], 
> condition=[NOT(IS TRUE($14))]): rowcount = 5.7017433643341056E17, cumulative 
> cost = {5.7017433643341056E16 rows, 2.28069734573364224E17 cpu, 0.0 io}, id = 
> 1068
>   DrillJoinRel(subset=[rel#1067:Subset#83.LOGICAL.ANY([]).[]], 
> condition=[AND(=($4, $12), =($5, $13))], joinType=[left]): rowcount = 
> 2.28069734573364224E18, cumulative cost = {2.28069734573364224E18 rows, 0.0 
> cpu, 0.0 io}, id = 1066
> 
> DrillProjectRel(subset=[rel#1038:Subset#75.LOGICAL.ANY([]).[]], s_name=[$0], 
> s_suppkey=[$1], s_nationkey=[$2], l_receiptdate=[$3], l_orderkey=[$4], 
> l_suppkey=[$5], l_commitdate=[$6], o_orderkey=[$7], o_orderstatus=[$8], 
> n_name=[$9], n_nationkey=[$10], $f0=[$13]): rowcount = 1.601806640625E8, 
> cumulative cost = {1.601806640625E7 rows, 1.92216796875E8 cpu, 0.0 io}, id = 
> 1037
>   
> DrillFilterRel(subset=[rel#1036:Subset#74.LOGICAL.ANY([]).[]], condition=[IS 
> TRUE($13)]): rowcount = 1.601806640625E8, cumulative cost = {1.601806640625E7 
> rows, 6.4072265625E7 cpu, 0.0 io}, id = 1035
> 
> DrillJoinRel(subset=[rel#1034:Subset#73.LOGICAL.ANY([]).[]], 
> condition=[AND(=($4, $11), =($5, $12))], joinType=[left]): rowcount = 
> 6.4072265625E8, cumulative cost = {6.4072265625E8 rows, 0.0 cpu, 0.0 io}, id 
> = 1033
>   
> DrillJoinRel(subset=[rel#1014:Subset#63.LOGICAL.ANY([]).[]], condition=[=($2, 
> $10)], joinType=[inner]): rowcount = 3796.875, cumulative cost = {3796.875 
> rows, 0.0 cpu, 0.0 io}, id = 1013
> 
> DrillProjectRel(subset=[rel#1009:Subset#60.LOGICAL.ANY([]).[]], $f0=[$6], 
> $f1=[$7], $f2=[$8], $f3=[$2], $f4=[$3], $f5=[$4], $f6=[$5], $f7=[$0], 
> $f8=[$1]): rowcount = 1687.5, cumulative cost = {168.75 rows, 1518.75 cpu, 
> 0.0 io}, id = 1008
>   
> DrillJoinRel(subset=[rel#1007:Subset#59.LOGICAL.ANY([]).[]], condition=[=($7, 
> 

[jira] [Resolved] (DRILL-6897) TPCH 13 has regressed

2019-01-11 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha resolved DRILL-6897.
---
Resolution: Duplicate

Duplicate of DRILL-6896. 

> TPCH 13 has regressed
> -
>
> Key: DRILL-6897
> URL: https://issues.apache.org/jira/browse/DRILL-6897
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.15.0
>Reporter: Robert Hou
>Assignee: Karthikeyan Manivannan
>Priority: Major
> Attachments: 240099ed-ef2a-a23a-4559-f1b2e0809e72.sys.drill, 
> 2400be84-c024-cb92-8743-3211589e0247.sys.drill
>
>
> I ran TPCH query 13 with both scale factor 100 and 1000, and ran them 3x to 
> get a warm start, and ran them twice to verify the regression. It is 
> regressing between 26 and 33%.
> Here is the query:
> {noformat}
> select
>   c_count,
>   count(*) as custdist
> from
>   (
> select
>   c.c_custkey,
>   count(o.o_orderkey)
> from
>   customer c 
>   left outer join orders o 
> on c.c_custkey = o.o_custkey
> and o.o_comment not like '%special%requests%'
> group by
>   c.c_custkey
>   ) as orders (c_custkey, c_count)
> group by
>   c_count
> order by
>   custdist desc,
>   c_count desc;
> {noformat}
> I have attached two profiles. 240099ed-ef2a-a23a-4559-f1b2e0809e72 is for 
> Drill 1.15. 2400be84-c024-cb92-8743-3211589e0247 is for Drill 1.14. The 
> commit for Drill 1.15 is 596227bbbecfb19bdb55dd8ea58159890f83bc9c. The commit 
> for Drill 1.14 is 0508a128853ce796ca7e99e13008e49442f83147.
> The two plans nearly the same. One difference is that Drill 1.15 is using 
> four times more memory in operator 07-01 Unordered Mux Exchange. I think the 
> problem may be in operator 09-01 Project. Drill 1.15 is projecting the 
> comment field while Drill 1.14 does not project the comment field.
> Another issue is that the Drill 1.15 takes more processing time to filter the 
> order table. Filter operator 09-03 takes an average of 19.3s. For Drill 1.14, 
> filter operator 09-04 takes an average of 15.6s. They process the same number 
> of rows, and have the same number of minor fragments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6912) NPE when other drillbit is already running

2018-12-18 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724760#comment-16724760
 ] 

Aman Sinha commented on DRILL-6912:
---

[~vvysotskyi] Yes, certainly the bug exists as you mentioned and affects both 
distributed and embedded. It is up to the release manager to decide whether to 
consider this for this release but it is sounding like a blocker.  In case you 
have a tentative fix, you can contribute that :) 

> NPE when other drillbit is already running
> --
>
> Key: DRILL-6912
> URL: https://issues.apache.org/jira/browse/DRILL-6912
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Vitalii Diravka
>Assignee: Boaz Ben-Zvi
>Priority: Critical
>
> If user tries to run the second drillbit process, the following output will 
> be obtained:
> {code:java}
> vitalii@vitalii-pc:/tmp/apache-drill-1.15.0$ bin/drill-embedded 
> java.lang.NullPointerException
> Apache Drill 1.15.0
> "This isn't your grandfather's SQL."
> sqlline> select * from (values(1));
> No current connection
> sqlline> !q
> {code}
> For 1.14.0 drill version the output was correct (but too long):
> {code:java}
> ./bin/drill-embedded 
> Dec 18, 2018 7:58:47 PM org.glassfish.jersey.server.ApplicationHandler 
> initialize
> INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 
> 01:25:26...
> Error: Failure in starting embedded Drillbit: java.net.BindException: Address 
> already in use (state=,code=0)
> java.sql.SQLException: Failure in starting embedded Drillbit: 
> java.net.BindException: Address already in use
>  at 
> org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:143)
>  at 
> org.apache.drill.jdbc.impl.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:72)
>  at 
> org.apache.drill.jdbc.impl.DrillFactory.newConnection(DrillFactory.java:68)
>  at 
> org.apache.calcite.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:138)
>  at org.apache.drill.jdbc.Driver.connect(Driver.java:72)
>  at sqlline.DatabaseConnection.connect(DatabaseConnection.java:167)
>  at sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:213)
>  at sqlline.Commands.connect(Commands.java:1083)
>  at sqlline.Commands.connect(Commands.java:1015)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:36)
>  at sqlline.SqlLine.dispatch(SqlLine.java:742)
>  at sqlline.SqlLine.initArgs(SqlLine.java:528)
>  at sqlline.SqlLine.begin(SqlLine.java:596)
>  at sqlline.SqlLine.start(SqlLine.java:375)
>  at sqlline.SqlLine.main(SqlLine.java:268)
> Caused by: java.net.BindException: Address already in use
>  at sun.nio.ch.Net.bind0(Native Method)
>  at sun.nio.ch.Net.bind(Net.java:433)
>  at sun.nio.ch.Net.bind(Net.java:425)
>  at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>  at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>  at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:279)
>  at 
> org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
>  at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:218)
>  at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>  at org.eclipse.jetty.server.Server.doStart(Server.java:337)
>  at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>  at org.apache.drill.exec.server.rest.WebServer.start(WebServer.java:155)
>  at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:200)
>  at 
> org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:134)
>  ... 18 more
> apache drill 1.14.0 
> "just drill it"
> 0: jdbc:drill:zk=local> !q{code}
> Looks like it is fine to have a short message in console about the reason of 
> error, similar to:
> {code:java}
> java.sql.SQLException: Failure in starting embedded Drillbit: 
> java.net.BindException: Address already in use
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6912) NPE when other drillbit is already running

2018-12-18 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724656#comment-16724656
 ] 

Aman Sinha commented on DRILL-6912:
---

[~ben-zvi]  Thanks,  I suppose we should document that env variable (currently, 
it is set internally in the drill_embedded script) ?  I would think quite a few 
people (me included) run sqlline directly in the embedded mode.  I am curious 
though why [~vitalii]'s  command failed since he ran drill_embedded.  Let's 
wait for his confirmation to see if it was an environment issue. 

> NPE when other drillbit is already running
> --
>
> Key: DRILL-6912
> URL: https://issues.apache.org/jira/browse/DRILL-6912
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Vitalii Diravka
>Assignee: Boaz Ben-Zvi
>Priority: Critical
>
> If user tries to run the second drillbit process, the following output will 
> be obtained:
> {code:java}
> vitalii@vitalii-pc:/tmp/apache-drill-1.15.0$ bin/drill-embedded 
> java.lang.NullPointerException
> Apache Drill 1.15.0
> "This isn't your grandfather's SQL."
> sqlline> select * from (values(1));
> No current connection
> sqlline> !q
> {code}
> For 1.14.0 drill version the output was correct (but too long):
> {code:java}
> ./bin/drill-embedded 
> Dec 18, 2018 7:58:47 PM org.glassfish.jersey.server.ApplicationHandler 
> initialize
> INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 
> 01:25:26...
> Error: Failure in starting embedded Drillbit: java.net.BindException: Address 
> already in use (state=,code=0)
> java.sql.SQLException: Failure in starting embedded Drillbit: 
> java.net.BindException: Address already in use
>  at 
> org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:143)
>  at 
> org.apache.drill.jdbc.impl.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:72)
>  at 
> org.apache.drill.jdbc.impl.DrillFactory.newConnection(DrillFactory.java:68)
>  at 
> org.apache.calcite.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:138)
>  at org.apache.drill.jdbc.Driver.connect(Driver.java:72)
>  at sqlline.DatabaseConnection.connect(DatabaseConnection.java:167)
>  at sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:213)
>  at sqlline.Commands.connect(Commands.java:1083)
>  at sqlline.Commands.connect(Commands.java:1015)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:36)
>  at sqlline.SqlLine.dispatch(SqlLine.java:742)
>  at sqlline.SqlLine.initArgs(SqlLine.java:528)
>  at sqlline.SqlLine.begin(SqlLine.java:596)
>  at sqlline.SqlLine.start(SqlLine.java:375)
>  at sqlline.SqlLine.main(SqlLine.java:268)
> Caused by: java.net.BindException: Address already in use
>  at sun.nio.ch.Net.bind0(Native Method)
>  at sun.nio.ch.Net.bind(Net.java:433)
>  at sun.nio.ch.Net.bind(Net.java:425)
>  at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>  at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>  at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:279)
>  at 
> org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
>  at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:218)
>  at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>  at org.eclipse.jetty.server.Server.doStart(Server.java:337)
>  at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>  at org.apache.drill.exec.server.rest.WebServer.start(WebServer.java:155)
>  at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:200)
>  at 
> org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:134)
>  ... 18 more
> apache drill 1.14.0 
> "just drill it"
> 0: jdbc:drill:zk=local> !q{code}
> Looks like it is fine to have a short message in console about the reason of 
> error, similar to:
> {code:java}
> java.sql.SQLException: Failure in starting embedded Drillbit: 
> java.net.BindException: Address already in use
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6912) NPE when other drillbit is already running

2018-12-18 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724622#comment-16724622
 ] 

Aman Sinha commented on DRILL-6912:
---

Odd...I do see the same NPE that [~vitalii] reported: 
{noformat} 
[root@aman1 ~]# which sqlline
/root/apache-drill-1.15.0-src/distribution/target/apache-drill-1.15.0/apache-drill-1.15.0/bin/sqlline

[root@aman1 ~]# sqlline -u jdbc:drill:zk=local -n admin --maxWidth=10
java.lang.NullPointerException
Apache Drill 1.15.0
{noformat}


> NPE when other drillbit is already running
> --
>
> Key: DRILL-6912
> URL: https://issues.apache.org/jira/browse/DRILL-6912
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Vitalii Diravka
>Assignee: Boaz Ben-Zvi
>Priority: Critical
>
> If user tries to run the second drillbit process, the following output will 
> be obtained:
> {code:java}
> vitalii@vitalii-pc:/tmp/apache-drill-1.15.0$ bin/drill-embedded 
> java.lang.NullPointerException
> Apache Drill 1.15.0
> "This isn't your grandfather's SQL."
> sqlline> select * from (values(1));
> No current connection
> sqlline> !q
> {code}
> For 1.14.0 drill version the output was correct (but too long):
> {code:java}
> ./bin/drill-embedded 
> Dec 18, 2018 7:58:47 PM org.glassfish.jersey.server.ApplicationHandler 
> initialize
> INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 
> 01:25:26...
> Error: Failure in starting embedded Drillbit: java.net.BindException: Address 
> already in use (state=,code=0)
> java.sql.SQLException: Failure in starting embedded Drillbit: 
> java.net.BindException: Address already in use
>  at 
> org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:143)
>  at 
> org.apache.drill.jdbc.impl.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:72)
>  at 
> org.apache.drill.jdbc.impl.DrillFactory.newConnection(DrillFactory.java:68)
>  at 
> org.apache.calcite.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:138)
>  at org.apache.drill.jdbc.Driver.connect(Driver.java:72)
>  at sqlline.DatabaseConnection.connect(DatabaseConnection.java:167)
>  at sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:213)
>  at sqlline.Commands.connect(Commands.java:1083)
>  at sqlline.Commands.connect(Commands.java:1015)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:36)
>  at sqlline.SqlLine.dispatch(SqlLine.java:742)
>  at sqlline.SqlLine.initArgs(SqlLine.java:528)
>  at sqlline.SqlLine.begin(SqlLine.java:596)
>  at sqlline.SqlLine.start(SqlLine.java:375)
>  at sqlline.SqlLine.main(SqlLine.java:268)
> Caused by: java.net.BindException: Address already in use
>  at sun.nio.ch.Net.bind0(Native Method)
>  at sun.nio.ch.Net.bind(Net.java:433)
>  at sun.nio.ch.Net.bind(Net.java:425)
>  at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>  at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>  at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:279)
>  at 
> org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
>  at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:218)
>  at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>  at org.eclipse.jetty.server.Server.doStart(Server.java:337)
>  at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>  at org.apache.drill.exec.server.rest.WebServer.start(WebServer.java:155)
>  at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:200)
>  at 
> org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:134)
>  ... 18 more
> apache drill 1.14.0 
> "just drill it"
> 0: jdbc:drill:zk=local> !q{code}
> Looks like it is fine to have a short message in console about the reason of 
> error, similar to:
> {code:java}
> java.sql.SQLException: Failure in starting embedded Drillbit: 
> java.net.BindException: Address already in use
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6896) Extraneous columns being projected past a join

2018-12-11 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-6896:
--
Fix Version/s: 1.16.0

> Extraneous columns being projected past a join
> --
>
> Key: DRILL-6896
> URL: https://issues.apache.org/jira/browse/DRILL-6896
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.15.0
>Reporter: Karthikeyan Manivannan
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.16.0
>
>
> [~rhou] noted that TPCH13 on Drill 1.15 was running slower than Drill 1.14. 
> Analysis revealed that an extra column was being projected in 1.15 and the 
> slowdown was because the extra column was being unnecessarily pushed across 
> an exchange.
> Here is a simplified query written by [~amansinha100] that exhibits the same 
> problem :
> In first plan, o_custkey and o_comment are both extraneous projections. 
>  In the second plan (on 1.14.0), also, there is an extraneous projection: 
> o_custkey but not o_comment.
> On 1.15.0:
> -
> {noformat}
> explain plan without implementation for 
> select
>   c.c_custkey
> from
>cp.`tpch/customer.parquet` c 
>  left outer join cp.`tpch/orders.parquet` o 
>   on c.c_custkey = o.o_custkey
>  and o.o_comment not like '%special%requests%'
>;
> DrillScreenRel
>   DrillProjectRel(c_custkey=[$0])
> DrillProjectRel(c_custkey=[$2], o_custkey=[$0], o_comment=[$1])
>   DrillJoinRel(condition=[=($2, $0)], joinType=[right])
> DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))])
>   DrillScanRel(table=[[cp, tpch/orders.parquet]], 
> groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=classpath:/tpch/orders.parquet]], 
> selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]])
> DrillScanRel(table=[[cp, tpch/customer.parquet]], 
> groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=classpath:/tpch/customer.parquet]], 
> selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`c_custkey`]]])
> {noformat}
> On 1.14.0:
> -
> {noformat}
> DrillScreenRel
>   DrillProjectRel(c_custkey=[$0])
> DrillProjectRel(c_custkey=[$1], o_custkey=[$0])
>   DrillJoinRel(condition=[=($1, $0)], joinType=[right])
> DrillProjectRel(o_custkey=[$0])
>   DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))])
> DrillScanRel(table=[[cp, tpch/orders.parquet]], 
> groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=classpath:/tpch/orders.parquet]], 
> selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]])
> DrillScanRel(table=[[cp, tpch/customer.parquet]], 
> groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=classpath:/tpch/customer.parquet]], 
> selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`c_custkey`]]])
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6896) Extraneous columns being projected past a join

2018-12-11 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718317#comment-16718317
 ] 

Aman Sinha commented on DRILL-6896:
---

On further analysis,  I can see a similar (not exactly same) extraneous column 
projection even on 1.14.0 and with an Inner Join query, with the filter 
condition in the WHERE clause (instead of ON clause).  

On 1.14.0: 
{noformat}
 explain plan without implementation for 
 select
   c.c_custkey
from
   cp.`tpch/customer.parquet` c 
   inner join cp.`tpch/orders.parquet` o 
on c.c_custkey = o.o_custkey
 where o.o_orderkey < 10 and c.c_name like 'abc%'
 ;
+--+--+
| text | json |
+--+--+
DrillScreenRel
  DrillProjectRel(c_custkey=[$0])
DrillProjectRel(c_custkey=[$2], c_name=[$3], o_custkey=[$0], 
o_orderkey=[$1])
  DrillJoinRel(condition=[=($2, $0)], joinType=[inner])
DrillFilterRel(condition=[<($1, 10)])
  DrillScanRel(table=[[cp, tpch/orders.parquet]], 
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=classpath:/tpch/orders.parquet]], 
selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`o_custkey`, `o_orderkey`]]])
DrillFilterRel(condition=[LIKE($1, 'abc%')])
  DrillScanRel(table=[[cp, tpch/customer.parquet]], 
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=classpath:/tpch/customer.parquet]], 
selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`c_custkey`, `c_name`]]])
{noformat}

Notice above that the DrillProjectRel above the Join is projecting both c_name 
and o_orderkey even though both these columns are only needed by the filter 
conditions below the DrillJoinRel.  There are 2 possible reasons why they were 
not pushed down: 
1.  Costing.  For whatever reason, the projection pushdown did not change the 
cost, so Calcite preserved the original non-pushdown plan.  This is less likely 
compared to #2 below. 
2.  Potential issues with rule application.  There were some issues with 
Calcite's ProjectMergeRule (see 
https://issues.apache.org/jira/browse/CALCITE-2223) which required a workaround 
implemented in [DRILL-6212|https://issues.apache.org/jira/browse/DRILL-6212].  
A second JIRA: [DRILL-6501|https://issues.apache.org/jira/browse/DRILL-6501] 
was filed to revert to using ProjectMergeRule later. 

Considering that 1.14.0 also shows a similar issue (albeit not for the same 
query),  I am inclined to move this JIRA to 1.16.0 release.  

> Extraneous columns being projected past a join
> --
>
> Key: DRILL-6896
> URL: https://issues.apache.org/jira/browse/DRILL-6896
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.15.0
>Reporter: Karthikeyan Manivannan
>Assignee: Aman Sinha
>Priority: Major
>
> [~rhou] noted that TPCH13 on Drill 1.15 was running slower than Drill 1.14. 
> Analysis revealed that an extra column was being projected in 1.15 and the 
> slowdown was because the extra column was being unnecessarily pushed across 
> an exchange.
> Here is a simplified query written by [~amansinha100] that exhibits the same 
> problem :
> In first plan, o_custkey and o_comment are both extraneous projections. 
>  In the second plan (on 1.14.0), also, there is an extraneous projection: 
> o_custkey but not o_comment.
> On 1.15.0:
> -
> {noformat}
> explain plan without implementation for 
> select
>   c.c_custkey
> from
>cp.`tpch/customer.parquet` c 
>  left outer join cp.`tpch/orders.parquet` o 
>   on c.c_custkey = o.o_custkey
>  and o.o_comment not like '%special%requests%'
>;
> DrillScreenRel
>   DrillProjectRel(c_custkey=[$0])
> DrillProjectRel(c_custkey=[$2], o_custkey=[$0], o_comment=[$1])
>   DrillJoinRel(condition=[=($2, $0)], joinType=[right])
> DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))])
>   DrillScanRel(table=[[cp, tpch/orders.parquet]], 
> groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=classpath:/tpch/orders.parquet]], 
> selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]])
> DrillScanRel(table=[[cp, tpch/customer.parquet]], 
> groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=classpath:/tpch/customer.parquet]], 
> selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`c_custkey`]]])
> {noformat}
> On 1.14.0:
> -
> {noformat}
> DrillScreenRel
>   DrillProjectRel(c_custkey=[$0])
> DrillProjectRel(c_custkey=[$1], o_custkey=[$0])
>   DrillJoinRel(condition=[=($1, $0)], joinType=[right])
> 

[jira] [Updated] (DRILL-6896) Extraneous columns being projected past a join

2018-12-11 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-6896:
--
Description: 
[~rhou] noted that TPCH13 on Drill 1.15 was running slower than Drill 1.14. 
Analysis revealed that an extra column was being projected in 1.15 and the 
slowdown was because the extra column was being unnecessarily pushed across an 
exchange.

Here is a simplified query written by [~amansinha100] that exhibits the same 
problem :

In first plan, o_custkey and o_comment are both extraneous projections. 
 In the second plan (on 1.14.0), also, there is an extraneous projection: 
o_custkey but not o_comment.

On 1.15.0:
-
{noformat}
explain plan without implementation for 
select
  c.c_custkey
from
   cp.`tpch/customer.parquet` c 
 left outer join cp.`tpch/orders.parquet` o 
  on c.c_custkey = o.o_custkey
 and o.o_comment not like '%special%requests%'
   ;

DrillScreenRel
  DrillProjectRel(c_custkey=[$0])
DrillProjectRel(c_custkey=[$2], o_custkey=[$0], o_comment=[$1])
  DrillJoinRel(condition=[=($2, $0)], joinType=[right])
DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))])
  DrillScanRel(table=[[cp, tpch/orders.parquet]], 
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=classpath:/tpch/orders.parquet]], 
selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]])
DrillScanRel(table=[[cp, tpch/customer.parquet]], 
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=classpath:/tpch/customer.parquet]], 
selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`c_custkey`]]])
{noformat}


On 1.14.0:
-
{noformat}
DrillScreenRel
  DrillProjectRel(c_custkey=[$0])
DrillProjectRel(c_custkey=[$1], o_custkey=[$0])
  DrillJoinRel(condition=[=($1, $0)], joinType=[right])
DrillProjectRel(o_custkey=[$0])
  DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))])
DrillScanRel(table=[[cp, tpch/orders.parquet]], 
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=classpath:/tpch/orders.parquet]], 
selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]])
DrillScanRel(table=[[cp, tpch/customer.parquet]], 
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=classpath:/tpch/customer.parquet]], 
selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`c_custkey`]]])
{noformat}

  was:
[~rhou] noted that TPCH13 on Drill 1.15 was running slower than Drill 1.14. 
Analysis revealed that an extra column was being projected in 1.15 and the 
slowdown was because the extra column was being unnecessarily pushed across an 
exchange.

Here is a simplified query written by [~amansinha100] that exhibits the same 
problem :

In first plan, o_custkey and o_comment are both extraneous projections. 
 In the second plan (on 1.14.0), also, there is an extraneous projection: 
o_custkey but not o_comment.

On 1.15.0:

-

explain plan without implementation for 
 select
 c.c_custkey
 from
 cp.`tpch/customer.parquet` c 
 left outer join cp.`tpch/orders.parquet` o 
 on c.c_custkey = o.o_custkey
 and o.o_comment not like '%special%requests%'
 ;

DrillScreenRel

DrillProjectRel(c_custkey=[$0])

DrillProjectRel(c_custkey=[$2], o_custkey=[$0], o_comment=[$1])

DrillJoinRel(condition=[=($2, $0)], joinType=[right])

DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))])

DrillScanRel(table=[[cp, tpch/orders.parquet]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=classpath:/tpch/orders.parquet]], 
selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]])

DrillScanRel(table=[[cp, tpch/customer.parquet]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=classpath:/tpch/customer.parquet]], 
selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`c_custkey`]]])

On 1.14.0:

-

DrillScreenRel

DrillProjectRel(c_custkey=[$0])

DrillProjectRel(c_custkey=[$1], o_custkey=[$0])

DrillJoinRel(condition=[=($1, $0)], joinType=[right])

DrillProjectRel(o_custkey=[$0])

DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))])

DrillScanRel(table=[[cp, tpch/orders.parquet]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=classpath:/tpch/orders.parquet]], 
selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]])

DrillScanRel(table=[[cp, tpch/customer.parquet]], groupscan=[ParquetGroupScan 

[jira] [Updated] (DRILL-6896) Extraneous columns being projected past a join

2018-12-11 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-6896:
--
Summary: Extraneous columns being projected past a join  (was: Extraneous 
columns being projected in Drill 1.15)

> Extraneous columns being projected past a join
> --
>
> Key: DRILL-6896
> URL: https://issues.apache.org/jira/browse/DRILL-6896
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.15.0
>Reporter: Karthikeyan Manivannan
>Assignee: Aman Sinha
>Priority: Major
>
> [~rhou] noted that TPCH13 on Drill 1.15 was running slower than Drill 1.14. 
> Analysis revealed that an extra column was being projected in 1.15 and the 
> slowdown was because the extra column was being unnecessarily pushed across 
> an exchange.
> Here is a simplified query written by [~amansinha100] that exhibits the same 
> problem :
> In first plan, o_custkey and o_comment are both extraneous projections. 
>  In the second plan (on 1.14.0), also, there is an extraneous projection: 
> o_custkey but not o_comment.
> On 1.15.0:
> -
> explain plan without implementation for 
>  select
>  c.c_custkey
>  from
>  cp.`tpch/customer.parquet` c 
>  left outer join cp.`tpch/orders.parquet` o 
>  on c.c_custkey = o.o_custkey
>  and o.o_comment not like '%special%requests%'
>  ;
> DrillScreenRel
> DrillProjectRel(c_custkey=[$0])
> DrillProjectRel(c_custkey=[$2], o_custkey=[$0], o_comment=[$1])
> DrillJoinRel(condition=[=($2, $0)], joinType=[right])
> DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))])
> DrillScanRel(table=[[cp, tpch/orders.parquet]], groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=classpath:/tpch/orders.parquet]], 
> selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]])
> DrillScanRel(table=[[cp, tpch/customer.parquet]], groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=classpath:/tpch/customer.parquet]], 
> selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`c_custkey`]]])
> On 1.14.0:
> -
> DrillScreenRel
> DrillProjectRel(c_custkey=[$0])
> DrillProjectRel(c_custkey=[$1], o_custkey=[$0])
> DrillJoinRel(condition=[=($1, $0)], joinType=[right])
> DrillProjectRel(o_custkey=[$0])
> DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))])
> DrillScanRel(table=[[cp, tpch/orders.parquet]], groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=classpath:/tpch/orders.parquet]], 
> selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]])
> DrillScanRel(table=[[cp, tpch/customer.parquet]], groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=classpath:/tpch/customer.parquet]], 
> selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`c_custkey`]]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6876) Enable CircleCI builds for JDK 9-11

2018-12-08 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-6876:
--
Fix Version/s: (was: Future)
   1.15.0

> Enable CircleCI builds for JDK 9-11
> ---
>
> Key: DRILL-6876
> URL: https://issues.apache.org/jira/browse/DRILL-6876
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Tools, Build  Test
>Affects Versions: 1.14.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.15.0
>
>
> Currently, CircleCI runs tests for JDK 8, but Drill also supports JDK 9-11.
> The aim of this Jira is to allow running tests for JDK9-11.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6882) Handle the cases where RowKeyJoin's left pipeline being called multiple times.

2018-12-06 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-6882:
--
Labels: ready-to-commit  (was: )

> Handle the cases where RowKeyJoin's left pipeline being called multiple times.
> --
>
> Key: DRILL-6882
> URL: https://issues.apache.org/jira/browse/DRILL-6882
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Blocker
>  Labels: ready-to-commit
> Fix For: 1.15.0
>
>
> RowKeyJoin operator's leftStream pipeline uses IterOutcome.NONE state to 
> repeat the pipeline multiple times. This should be avoided as 
> IterOutcome.NONE will assertion checks in validator. We ran into this issue 
> when trying to fix DRILL-6878.
> Sample unit test failure from DRILL-6878
> select t1.id.ssn as ssn from hbase.`index_test_primary` t1, 
> hbase.`index_test_primary` t2 where t1._id = t2.rowid and 
> cast(t2.activity.irs.firstlogin as timestamp) = to_timestamp('2013-02-04 
> 22:34:38.0', '-MM-dd HH:mm:ss.S')
>  at 
> org.apache.drill.test.DrillTestWrapper.compareMergedOnHeapVectors(DrillTestWrapper.java:630)
>  ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT]
>  at 
> org.apache.drill.test.DrillTestWrapper.compareOrderedResults(DrillTestWrapper.java:584)
>  ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT]
>  at org.apache.drill.test.DrillTestWrapper.run(DrillTestWrapper.java:157) 
> ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT]
>  at org.apache.drill.test.TestBuilder.go(TestBuilder.java:139) 
> ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT]
>  at 
> com.mapr.drill.maprdb.tests.index.IndexPlanTest.testRowkeyJoinPushdown_7(IndexPlanTest.java:1828)
>  ~[test-classes/:na]
>  at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_141]
> Caused by: org.apache.drill.exec.rpc.RpcException: 
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
> IllegalStateException: next() [on #851, ScanBatch] called again after it 
> returned NONE.  Caller should not have called next() again.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6839) Failed to plan (aggregate + Hash or NL join) when slice target is low

2018-11-16 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690149#comment-16690149
 ] 

Aman Sinha commented on DRILL-6839:
---

[~vvysotskyi] it is certainly possible to change the `StreamAggPrule` to 
generate single-phase agg if for some reason the 2 phase agg is not possible.  
It would be a workaround but the performance will be poor because after a 
`NestedLoopJoin` the output row count would increase substantially due to the 
cross join, so we should ideally do local aggregation first before sending all 
those rows to the foreman node.  After doing local aggregation, we are only 
sending M rows to the foreman where M is the number of minor fragments, whereas 
otherwise it will send `O(N^2)` rows to the foreman where N is the number of 
input rows. 

So, it would be useful to determine why the 2 phase StreamingAgg was not 
created in this example.  Looking at the code, it is creating the distribution 
traits using the distribution traits of the child.  If there is no GROUP-BY in 
the aggregation, then I would expect that DrillDistributionTrait.ANY  should be 
sufficient.  It would be good to check if that is happening.  

> Failed to plan (aggregate + Hash or NL join) when slice target is low 
> --
>
> Key: DRILL-6839
> URL: https://issues.apache.org/jira/browse/DRILL-6839
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Igor Guzenko
>Priority: Major
> Fix For: 1.16.0
>
>
> *Case 1.* When nested loop join is about to be used:
>  - Option "_planner.enable_nljoin_for_scalar_only_" is set to false
>  - Option "_planner.slice_target_" is set to low value for imitation of big 
> input tables
>  
> {code:java}
> @Category(SqlTest.class)
> public class CrossJoinTest extends ClusterTest {
>  @BeforeClass
>  public static void setUp() throws Exception {
>  startCluster(ClusterFixture.builder(dirTestWatcher));
>  }
>  @Test
>  public void testCrossJoinSucceedsForLowSliceTarget() throws Exception {
>try {
>  client.alterSession(PlannerSettings.NLJOIN_FOR_SCALAR.getOptionName(), 
> false);
>  client.alterSession(ExecConstants.SLICE_TARGET, 1);
>  queryBuilder().sql(
> "SELECT COUNT(l.nation_id) " +
> "FROM cp.`tpch/nation.parquet` l " +
> ", cp.`tpch/region.parquet` r")
>  .run();
>} finally {
> client.resetSession(ExecConstants.SLICE_TARGET);
> client.resetSession(PlannerSettings.NLJOIN_FOR_SCALAR.getOptionName());
>}
>  }
> }{code}
>  
> *Case 2.* When hash join is about to be used:
>  - Option "planner.enable_mergejoin" is set to false, so hash join will be 
> used instead
>  - Option "planner.slice_target" is set to low value for imitation of big 
> input tables
>  - Comment out //ruleList.add(HashJoinPrule.DIST_INSTANCE); in 
> PlannerPhase.getPhysicalRules method
> {code:java}
> @Category(SqlTest.class)
> public class CrossJoinTest extends ClusterTest {
>  @BeforeClass
>  public static void setUp() throws Exception {
>startCluster(ClusterFixture.builder(dirTestWatcher));
>  }
>  @Test
>  public void testInnerJoinSucceedsForLowSliceTarget() throws Exception {
>try {
> client.alterSession(PlannerSettings.MERGEJOIN.getOptionName(), false);
> client.alterSession(ExecConstants.SLICE_TARGET, 1);
> queryBuilder().sql(
>   "SELECT COUNT(l.nation_id) " +
>   "FROM cp.`tpch/nation.parquet` l " +
>   "INNER JOIN cp.`tpch/region.parquet` r " +
>   "ON r.nation_id = l.nation_id")
> .run();
>} finally {
> client.resetSession(ExecConstants.SLICE_TARGET);
> client.resetSession(PlannerSettings.MERGEJOIN.getOptionName());
>}
>  }
> }
> {code}
>  
> *Workaround:* To avoid the exception we need to set option 
> "_planner.enable_multiphase_agg_" to false. By doing this we avoid 
> unsuccessful attempts to create 2 phase aggregation plan in StreamAggPrule 
> and guarantee that logical aggregate will be converted to physical one. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6829) Handle schema change in ExternalSort

2018-11-06 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677656#comment-16677656
 ] 

Aman Sinha commented on DRILL-6829:
---

[~Paul.Rogers] I want to clarify .. by 'union' of two incompatible schemas I 
did not mean using the union type.  I meant the union like operation that we 
normally do for record batches.  Step #7 in my first comment is about doing 
this cross-schema-union.  Suppose there are 3 record batches each with 
different schema for the sort key.  These will be sitting in separate internal 
queues of the blocking operator and each will be individually sorted. The 
cross-schema-union will traverse these queues in a certain order (e.g all 
Numeric types appear first, followed by all String types, followed by Date 
types) consuming all batches from the first queue and emitting them, followed 
by second queue and so on.  

> Handle schema change in ExternalSort
> 
>
> Key: DRILL-6829
> URL: https://issues.apache.org/jira/browse/DRILL-6829
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Aman Sinha
>Priority: Major
>
> While we continue to enhance the schema provision and metastore aspects in 
> Drill, we also should explore what it means to be truly schema-less such that 
> we can better handle \{semi, un}structured data, data sitting in DBs that 
> store JSON documents (e.g Mongo, MapR-DB). 
>  
> The blocking operators are the main hurdles in this goal (other operators 
> also need to be smarter about this but the problem is harder for the blocking 
> operators).   This Jira is specifically about ExternalSort. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6829) Handle schema change in ExternalSort

2018-11-06 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677245#comment-16677245
 ] 

Aman Sinha commented on DRILL-6829:
---

Hi [~Paul.Rogers], the fundamental assumption you are making is that all input 
types need to be converted to a common type in order to generate a global sort. 
 But let's drop the requirement of a global sort for the moment.  If you 
examine the approach I described, I am proposing that each distinct schema be 
sorted independently and only compatible types need to be merged together to 
produce a common output schema.  Incompatible types just need a 'union'.  

It just occurred to me to check MongoDB's sort behavior and it does not seem 
much different from what I am proposing. Please see [1].   So, if a user is 
querying NoSQL DB such as Mongo, their client applications already must be 
aware of schema changes.  How  can we provide the same functionality through 
SQL ?  Since Drill's key differentiator is to provide a SQL capability to a 
wide variety of data sources, this seems like a must-have capability otherwise 
why would a Mongo user migrate to Drill+Mongo despite the enormous benefits of 
using a SQL interface ?   (BTW, I am just using MongoDB as an example here). 

That said, I am under no illusion that it is a much more complex task to do 
this in Drill.  But, let's chip away the layers and see what is feasible.  

[1] 
https://docs.mongodb.com/manual/reference/bson-type-comparison-order/#bson-types-comparison-order

> Handle schema change in ExternalSort
> 
>
> Key: DRILL-6829
> URL: https://issues.apache.org/jira/browse/DRILL-6829
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Aman Sinha
>Priority: Major
>
> While we continue to enhance the schema provision and metastore aspects in 
> Drill, we also should explore what it means to be truly schema-less such that 
> we can better handle \{semi, un}structured data, data sitting in DBs that 
> store JSON documents (e.g Mongo, MapR-DB). 
>  
> The blocking operators are the main hurdles in this goal (other operators 
> also need to be smarter about this but the problem is harder for the blocking 
> operators).   This Jira is specifically about ExternalSort. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6829) Handle schema change in ExternalSort

2018-11-05 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676056#comment-16676056
 ] 

Aman Sinha commented on DRILL-6829:
---

Here's one approach.  Other approaches are also welcome and we can consider the 
pros and cons.  

 
Consider a Sort on (a1, b1)
 
*Scenario 1*: schema change from one numeric type to another numeric type
Batch 1: a1: INT, b1: INT       (schema 1)
Batch 2: a1: FLOAT, b1: INT     (schema 2)
 
*Scenario 2*: schema change from numeric to string type 
Batch 1: a1: INT, b1: INT       (schema 1)
Batch 2: a1: STRING, b1: INT    (schema 3)
 
*Scenario 3:* schema change from numeric to string to float 
Batch 1: a1: INT, b1: INT       (schema 1)
Batch 2: a1: STRING, b1: INT    (schema 3)
Batch 3: a1: FLOAT, b1: INT     (schema 2)
 
Let's define *compatible schemas:* 
 schema 1 is compatible with schema 2 since INT and FLOAT are both numeric 
types. 
 schema 1 and 2 are not compatible with schema 3 although technically one could 
   cast the numeric types to string but if the user has not explicitly done a 
CAST
   then we cannot impose it. 
 
Define *sort order* among different schemas:
For compatible schemas, this is straighforward. For instance, sort order among 
all numeric types: INT, BIGINT, FLOAT, DOUBLE.  (TODO: need to define behavior 
for Decimal type). 
What's the sort order between incompatible schemas ?  We can have a policy that 
says by default numeric values appear first followed by String values followed 
by Date values etc.  This is similar to NULLS first/NULLS last. This policy 
will appear in external documentation so users and BI tools are aware of the 
expected order.  In the future we may add a declarative way of changing the 
default sort order between incompatible schemas. 
 
Define the *output type* for compatible schemas:
The output type of compatible schemas will be the 'higher' data type, for 
example: 
   (INT, FLOAT) ==> FLOAT
   (DATE, TIMESTAMP) ==> TIMESTAMP
 
*Execution steps*
On each minor fragment that contains Sort, the algorithm would be as follows:
 # Start with a single sorter internally based on the input record batch's 
schema. If the input schema changes in a new record batch, start a new sorter.  
Thus, there will be N sorters internally within the Sort operator corresponding 
to the N schemas. NOTE: in most practical cases, N is quite small.  We can have 
a default 'schema change threshold' of 3. Beyond that, we would still throw a 
SchemaChangeException.
 # Each sorter maintains a separate list of batches for each schema. 
 # Each sorter sorts the rows for all batches belonging to that schema, 
spilling as necessary.
 # Create as many 'mergers' as the number of compatible schema categories. Thus 
in scenario 1 there will be 1 merger while scenario 3 will have 2 mergers. 
 # Do a local merge of the data within each compatible category. The generated 
code that does the comparison will need to handle comparisons of INT vs BIGINT, 
vs FLOAT etc.
 # This also means that an operator needs to be able to use more than 1 version 
of generated code.  The exact mechanics of this needs to be determined.  If it 
is not feasible, then we may think of running in non-code-gen mode when we 
encounter schema change.. it will be slower but at least the query will 
complete. 
 # Once the individual merges have been done within each compatible schema 
category, do a cross-schema-union.  This simply unions the results of the prior 
merge. The union will preserve the relative ordering among incompatible 
schemas. 
 # After the union, each Output Batch that is created will have the schema it 
inherits from each side of the union. So if there are 3 inputs to the union, 
there will be at least 3 output batches created each with a different schema. 
 
Here are the steps that would need to happen internally in the Sort operator 
for Scenario 3: 
{noformat}
              Output batch
                      |
                Cross-Schema-Union
                  /       \
              Merge      Sort (schema 3)
            /      \     
         Sort       Sort
  (schema 1)  (schema 2){noformat}
 
On the foreman node that does the final Merge, there will be again a 2 step 
process: first do the individual merge of compatible schemas coming from 
separate minor fragments, followed by the Cross-schema-Union. 
  
*Schema change due to NULLABILITY:*
In all the above scenarios, the data type may be nullable or non-nullable. If 
the schema change occurs because a type changed from non-nullable to nullable, 
it would be handled the same way as we would handle a 'compatible' schema 
change since a nullable INT type is compatible with a non-nullable INT type. 
 
*Schema change from Scalar type to Array type:*
This may occur for example when a single phone number changes to an array of 
phone numbers in a customer table. Doing a Sort on an array does not make sense 
but 

[jira] [Created] (DRILL-6829) Handle schema change in ExternalSort

2018-11-05 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-6829:
-

 Summary: Handle schema change in ExternalSort
 Key: DRILL-6829
 URL: https://issues.apache.org/jira/browse/DRILL-6829
 Project: Apache Drill
  Issue Type: New Feature
Reporter: Aman Sinha


While we continue to enhance the schema provision and metastore aspects in 
Drill, we also should explore what it means to be truly schema-less such that 
we can better handle \{semi, un}structured data, data sitting in DBs that store 
JSON documents (e.g Mongo, MapR-DB). 
 
The blocking operators are the main hurdles in this goal (other operators also 
need to be smarter about this but the problem is harder for the blocking 
operators).   This Jira is specifically about ExternalSort. 
 
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6818) Add descriptions to secondary index options

2018-11-05 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-6818:
--
Labels: doc-impacting  (was: )

> Add descriptions to secondary index options
> ---
>
> Key: DRILL-6818
> URL: https://issues.apache.org/jira/browse/DRILL-6818
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.15.0
>Reporter: Arina Ielchiieva
>Assignee: Aman Sinha
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.15.0
>
>
> Add descriptions to secondary index options in ExecConstants and 
> PlannerSettings:
>  index plan related options ==
> planner.use_simple_optimizer: false,
> planner.enable_index_planning: true,
> planner.enable_statistics: true,
> planner.disable_full_table_scan: false,
> planner.index.max_chosen_indexes_per_table: 5,
> planner.index.force_sort_noncovering: false,
> planner.index.use_hashjoin_noncovering: false,
> planner.index.covering_selectivity_threshold: 1.0,
> planner.index.noncovering_selectivity_threshold: 0.025,
> planner.index.rowkeyjoin_cost_factor: 0.1,
> planner.index.statistics_rowcount_scaling_factor: 0.1,
> planner.index.prefer_intersect_plans: false,
> planner.index.max_indexes_to_intersect: 5,
> exec.query.rowkeyjoin_batchsize: 128,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6822) Fix imports for DRILL-6381

2018-10-31 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-6822:
--
Fix Version/s: 1.15.0

> Fix imports for DRILL-6381
> --
>
> Key: DRILL-6822
> URL: https://issues.apache.org/jira/browse/DRILL-6822
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Venkata Jyothsna Donapati
>Assignee: Venkata Jyothsna Donapati
>Priority: Major
> Fix For: 1.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6381) Add capability to do index based planning and execution

2018-10-25 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-6381:
--
Reviewer: Vitalii Diravka

> Add capability to do index based planning and execution
> ---
>
> Key: DRILL-6381
> URL: https://issues.apache.org/jira/browse/DRILL-6381
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.15.0
>
>
> If the underlying data source supports indexes (primary and secondary 
> indexes), Drill should leverage those during planning and execution in order 
> to improve query performance.  
> On the planning side, Drill planner should be enhanced to provide an 
> abstraction layer which express the index metadata and statistics.  Further, 
> a cost-based index selection is needed to decide which index(es) are 
> suitable.  
> On the execution side, appropriate operator enhancements would be needed to 
> handle different categories of indexes such as covering, non-covering 
> indexes, taking into consideration the index data may not be co-located with 
> the primary table, i.e a global index.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6381) Add capability to do index based planning and execution

2018-10-25 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-6381:
--
Labels: ready-to-commit  (was: )

> Add capability to do index based planning and execution
> ---
>
> Key: DRILL-6381
> URL: https://issues.apache.org/jira/browse/DRILL-6381
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.15.0
>
>
> If the underlying data source supports indexes (primary and secondary 
> indexes), Drill should leverage those during planning and execution in order 
> to improve query performance.  
> On the planning side, Drill planner should be enhanced to provide an 
> abstraction layer which express the index metadata and statistics.  Further, 
> a cost-based index selection is needed to decide which index(es) are 
> suitable.  
> On the execution side, appropriate operator enhancements would be needed to 
> handle different categories of indexes such as covering, non-covering 
> indexes, taking into consideration the index data may not be co-located with 
> the primary table, i.e a global index.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (DRILL-6381) Add capability to do index based planning and execution

2018-09-21 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613710#comment-16613710
 ] 

Aman Sinha edited comment on DRILL-6381 at 9/21/18 3:28 PM:


The index planning and execution feature has been in development in a private 
repository at MapR. We would like to contribute it to Apache Drill and hope 
that it spurs further development and adoption by the community. The main ideas 
are described below.

The feature is divided into two broad categories:
 # A comprehensive index planning and execution framework which supports 
distributed covering and non-covering indices. Index planning is done for WHERE 
clause filters, sort-based operation such as ORDER BY, GROUP BY (using 
StreamingAggregate) and joins (using MergeJoin). The framework is intended to 
be agnostic to the storage plugins. It provides a clean abstraction layer that 
allows the Drill planner and executor to work with only core Drill artifacts 
while storage plugins provide concrete implementations of the interfaces.
 # A reference implementation with MapR-DB JSON plugin whose backend supports 
secondary indexing. Other similar DB plugins whose backend supports secondary 
indices could potentially use the reference implementation as a guide.

Note that Drill is a query engine; it does not provide the Create/Update/Delete 
operations either on the primary table or indexes. These are assumed to be 
maintained by the respective backend servers to which Drill communicates via 
storage/format plugins.

+*Key design concepts:*+

*Covering index:*

An index whose index fields plus non-index fields (a.k.a included fields) 
'covers' all the columns referenced in the query. The Drill planner will 
generate a covering index plan (a.k.a index-only) plan where all the columns 
are retrieved from the index after pushing down relevant filter conditions to 
the index scan.

*Non-covering index:*

An index whose index fields plus included fields only partially covers columns 
referenced in the query. For instance, suppose the index is created as follows: 
index keys:\{a, b}, included fields: \{c} and the query is SELECT d, e FROM T 
WHERE a > 10 AND b < 20. In this case, since columns d, e are not present in 
the index at all, this is a non-covering index. For such indexes, the Drill 
planner will generate a non-covering plan where only the row ids are fetched 
from the index by pushing down the WHERE clause filters and the rest of the 
columns are fetched after a join-back to the primary table. The join-back is 
performed using the row ids. A related notion is that of Global index: Drill 
planner assumes indexes are global in nature, i.e the index blocks are not 
necessarily co-located with the primary table's data blocks. This is the most 
general case since an index may be quite large and in order to fully utilize 
the cluster resources it is best to have it fully distributed.

*Functional index:*

An index which is created not on the base table columns but on 
functions/expressions. Currently, only CAST functions have been tested since 
these are most commonly used in Drill views. If the filter condition is 'WHERE 
CAST(zip_code as BIGINT) = 95120 and a functional index exists on CAST(zip_code 
as BIGINT), then the Drill planner will leverage such indexes as long as they 
are exposed through appropriate metadata interfaces.

*Range partitioning:*

This is applicable for non-covering indexes. Since the index is typically 
distributed across multiple nodes in the cluster, once we retrieve the row ids 
from a particular node, we have to send each row id to the appropriate 
destination node which contains primary table data for that row. This is done 
through range partitioning. The Drill executor contains a special operator 
(RangePartitionRecordBatch) and a corresponding exchange 
(RangePartitionExchange) for this purpose. For example, suppose the filter 
condition is WHERE state IN ('CA', 'TX', NY'). Since the primary table data for 
these states may be spread out over multiple nodes, once the row ids are 
fetched from the index, they have to be grouped into separate 'ranges' such 
that co-located row ids can be sent to the same node.

*Rowkey Join and Restricted Scan (skip scan):*

A Rowkey (a.k.a rowid) join is used whenever we need to fetch the rows from 
primary table for a set of row ids retrieved from the index. Note that this is 
not a real join but more of a lookup based on rowid. This is a random I/O 
operation from the primary table since each row id may access a separate data 
block from the table. The Drill executor contains a special operator 
(RowKeyJoinBatch) for doing the join-back. The right side of the RowKeyJoin is 
the index sub-plan. The left side contains a special type of scan called 
Restricted Scan (or skip scan). The storage plugin must implement this scan 
where a list of row ids can be submitted at 

[jira] [Commented] (DRILL-6381) Add capability to do index based planning and execution

2018-09-13 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613710#comment-16613710
 ] 

Aman Sinha commented on DRILL-6381:
---

The index planning and execution feature has been in development in a private 
repository at MapR. We would like to contribute it to Apache Drill and hope 
that it spurs further development and adoption by the community. The main ideas 
are described below.

The feature is divided into two broad categories:
 # A comprehensive index planning and execution framework which supports 
distributed covering and non-covering indices. Index planning is done for WHERE 
clause filters, sort-based operation such as ORDER BY, GROUP BY (using 
StreamingAggregate) and joins (using MergeJoin). The framework is intended to 
be agnostic to the storage plugins. It provides a clean abstraction layer that 
allows the Drill planner and executor to work with only core Drill artifacts 
while storage plugins provide concrete implementations of the interfaces.
 # A reference implementation with MapR-DB JSON plugin whose backend supports 
secondary indexing. Other similar DB plugins whose backend supports secondary 
indices could potentially use the reference implementation as a guide.

Note that Drill is a query engine; it does not provide CRUD operations either 
on the primary table or indexes. These are assumed to be maintained by the 
respective backend servers to which Drill communicates via storage/format 
plugins.

+*Key design concepts:*+

*Covering index:* 

An index whose index fields plus non-index fields (a.k.a included fields) 
'covers' all the columns referenced in the query. The Drill planner will 
generate a covering index plan (a.k.a index-only) plan where all the columns 
are retrieved from the index after pushing down relevant filter conditions to 
the index scan.

*Non-covering index:* 

An index whose index fields plus included fields only partially covers columns 
referenced in the query. For instance, suppose the index is created as follows: 
index keys:\{a, b}, included fields: \{c} and the query is SELECT d, e FROM T 
WHERE a > 10 AND b < 20. In this case, since columns d, e are not present in 
the index at all, this is a non-covering index. For such indexes, the Drill 
planner will generate a non-covering plan where only the row ids are fetched 
from the index by pushing down the WHERE clause filters and the rest of the 
columns are fetched after a join-back to the primary table. The join-back is 
performed using the row ids. A related notion is that of Global index: Drill 
planner assumes indexes are global in nature, i.e the index blocks are not 
necessarily co-located with the primary table's data blocks. This is the most 
general case since an index may be quite large and in order to fully utilize 
the cluster resources it is best to have it fully distributed.

*Functional index:* 

An index which is created not on the base table columns but on 
functions/expressions. Currently, only CAST functions have been tested since 
these are most commonly used in Drill views. If the filter condition is 'WHERE 
CAST(zip_code as BIGINT) = 95120 and a functional index exists on CAST(zip_code 
as BIGINT), then the Drill planner will leverage such indexes as long as they 
are exposed through appropriate metadata interfaces.

*Range partitioning:* 

This is applicable for non-covering indexes. Since the index is typically 
distributed across multiple nodes in the cluster, once we retrieve the row ids 
from a particular node, we have to send each row id to the appropriate 
destination node which contains primary table data for that row. This is done 
through range partitioning. The Drill executor contains a special operator 
(RangePartitionRecordBatch) and a corresponding exchange 
(RangePartitionExchange) for this purpose. For example, suppose the filter 
condition is WHERE state IN ('CA', 'TX', NY'). Since the primary table data for 
these states may be spread out over multiple nodes, once the row ids are 
fetched from the index, they have to be grouped into separate 'ranges' such 
that co-located row ids can be sent to the same node.

*Rowkey Join and Restricted Scan (skip scan):* 

A Rowkey (a.k.a rowid) join is used whenever we need to fetch the rows from 
primary table for a set of row ids retrieved from the index. Note that this is 
not a real join but more of a lookup based on rowid. This is a random I/O 
operation from the primary table since each row id may access a separate data 
block from the table. The Drill executor contains a special operator 
(RowKeyJoinBatch) for doing the join-back. The right side of the RowKeyJoin is 
the index sub-plan. The left side contains a special type of scan called 
Restricted Scan (or skip scan). The storage plugin must implement this scan 
where a list of row ids can be submitted at a time if the backend supports bulk 
fetch. Otherwise, single 

[jira] [Updated] (DRILL-6651) Compilation error in IDE due to missing package name

2018-07-31 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-6651:
--
Labels: ready-to-commit  (was: )

> Compilation error in IDE due to missing package name
> 
>
> Key: DRILL-6651
> URL: https://issues.apache.org/jira/browse/DRILL-6651
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Aman Sinha
>Assignee: Boaz Ben-Zvi
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0, 1.15.0
>
>
> I am seeing the following compilation errors in my Eclipse build (only in the 
> IDE.. this does not happen on the maven command line): 
> {noformat}
> The declared package "" does not match the expected package 
> "org.apache.drill.exec.store.parquet" TestComplexColumnInSchema.java  
> /drill-java-exec/src/test/java/org/apache/drill/exec/store/parquet  line 1
> The declared package "" does not match the expected package 
> "org.apache.drill.exec.store.parquet" TestParquetReaderUtility.java   
> /drill-java-exec/src/test/java/org/apache/drill/exec/store/parquet  line 
> 1{noformat}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6650) Remove Stray Semicolon in Printing Results Listener

2018-07-31 Thread Aman Sinha (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-6650:
--
Labels: ready-to-commit  (was: )

> Remove Stray Semicolon in Printing Results Listener
> ---
>
> Key: DRILL-6650
> URL: https://issues.apache.org/jira/browse/DRILL-6650
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
>  Labels: ready-to-commit
>
> Having empty import statements with stray semicolons can cause compilation in 
> eclipse to fail. I investigated adding a checkstyle check to prevent empty 
> import declarations, but apparently there is no way to do this. In fact 
> having multiple semicolons in an import statement is technically not 
> supported by the java spec 
> https://stackoverflow.com/questions/8125558/eclipse-double-semi-colon-on-an-import.
>  There is also a bug filed for the jdk for not throwing an error in this case
> https://bugs.openjdk.java.net/browse/JDK-8072390
> Since there is no way to automate checking this I am just going to manually 
> remove this error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6651) Compilation error in Eclipse IDE due to missing package name

2018-07-31 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-6651:
-

 Summary: Compilation error in Eclipse IDE due to missing package 
name
 Key: DRILL-6651
 URL: https://issues.apache.org/jira/browse/DRILL-6651
 Project: Apache Drill
  Issue Type: Bug
Reporter: Aman Sinha


I am seeing the following compilation errors in my Eclipse build (only in the 
IDE.. this does not happen on the maven command line): 
{noformat}
The declared package "" does not match the expected package 
"org.apache.drill.exec.store.parquet"   TestComplexColumnInSchema.java  
/drill-java-exec/src/test/java/org/apache/drill/exec/store/parquet  line 1
The declared package "" does not match the expected package 
"org.apache.drill.exec.store.parquet"   TestParquetReaderUtility.java   
/drill-java-exec/src/test/java/org/apache/drill/exec/store/parquet  line 
1{noformat}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-07-18 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547976#comment-16547976
 ] 

Aman Sinha commented on DRILL-6385:
---

The design discussions are getting lost in among all the GitHub comments.  I am 
not sure how to avoid this in the JIRA..  In any case, [~weijie] regarding the 
previous comments about broadcast hash join, my point was that the BF does not 
save network I/O on probe (since there is no network transfer) and it does not 
save memory (since blocking operators won't let BF pass through).  So, the 
question is how much does the reduction in the row count help hash join probe 
given that there is a CPU cost associated with creating and evaluating a BF ?  
{noformat}
Suppose original row count from table is N
Let selectivity of BF is s1 (fraction between 0 to 1)

So, original Hash Join probe side cost Cost_Orig = 
I/O cost of Scanning N rows + Probe cost of N rows

With BF, new Hash Join probe side cost Cost_BF_Probe = 
I/O cost of scanning N rows + CPU cost of Bloom Filter on N rows + Probe 
cost of (s1 * N) rows

With BF, assume hash join build side cost is Cost_BF_build

So total_HJ_BF_Cost = Cost_BF_Probe + Cost_BF_build

{noformat}
 Whether or not the second total cost is cheaper than the original depends on 
selectivity and the actual cost of creating/evaluation bloom filter.   So,  all 
I was saying previously was that for broadcast HJ, the expected benefits would 
be less. We would only find out through some performance runs.   

 

> Support JPPD (Join Predicate Push Down)
> ---
>
> Key: DRILL-6385
> URL: https://issues.apache.org/jira/browse/DRILL-6385
> Project: Apache Drill
>  Issue Type: New Feature
>  Components:  Server, Execution - Flow
>Affects Versions: 1.14.0
>Reporter: weijie.tong
>Assignee: weijie.tong
>Priority: Major
>
> This feature is to support the JPPD (Join Predicate Push Down). It will 
> benefit the HashJoin ,Broadcast HashJoin performance by reducing the number 
> of rows to send across the network ,the memory consumed. This feature is 
> already supported by Impala which calls it RuntimeFilter 
> ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]).
>  The first PR will try to push down a bloom filter of HashJoin node to 
> Parquet’s scan node.   The propose basic procedure is described as follow:
>  # The HashJoin build side accumulate the equal join condition rows to 
> construct a bloom filter. Then it sends out the bloom filter to the foreman 
> node.
>  # The foreman node accept the bloom filters passively from all the fragments 
> that has the HashJoin operator. It then aggregates the bloom filters to form 
> a global bloom filter.
>  # The foreman node broadcasts the global bloom filter to all the probe side 
> scan nodes which maybe already have send out partial data to the hash join 
> nodes(currently the hash join node will prefetch one batch from both sides ).
>       4.  The scan node accepts a global bloom filter from the foreman node. 
> It will filter the rest rows satisfying the bloom filter.
>  
> To implement above execution flow, some main new notion described as below:
>       1. RuntimeFilter
> It’s a filter container which may contain BloomFilter or MinMaxFilter.
>       2. RuntimeFilterReporter
> It wraps the logic to send hash join’s bloom filter to the foreman.The 
> serialized bloom filter will be sent out through the data tunnel.This object 
> will be instanced by the FragmentExecutor and passed to the 
> FragmentContext.So the HashJoin operator can obtain it through the 
> FragmentContext.
>      3. RuntimeFilterRequestHandler
> It is responsible to accept a SendRuntimeFilterRequest RPC to strip the 
> actual BloomFilter from the network. It then translates this filter to the 
> WorkerBee’s new interface registerRuntimeFilter.
> Another RPC type is BroadcastRuntimeFilterRequest. It will register the 
> accepted global bloom filter to the WorkerBee by the registerRuntimeFilter 
> method and then propagate to the FragmentContext through which the probe side 
> scan node can fetch the aggregated bloom filter.
>       4.RuntimeFilterManager
> The foreman will instance a RuntimeFilterManager .It will indirectly get 
> every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been 
> accepted and aggregated . It will broadcast the aggregated bloom filter to 
> all the probe side scan nodes through the data tunnel by a 
> BroadcastRuntimeFilterRequest RPC.
>      5. RuntimeFilterEnableOption 
>  A global option will be added to decide whether to enable this new feature.
>  
> Welcome suggestion and advice from you.The related PR will be presented as 
> soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6606) Hash Join returns incorrect data types when joining subqueries with limit 0

2018-07-13 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543541#comment-16543541
 ] 

Aman Sinha commented on DRILL-6606:
---

Agree that subquery with filter producing 0 rows is  common and we should 
address the issue.  I was mainly referring to the tableau generated limit 0 
queries. 

> Hash Join returns incorrect data types when joining subqueries with limit 0
> ---
>
> Key: DRILL-6606
> URL: https://issues.apache.org/jira/browse/DRILL-6606
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Bohdan Kazydub
>Assignee: Timothy Farkas
>Priority: Blocker
> Fix For: 1.14.0
>
>
> PreparedStatement for query
> {code:sql}
> SELECT l.l_quantity, l.l_shipdate, o.o_custkey
> FROM (SELECT * FROM cp.`tpch/lineitem.parquet` LIMIT 0) l
>     JOIN (SELECT * FROM cp.`tpch/orders.parquet` LIMIT 0) o 
>     ON l.l_orderkey = o.o_orderkey
> LIMIT 0
> {code}
>  is created with wrong types (nullable INTEGER) for all selected columns, no 
> matter what their actual type is. This behavior reproduces with hash join 
> only and is very likely to be caused by DRILL-6027 as the query works fine 
> before this feature was implemented.
> To reproduce the problem you can put the aforementioned query into 
> TestPreparedStatementProvider#joinOrderByQuery() test method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6453) TPC-DS query 72 has regressed

2018-07-13 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543351#comment-16543351
 ] 

Aman Sinha commented on DRILL-6453:
---

[~khfaraaz]  can you also try running a simplified version of the query with 
the first say 3 joins (starting from the leaf level in the plan) ?  We should 
see what the behavior is with patterns like  hash-partitioned HJ followed by 
broadcast, broadcast HJ.

> TPC-DS query 72 has regressed
> -
>
> Key: DRILL-6453
> URL: https://issues.apache.org/jira/browse/DRILL-6453
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.14.0
>Reporter: Khurram Faraaz
>Assignee: Boaz Ben-Zvi
>Priority: Blocker
> Fix For: 1.14.0
>
> Attachments: 24f75b18-014a-fb58-21d2-baeab5c3352c.sys.drill, 
> jstack_29173_June_10_2018.txt, jstack_29173_June_10_2018.txt, 
> jstack_29173_June_10_2018_b.txt, jstack_29173_June_10_2018_b.txt, 
> jstack_29173_June_10_2018_c.txt, jstack_29173_June_10_2018_c.txt, 
> jstack_29173_June_10_2018_d.txt, jstack_29173_June_10_2018_d.txt, 
> jstack_29173_June_10_2018_e.txt, jstack_29173_June_10_2018_e.txt
>
>
> TPC-DS query 72 seems to have regressed, query profile for the case where it 
> Canceled after 2 hours on Drill 1.14.0 is attached here.
> {noformat}
> On, Drill 1.14.0-SNAPSHOT 
> commit : 931b43e (TPC-DS query 72 executed successfully on this commit, took 
> around 55 seconds to execute)
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> TPC-DS query 72 executed successfully & took 47 seconds to complete execution.
> {noformat}
> {noformat}
> TPC-DS data in the below run has date values stored as DATE datatype and not 
> VARCHAR type
> On, Drill 1.14.0-SNAPSHOT
> commit : 82e1a12
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> and
> alter system set `exec.hashjoin.num_partitions` = 1;
> TPC-DS query 72 executed for 2 hrs and 11 mins and did not complete, I had to 
> Cancel it by stopping the Foreman drillbit.
> As a result several minor fragments are reported to be in 
> CANCELLATION_REQUESTED state on UI.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6606) Hash Join returns incorrect data types when joining subqueries with limit 0

2018-07-13 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543323#comment-16543323
 ] 

Aman Sinha commented on DRILL-6606:
---

I don't think LIMIT 0 in the subqueries  or Views is common. For instance, 
Tableau generates a wrapper  LIMIT 0  on the entire query, not within each 
subquery.   What is the data type of columns if you only have the outer LIMIT 0 
after the join of the subqueries ?

> Hash Join returns incorrect data types when joining subqueries with limit 0
> ---
>
> Key: DRILL-6606
> URL: https://issues.apache.org/jira/browse/DRILL-6606
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Bohdan Kazydub
>Assignee: Timothy Farkas
>Priority: Blocker
> Fix For: 1.14.0
>
>
> PreparedStatement for query
> {code:sql}
> SELECT l.l_quantity, l.l_shipdate, o.o_custkey
> FROM (SELECT * FROM cp.`tpch/lineitem.parquet` LIMIT 0) l
>     JOIN (SELECT * FROM cp.`tpch/orders.parquet` LIMIT 0) o 
>     ON l.l_orderkey = o.o_orderkey
> LIMIT 0
> {code}
>  is created with wrong types (nullable INTEGER) for all selected columns, no 
> matter what their actual type is. This behavior reproduces with hash join 
> only and is very likely to be caused by DRILL-6027 as the query works fine 
> before this feature was implemented.
> To reproduce the problem you can put the aforementioned query into 
> TestPreparedStatementProvider#joinOrderByQuery() test method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6453) TPC-DS query 72 has regressed

2018-07-12 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542476#comment-16542476
 ] 

Aman Sinha commented on DRILL-6453:
---

[~ben-zvi], [~dgu-atmapr]  looking at the query profile and plan, the lower 
most HashJoin is the only one that does hash partitioned join while the other 8 
joins are broadcast joins.  Since the hang happens when doing a next() on an 
UnorderedReceiver (associated with the hash exchange),  can we see if changing 
this to a broadcast join helps avoid the issue ?  The reason the hash partition 
is picked is because the right side's input (Project # 04-29) estimated row 
count is 11M which is just over the 10M broadcast threshold.  Let's see if 
bumping this up to 20M helps.  This could be a temporary workaround until we 
address the underlying issue 

> TPC-DS query 72 has regressed
> -
>
> Key: DRILL-6453
> URL: https://issues.apache.org/jira/browse/DRILL-6453
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.14.0
>Reporter: Khurram Faraaz
>Assignee: Boaz Ben-Zvi
>Priority: Blocker
> Fix For: 1.14.0
>
> Attachments: 24f75b18-014a-fb58-21d2-baeab5c3352c.sys.drill, 
> jstack_29173_June_10_2018.txt, jstack_29173_June_10_2018.txt, 
> jstack_29173_June_10_2018_b.txt, jstack_29173_June_10_2018_b.txt, 
> jstack_29173_June_10_2018_c.txt, jstack_29173_June_10_2018_c.txt, 
> jstack_29173_June_10_2018_d.txt, jstack_29173_June_10_2018_d.txt, 
> jstack_29173_June_10_2018_e.txt, jstack_29173_June_10_2018_e.txt
>
>
> TPC-DS query 72 seems to have regressed, query profile for the case where it 
> Canceled after 2 hours on Drill 1.14.0 is attached here.
> {noformat}
> On, Drill 1.14.0-SNAPSHOT 
> commit : 931b43e (TPC-DS query 72 executed successfully on this commit, took 
> around 55 seconds to execute)
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> TPC-DS query 72 executed successfully & took 47 seconds to complete execution.
> {noformat}
> {noformat}
> TPC-DS data in the below run has date values stored as DATE datatype and not 
> VARCHAR type
> On, Drill 1.14.0-SNAPSHOT
> commit : 82e1a12
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> and
> alter system set `exec.hashjoin.num_partitions` = 1;
> TPC-DS query 72 executed for 2 hrs and 11 mins and did not complete, I had to 
> Cancel it by stopping the Foreman drillbit.
> As a result several minor fragments are reported to be in 
> CANCELLATION_REQUESTED state on UI.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6588) System table columns incorrectly marked as non-nullable

2018-07-09 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-6588:
-

 Summary: System table columns incorrectly marked as non-nullable 
 Key: DRILL-6588
 URL: https://issues.apache.org/jira/browse/DRILL-6588
 Project: Apache Drill
  Issue Type: Bug
  Components: Metadata
Affects Versions: 1.13.0
Reporter: Aman Sinha


System table columns can contain null values but they are incorrectly marked as 
non-nullable as shown in example table below:  
{noformat}
0: jdbc:drill:drillbit=10.10.10.191> describe sys.boot;
+---++--+
|    COLUMN_NAME    |     DATA_TYPE      | IS_NULLABLE  |
+---++--+
| name              | CHARACTER VARYING  | NO           |
| kind              | CHARACTER VARYING  | NO           |
| accessibleScopes  | CHARACTER VARYING  | NO           |
| optionScope       | CHARACTER VARYING  | NO           |
| status            | CHARACTER VARYING  | NO           |
| num_val           | BIGINT             | NO           |
| string_val        | CHARACTER VARYING  | NO           |
| bool_val          | BOOLEAN            | NO           |
| float_val         | DOUBLE             | NO           |
+---++--+{noformat}
 

Note that several columns are nulls: 
{noformat}
+---+--+--+-++-++--+---+
|                       name                        |   kind   | 
accessibleScopes | optionScope | status | num_val | string_val | bool_val | 
float_val |
+---+--+--+-++-++--+---+
drill.exec.options.exec.udf.enable_dynamic_support | BOOLEAN | BOOT | BOOT | 
BOOT | null | null | true | null |{noformat}
 

Because of the not-null metadata, the predicates on these tables such as `WHERE 
 IS NULL` evaluate to FALSE which is incorrect. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-07-05 Thread Aman Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534163#comment-16534163
 ] 

Aman Sinha commented on DRILL-6385:
---

Couple of design items to discuss, hence I am adding them here instead of in 
the PR: 
 * For broadcast hash join, the bloom filter will not save on network I/O.  It 
will impact number of rows in the pipeline below the hash join probe but since 
we are not applying bloom filter if there is a blocking operator on probe side, 
there is not much scope of memory savings.   There is potential benefit if 
there is some CPU intensive operation on the probe side, which is somewhat 
unlikely since typically such operations would be done after the join.  That 
leaves only 1 other possibility - if the bloom filter allows partition/rowgroup 
pruning on the scan itself but the current implementation does not support 
that.  So, my conclusion is that for broadcast hash join the bloom filter will 
add an overhead (creation of BF plus communication with the foreman plus 
foreman's communication with drill bits) without much benefit.   Do you have 
thoughts about this ?  At the very least, I suggest adding a session option 
such as 'enable_runtime_filter_broadcast_hashjoin' whose default value is FALSE 
 and look at  performance evaluation for a large data set.   
 * For 3 or more table joins,  consider a left-deep join plan    

{noformat}

HJ2 
   /  \
 HJ1   T3
/  \
   T1   T2{noformat}
             Here, I was envisaging that there will be a boom filter associated 
with the join condition of HJ1  and another bloom filter for HJ2.   However, 
since your implementation checks for blocking operators on the probe side, it 
will prevent HJ2 from creating a BF..is that correct ?

This is quite a common use case, so I was hoping that we would support it.  
From a design perspective, In this case, first we should make sure that the 
bloom filter for HJ2 is only created if the join is between T1 and T3, not 
between T2 and T3 since T2 is on the build side.  Second,  in terms of 
aggregating the BFs, it could behave the same way as a  2 column join between 
T1-T2.  In other words, conceptually the following 2 join scenarios are similar 
from a bloom filter aggregation perspective: 
{noformat}
 // 2 column join between T1-T2
 T1.a1 = T2.a2 AND T1.b1 = T2.b2

 // 1 column join between T1-T2 and 1 column join between T1-T3
 T1.a1 = T2.a2 AND T1.b1 = T3.b3{noformat}

> Support JPPD (Join Predicate Push Down)
> ---
>
> Key: DRILL-6385
> URL: https://issues.apache.org/jira/browse/DRILL-6385
> Project: Apache Drill
>  Issue Type: New Feature
>  Components:  Server, Execution - Flow
>Affects Versions: 1.14.0
>Reporter: weijie.tong
>Assignee: weijie.tong
>Priority: Major
>
> This feature is to support the JPPD (Join Predicate Push Down). It will 
> benefit the HashJoin ,Broadcast HashJoin performance by reducing the number 
> of rows to send across the network ,the memory consumed. This feature is 
> already supported by Impala which calls it RuntimeFilter 
> ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]).
>  The first PR will try to push down a bloom filter of HashJoin node to 
> Parquet’s scan node.   The propose basic procedure is described as follow:
>  # The HashJoin build side accumulate the equal join condition rows to 
> construct a bloom filter. Then it sends out the bloom filter to the foreman 
> node.
>  # The foreman node accept the bloom filters passively from all the fragments 
> that has the HashJoin operator. It then aggregates the bloom filters to form 
> a global bloom filter.
>  # The foreman node broadcasts the global bloom filter to all the probe side 
> scan nodes which maybe already have send out partial data to the hash join 
> nodes(currently the hash join node will prefetch one batch from both sides ).
>       4.  The scan node accepts a global bloom filter from the foreman node. 
> It will filter the rest rows satisfying the bloom filter.
>  
> To implement above execution flow, some main new notion described as below:
>       1. RuntimeFilter
> It’s a filter container which may contain BloomFilter or MinMaxFilter.
>       2. RuntimeFilterReporter
> It wraps the logic to send hash join’s bloom filter to the foreman.The 
> serialized bloom filter will be sent out through the data tunnel.This object 
> will be instanced by the FragmentExecutor and passed to the 
> FragmentContext.So the HashJoin operator can obtain it through the 
> FragmentContext.
>      3. RuntimeFilterRequestHandler
> It is responsible to accept a SendRuntimeFilterRequest RPC to strip the 
> actual BloomFilter from the network. It then translates this filter to the 
> 

[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-05-07 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466808#comment-16466808
 ] 

Aman Sinha commented on DRILL-6385:
---

Sending a link to a short design overview doc [2] I had proposed during the 
Drill design hackathon [1] in September 2017.  The proposal was to send the 
bloom filter past exchange boundary rather than sending to the foreman. 
However, this is not implemented, so your contribution would be welcome.  I 
think doing the hash partitioned hash join first seems fine since that's the 
one that would benefit the most.  Looking forward to your pull request !

[[1] 
https://lists.apache.org/thread.html/74cf48dd78d323535dc942c969e72008884e51f8715f4a20f6f8fb66@%3Cdev.drill.apache.org%3E|https://lists.apache.org/thread.html/74cf48dd78d323535dc942c969e72008884e51f8715f4a20f6f8fb66@%3Cdev.drill.apache.org%3E]

[2] 
[https://docs.google.com/document/d/1cNznfv60wwuFJlbKwkVbCBNGSBlY5QbjYNgglPw8JQ0/edit?usp=sharing]

 

> Support JPPD (Join Predicate Push Down)
> ---
>
> Key: DRILL-6385
> URL: https://issues.apache.org/jira/browse/DRILL-6385
> Project: Apache Drill
>  Issue Type: New Feature
>  Components:  Server, Execution - Flow
>Reporter: weijie.tong
>Assignee: weijie.tong
>Priority: Major
>
> This feature is to support the JPPD (Join Predicate Push Down). It will 
> benefit the HashJoin ,Broadcast HashJoin performance by reducing the number 
> of rows to send across the network ,the memory consumed. This feature is 
> already supported by Impala which calls it RuntimeFilter 
> ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]).
>  The first PR will try to push down a bloom filter of HashJoin node to 
> Parquet’s scan node.   The propose basic procedure is described as follow:
>  # The HashJoin build side accumulate the equal join condition rows to 
> construct a bloom filter. Then it sends out the bloom filter to the foreman 
> node.
>  # The foreman node accept the bloom filters passively from all the fragments 
> that has the HashJoin operator. It then aggregates the bloom filters to form 
> a global bloom filter.
>  # The foreman node broadcasts the global bloom filter to all the probe side 
> scan nodes which maybe already have send out partial data to the hash join 
> nodes(currently the hash join node will prefetch one batch from both sides ).
>       4.  The scan node accepts a global bloom filter from the foreman node. 
> It will filter the rest rows satisfying the bloom filter.
>  
> To implement above execution flow, some main new notion described as below:
>       1. RuntimeFilter
> It’s a filter container which may contain BloomFilter or MinMaxFilter.
>       2. RuntimeFilterReporter
> It wraps the logic to send hash join’s bloom filter to the foreman.The 
> serialized bloom filter will be sent out through the data tunnel.This object 
> will be instanced by the FragmentExecutor and passed to the 
> FragmentContext.So the HashJoin operator can obtain it through the 
> FragmentContext.
>      3. RuntimeFilterRequestHandler
> It is responsible to accept a SendRuntimeFilterRequest RPC to strip the 
> actual BloomFilter from the network. It then translates this filter to the 
> WorkerBee’s new interface registerRuntimeFilter.
> Another RPC type is BroadcastRuntimeFilterRequest. It will register the 
> accepted global bloom filter to the WorkerBee by the registerRuntimeFilter 
> method and then propagate to the FragmentContext through which the probe side 
> scan node can fetch the aggregated bloom filter.
>       4.RuntimeFilterManager
> The foreman will instance a RuntimeFilterManager .It will indirectly get 
> every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been 
> accepted and aggregated . It will broadcast the aggregated bloom filter to 
> all the probe side scan nodes through the data tunnel by a 
> BroadcastRuntimeFilterRequest RPC.
>      5. RuntimeFilterEnableOption 
>  A global option will be added to decide whether to enable this new feature.
>  
> Welcome suggestion and advice from you.The related PR will be presented as 
> soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-05-07 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466032#comment-16466032
 ] 

Aman Sinha commented on DRILL-6385:
---

[~weijie] thanks for working on this.   It sounds like you are far along in the 
implementation.  Just as a future reference, , it would be good to create the 
Jira sooner or inform on the dev list about the ongoing work so that others in 
the community are aware.  

Regarding the proposal, couple of thoughts: is a global bloom filter always 
needed or a local bloom filter will suffice in certain cases ?  In the case 
where we are doing broadcast hash join, the probe side is never distributed, so 
once the build is done on each minor fragment, the bloom filter can be passed 
to the Scan operator locally without contacting the foreman node.    A second 
related thought:  for hash distributed hash join where both probe and build 
sides are hash distributed, does it mean that a 'global bloom filter' is a 
synchronization point in your proposal ? In other words, suppose there are 20 
minor fragments and one of them is slow in completing the build phase, will the 
other 19 probes continue at their own pace ?

> Support JPPD (Join Predicate Push Down)
> ---
>
> Key: DRILL-6385
> URL: https://issues.apache.org/jira/browse/DRILL-6385
> Project: Apache Drill
>  Issue Type: New Feature
>  Components:  Server, Execution - Flow
>Reporter: weijie.tong
>Assignee: weijie.tong
>Priority: Major
>
> This feature is to support the JPPD (Join Predicate Push Down). It will 
> benefit the HashJoin ,Broadcast HashJoin performance by reducing the number 
> of rows to send across the network ,the memory consumed. This feature is 
> already supported by Impala which calls it RuntimeFilter 
> ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]).
>  The first PR will try to push down a bloom filter of HashJoin node to 
> Parquet’s scan node.   The propose basic procedure is described as follow:
>  # The HashJoin build side accumulate the equal join condition rows to 
> construct a bloom filter. Then it sends out the bloom filter to the foreman 
> node.
>  # The foreman node accept the bloom filters passively from all the fragments 
> that has the HashJoin operator. It then aggregates the bloom filters to form 
> a global bloom filter.
>  # The foreman node broadcasts the global bloom filter to all the probe side 
> scan nodes which maybe already have send out partial data to the hash join 
> nodes(currently the hash join node will prefetch one batch from both sides ).
>       4.  The scan node accepts a global bloom filter from the foreman node. 
> It will filter the rest rows satisfying the bloom filter.
>  
> To implement above execution flow, some main new notion described as below:
>       1. RuntimeFilter
> It’s a filter container which may contain BloomFilter or MinMaxFilter.
>       2. RuntimeFilterReporter
> It wraps the logic to send hash join’s bloom filter to the foreman.The 
> serialized bloom filter will be sent out through the data tunnel.This object 
> will be instanced by the FragmentExecutor and passed to the 
> FragmentContext.So the HashJoin operator can obtain it through the 
> FragmentContext.
>      3. RuntimeFilterRequestHandler
> It is responsible to accept a SendRuntimeFilterRequest RPC to strip the 
> actual BloomFilter from the network. It then translates this filter to the 
> WorkerBee’s new interface registerRuntimeFilter.
> Another RPC type is BroadcastRuntimeFilterRequest. It will register the 
> accepted global bloom filter to the WorkerBee by the registerRuntimeFilter 
> method and then propagate to the FragmentContext through which the probe side 
> scan node can fetch the aggregated bloom filter.
>       4.RuntimeFilterManager
> The foreman will instance a RuntimeFilterManager .It will indirectly get 
> every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been 
> accepted and aggregated . It will broadcast the aggregated bloom filter to 
> all the probe side scan nodes through the data tunnel by a 
> BroadcastRuntimeFilterRequest RPC.
>      5. RuntimeFilterEnableOption 
>  A global option will be added to decide whether to enable this new feature.
>  
> Welcome suggestion and advice from you.The related PR will be presented as 
> soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6383) View column types, modes are plan-time guesses, not actual types

2018-05-04 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463417#comment-16463417
 ] 

Aman Sinha commented on DRILL-6383:
---

This link [1] has some details of  views being used by Tableau.  My 
understanding is that in the data exploration phase, the user previews a sample 
of the data and defines a view (this can also be done through Drill Explorer).  
So in that sense, there is indeed a query execution phase before defining the 
view but it is initiated by the user.  Since Drill does not have a CREATE TABLE 
DDL,  the CREATE VIEW is the closest abstraction which is used by Tableau.   
One option to improve the accuracy of the views is to extend the syntax of view 
creation to include a sampling clause which would basically mimic what the user 
was doing in the data exploration phase. 

I don't have experience with the client side, so I am not sure what would be 
the impact of removing some pieces of information from the view definition.   
At the very least column names should be there.  For the data type, If CAST is 
specified, then the data type should be there.  If we want to make nullability 
more accurate, we could mark it UNKNOWN by default but since the user knows 
best  - e.g the rowkey column in HBASE is non-nullable, they would mark it 
non-nullable. 

Also cc-ing [~aengelbrecht] who has worked with the Drill+BI tools to shed more 
light. 

[1] [http://drill.apache.org/docs/tableau-examples]

> View column types, modes are plan-time guesses, not actual types
> 
>
> Key: DRILL-6383
> URL: https://issues.apache.org/jira/browse/DRILL-6383
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Priority: Minor
>
> Create a view views and look at the list of columns within the view. You'll 
> see that they are often wrong in name, type and mode.
> Consider a very simple CSV file with headers:
> {noformat}
> custId,name,balance,status
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> 128,Barney,1.23,PAST DUE,30
> {noformat}
> Define the simplest possible view:
> {noformat}
> CREATE VIEW myView2 AS SELECT * FROM `csvh/cust.csvh`;
> {noformat}
> Then look at the view file:
> {noformat}
> {
>   "name" : "myView2",
>   "sql" : "SELECT *\nFROM `csvh/cust.csvh`",
>   "fields" : [ {
> "name" : "**",
> "type" : "DYNAMIC_STAR",
> "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> It is clear that the view simply captured the plan-time list of the new 
> double-star for the wildcard. Since this is not a true type, it should not 
> have an `isNullable` attribute.
> OK, we have to spell out the columns:
> {noformat}
> CREATE VIEW myView3 AS SELECT custId  FROM `csvh/cust.csvh`;
> {noformat}
> Let's look at the view file:
> {noformat}
> {
>   "name" : "myView3",
>   "sql" : "SELECT `custId`\nFROM `csvh/cust.csvh`",
>   "fields" : [ {
> "name" : "custId",
> "type" : "ANY",
> "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> The name is correct. The type is `ANY`, which is wrong. Since this is a CSV 
> file, the column type is `VARCHAR`. Further, because this is a CSV file which 
> headers, the mode is REQUIRED, but is listed as nullable. To verify:
> {noformat}
> SELECT sqlTypeOf(custId), modeOf(custId) FROM myView3 LIMIT 1;
> ++---+
> |   EXPR$0   |  EXPR$1   |
> ++---+
> | CHARACTER VARYING  | NOT NULL  |
> ++---+
> {noformat}
> Now, let's try a CSV file without headers:
> {noformat}
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> {noformat}
> {noformat}
> CREATE VIEW myView4 AS SELECT columns FROM `csv/cust.csv`;
> SELECT * FROM myView4;
> ++
> |columns |
> ++
> | ["123","Fred","456.78"]|
> | ["125","Betty","98.76","VIP"]  |
> ++
> {noformat}
> Let's look at the view file:
> {noformat}
> {
>   "name" : "myView4",
>   "sql" : "SELECT `columns`\nFROM `csv/cust.csv`",
>   "fields" : [ {
> "name" : "columns",
> "type" : "ANY",
> "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> This is almost non-sensical. `columns` is reported as type `ANY` and 
> nullable. But, `columns` is Repeated `VARCHAR` and repeated types cannot be 
> nullable.
> The conclusion is that the type information is virtually worthless and the 
> `isNullable` information is worse than worthless: it is plain wrong.
> The type information is valid only if the planner can inver types:
> {noformat}
> CREATE VIEW myView5 AS
>   SELECT CAST(custId AS INTEGER) AS custId FROM `csvh/cust.csvh`;
> 

[jira] [Commented] (DRILL-6383) View column types, modes are plan-time guesses, not actual types

2018-05-03 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461973#comment-16461973
 ] 

Aman Sinha commented on DRILL-6383:
---

Drill views don't execute the query (not even a limit 0) which is by design, 
afaik.  So the primary use case for views is the one with CAST expressions 
which as your view5 shows has the correct type inference.  The nullability is 
not expressible in a CAST, so Drill uses the default of 'less strict' 
nullability i.e always assume nullable.  For schema-less systems, I would think 
that this is a reasonable choice. 

One could argue that even doing a limit 0 query is no guarantee that the schema 
would match the type and mode of what is returned at run time since it depends 
on the order in which files are read.  

Note that for BI tools we recommend creating views with CAST and the tool (at 
least Tableau) wraps the entire complex query with a LIMIT 0 which hints Drill 
to use the type inferencing directly at planning time without going through 
even the build schema phase of execution.  

> View column types, modes are plan-time guesses, not actual types
> 
>
> Key: DRILL-6383
> URL: https://issues.apache.org/jira/browse/DRILL-6383
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Priority: Minor
>
> Create a view views and look at the list of columns within the view. You'll 
> see that they are often wrong in name, type and mode.
> Consider a very simple CSV file with headers:
> {noformat}
> custId,name,balance,status
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> 128,Barney,1.23,PAST DUE,30
> {noformat}
> Define the simplest possible view:
> {noformat}
> CREATE VIEW myView2 AS SELECT * FROM `csvh/cust.csvh`;
> {noformat}
> Then look at the view file:
> {noformat}
> {
>   "name" : "myView2",
>   "sql" : "SELECT *\nFROM `csvh/cust.csvh`",
>   "fields" : [ {
> "name" : "**",
> "type" : "DYNAMIC_STAR",
> "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> It is clear that the view simply captured the plan-time list of the new 
> double-star for the wildcard. Since this is not a true type, it should not 
> have an `isNullable` attribute.
> OK, we have to spell out the columns:
> {noformat}
> CREATE VIEW myView3 AS SELECT custId  FROM `csvh/cust.csvh`;
> {noformat}
> Let's look at the view file:
> {noformat}
> {
>   "name" : "myView3",
>   "sql" : "SELECT `custId`\nFROM `csvh/cust.csvh`",
>   "fields" : [ {
> "name" : "custId",
> "type" : "ANY",
> "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> The name is correct. The type is `ANY`, which is wrong. Since this is a CSV 
> file, the column type is `VARCHAR`. Further, because this is a CSV file which 
> headers, the mode is REQUIRED, but is listed as nullable. To verify:
> {noformat}
> SELECT sqlTypeOf(custId), modeOf(custId) FROM myView3 LIMIT 1;
> ++---+
> |   EXPR$0   |  EXPR$1   |
> ++---+
> | CHARACTER VARYING  | NOT NULL  |
> ++---+
> {noformat}
> Now, let's try a CSV file without headers:
> {noformat}
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> {noformat}
> {noformat}
> CREATE VIEW myView4 AS SELECT columns FROM `csv/cust.csv`;
> SELECT * FROM myView4;
> ++
> |columns |
> ++
> | ["123","Fred","456.78"]|
> | ["125","Betty","98.76","VIP"]  |
> ++
> {noformat}
> Let's look at the view file:
> {noformat}
> {
>   "name" : "myView4",
>   "sql" : "SELECT `columns`\nFROM `csv/cust.csv`",
>   "fields" : [ {
> "name" : "columns",
> "type" : "ANY",
> "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> This is almost non-sensical. `columns` is reported as type `ANY` and 
> nullable. But, `columns` is Repeated `VARCHAR` and repeated types cannot be 
> nullable.
> The conclusion is that the type information is virtually worthless and the 
> `isNullable` information is worse than worthless: it is plain wrong.
> The type information is valid only if the planner can inver types:
> {noformat}
> CREATE VIEW myView5 AS
>   SELECT CAST(custId AS INTEGER) AS custId FROM `csvh/cust.csvh`;
> {noformat}
> The view file:
> {noformat}
> {
>   "name" : "myView5",
>   "sql" : "SELECT CAST(`custId` AS INTEGER) AS `custId`\nFROM 
> `csvh/cust.csvh`",
>   "fields" : [ {
> "name" : "custId",
> "type" : "INTEGER",
> "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> Note that the `type` is inferred from the cast, but `isNullable` is wrong 
> 

[jira] [Created] (DRILL-6381) Add capability to do index based planning and execution

2018-05-02 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-6381:
-

 Summary: Add capability to do index based planning and execution
 Key: DRILL-6381
 URL: https://issues.apache.org/jira/browse/DRILL-6381
 Project: Apache Drill
  Issue Type: New Feature
  Components: Execution - Relational Operators, Query Planning  
Optimization
Reporter: Aman Sinha
Assignee: Aman Sinha


If the underlying data source supports indexes (primary and secondary indexes), 
Drill should leverage those during planning and execution in order to improve 
query performance.  

On the planning side, Drill planner should be enhanced to provide an 
abstraction layer which express the index metadata and statistics.  Further, a 
cost-based index selection is needed to decide which index(es) are suitable.  

On the execution side, appropriate operator enhancements would be needed to 
handle different categories of indexes such as covering, non-covering indexes, 
taking into consideration the index data may not be co-located with the primary 
table, i.e a global index.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6341) Mongo Tests Fail on OSX 10.13.4

2018-04-18 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443265#comment-16443265
 ] 

Aman Sinha commented on DRILL-6341:
---

Hey [~timothyfarkas], thanks for chasing this down.  I am tagging 
[~akumarb2010] or [~kam_iitkgp] to see if they can help since they wrote the 
MongoDB plugin.  

> Mongo Tests Fail on OSX 10.13.4
> ---
>
> Key: DRILL-6341
> URL: https://issues.apache.org/jira/browse/DRILL-6341
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
>
> [mongod output] [mongoimport error]failed MSpanList_Insert 0x604000 
> 0x344eb6bdc31 0x0
> [mongoimport error] fatal error: MSpanList_Insert
> [mongoimport error] 
> [mongoimport error] runtime stack:
> [mongoimport error] runtime.throw(0x468d0b)
> [mongoimport error]   /usr/local/go/src/runtime/panic.go:491 +0xad 
> fp=0x7ffeefbff000 sp=0x7ffeefbfefd0
> [mongoimport error] runtime.MSpanList_Insert(0x48a0c8, 0x604000)
> [mongoimport error]   /usr/local/go/src/runtime/mheap.c:692 +0x8f 
> fp=0x7ffeefbff028 sp=0x7ffeefbff000
> [mongoimport error] MHeap_FreeSpanLocked(0x486cc0, 0x604000, 0x100)
> [mongoimport error]   /usr/local/go/src/runtime/mheap.c:583 +0x163 
> fp=0x7ffeefbff068 sp=0x7ffeefbff028
> [mongoimport error] MHeap_Grow(0x486cc0, 0x8, 0x0)
> [mongoimport error]   /usr/local/go/src/runtime/mheap.c:420 +0x1a8 
> fp=0x7ffeefbff0a8 sp=0x7ffeefbff068
> [mongoimport error] MHeap_AllocSpanLocked(0x486cc0, 0x1, 0x0)
> [mongoimport error]   /usr/local/go/src/runtime/mheap.c:298 +0x365 
> fp=0x7ffeefbff0e8 sp=0x7ffeefbff0a8
> [mongoimport error] mheap_alloc(0x486cc0, 0x1, 0x12, 0x0)
> [mongoimport error]   /usr/local/go/src/runtime/mheap.c:190 +0x121 
> fp=0x7ffeefbff110 sp=0x7ffeefbff0e8
> [mongoimport error] runtime.MHeap_Alloc(0x486cc0, 0x1, 0x112, 0xdbc9)
> [mongoimport error]   /usr/local/go/src/runtime/mheap.c:240 +0x66 
> fp=0x7ffeefbff148 sp=0x7ffeefbff110
> [mongoimport error] MCentral_Grow(0x48ea38, 0x0)
> [mongoimport error]   /usr/local/go/src/runtime/mcentral.c:197 +0x8b 
> fp=0x7ffeefbff1b0 sp=0x7ffeefbff148
> [mongoimport error] runtime.MCentral_CacheSpan(0x48ea38, 0x0)
> [mongoimport error]   /usr/local/go/src/runtime/mcentral.c:85 +0x167 
> fp=0x7ffeefbff1e8 sp=0x7ffeefbff1b0
> [mongoimport error] runtime.MCache_Refill(0x60, 0x12, 0x0)
> [mongoimport error]   /usr/local/go/src/runtime/mcache.c:90 +0xa0 
> fp=0x7ffeefbff210 sp=0x7ffeefbff1e8
> [mongoimport error] runtime.mcacheRefill_m()
> [mongoimport error]   /usr/local/go/src/runtime/malloc.c:368 +0x57 
> fp=0x7ffeefbff230 sp=0x7ffeefbff210
> [mongoimport error] runtime.onM(0x3917f8)
> [mongoimport error]   /usr/local/go/src/runtime/asm_amd64.s:273 +0x9a 
> fp=0x7ffeefbff238 sp=0x7ffeefbff230
> [mongoimport error] runtime.mallocgc(0x120, 0x2f1120, 0x0, 0x0)
> [mongoimport error]   /usr/local/go/src/runtime/malloc.go:178 +0x849 
> fp=0x7ffeefbff2e8 sp=0x7ffeefbff238
> [mongoimport error] runtime.newobject(0x2f1120, 0x60)
> [mongoimport error]   /usr/local/go/src/runtime/malloc.go:353 +0x49 
> fp=0x7ffeefbff310 sp=0x7ffeefbff2e8
> [mongoimport error] runtime.newG(0x29c0a)
> [mongoimport error]   /usr/local/go/src/runtime/proc.go:233 +0x2a 
> fp=0x7ffeefbff328 sp=0x7ffeefbff310
> [mongoimport error] allocg(0x478540)
> [mongoimport error]   /usr/local/go/src/runtime/proc.c:925 +0x1f 
> fp=0x7ffeefbff338 sp=0x7ffeefbff328
> [mongoimport error] runtime.malg(0x8000, 0x4785e0)
> [mongoimport error]   /usr/local/go/src/runtime/proc.c:2106 +0x1f 
> fp=0x7ffeefbff368 sp=0x7ffeefbff338
> [mongoimport error] runtime.mpreinit(0x478bc0)
> [mongoimport error]   /usr/local/go/src/runtime/os_darwin.c:137 +0x27 
> fp=0x7ffeefbff380 sp=0x7ffeefbff368
> [mongoimport error] mcommoninit(0x478bc0)
> [mongoimport error]   /usr/local/go/src/runtime/proc.c:201 +0xc9 
> fp=0x7ffeefbff3a8 sp=0x7ffeefbff380
> [mongoimport error] runtime.schedinit()
> [mongoimport error]   /usr/local/go/src/runtime/proc.c:138 +0x55 
> fp=0x7ffeefbff3d0 sp=0x7ffeefbff3a8
> [mongoimport error] runtime.rt0_go(0x7ffeefbff400, 0xd, 0x7ffeefbff400, 0x0, 
> 0xd, 0x7ffeefbff6c0, 0x7ffeefbff73b, 0x7ffeefbff742, 0x7ffeefbff748, 
> 0x7ffeefbff74d, ...)
> [mongoimport error]   /usr/local/go/src/runtime/asm_amd64.s:95 +0x116 
> fp=0x7ffeefbff3d8 sp=0x7ffeefbff3d0
> [mongoimport error] 
> 12:45:37.991 [main] ERROR d.f.e.p.runtime.AbstractProcess - failed to call 
> onAfterProcessStart()
> [mongod output] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time 
> elapsed: -635,185.335 sec <<< FAILURE! - in 
> org.apache.drill.exec.store.mongo.MongoTestSuit
> org.apache.drill.exec.store.mongo.MongoTestSuit  Time elapsed: -635,185.336 
> sec  <<< ERROR!
> java.io.IOException: Could not start process: 
>   at 
> 

[jira] [Updated] (DRILL-6103) lsb_release: command not found

2018-03-31 Thread Aman Sinha (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-6103:
--
Labels: ready-to-commit  (was: )

> lsb_release: command not found
> --
>
> Key: DRILL-6103
> URL: https://issues.apache.org/jira/browse/DRILL-6103
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Chunhui Shi
>Assignee: Kunal Khatua
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.14.0
>
> Attachments: drill-config.sh.patch
>
>
> Got this error when running drillbit.sh:
>  
> $ bin/drillbit.sh restart
> bin/drill-config.sh: line 317: lsb_release: command not found



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   10   >