[jira] [Assigned] (DRILL-7391) Wrong result when doing left outer join on CSV table
[ https://issues.apache.org/jira/browse/DRILL-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha reassigned DRILL-7391: - Assignee: Vova Vysotskyi (was: Aman Sinha) > Wrong result when doing left outer join on CSV table > > > Key: DRILL-7391 > URL: https://issues.apache.org/jira/browse/DRILL-7391 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Affects Versions: 1.16.0 >Reporter: Aman Sinha >Assignee: Vova Vysotskyi >Priority: Major > Fix For: 1.17.0 > > Attachments: tt5.tar.gz, tt6.tar.gz > > > The following query shows 1 row that is incorrect. For the non-null rows, > both columns should have the same value. This is on CSV sample data (I will > attach the files). > {noformat} > apache drill (dfs.tmp)> select tt5.columns[0], tt6.columns[0] from tt5 left > outer join tt6 on tt5.columns[0] = tt6.columns[0]; > +++ > | EXPR$0 | EXPR$1 | > +++ > | 455| null | > | 455| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 580| null | > | | null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 580| null | > | 6767 | null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 555| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 6767 | null | > | 555| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 9669 | 1414 | <--- Wrong result > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 555| null | > | 580| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 409| null | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 555| null | > | 555| null | > | 555| null | > +++ > 75 rows selected > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7391) Wrong result when doing left outer join on CSV table
[ https://issues.apache.org/jira/browse/DRILL-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954921#comment-16954921 ] Aman Sinha commented on DRILL-7391: --- [~volodymyr], I assume this fix would go into 1.17 since it is a wrong result. Since I will be offline for a few days, I am assigning this to you for updating the pom.xml file for Calcite version once CALCITE-3390 is merged and also adding the following unit test to TestExampleQueries.java: {noformat} @Test // DRILL-7391 public void testItemPushdownPastLeftOuterJoin() throws Exception { String query = "select t1.columns[0] as a, t2.columns[0] as b from cp.`store/text/data/regions.csv` t1 " + " left outer join cp.`store/text/data/regions.csv` t2 on t1.columns[0] = t2.columns[0]"; PlanTestBase.testPlanMatchingPatterns(query, new String[] {}, // exclude pattern where Project is projecting the 'columns' field new String[]{"Project.*columns"}); } {noformat} > Wrong result when doing left outer join on CSV table > > > Key: DRILL-7391 > URL: https://issues.apache.org/jira/browse/DRILL-7391 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Affects Versions: 1.16.0 >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.17.0 > > Attachments: tt5.tar.gz, tt6.tar.gz > > > The following query shows 1 row that is incorrect. For the non-null rows, > both columns should have the same value. This is on CSV sample data (I will > attach the files). > {noformat} > apache drill (dfs.tmp)> select tt5.columns[0], tt6.columns[0] from tt5 left > outer join tt6 on tt5.columns[0] = tt6.columns[0]; > +++ > | EXPR$0 | EXPR$1 | > +++ > | 455| null | > | 455| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 580| null | > | | null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 580| null | > | 6767 | null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 555| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 6767 | null | > | 555| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 9669 | 1414 | <--- Wrong result > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 555| null | > | 580| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 409| null | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 555| null | > | 555| null | > | 555| null | > +++ > 75 rows selected > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7391) Wrong result when doing left outer join on CSV table
[ https://issues.apache.org/jira/browse/DRILL-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954729#comment-16954729 ] Aman Sinha commented on DRILL-7391: --- Adding link to CALCITE-3390 on which this fix depends. > Wrong result when doing left outer join on CSV table > > > Key: DRILL-7391 > URL: https://issues.apache.org/jira/browse/DRILL-7391 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Affects Versions: 1.16.0 >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.17.0 > > Attachments: tt5.tar.gz, tt6.tar.gz > > > The following query shows 1 row that is incorrect. For the non-null rows, > both columns should have the same value. This is on CSV sample data (I will > attach the files). > {noformat} > apache drill (dfs.tmp)> select tt5.columns[0], tt6.columns[0] from tt5 left > outer join tt6 on tt5.columns[0] = tt6.columns[0]; > +++ > | EXPR$0 | EXPR$1 | > +++ > | 455| null | > | 455| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 580| null | > | | null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 580| null | > | 6767 | null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 555| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 6767 | null | > | 555| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 9669 | 1414 | <--- Wrong result > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 555| null | > | 580| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 409| null | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 555| null | > | 555| null | > | 555| null | > +++ > 75 rows selected > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7391) Wrong result when doing left outer join on CSV table
[ https://issues.apache.org/jira/browse/DRILL-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941377#comment-16941377 ] Aman Sinha commented on DRILL-7391: --- I narrowed it down to the Calcite upgrade [1] that was done in Drill 1.13.0. Just before the Calcite upgrade, the logical plan and the results are correct (See below). After the upgrade, the logical plan has an extra 'columns' field in the Project on the right input of the Left Outer Join, which is what is causing the problem. The 'columns' is present because the ITEM expression above the LOJ was only pushed down to the left side of the join but not to the right side. Here's the sequence which should clarify: (here I am using 2 different tables compared to what I used in the Description but the result is similar) {noformat} // Before calcite upgrade in Drill 1.13.0 // RIGHT result 0: jdbc:drill:zk=local> select * from sys.version; +--+---+++--++ | version | commit_id | commit_message |commit_time | build_email | build_time | +--+---+++--++ | 1.13.0-SNAPSHOT | 450e67094eb6e9a6484d7f86c49b51c77a08d7b2 | REVERTED: DRILL-5089 | 16.01.2018 @ 02:10:13 PST | asi...@maprtech.com | 27.09.2019 @ 11:25:45 PDT | +--+---+++--++ 0: jdbc:drill:zk=local> explain plan without implementation for select tt7.columns[0], tt8.columns[0] as x from tt7 left outer join tt8 on tt7.columns[0] = tt8.columns[0]; +--+--+ | text | json | +--+--+ | DrillScreenRel DrillProjectRel(EXPR$0=[$1], x=[$3]) DrillJoinRel(condition=[=($0, $2)], joinType=[left]) DrillProjectRel($f2=[ITEM($0, 0)], ITEM=[ITEM($0, 0)]) DrillScanRel(table=[[dfs, tmp, tt7]], groupscan=[EasyGroupScan [selectionRoot=file:/tmp/tt7, numFiles=1, columns=[`columns`[0]], files=[file:/tmp/tt7/0_0_0.csv]]]) DrillProjectRel($f2=[ITEM($0, 0)], ITEM=[ITEM($0, 0)]) DrillScanRel(table=[[dfs, tmp, tt8]], groupscan=[EasyGroupScan [selectionRoot=file:/tmp/tt8, numFiles=1, columns=[`columns`[0]], files=[file:/tmp/tt8/0_0_0.csv]]]) 0: jdbc:drill:zk=local> select tt7.columns[0], tt8.columns[0] as x from tt7 left outer join tt8 on tt7.columns[0] = tt8.columns[0]; +-+---+ | EXPR$0 | x | +-+---+ | 1414| 1414 | | 455 | null | | 555 | null | | 1414| 1414 | | 455 | null | | 9669| 9669 | | 555 | null | +-+---+ // After Calcite upgrade in Drill 1.13.0 // WRONG result 0: jdbc:drill:zk=local> select * from sys.version; +--+---+++--++ | version | commit_id | commit_message |commit_time | build_email | build_time | +--+---+++--++ | 1.13.0-SNAPSHOT | 3f0e517fb62a1ebad92fb473e787d343152920d6 | DRILL-3993: Resolve conflicts | 16.01.2018 @ 02:10:13 PST | asi...@maprtech.com | 27.09.2019 @ 11:53:51 PDT | +--+---+++--++ 0: jdbc:drill:zk=local> explain plan without implementation for select tt7.columns[0], tt8.columns[0] as x from tt7 left outer join tt8 on tt7.columns[0] = tt8.columns[0]; +--+--+ | text | json | +--+--+ | DrillScreenRel DrillProjectRel(EXPR$0=[$1], x=[ITEM($2, 0)]) DrillJoinRel(condition=[=($0, $3)], joinType=[left]) DrillProjectRel($f2=[ITEM($0, 0)], ITEM=[ITEM($0, 0)]) DrillScanRel(table=[[dfs, tmp, tt7]], groupscan=[EasyGroupScan [selectionRoot=file:/tmp/tt7, numFiles=1, columns=[`columns`[0]], files=[file:/tmp/tt7/0_0_0.csv]]]) DrillProjectRel(columns=[$0], $f2=[ITEM($0, 0)]) DrillScanRel(table=[[dfs, tmp, tt8]], groupscan=[EasyGroupScan [selectionRoot=file:/tmp/tt8, numFiles=1, columns=[`columns`, `columns`[0]], files=[file:/tmp/tt8/0_0_0.csv]]]) 0: jdbc:drill:zk=local> select tt7.columns[0], tt8.columns[0] as x from tt7 left outer join tt8 on tt7.columns[0] = tt8.columns[0]; +-+---+ |
[jira] [Updated] (DRILL-7391) Wrong result when doing left outer join on CSV table
[ https://issues.apache.org/jira/browse/DRILL-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7391: -- Attachment: tt5.tar.gz > Wrong result when doing left outer join on CSV table > > > Key: DRILL-7391 > URL: https://issues.apache.org/jira/browse/DRILL-7391 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Affects Versions: 1.16.0 >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.17.0 > > Attachments: tt5.tar.gz, tt6.tar.gz > > > The following query shows 1 row that is incorrect. For the non-null rows, > both columns should have the same value. This is on CSV sample data (I will > attach the files). > {noformat} > apache drill (dfs.tmp)> select tt5.columns[0], tt6.columns[0] from tt5 left > outer join tt6 on tt5.columns[0] = tt6.columns[0]; > +++ > | EXPR$0 | EXPR$1 | > +++ > | 455| null | > | 455| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 580| null | > | | null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 580| null | > | 6767 | null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 555| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 6767 | null | > | 555| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 9669 | 1414 | <--- Wrong result > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 555| null | > | 580| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 409| null | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 555| null | > | 555| null | > | 555| null | > +++ > 75 rows selected > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7391) Wrong result when doing left outer join on CSV table
[ https://issues.apache.org/jira/browse/DRILL-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7391: -- Attachment: tt6.tar.gz > Wrong result when doing left outer join on CSV table > > > Key: DRILL-7391 > URL: https://issues.apache.org/jira/browse/DRILL-7391 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Affects Versions: 1.16.0 >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.17.0 > > Attachments: tt5.tar.gz, tt6.tar.gz > > > The following query shows 1 row that is incorrect. For the non-null rows, > both columns should have the same value. This is on CSV sample data (I will > attach the files). > {noformat} > apache drill (dfs.tmp)> select tt5.columns[0], tt6.columns[0] from tt5 left > outer join tt6 on tt5.columns[0] = tt6.columns[0]; > +++ > | EXPR$0 | EXPR$1 | > +++ > | 455| null | > | 455| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 580| null | > | | null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 580| null | > | 6767 | null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 555| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 6767 | null | > | 555| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 9669 | 1414 | <--- Wrong result > | 555| null | > | 455| null | > | 455| null | > | 455| null | > | 555| null | > | 580| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 455| null | > | 455| null | > | 409| null | > | 455| null | > | 555| null | > | 555| null | > | 455| null | > | 455| null | > | 555| null | > | 455| null | > | 555| null | > | 1414 | 1414 | > | 455| null | > | 555| null | > | 555| null | > | 555| null | > +++ > 75 rows selected > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7391) Wrong result when doing left outer join on CSV table
Aman Sinha created DRILL-7391: - Summary: Wrong result when doing left outer join on CSV table Key: DRILL-7391 URL: https://issues.apache.org/jira/browse/DRILL-7391 Project: Apache Drill Issue Type: Bug Components: Query Planning Optimization Affects Versions: 1.16.0 Reporter: Aman Sinha Assignee: Aman Sinha Fix For: 1.17.0 The following query shows 1 row that is incorrect. For the non-null rows, both columns should have the same value. This is on CSV sample data (I will attach the files). {noformat} apache drill (dfs.tmp)> select tt5.columns[0], tt6.columns[0] from tt5 left outer join tt6 on tt5.columns[0] = tt6.columns[0]; +++ | EXPR$0 | EXPR$1 | +++ | 455| null | | 455| null | | 555| null | | 1414 | 1414 | | 455| null | | 580| null | | | null | | 555| null | | 455| null | | 455| null | | 455| null | | 455| null | | 455| null | | 555| null | | 455| null | | 455| null | | 455| null | | 580| null | | 6767 | null | | 455| null | | 555| null | | 455| null | | 555| null | | 555| null | | 555| null | | 455| null | | 555| null | | 455| null | | 455| null | | 455| null | | 6767 | null | | 555| null | | 555| null | | 455| null | | 555| null | | 555| null | | 1414 | 1414 | | 455| null | | 555| null | | 555| null | | 455| null | | 455| null | | 555| null | | 455| null | | 555| null | | 555| null | | 455| null | | 455| null | | 9669 | 1414 | <--- Wrong result | 555| null | | 455| null | | 455| null | | 455| null | | 555| null | | 580| null | | 455| null | | 555| null | | 455| null | | 555| null | | 455| null | | 455| null | | 409| null | | 455| null | | 555| null | | 555| null | | 455| null | | 455| null | | 555| null | | 455| null | | 555| null | | 1414 | 1414 | | 455| null | | 555| null | | 555| null | | 555| null | +++ 75 rows selected {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7302) Bump Apache Avro from 1.8.2 to 1.9.0
[ https://issues.apache.org/jira/browse/DRILL-7302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870648#comment-16870648 ] Aman Sinha commented on DRILL-7302: --- [~Fokko] thanks for the contribution. For some reason I am unable to assign the JIRA to you. Pls assign it to yourself. > Bump Apache Avro from 1.8.2 to 1.9.0 > > > Key: DRILL-7302 > URL: https://issues.apache.org/jira/browse/DRILL-7302 > Project: Apache Drill > Issue Type: Improvement >Reporter: Fokko Driesprong >Priority: Major > Labels: ready-to-commit > Fix For: 1.17.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7297) Query hangs in planning stage when Error is thrown
[ https://issues.apache.org/jira/browse/DRILL-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7297: -- Reviewer: Bohdan Kazydub > Query hangs in planning stage when Error is thrown > -- > > Key: DRILL-7297 > URL: https://issues.apache.org/jira/browse/DRILL-7297 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Labels: ready-to-commit > Fix For: 1.17.0 > > > Query hangs in the planning stage when Error (not OOM or AssertionError) is > thrown during query planning. After canceling the query it will stay in > Cancellation Requested state. > Such error may be thrown due to the mistake in the code, including UDF. Since > the user may provide custom UDFs, Drill should be able to handle such cases > also. > Steps to reproduce this issue: > 1. Create UDF which throws Error in either {{eval()}} or {{setup()}} method > (instructions how to create custom UDF may be found > [here|https://drill.apache.org/docs/tutorial-develop-a-simple-function/]. > 2. Register custom UDF which throws an error (instruction is > [here|https://drill.apache.org/docs/adding-custom-functions-to-drill-introduction/]). > 3. Run the query with this UDF. > After submitting the query, the following stack trace is printed: > {noformat} > Exception in thread "drill-executor-1" java.lang.Error > at > org.apache.drill.contrib.function.FunctionExample.setup(FunctionExample.java:19) > at > org.apache.drill.exec.expr.fn.interpreter.InterpreterEvaluator.evaluateFunction(InterpreterEvaluator.java:139) > at > org.apache.drill.exec.expr.fn.interpreter.InterpreterEvaluator$EvalVisitor.visitFunctionHolderExpression(InterpreterEvaluator.java:355) > at > org.apache.drill.exec.expr.fn.interpreter.InterpreterEvaluator$EvalVisitor.visitFunctionHolderExpression(InterpreterEvaluator.java:204) > at > org.apache.drill.common.expression.FunctionHolderExpression.accept(FunctionHolderExpression.java:53) > at > org.apache.drill.exec.expr.fn.interpreter.InterpreterEvaluator.evaluateConstantExpr(InterpreterEvaluator.java:70) > at > org.apache.drill.exec.planner.logical.DrillConstExecutor.reduce(DrillConstExecutor.java:152) > at > org.apache.calcite.rel.rules.ReduceExpressionsRule.reduceExpressionsInternal(ReduceExpressionsRule.java:620) > at > org.apache.calcite.rel.rules.ReduceExpressionsRule.reduceExpressions(ReduceExpressionsRule.java:541) > at > org.apache.calcite.rel.rules.ReduceExpressionsRule$ProjectReduceExpressionsRule.onMatch(ReduceExpressionsRule.java:288) > at > org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:643) > at > org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:339) > at > org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform(DefaultSqlHandler.java:430) > at > org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform(DefaultSqlHandler.java:370) > at > org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToRawDrel(DefaultSqlHandler.java:250) > at > org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:319) > at > org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan(DefaultSqlHandler.java:177) > at > org.apache.drill.exec.planner.sql.DrillSqlWorker.getQueryPlan(DrillSqlWorker.java:226) > at > org.apache.drill.exec.planner.sql.DrillSqlWorker.convertPlan(DrillSqlWorker.java:124) > at > org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:90) > at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:593) > at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:276) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > 4. Check that query is still in progress state, cancel query. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7242) Query with range predicate hits IOBE when accessing histogram buckets
[ https://issues.apache.org/jira/browse/DRILL-7242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7242: -- Labels: ready-to-commit (was: ) Reviewer: Gautam Parai > Query with range predicate hits IOBE when accessing histogram buckets > - > > Key: DRILL-7242 > URL: https://issues.apache.org/jira/browse/DRILL-7242 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Affects Versions: 1.16.0 >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Labels: ready-to-commit > Fix For: 1.17.0 > > > Following query hits an IOBE during histogram access: (make sure to run > ANALYZE command before running this query): > {noformat} > select 1 from dfs.tmp.employee where store_id > 24; > Caused by: java.lang.ArrayIndexOutOfBoundsException: 11 > at > org.apache.drill.exec.planner.common.NumericEquiDepthHistogram.getSelectedRows(NumericEquiDepthHistogram.java:215) > ~[drill-java-exec-1.16.0.0-mapr.jar:1.16.0.0-mapr] > at > org.apache.drill.exec.planner.common.NumericEquiDepthHistogram.estimatedSelectivity(NumericEquiDepthHistogram.java:130) > ~[drill-java-exec-1.16.0.0-mapr.jar:1.16.0.0-mapr] > at > org.apache.drill.exec.planner.cost.DrillRelMdSelectivity.computeRangeSelectivity(DrillRelMd > {noformat} > Here, 24.0 is the end point of the last histogram bucket and the boundary > condition is not being correctly handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-7242) Query with range predicate hits IOBE when accessing histogram buckets
Aman Sinha created DRILL-7242: - Summary: Query with range predicate hits IOBE when accessing histogram buckets Key: DRILL-7242 URL: https://issues.apache.org/jira/browse/DRILL-7242 Project: Apache Drill Issue Type: Bug Components: Query Planning Optimization Affects Versions: 1.16.0 Reporter: Aman Sinha Assignee: Aman Sinha Fix For: 1.17.0 Following query hits an IOBE during histogram access: (make sure to run ANALYZE command before running this query): {noformat} select 1 from dfs.tmp.employee where store_id > 24; Caused by: java.lang.ArrayIndexOutOfBoundsException: 11 at org.apache.drill.exec.planner.common.NumericEquiDepthHistogram.getSelectedRows(NumericEquiDepthHistogram.java:215) ~[drill-java-exec-1.16.0.0-mapr.jar:1.16.0.0-mapr] at org.apache.drill.exec.planner.common.NumericEquiDepthHistogram.estimatedSelectivity(NumericEquiDepthHistogram.java:130) ~[drill-java-exec-1.16.0.0-mapr.jar:1.16.0.0-mapr] at org.apache.drill.exec.planner.cost.DrillRelMdSelectivity.computeRangeSelectivity(DrillRelMd {noformat} Here, 24.0 is the end point of the last histogram bucket and the boundary condition is not being correctly handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7228) Histogram end points show high deviation for a sample data set
[ https://issues.apache.org/jira/browse/DRILL-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7228: -- Labels: ready-to-commit (was: ) Reviewer: Gautam Parai > Histogram end points show high deviation for a sample data set > -- > > Key: DRILL-7228 > URL: https://issues.apache.org/jira/browse/DRILL-7228 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.16.0 >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Labels: ready-to-commit > Fix For: 1.17.0 > > > There are couple of scenarios where the histogram bucket end points show high > deviation for the attached sample data set. > +Scenario 1: + > There are total 100 rows in the sample. Here are the first 10 values of the > c_float column ordered by the column. > {noformat} > select c_float from `table_stats/alltypes_with_nulls` order by c_float; > +--+ > | c_float| > +--+ > | -4.6873795E9 | > | 8.1855632E7 | > | 2.65311632E8 | > | 4.50677952E8 | > | 4.6864464E8 | > | 5.7848493E8 | > | 6.6793114E8 | > | 7.1175571E8 | > | 9.0065581E8 | > | 9.2245773E8 | > ... > ... > <100 rows> > {noformat} > Here the minimum value is a small negative number. Here's the output of the > histogram after running ANALYZE command: > {noformat} > "buckets" : [ 8.1855488E7, 9.13736816E8, 1.720863011198E9, > 3.2401755232E9, 4.6546719328E9, 5.130497904E9, 5.9901393504E9, 6.779930992E9, > 7.998626672E9, 8.69159614398E9, 9.983783792E9 ] > {noformat} > Note that the starting end point of bucket 0 is actually the 2nd value in the > ordered list and the small negative number is not represented in the > histogram at all. > +Scenario 2:+ > Histogram for the c_bigint column is as below: > {noformat} > { > "column" : "`c_bigint`", > "majortype" : { > "type" : "BIGINT", > "mode" : "OPTIONAL" > }, > "schema" : 1.0, > "rowcount" : 100.0, > "nonnullrowcount" : 87.0, > "ndv" : 46, > "avgwidth" : 8.0, > "histogram" : { > "category" : "numeric-equi-depth", > "numRowsPerBucket" : 8, > "buckets" : [ -8.6390506354062131E18, -7.679478802017577E18, > -5.8389791200382024E18, -2.9165328693138038E18, -1.77746633649836621E18, > 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, > 8.848383132345303E17, 4.6441480083157811E18 ] > } > } > {noformat} > This indicates that there are duplicate rows with the value close to 2.83 > which is not true when we analyze the source data. > This is the output of the ntile function: > {noformat} > SELECT bucket_num, > min(c_bigint) as min_amount, > max(c_bigint) as max_amount, > count(*) as total_count >FROM ( > SELECT c_bigint, > NTILE(10) OVER (ORDER BY c_bigint) as bucket_num > FROM `table_stats/alltypes_with_nulls` > ) > GROUP BY bucket_num > ORDER BY bucket_num; > ++--+--+-+ > | bucket_num | min_amount | max_amount | total_count | > ++--+--+-+ > | 1 | -8804872880253829120 | -6983033704176156672 | 10 | > | 2 | -6772904422084182016 | -5326061597989273600 | 10 | > | 3 | -5111449881868763136 | -2561061038367703040 | 10 | > | 4 | -2424523650070740992 | -449093763428515840 | 10 | > | 5 | 0| 0| 10 | > | 6 | 0| 0| 10 | > | 7 | 0| 0| 10 | > | 8 | 0| 884838034226544640 | 10 | > | 9 | 884838034226544640 | 4644147690488201216 | 10 | > | 10 | null | null | 10 | > ++--+--+-+ > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7171) Count(*) query on leaf level directory is not reading summary cache file.
[ https://issues.apache.org/jira/browse/DRILL-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7171: -- Labels: ready-to-commit (was: ) > Count(*) query on leaf level directory is not reading summary cache file. > - > > Key: DRILL-7171 > URL: https://issues.apache.org/jira/browse/DRILL-7171 > Project: Apache Drill > Issue Type: Bug >Reporter: Venkata Jyothsna Donapati >Assignee: Venkata Jyothsna Donapati >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Since the leaf level directory doesn't store the metadata directories file, > while reading summary if the directories cache file is not present, it is > assumed that the cache is possibly corrupt and reading of the summary cache > file is skipped. Metadata directories cache file should be created at the > leaf level. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram
[ https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7187: -- Reviewer: Gautam Parai > Improve selectivity estimates for range predicates when using histogram > --- > > Key: DRILL-7187 > URL: https://issues.apache.org/jira/browse/DRILL-7187 > Project: Apache Drill > Issue Type: Bug >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Labels: ready-to-commit > Fix For: 1.17.0 > > > 2 types of selectivity estimation improvements need to be done: > 1. For range predicates on the same column, we need to collect all such > predicates in 1 group and do a histogram lookup for them together. > For instance: > {noformat} > WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50 > {noformat} > Currently, the Drill behavior is to treat each of the conjuncts > independently and multiply the individual selectivities. However, that will > not give the accurate estimates. Here, we want to group the predicates on 'a' > together and do a single lookup. Similarly for 'b'. > 2. NULLs are not maintained by the histogram but when doing the selectivity > calculations, the histogram should use the totalRowCount as the denominator > rather than the non-null count. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7225) Merging of columnTypeInfo for file with different schema throws NullPointerException during refresh metadata
[ https://issues.apache.org/jira/browse/DRILL-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7225: -- Labels: ready-to-commit (was: ) > Merging of columnTypeInfo for file with different schema throws > NullPointerException during refresh metadata > > > Key: DRILL-7225 > URL: https://issues.apache.org/jira/browse/DRILL-7225 > Project: Apache Drill > Issue Type: Bug >Reporter: Venkata Jyothsna Donapati >Assignee: Venkata Jyothsna Donapati >Priority: Major > Labels: ready-to-commit > Fix For: 1.17.0 > > > Merging of columnTypeInfo from two files with different schemas throws > nullpointerexception. For example if a directory Orders has two files: > * orders.parquet (with columns order_id, order_name, order_date) > * orders_with_address.parquet (with columns order_id, order_name, address) > When refresh table metadata is triggered, metadata such as total_null_count > for columns in both the files is aggregated and updated in the > ColumnTypeInfo. Initially ColumnTypeInfo is initialized with the first file's > ColumnTypeInfo (i.e., order_id, order_name, order_date). While aggregating, > the existing ColumnTypeInfo is looked up for columns in the second file and > since some of them don't exist in the ColumnTypeInfo, a npe is thrown. This > can be fixed by initializing ColumnTypeInfo for columns that are not yet > present. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram
[ https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7187: -- Labels: ready-to-commit (was: ) > Improve selectivity estimates for range predicates when using histogram > --- > > Key: DRILL-7187 > URL: https://issues.apache.org/jira/browse/DRILL-7187 > Project: Apache Drill > Issue Type: Bug >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Labels: ready-to-commit > Fix For: 1.17.0 > > > 2 types of selectivity estimation improvements need to be done: > 1. For range predicates on the same column, we need to collect all such > predicates in 1 group and do a histogram lookup for them together. > For instance: > {noformat} > WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50 > {noformat} > Currently, the Drill behavior is to treat each of the conjuncts > independently and multiply the individual selectivities. However, that will > not give the accurate estimates. Here, we want to group the predicates on 'a' > together and do a single lookup. Similarly for 'b'. > 2. NULLs are not maintained by the histogram but when doing the selectivity > calculations, the histogram should use the totalRowCount as the denominator > rather than the non-null count. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-7228) Histogram end points show high deviation for a sample data set
Aman Sinha created DRILL-7228: - Summary: Histogram end points show high deviation for a sample data set Key: DRILL-7228 URL: https://issues.apache.org/jira/browse/DRILL-7228 Project: Apache Drill Issue Type: Bug Affects Versions: 1.16.0 Reporter: Aman Sinha Assignee: Aman Sinha Fix For: 1.17.0 There are couple of scenarios where the histogram bucket end points show high deviation for the attached sample data set. +Scenario 1: + There are total 100 rows in the sample. Here are the first 10 values of the c_float column ordered by the column. {noformat} select c_float from `table_stats/alltypes_with_nulls` order by c_float; +--+ | c_float| +--+ | -4.6873795E9 | | 8.1855632E7 | | 2.65311632E8 | | 4.50677952E8 | | 4.6864464E8 | | 5.7848493E8 | | 6.6793114E8 | | 7.1175571E8 | | 9.0065581E8 | | 9.2245773E8 | ... ... <100 rows> {noformat} Here the minimum value is a small negative number. Here's the output of the histogram after running ANALYZE command: {noformat} "buckets" : [ 8.1855488E7, 9.13736816E8, 1.720863011198E9, 3.2401755232E9, 4.6546719328E9, 5.130497904E9, 5.9901393504E9, 6.779930992E9, 7.998626672E9, 8.69159614398E9, 9.983783792E9 ] {noformat} Note that the starting end point of bucket 0 is actually the 2nd value in the ordered list and the small negative number is not represented in the histogram at all. +Scenario 2:+ Histogram for the c_bigint column is as below: {noformat} { "column" : "`c_bigint`", "majortype" : { "type" : "BIGINT", "mode" : "OPTIONAL" }, "schema" : 1.0, "rowcount" : 100.0, "nonnullrowcount" : 87.0, "ndv" : 46, "avgwidth" : 8.0, "histogram" : { "category" : "numeric-equi-depth", "numRowsPerBucket" : 8, "buckets" : [ -8.6390506354062131E18, -7.679478802017577E18, -5.8389791200382024E18, -2.9165328693138038E18, -1.77746633649836621E18, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 8.848383132345303E17, 4.6441480083157811E18 ] } } {noformat} This indicates that there are duplicate rows with the value close to 2.83 which is not true when we analyze the source data. This is the output of the ntile function: {noformat} SELECT bucket_num, min(c_bigint) as min_amount, max(c_bigint) as max_amount, count(*) as total_count FROM ( SELECT c_bigint, NTILE(10) OVER (ORDER BY c_bigint) as bucket_num FROM `table_stats/alltypes_with_nulls` ) GROUP BY bucket_num ORDER BY bucket_num; ++--+--+-+ | bucket_num | min_amount | max_amount | total_count | ++--+--+-+ | 1 | -8804872880253829120 | -6983033704176156672 | 10 | | 2 | -6772904422084182016 | -5326061597989273600 | 10 | | 3 | -5111449881868763136 | -2561061038367703040 | 10 | | 4 | -2424523650070740992 | -449093763428515840 | 10 | | 5 | 0| 0| 10 | | 6 | 0| 0| 10 | | 7 | 0| 0| 10 | | 8 | 0| 884838034226544640 | 10 | | 9 | 884838034226544640 | 4644147690488201216 | 10 | | 10 | null | null | 10 | ++--+--+-+ {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7225) Merging of columnTypeInfo for file with different schema throws NullPointerException during refresh metadata
[ https://issues.apache.org/jira/browse/DRILL-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7225: -- Reviewer: Aman Sinha Fix Version/s: 1.17.0 Technically this is a regression from 1.15 but since Case 1 is the most common case and that is working, I am marking the fix version as 1.17. > Merging of columnTypeInfo for file with different schema throws > NullPointerException during refresh metadata > > > Key: DRILL-7225 > URL: https://issues.apache.org/jira/browse/DRILL-7225 > Project: Apache Drill > Issue Type: Bug >Reporter: Venkata Jyothsna Donapati >Assignee: Venkata Jyothsna Donapati >Priority: Major > Fix For: 1.17.0 > > > Merging of columnTypeInfo from two files with different schemas throws > nullpointerexception. For example if a directory Orders has two files: > * orders.parquet (with columns order_id, order_name, order_date) > * orders_with_address.parquet (with columns order_id, order_name, address) > When refresh table metadata is triggered, metadata such as total_null_count > for columns in both the files is aggregated and updated in the > ColumnTypeInfo. Initially ColumnTypeInfo is initialized with the first file's > ColumnTypeInfo (i.e., order_id, order_name, order_date). While aggregating, > the existing ColumnTypeInfo is looked up for columns in the second file and > since some of them don't exist in the ColumnTypeInfo, a npe is thrown. This > can be fixed by initializing ColumnTypeInfo for columns that are not yet > present. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7225) Merging of columnTypeInfo for file with different schema throws NullPointerException during refresh metadata
[ https://issues.apache.org/jira/browse/DRILL-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829814#comment-16829814 ] Aman Sinha commented on DRILL-7225: --- [~vdonapati], let's clarify this further: {noformat} Case 1: /a/b contains files with Schema A /a/c contains files with Schema B Case 2: /a/b contains files with Schema A and Schema B {noformat} Does the existing code work correctly for Case 1 and fails for Case 2 ? Case 1 is the more common scenario, Case 2 is much less common. > Merging of columnTypeInfo for file with different schema throws > NullPointerException during refresh metadata > > > Key: DRILL-7225 > URL: https://issues.apache.org/jira/browse/DRILL-7225 > Project: Apache Drill > Issue Type: Bug >Reporter: Venkata Jyothsna Donapati >Assignee: Venkata Jyothsna Donapati >Priority: Major > > Merging of columnTypeInfo from two files with different schemas throws > nullpointerexception. For example if a directory Orders has two files: > * orders.parquet (with columns order_id, order_name, order_date) > * orders_with_address.parquet (with columns order_id, order_name, address) > When refresh table metadata is triggered, metadata such as total_null_count > for columns in both the files is aggregated and updated in the > ColumnTypeInfo. Initially ColumnTypeInfo is initialized with the first file's > ColumnTypeInfo (i.e., order_id, order_name, order_date). While aggregating, > the existing ColumnTypeInfo is looked up for columns in the second file and > since some of them don't exist in the ColumnTypeInfo, a npe is thrown. This > can be fixed by initializing ColumnTypeInfo for columns that are not yet > present. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-7223) Make the timeout in TimedCallable a configurable boot time parameter
Aman Sinha created DRILL-7223: - Summary: Make the timeout in TimedCallable a configurable boot time parameter Key: DRILL-7223 URL: https://issues.apache.org/jira/browse/DRILL-7223 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.16.0 Reporter: Aman Sinha Fix For: 1.17.0 The [TimedCallable.TIMEOUT_PER_RUNNABLE_IN_MSECS|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/TimedCallable.java#L52] is currently an internal Drill constant defined as 15 secs. This has been there from day 1 of the introduction. Drill's TimedCallable implements the Java concurrency's Callable interface to create timed threads. It is used by the REFRESH METADATA command which creates multiple threads on the Foreman node to gather Parquet metadata to build the metadata cache. Depending on the load on the system or for very large scale number of parquet files (millions) it is possible to exceed this timeout. While the exact root cause of exceeding the timeout is being investigated, it makes sense to make this timeout a configurable parameter to aid with large scale testing. This JIRA is to make this a configurable bootstrapping option in the drill-override. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-7198) Issuing a control-C in Sqlline exits the session (it does cancel the query)
Aman Sinha created DRILL-7198: - Summary: Issuing a control-C in Sqlline exits the session (it does cancel the query) Key: DRILL-7198 URL: https://issues.apache.org/jira/browse/DRILL-7198 Project: Apache Drill Issue Type: Bug Affects Versions: 1.15.0, 1.16.0 Reporter: Aman Sinha This behavior is observed both in Drill 1.15.0 and the RC1 of 1.16.0. Run a long-running query in sqlline and cancel it using control-c. It exits the sqlline session although it does cancel the query. Behavior is seen in both embedded mode and distributed mode. If the query is submitted through sqlline and cancelled from the Web UI, it does behave correctly..the session does not get killed and subsequent queries can be submitted in the same sqlline session. Same query in Drill 1.14.0 works correctly and returns the column headers while canceling the query. Since the query can be cancelled just fine through the Web UI, I am not considering this a blocker for 1.16. Very likely the sqlline upgrade in 1.15.0 changed the behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7195) Query returns incorrect result or does not fail when cast with is null is used in filter condition
[ https://issues.apache.org/jira/browse/DRILL-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824243#comment-16824243 ] Aman Sinha commented on DRILL-7195: --- [~vvysotskyi] regarding this issue, it is worth debating whether or not it really is a blocker. The first scenario where no error is shown does not seem like a blocker. The second case with wrong result occurs only when the cast_empty_string_to_null is set to True, which is not the default setting. Looking at the history of IS NULL simplification in Calcite, it is somewhat complex issue and I feel that any fix we do could potentially introduce other side effects which will prolong the release. Can we document this for 1.16 and fix it in master ? > Query returns incorrect result or does not fail when cast with is null is > used in filter condition > -- > > Key: DRILL-7195 > URL: https://issues.apache.org/jira/browse/DRILL-7195 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Blocker > Fix For: 1.16.0 > > > 1. For the case when a query contains filter with a {{cast}} which cannot be > done with {{is null}}, the query does not fail: > {code:sql} > select * from dfs.tmp.`a.json` as t where cast(t.a as integer) is null; > +---+ > | a | > +---+ > +---+ > No rows selected (0.142 seconds) > {code} > where > {noformat} > cat /tmp/a.json > {"a":"aaa"} > {noformat} > But for the case when this condition is specified in project, query, as it is > expected, fails: > {code:sql} > select cast(t.a as integer) is null from dfs.tmp.`a.json` t; > Error: SYSTEM ERROR: NumberFormatException: aaa > Fragment 0:0 > Please, refer to logs for more information. > [Error Id: ed3982ce-a12f-4d63-bc6e-cafddf28cc24 on user515050-pc:31010] > (state=,code=0) > {code} > This is a regression, for Drill 1.15 the first and the second queries are > failed: > {code:sql} > select * from dfs.tmp.`a.json` as t where cast(t.a as integer) is null; > Error: SYSTEM ERROR: NumberFormatException: aaa > Fragment 0:0 > Please, refer to logs for more information. > [Error Id: 2f878f15-ddaa-48cd-9dfb-45c04db39048 on user515050-pc:31010] > (state=,code=0) > {code} > 2. For the case when {{drill.exec.functions.cast_empty_string_to_null}} is > enabled, this issue will cause wrong results: > {code:sql} > alter system set `drill.exec.functions.cast_empty_string_to_null`=true; > select * from dfs.tmp.`a1.json` t where cast(t.a as integer) is null; > +---+ > | a | > +---+ > +---+ > No rows selected (1.759 seconds) > {code} > where > {noformat} > cat /tmp/a1.json > {"a":"1"} > {"a":""} > {noformat} > Result for Drill 1.15.0: > {code:sql} > select * from dfs.tmp.`a1.json` t where cast(t.a as integer) is null; > ++ > | a | > ++ > || > ++ > 1 row selected (1.724 seconds) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram
[ https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7187: -- Fix Version/s: 1.17.0 > Improve selectivity estimates for range predicates when using histogram > --- > > Key: DRILL-7187 > URL: https://issues.apache.org/jira/browse/DRILL-7187 > Project: Apache Drill > Issue Type: Bug >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.17.0 > > > 2 types of selectivity estimation improvements need to be done: > 1. For range predicates on the same column, we need to collect all such > predicates in 1 group and do a histogram lookup for them together. > For instance: > {noformat} > WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50 > {noformat} > Currently, the Drill behavior is to treat each of the conjuncts > independently and multiply the individual selectivities. However, that will > not give the accurate estimates. Here, we want to group the predicates on 'a' > together and do a single lookup. Similarly for 'b'. > 2. NULLs are not maintained by the histogram but when doing the selectivity > calculations, the histogram should use the totalRowCount as the denominator > rather than the non-null count. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram
Aman Sinha created DRILL-7187: - Summary: Improve selectivity estimates for range predicates when using histogram Key: DRILL-7187 URL: https://issues.apache.org/jira/browse/DRILL-7187 Project: Apache Drill Issue Type: Bug Reporter: Aman Sinha Assignee: Aman Sinha 2 types of selectivity estimation improvements need to be done: 1. For range predicates on the same column, we need to collect all such predicates in 1 group and do a histogram lookup for them together. For instance: {noformat} WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50 {noformat} Currently, the Drill behavior is to treat each of the conjuncts independently and multiply the individual selectivities. However, that will not give the accurate estimates. Here, we want to group the predicates on 'a' together and do a single lookup. Similarly for 'b'. 2. NULLs are not maintained by the histogram but when doing the selectivity calculations, the histogram should use the totalRowCount as the denominator rather than the non-null count. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (DRILL-3929) Support the ability to query database tables using external indices
[ https://issues.apache.org/jira/browse/DRILL-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha resolved DRILL-3929. --- Resolution: Fixed Fix Version/s: 1.15.0 This feature was done in the scope of DRILL-6381. > Support the ability to query database tables using external indices > -- > > Key: DRILL-3929 > URL: https://issues.apache.org/jira/browse/DRILL-3929 > Project: Apache Drill > Issue Type: New Feature > Components: Execution - Relational Operators, Query Planning > Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.15.0 > > > This is a placeholder for adding support in Drill to query database tables > using external indices. I will add more details about the use case and a > preliminary design proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7063) Create separate summary file for schema, totalRowCount, totalNullCount (includes maintenance)
[ https://issues.apache.org/jira/browse/DRILL-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816410#comment-16816410 ] Aman Sinha commented on DRILL-7063: --- Thanks [~vvysotskyi] . I will work with [~vdonapati] to add the full description. Although the details are in the design doc, we need a simplified version for the documentation. > Create separate summary file for schema, totalRowCount, totalNullCount > (includes maintenance) > - > > Key: DRILL-7063 > URL: https://issues.apache.org/jira/browse/DRILL-7063 > Project: Apache Drill > Issue Type: Sub-task > Components: Metadata >Reporter: Venkata Jyothsna Donapati >Assignee: Venkata Jyothsna Donapati >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.16.0 > > Original Estimate: 252h > Remaining Estimate: 252h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7166) Count(*) queries with wildcards in table name are reading metadata cache and returning wrong results
[ https://issues.apache.org/jira/browse/DRILL-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7166: -- Labels: ready-to-commit (was: ) > Count(*) queries with wildcards in table name are reading metadata cache and > returning wrong results > > > Key: DRILL-7166 > URL: https://issues.apache.org/jira/browse/DRILL-7166 > Project: Apache Drill > Issue Type: Bug > Components: Metadata >Affects Versions: 1.16.0 >Reporter: Abhishek Girish >Assignee: Venkata Jyothsna Donapati >Priority: Blocker > Labels: ready-to-commit > Fix For: 1.16.0 > > > Tests: > {code} > Functional/metadata_caching/data/drill4376_1.q > Functional/metadata_caching/data/drill4376_2.q > Functional/metadata_caching/data/drill4376_3.q > Functional/metadata_caching/data/drill4376_4.q > Functional/metadata_caching/data/drill4376_5.q > Functional/metadata_caching/data/drill4376_6.q > Functional/metadata_caching/data/drill4376_8.q > {code} > Example pattern of queries: > {code} > select count(*) from `lineitem_hierarchical_intint/*8*/3*`; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7166) Count(*) queries with wildcards in table name are reading metadata cache and returning wrong results
[ https://issues.apache.org/jira/browse/DRILL-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7166: -- Reviewer: Aman Sinha > Count(*) queries with wildcards in table name are reading metadata cache and > returning wrong results > > > Key: DRILL-7166 > URL: https://issues.apache.org/jira/browse/DRILL-7166 > Project: Apache Drill > Issue Type: Bug > Components: Metadata >Affects Versions: 1.16.0 >Reporter: Abhishek Girish >Assignee: Venkata Jyothsna Donapati >Priority: Blocker > Fix For: 1.16.0 > > > Tests: > {code} > Functional/metadata_caching/data/drill4376_1.q > Functional/metadata_caching/data/drill4376_2.q > Functional/metadata_caching/data/drill4376_3.q > Functional/metadata_caching/data/drill4376_4.q > Functional/metadata_caching/data/drill4376_5.q > Functional/metadata_caching/data/drill4376_6.q > Functional/metadata_caching/data/drill4376_8.q > {code} > Example pattern of queries: > {code} > select count(*) from `lineitem_hierarchical_intint/*8*/3*`; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6992) Support column histogram statistics
[ https://issues.apache.org/jira/browse/DRILL-6992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812899#comment-16812899 ] Aman Sinha commented on DRILL-6992: --- Histogram creation and usage for the following data types is supported as of commit # 849f896: INT, BIGINT, FLOAT, DOUBLE, TIME, DATE, TIMESTAMP, BOOLEAN. Marking this umbrella JIRA as fixed. For other data types, in particular VARCHAR, VARBINARY, I will open enhancement JIRA separately for a future release. > Support column histogram statistics > --- > > Key: DRILL-6992 > URL: https://issues.apache.org/jira/browse/DRILL-6992 > Project: Apache Drill > Issue Type: New Feature > Components: Query Planning Optimization >Affects Versions: 1.15.0 >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > > As a follow-up to > [DRILL-1328|https://issues.apache.org/jira/browse/DRILL-1328] which is adding > NDV (num distinct values) support and creating the framework for statistics, > we also need Histograms. These are needed for range predicates selectivity > estimation as well as equality predicates when there is non-uniform > distribution of data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7063) Create separate summary file for schema, totalRowCount, totalNullCount (includes maintenance)
[ https://issues.apache.org/jira/browse/DRILL-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7063: -- Labels: ready-to-commit (was: ) > Create separate summary file for schema, totalRowCount, totalNullCount > (includes maintenance) > - > > Key: DRILL-7063 > URL: https://issues.apache.org/jira/browse/DRILL-7063 > Project: Apache Drill > Issue Type: Sub-task > Components: Metadata >Reporter: Venkata Jyothsna Donapati >Assignee: Venkata Jyothsna Donapati >Priority: Major > Labels: ready-to-commit > Fix For: 1.16.0 > > Original Estimate: 252h > Remaining Estimate: 252h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7119) Modify selectivity calculations to use histograms for supported data types
[ https://issues.apache.org/jira/browse/DRILL-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7119: -- Labels: ready-to-commit (was: ) > Modify selectivity calculations to use histograms for supported data types > -- > > Key: DRILL-7119 > URL: https://issues.apache.org/jira/browse/DRILL-7119 > Project: Apache Drill > Issue Type: Sub-task > Components: Query Planning Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Labels: ready-to-commit > Fix For: 1.16.0 > > > (Please see parent JIRA for the design document) > Once the t-digest based histogram is created, we need to read it back and > modify the selectivity calculations such that they use the histogram buckets > for range conditions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7119) Modify selectivity calculations to use histograms for supported data types
[ https://issues.apache.org/jira/browse/DRILL-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7119: -- Summary: Modify selectivity calculations to use histograms for supported data types (was: Modify selectivity calculations to use histograms) > Modify selectivity calculations to use histograms for supported data types > -- > > Key: DRILL-7119 > URL: https://issues.apache.org/jira/browse/DRILL-7119 > Project: Apache Drill > Issue Type: Sub-task > Components: Query Planning Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > > (Please see parent JIRA for the design document) > Once the t-digest based histogram is created, we need to read it back and > modify the selectivity calculations such that they use the histogram buckets > for range conditions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-3846) Metadata Caching : A count(*) query took more time with the cache in place
[ https://issues.apache.org/jira/browse/DRILL-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812807#comment-16812807 ] Aman Sinha commented on DRILL-3846: --- Let's try this after DRILL-7064 is fixed. > Metadata Caching : A count(*) query took more time with the cache in place > -- > > Key: DRILL-3846 > URL: https://issues.apache.org/jira/browse/DRILL-3846 > Project: Apache Drill > Issue Type: Bug > Components: Metadata >Reporter: Rahul Challapalli >Assignee: Aman Sinha >Priority: Critical > Fix For: 1.16.0 > > > git.commit.id.abbrev=3c89b30 > I have a folder with 10k complex files. The generated cache file is around > 486 MB. The below numbers indicate that we regressed in terms of performance > when we generated the metadata cache > {code} > 0: jdbc:drill:zk=10.10.100.190:5181> select count(*) from > `complex_sparse_5files`; > +--+ > | EXPR$0 | > +--+ > | 100 | > +--+ > 1 row selected (30.835 seconds) > 0: jdbc:drill:zk=10.10.100.190:5181> refresh table metadata > `complex_sparse_5files`; > +---+-+ > | ok | summary > | > +---+-+ > | true | Successfully updated metadata for table complex_sparse_5files. > | > +---+-+ > 1 row selected (10.69 seconds) > 0: jdbc:drill:zk=10.10.100.190:5181> select count(*) from > `complex_sparse_5files`; > +--+ > | EXPR$0 | > +--+ > | 100 | > +--+ > 1 row selected (47.614 seconds) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (DRILL-3846) Metadata Caching : A count(*) query took more time with the cache in place
[ https://issues.apache.org/jira/browse/DRILL-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha reassigned DRILL-3846: - Assignee: Aman Sinha (was: Venkata Jyothsna Donapati) > Metadata Caching : A count(*) query took more time with the cache in place > -- > > Key: DRILL-3846 > URL: https://issues.apache.org/jira/browse/DRILL-3846 > Project: Apache Drill > Issue Type: Bug > Components: Metadata >Reporter: Rahul Challapalli >Assignee: Aman Sinha >Priority: Critical > Fix For: 1.16.0 > > > git.commit.id.abbrev=3c89b30 > I have a folder with 10k complex files. The generated cache file is around > 486 MB. The below numbers indicate that we regressed in terms of performance > when we generated the metadata cache > {code} > 0: jdbc:drill:zk=10.10.100.190:5181> select count(*) from > `complex_sparse_5files`; > +--+ > | EXPR$0 | > +--+ > | 100 | > +--+ > 1 row selected (30.835 seconds) > 0: jdbc:drill:zk=10.10.100.190:5181> refresh table metadata > `complex_sparse_5files`; > +---+-+ > | ok | summary > | > +---+-+ > | true | Successfully updated metadata for table complex_sparse_5files. > | > +---+-+ > 1 row selected (10.69 seconds) > 0: jdbc:drill:zk=10.10.100.190:5181> select count(*) from > `complex_sparse_5files`; > +--+ > | EXPR$0 | > +--+ > | 100 | > +--+ > 1 row selected (47.614 seconds) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7114) ANALYZE command generates warnings for stats file and materialization
[ https://issues.apache.org/jira/browse/DRILL-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811076#comment-16811076 ] Aman Sinha commented on DRILL-7114: --- [~gparai] for tracking, pls provide the commit id for the fix since this was fixed as part of another PR. > ANALYZE command generates warnings for stats file and materialization > - > > Key: DRILL-7114 > URL: https://issues.apache.org/jira/browse/DRILL-7114 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Reporter: Aman Sinha >Assignee: Gautam Parai >Priority: Minor > Fix For: 1.16.0 > > > When I run ANALYZE, I see warnings in the log file as shown below. The > ANALYZE command should not try to read the stats file or materialize the > stats. > {noformat} > 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN > o.a.d.e.p.common.DrillStatsTable - Failed to read the stats file. > java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does > not exist > 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN > o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. > Continuing without stats. > java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does > not exist > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7119) Modify selectivity calculations to use histograms
[ https://issues.apache.org/jira/browse/DRILL-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7119: -- Reviewer: Gautam Parai > Modify selectivity calculations to use histograms > - > > Key: DRILL-7119 > URL: https://issues.apache.org/jira/browse/DRILL-7119 > Project: Apache Drill > Issue Type: Sub-task > Components: Query Planning Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > > (Please see parent JIRA for the design document) > Once the t-digest based histogram is created, we need to read it back and > modify the selectivity calculations such that they use the histogram buckets > for range conditions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (DRILL-7152) Histogram creation throws exception for all nulls column
[ https://issues.apache.org/jira/browse/DRILL-7152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha resolved DRILL-7152. --- Resolution: Fixed Fixed in 54384a9. > Histogram creation throws exception for all nulls column > > > Key: DRILL-7152 > URL: https://issues.apache.org/jira/browse/DRILL-7152 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > > ANALYZE command fails when creating the histogram for a table with 1 column > with all NULLs. > Analyze table `table_stats/parquet_col_nulls` compute statistics; > {noformat} > Error: SYSTEM ERROR: NullPointerException > (org.apache.drill.common.exceptions.DrillRuntimeException) Failed to get > TDigest output > > org.apache.drill.exec.test.generated.StreamingAggregatorGen32.outputRecordValues():1085 > > org.apache.drill.exec.test.generated.StreamingAggregatorGen32.outputToBatchPrev():492 > org.apache.drill.exec.test.generated.StreamingAggregatorGen32.doWork():224 > > org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext():288 > org.apache.drill.exec.record.AbstractRecordBatch.next():186 > org.apache.drill.exec.record.AbstractRecordBatch.next():126 > org.apache.drill.exec.record.AbstractRecordBatch.next():116 > > org.apache.drill.exec.physical.impl.statistics.StatisticsMergeBatch.innerNext():358 > org.apache.drill.exec.record.AbstractRecordBatch.next():186 > org.apache.drill.exec.record.AbstractRecordBatch.next():126 > org.apache.drill.exec.record.AbstractRecordBatch.next():116 > > org.apache.drill.exec.physical.impl.unpivot.UnpivotMapsRecordBatch.innerNext():106 > org.apache.drill.exec.record.AbstractRecordBatch.next():186 > org.apache.drill.exec.record.AbstractRecordBatch.next():126 > org.apache.drill.exec.record.AbstractRecordBatch.next():116 > > org.apache.drill.exec.physical.impl.StatisticsWriterRecordBatch.innerNext():96 > org.apache.drill.exec.record.AbstractRecordBatch.next():186 > org.apache.drill.exec.record.AbstractRecordBatch.next():126 > org.apache.drill.exec.record.AbstractRecordBatch.next():116 > org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63 > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141 > org.apache.drill.exec.record.AbstractRecordBatch.next():186 > org.apache.drill.exec.physical.impl.BaseRootExec.next():104 > > org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():83 > org.apache.drill.exec.physical.impl.BaseRootExec.next():94 > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296 > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283 > java.security.AccessController.doPrivileged():-2 > javax.security.auth.Subject.doAs():422 > org.apache.hadoop.security.UserGroupInformation.doAs():1669 > org.apache.drill.exec.work.fragment.FragmentExecutor.run():283 > org.apache.drill.common.SelfCleaningRunnable.run():38 > java.util.concurrent.ThreadPoolExecutor.runWorker():1149 > java.util.concurrent.ThreadPoolExecutor$Worker.run():624 > java.lang.Thread.run():748 > {noformat} > This table has 1 column with all NULL values: > {noformat} > apache drill (dfs.drilltestdir)> select * from > `table_stats/parquet_col_nulls` limit 20; > +--+--+ > | col1 | col2 | > +--+--+ > | 0| null | > | 1| null | > | 2| null | > | 3| null | > | 4| null | > | 5| null | > | 6| null | > | 7| null | > | 8| null | > | 9| null | > | 10 | null | > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7152) Histogram creation throws exception for all nulls column
[ https://issues.apache.org/jira/browse/DRILL-7152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7152: -- Reviewer: Gautam Parai > Histogram creation throws exception for all nulls column > > > Key: DRILL-7152 > URL: https://issues.apache.org/jira/browse/DRILL-7152 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > > ANALYZE command fails when creating the histogram for a table with 1 column > with all NULLs. > Analyze table `table_stats/parquet_col_nulls` compute statistics; > {noformat} > Error: SYSTEM ERROR: NullPointerException > (org.apache.drill.common.exceptions.DrillRuntimeException) Failed to get > TDigest output > > org.apache.drill.exec.test.generated.StreamingAggregatorGen32.outputRecordValues():1085 > > org.apache.drill.exec.test.generated.StreamingAggregatorGen32.outputToBatchPrev():492 > org.apache.drill.exec.test.generated.StreamingAggregatorGen32.doWork():224 > > org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext():288 > org.apache.drill.exec.record.AbstractRecordBatch.next():186 > org.apache.drill.exec.record.AbstractRecordBatch.next():126 > org.apache.drill.exec.record.AbstractRecordBatch.next():116 > > org.apache.drill.exec.physical.impl.statistics.StatisticsMergeBatch.innerNext():358 > org.apache.drill.exec.record.AbstractRecordBatch.next():186 > org.apache.drill.exec.record.AbstractRecordBatch.next():126 > org.apache.drill.exec.record.AbstractRecordBatch.next():116 > > org.apache.drill.exec.physical.impl.unpivot.UnpivotMapsRecordBatch.innerNext():106 > org.apache.drill.exec.record.AbstractRecordBatch.next():186 > org.apache.drill.exec.record.AbstractRecordBatch.next():126 > org.apache.drill.exec.record.AbstractRecordBatch.next():116 > > org.apache.drill.exec.physical.impl.StatisticsWriterRecordBatch.innerNext():96 > org.apache.drill.exec.record.AbstractRecordBatch.next():186 > org.apache.drill.exec.record.AbstractRecordBatch.next():126 > org.apache.drill.exec.record.AbstractRecordBatch.next():116 > org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63 > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141 > org.apache.drill.exec.record.AbstractRecordBatch.next():186 > org.apache.drill.exec.physical.impl.BaseRootExec.next():104 > > org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():83 > org.apache.drill.exec.physical.impl.BaseRootExec.next():94 > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296 > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283 > java.security.AccessController.doPrivileged():-2 > javax.security.auth.Subject.doAs():422 > org.apache.hadoop.security.UserGroupInformation.doAs():1669 > org.apache.drill.exec.work.fragment.FragmentExecutor.run():283 > org.apache.drill.common.SelfCleaningRunnable.run():38 > java.util.concurrent.ThreadPoolExecutor.runWorker():1149 > java.util.concurrent.ThreadPoolExecutor$Worker.run():624 > java.lang.Thread.run():748 > {noformat} > This table has 1 column with all NULL values: > {noformat} > apache drill (dfs.drilltestdir)> select * from > `table_stats/parquet_col_nulls` limit 20; > +--+--+ > | col1 | col2 | > +--+--+ > | 0| null | > | 1| null | > | 2| null | > | 3| null | > | 4| null | > | 5| null | > | 6| null | > | 7| null | > | 8| null | > | 9| null | > | 10 | null | > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-7152) Histogram creation throws exception for all nulls column
Aman Sinha created DRILL-7152: - Summary: Histogram creation throws exception for all nulls column Key: DRILL-7152 URL: https://issues.apache.org/jira/browse/DRILL-7152 Project: Apache Drill Issue Type: Bug Components: Query Planning Optimization Reporter: Aman Sinha Assignee: Aman Sinha Fix For: 1.16.0 ANALYZE command fails when creating the histogram for a table with 1 column with all NULLs. Analyze table `table_stats/parquet_col_nulls` compute statistics; {noformat} Error: SYSTEM ERROR: NullPointerException (org.apache.drill.common.exceptions.DrillRuntimeException) Failed to get TDigest output org.apache.drill.exec.test.generated.StreamingAggregatorGen32.outputRecordValues():1085 org.apache.drill.exec.test.generated.StreamingAggregatorGen32.outputToBatchPrev():492 org.apache.drill.exec.test.generated.StreamingAggregatorGen32.doWork():224 org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext():288 org.apache.drill.exec.record.AbstractRecordBatch.next():186 org.apache.drill.exec.record.AbstractRecordBatch.next():126 org.apache.drill.exec.record.AbstractRecordBatch.next():116 org.apache.drill.exec.physical.impl.statistics.StatisticsMergeBatch.innerNext():358 org.apache.drill.exec.record.AbstractRecordBatch.next():186 org.apache.drill.exec.record.AbstractRecordBatch.next():126 org.apache.drill.exec.record.AbstractRecordBatch.next():116 org.apache.drill.exec.physical.impl.unpivot.UnpivotMapsRecordBatch.innerNext():106 org.apache.drill.exec.record.AbstractRecordBatch.next():186 org.apache.drill.exec.record.AbstractRecordBatch.next():126 org.apache.drill.exec.record.AbstractRecordBatch.next():116 org.apache.drill.exec.physical.impl.StatisticsWriterRecordBatch.innerNext():96 org.apache.drill.exec.record.AbstractRecordBatch.next():186 org.apache.drill.exec.record.AbstractRecordBatch.next():126 org.apache.drill.exec.record.AbstractRecordBatch.next():116 org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141 org.apache.drill.exec.record.AbstractRecordBatch.next():186 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():83 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1669 org.apache.drill.exec.work.fragment.FragmentExecutor.run():283 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1149 java.util.concurrent.ThreadPoolExecutor$Worker.run():624 java.lang.Thread.run():748 {noformat} This table has 1 column with all NULL values: {noformat} apache drill (dfs.drilltestdir)> select * from `table_stats/parquet_col_nulls` limit 20; +--+--+ | col1 | col2 | +--+--+ | 0| null | | 1| null | | 2| null | | 3| null | | 4| null | | 5| null | | 6| null | | 7| null | | 8| null | | 9| null | | 10 | null | {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal) and date/time/timestamp
[ https://issues.apache.org/jira/browse/DRILL-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7117: -- Labels: doc-impacting (was: ) > Support creation of histograms for numeric data types (except Decimal) and > date/time/timestamp > -- > > Key: DRILL-7117 > URL: https://issues.apache.org/jira/browse/DRILL-7117 > Project: Apache Drill > Issue Type: Sub-task > Components: Query Planning Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Labels: doc-impacting > Fix For: 1.16.0 > > > This JIRA is specific to creating histograms for numeric data types: INT, > BIGINT, FLOAT4, FLOAT8 and their corresponding nullable/non-nullable > versions. Additionally, since DATE/TIME/TIMESTAMP are internally stored as > longs, we should allow the same numeric type histogram creation for these > data types as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7064) Leverage the summary's totalRowCount and totalNullCount for COUNT() queries (also prevent eager expansion of files)
[ https://issues.apache.org/jira/browse/DRILL-7064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7064: -- Description: This sub-task is meant to leverage the Parquet metadata cache's summary stats: totalRowCount (across all files and row groups) and the per-column totalNullCount (across all files and row groups) to answer plain COUNT aggregation queries without Group-By. These are currently converted to a DirectScan by the ConvertCountToDirectScanRule which utilizes the row group metadata; however this rule is applied on Drill Logical rels and converts the logical plan to a physical plan with DirectScanPrel but this is too late since the DrillScanRel that is already created during logical planning has already read the entire metadata cache file along with its full list of row group entries. The metadata cache file can grow quite large and this does not scale. The solution is to use the Metadata Summary file that is created in DRILL-7063 and create a new rule that will apply early on such that it operates on the Calcite logical rels instead of the Drill logical rels and prevents eager expansion of the list of files/row groups. We will not remove the existing rule. The existing rule will continue to operate as before because it is possible that after some transformations, we still want to apply the optimizations for COUNT queries. > Leverage the summary's totalRowCount and totalNullCount for COUNT() queries > (also prevent eager expansion of files) > --- > > Key: DRILL-7064 > URL: https://issues.apache.org/jira/browse/DRILL-7064 > Project: Apache Drill > Issue Type: Sub-task > Components: Metadata >Reporter: Venkata Jyothsna Donapati >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > Original Estimate: 336h > Remaining Estimate: 336h > > This sub-task is meant to leverage the Parquet metadata cache's summary > stats: totalRowCount (across all files and row groups) and the per-column > totalNullCount (across all files and row groups) to answer plain COUNT > aggregation queries without Group-By. These are currently converted to a > DirectScan by the ConvertCountToDirectScanRule which utilizes the row group > metadata; however this rule is applied on Drill Logical rels and converts the > logical plan to a physical plan with DirectScanPrel but this is too late > since the DrillScanRel that is already created during logical planning has > already read the entire metadata cache file along with its full list of row > group entries. The metadata cache file can grow quite large and this does not > scale. > The solution is to use the Metadata Summary file that is created in > DRILL-7063 and create a new rule that will apply early on such that it > operates on the Calcite logical rels instead of the Drill logical rels and > prevents eager expansion of the list of files/row groups. > We will not remove the existing rule. The existing rule will continue to > operate as before because it is possible that after some transformations, we > still want to apply the optimizations for COUNT queries. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (DRILL-7064) Leverage the summary's totalRowCount and totalNullCount for COUNT() queries (also prevent eager expansion of files)
[ https://issues.apache.org/jira/browse/DRILL-7064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha reassigned DRILL-7064: - Assignee: Aman Sinha (was: Venkata Jyothsna Donapati) > Leverage the summary's totalRowCount and totalNullCount for COUNT() queries > (also prevent eager expansion of files) > --- > > Key: DRILL-7064 > URL: https://issues.apache.org/jira/browse/DRILL-7064 > Project: Apache Drill > Issue Type: Sub-task > Components: Metadata >Reporter: Venkata Jyothsna Donapati >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > Original Estimate: 336h > Remaining Estimate: 336h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal) and date/time/timestamp
[ https://issues.apache.org/jira/browse/DRILL-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7117: -- Reviewer: Gautam Parai > Support creation of histograms for numeric data types (except Decimal) and > date/time/timestamp > -- > > Key: DRILL-7117 > URL: https://issues.apache.org/jira/browse/DRILL-7117 > Project: Apache Drill > Issue Type: Sub-task > Components: Query Planning Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > > This JIRA is specific to creating histograms for numeric data types: INT, > BIGINT, FLOAT4, FLOAT8 and their corresponding nullable/non-nullable > versions. Additionally, since DATE/TIME/TIMESTAMP are internally stored as > longs, we should allow the same numeric type histogram creation for these > data types as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal) and date/time/timestamp
[ https://issues.apache.org/jira/browse/DRILL-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7117: -- Description: This JIRA is specific to creating histograms for numeric data types: INT, BIGINT, FLOAT4, FLOAT8 and their corresponding nullable/non-nullable versions. Additionally, since DATE/TIME/TIMESTAMP are internally stored as longs, we should allow the same numeric type histogram creation for these data types as well. (was: This JIRA is specific to creating histograms for numeric data types: INT, BIGINT, FLOAT4, FLOAT8 and their corresponding nullable/non-nullable versions. ) > Support creation of histograms for numeric data types (except Decimal) and > date/time/timestamp > -- > > Key: DRILL-7117 > URL: https://issues.apache.org/jira/browse/DRILL-7117 > Project: Apache Drill > Issue Type: Sub-task > Components: Query Planning Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > > This JIRA is specific to creating histograms for numeric data types: INT, > BIGINT, FLOAT4, FLOAT8 and their corresponding nullable/non-nullable > versions. Additionally, since DATE/TIME/TIMESTAMP are internally stored as > longs, we should allow the same numeric type histogram creation for these > data types as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal) and date/time/timestamp
[ https://issues.apache.org/jira/browse/DRILL-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7117: -- Summary: Support creation of histograms for numeric data types (except Decimal) and date/time/timestamp (was: Support creation of histograms for numeric data types (except Decimal)) > Support creation of histograms for numeric data types (except Decimal) and > date/time/timestamp > -- > > Key: DRILL-7117 > URL: https://issues.apache.org/jira/browse/DRILL-7117 > Project: Apache Drill > Issue Type: Sub-task > Components: Query Planning Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > > This JIRA is specific to creating histograms for numeric data types: INT, > BIGINT, FLOAT4, FLOAT8 and their corresponding nullable/non-nullable > versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal)
[ https://issues.apache.org/jira/browse/DRILL-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799742#comment-16799742 ] Aman Sinha edited comment on DRILL-7117 at 3/23/19 6:16 PM: I have a work-in-progress implementation and initial results for the histogram using the proposed design (see design doc in parent JIRA) are encouraging. I analyzed a few columns from the TPC-DS table Web_Sales and in particular for a skewed column ws_ext_tax I compared the output of the NTILE() function with what is generated by the histogram implementation. They are very close. Here's the NTILE() output (this is the baseline): {noformat} select x as ntile, min(ws_ext_tax) as min, max(ws_ext_tax) as max, count(*) as count from (select ws_ext_tax, ntile(10) over (order by ws_ext_tax) as x from dfs.tmp.web_sales1) group by x order by x; |ntile | min | max| count | | 1 | 0.0 | 0.0 | 71939 | | 2 | 0.0 | 3.05 | 71939 | | 3 | 3.05| 9.28 | 71939 | | 4 | 9.28| 19.24| 71939 | | 5 | 19.24 | 34.51| 71938 | | 6 | 34.51 | 57.96| 71938 | | 7 | 57.96 | 95.01| 71938 | | 8 | 95.01 | 158.71 | 71938 | | 9 | 158.71 | 292.44 | 71938 | | 10 | 292.44 | 2433.41 | 71938 | {noformat} Here's the Histogram output after running ANALYZE and using 10 buckets: {noformat} { "column" : "`ws_ext_tax`", "majortype" : { "type" : "FLOAT4", "mode" : "OPTIONAL" }, "schema" : 1.0, "rowcount" : 719384.0, "nonnullrowcount" : 719200.0, "ndv" : 68400, "avgwidth" : 4.0, "histogram" : { "category" : "numeric-histogram", "numRowsPerBucket" : 71920, "buckets" : [ 0.0, 0.0, 3.0012382450412556, 9.294927927208075, 19.451534502867155, 34.26953338960548, 57.30967536502495, 95.25307998787451, 158.88263313484484, 289.2295684860717, 24 33.409910105169 ] } } {noformat} Note that the bucket boundaries closely align with the baseline and there's non-uniformity in the data distribution. was (Author: amansinha100): I have a work-in-progress implementation and initial results for the histogram using the proposed design (see design doc in parent JIRA) are encouraging. I analyzed a few columns from the TPC-DS table Web_Sales and in particular for a skewed column ws_ext_tax I compared the output of the NTILE() function with what is generated by the histogram implementation. They are very close. Here's the NTILE() output (this is the baseline): {noformat} ||Heading 1||Heading 2|| |Col A1|Col A2| select x as ntile, min(ws_ext_tax) as min, max(ws_ext_tax) as max, count(*) as count from (select ws_ext_tax, ntile(10) over (order by ws_ext_tax) as x from dfs.tmp.web_sales1) group by x order by x; ||ntile || min || max|| count || | 1 | 0.0 | 0.0 | 71939 | | 2 | 0.0 | 3.05 | 71939 | | 3 | 3.05| 9.28 | 71939 | | 4 | 9.28| 19.24| 71939 | | 5 | 19.24 | 34.51| 71938 | | 6 | 34.51 | 57.96| 71938 | | 7 | 57.96 | 95.01| 71938 | | 8 | 95.01 | 158.71 | 71938 | | 9 | 158.71 | 292.44 | 71938 | | 10 | 292.44 | 2433.41 | 71938 | {noformat} Here's the Histogram output after running ANALYZE and using 10 buckets: {noformat} { "column" : "`ws_ext_tax`", "majortype" : { "type" : "FLOAT4", "mode" : "OPTIONAL" }, "schema" : 1.0, "rowcount" : 719384.0, "nonnullrowcount" : 719200.0, "ndv" : 68400, "avgwidth" : 4.0, "histogram" : { "category" : "numeric-histogram", "numRowsPerBucket" : 71920, "buckets" : [ 0.0, 0.0, 3.0012382450412556, 9.294927927208075, 19.451534502867155, 34.26953338960548, 57.30967536502495, 95.25307998787451, 158.88263313484484, 289.2295684860717, 24 33.409910105169 ] } } {noformat} Note that the bucket boundaries closely align with the baseline and there's non-uniformity in the data distribution. > Support creation of histograms for numeric data types (except Decimal) > -- > > Key: DRILL-7117 > URL: https://issues.apache.org/jira/browse/DRILL-7117 > Project: Apache Drill > Issue Type: Sub-task > Components: Query Planning Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > > This JIRA is specific to creating histograms for numeric data types: INT, > BIGINT, FLOAT4, FLOAT8 and their corresponding nullable/non-nullable > versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal)
[ https://issues.apache.org/jira/browse/DRILL-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799742#comment-16799742 ] Aman Sinha commented on DRILL-7117: --- I have a work-in-progress implementation and initial results for the histogram using the proposed design (see design doc in parent JIRA) are encouraging. I analyzed a few columns from the TPC-DS table Web_Sales and in particular for a skewed column ws_ext_tax I compared the output of the NTILE() function with what is generated by the histogram implementation. They are very close. Here's the NTILE() output (this is the baseline): {noformat} ||Heading 1||Heading 2|| |Col A1|Col A2| select x as ntile, min(ws_ext_tax) as min, max(ws_ext_tax) as max, count(*) as count from (select ws_ext_tax, ntile(10) over (order by ws_ext_tax) as x from dfs.tmp.web_sales1) group by x order by x; ||ntile || min || max|| count || | 1 | 0.0 | 0.0 | 71939 | | 2 | 0.0 | 3.05 | 71939 | | 3 | 3.05| 9.28 | 71939 | | 4 | 9.28| 19.24| 71939 | | 5 | 19.24 | 34.51| 71938 | | 6 | 34.51 | 57.96| 71938 | | 7 | 57.96 | 95.01| 71938 | | 8 | 95.01 | 158.71 | 71938 | | 9 | 158.71 | 292.44 | 71938 | | 10 | 292.44 | 2433.41 | 71938 | {noformat} Here's the Histogram output after running ANALYZE and using 10 buckets: {noformat} { "column" : "`ws_ext_tax`", "majortype" : { "type" : "FLOAT4", "mode" : "OPTIONAL" }, "schema" : 1.0, "rowcount" : 719384.0, "nonnullrowcount" : 719200.0, "ndv" : 68400, "avgwidth" : 4.0, "histogram" : { "category" : "numeric-histogram", "numRowsPerBucket" : 71920, "buckets" : [ 0.0, 0.0, 3.0012382450412556, 9.294927927208075, 19.451534502867155, 34.26953338960548, 57.30967536502495, 95.25307998787451, 158.88263313484484, 289.2295684860717, 24 33.409910105169 ] } } {noformat} Note that the bucket boundaries closely align with the baseline and there's non-uniformity in the data distribution. > Support creation of histograms for numeric data types (except Decimal) > -- > > Key: DRILL-7117 > URL: https://issues.apache.org/jira/browse/DRILL-7117 > Project: Apache Drill > Issue Type: Sub-task > Components: Query Planning Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > > This JIRA is specific to creating histograms for numeric data types: INT, > BIGINT, FLOAT4, FLOAT8 and their corresponding nullable/non-nullable > versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7109) Statistics adds external sort, which spills to disk
[ https://issues.apache.org/jira/browse/DRILL-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799215#comment-16799215 ] Aman Sinha commented on DRILL-7109: --- [~rhou] for all such issues related to the planning, please add the EXPLAIN plan with and without statistics for faster diagnosis. > Statistics adds external sort, which spills to disk > --- > > Key: DRILL-7109 > URL: https://issues.apache.org/jira/browse/DRILL-7109 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Affects Versions: 1.16.0 >Reporter: Robert Hou >Assignee: Gautam Parai >Priority: Major > Fix For: 1.16.0 > > > TPCH query 4 with sf 100 runs many times slower. One issue is that an extra > external sort has been added, and both external sorts spill to disk. > Also, the hash join sees 100x more data. > Here is the query: > {noformat} > select > o.o_orderpriority, > count(*) as order_count > from > orders o > where > o.o_orderdate >= date '1996-10-01' > and o.o_orderdate < date '1996-10-01' + interval '3' month > and > exists ( > select > * > from > lineitem l > where > l.l_orderkey = o.o_orderkey > and l.l_commitdate < l.l_receiptdate > ) > group by > o.o_orderpriority > order by > o.o_orderpriority; > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6852) Adapt current Parquet Metadata cache implementation to use Drill Metastore API
[ https://issues.apache.org/jira/browse/DRILL-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-6852: -- Labels: ready-to-commit (was: ) > Adapt current Parquet Metadata cache implementation to use Drill Metastore API > -- > > Key: DRILL-6852 > URL: https://issues.apache.org/jira/browse/DRILL-6852 > Project: Apache Drill > Issue Type: Sub-task >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Labels: ready-to-commit > Fix For: 1.16.0 > > > According to the design document for DRILL-6552, existing metadata cache API > should be adapted to use generalized API for metastore and parquet metadata > cache will be presented as the implementation of metastore API. > The aim of this Jira is to refactor Parquet Metadata cache implementation and > adapt it to use Drill Metastore API. > Execution plan: > - Refactor AbstractParquetGroupScan and its implementations to use metastore > metadata classes. Store Drill data types in metadata files for Parquet tables. > - Storing the least restrictive type instead of current first file’s column > data type. > - Rework logic in AbstractParquetGroupScan to allow filtering at different > metadata layers: partition, file, row group, etc. The same for pushing the > limit. > - Implement logic to convert existing parquet metadata to metastore metadata > to preserve backward compatibility. > - Implement fetching metadata only when it is needed (for filtering, limit, > count(*) etc.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7113) Issue with filtering null values from MapRDB-JSON
[ https://issues.apache.org/jira/browse/DRILL-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7113: -- Labels: ready-to-commit (was: ) > Issue with filtering null values from MapRDB-JSON > - > > Key: DRILL-7113 > URL: https://issues.apache.org/jira/browse/DRILL-7113 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Affects Versions: 1.15.0 >Reporter: Hanumath Rao Maduri >Assignee: Aman Sinha >Priority: Major > Labels: ready-to-commit > Fix For: 1.16.0 > > > When the Drill is querying documents from MapRDBJSON that contain fields with > null value, it returns the wrong result. > The issue is locally reproduced. > Please find the repro steps: > [1] Create a MaprDBJSON table. Say '/tmp/dmdb2/'. > [2] Insert the following sample records to table: > {code:java} > insert --table /tmp/dmdb2/ --value '{"_id": "1", "label": "person", > "confidence": 0.24}' > insert --table /tmp/dmdb2/ --value '{"_id": "2", "label": "person2"}' > insert --table /tmp/dmdb2/ --value '{"_id": "3", "label": "person3", > "confidence": 0.54}' > insert --table /tmp/dmdb2/ --value '{"_id": "4", "label": "person4", > "confidence": null}' > {code} > We can see that for field 'confidence' document 1 has value 0.24, document 3 > has value 0.54, document 2 does not have the field and document 4 has the > field with value null. > [3] Query the table from DRILL. > *Query 1:* > {code:java} > 0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2; > +--+-+ > | label | confidence | > +--+-+ > | person | 0.24| > | person2 | null| > | person3 | 0.54| > | person4 | null| > +--+-+ > 4 rows selected (0.2 seconds) > {code} > *Query 2:* > {code:java} > 0: jdbc:drill:> select * from dfs.tmp.dmdb2; > +--+-+--+ > | _id | confidence | label | > +--+-+--+ > | 1| 0.24| person | > | 2| null| person2 | > | 3| 0.54| person3 | > | 4| null| person4 | > +--+-+--+ > 4 rows selected (0.174 seconds) > {code} > *Query 3:* > {code:java} > 0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2 where confidence > is not null; > +--+-+ > | label | confidence | > +--+-+ > | person | 0.24| > | person3 | 0.54| > | person4 | null| > +--+-+ > 3 rows selected (0.192 seconds) > {code} > *Query 4:* > {code:java} > 0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2 where confidence > is null; > +--+-+ > | label | confidence | > +--+-+ > | person2 | null| > +--+-+ > 1 row selected (0.262 seconds) > {code} > As you can see, Query 3 which queries for all documents with confidence value > 'is not null', returns a document with null value. > *Other observation:* > Querying the same data using DRILL without MapRDB provides the correct > result. > For example, create 4 different JSON files with following data: > {"label": "person", "confidence": 0.24} \{"label": "person2"} \{"label": > "person3", "confidence": 0.54} \{"label": "person4", "confidence": null} > Query it directly using DRILL: > *Query 5:* > {code:java} > 0: jdbc:drill:> select label,confidence from dfs.tmp.t2; > +--+-+ > | label | confidence | > +--+-+ > | person4 | null| > | person3 | 0.54| > | person2 | null| > | person | 0.24| > +--+-+ > 4 rows selected (0.203 seconds) > {code} > *Query 6:* > {code:java} > 0: jdbc:drill:> select label,confidence from dfs.tmp.t2 where confidence is > null; > +--+-+ > | label | confidence | > +--+-+ > | person4 | null| > | person2 | null| > +--+-+ > 2 rows selected (0.352 seconds) > {code} > *Query 7:* > {code:java} > 0: jdbc:drill:> select label,confidence from dfs.tmp.t2 where confidence is > not null; > +--+-+ > | label | confidence | > +--+-+ > | person3 | 0.54| > | person | 0.24| > +--+-+ > 2 rows selected (0.265 seconds) > {code} > As seen in query 6 & 7, it returns the correct result. > I believe the issue is at the MapRDB layer where it is fetching the results. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-7119) Modify selectivity calculations to use histograms
Aman Sinha created DRILL-7119: - Summary: Modify selectivity calculations to use histograms Key: DRILL-7119 URL: https://issues.apache.org/jira/browse/DRILL-7119 Project: Apache Drill Issue Type: Sub-task Components: Query Planning Optimization Reporter: Aman Sinha Assignee: Aman Sinha Fix For: 1.16.0 (Please see parent JIRA for the design document) Once the t-digest based histogram is created, we need to read it back and modify the selectivity calculations such that they use the histogram buckets for range conditions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal)
Aman Sinha created DRILL-7117: - Summary: Support creation of histograms for numeric data types (except Decimal) Key: DRILL-7117 URL: https://issues.apache.org/jira/browse/DRILL-7117 Project: Apache Drill Issue Type: Sub-task Components: Query Planning Optimization Reporter: Aman Sinha Assignee: Aman Sinha Fix For: 1.16.0 This JIRA is specific to creating histograms for numeric data types: INT, BIGINT, FLOAT4, FLOAT8 and their corresponding nullable/non-nullable versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7063) Create separate summary file for schema, totalRowCount, totalNullCount (includes maintenance)
[ https://issues.apache.org/jira/browse/DRILL-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7063: -- Reviewer: Aman Sinha > Create separate summary file for schema, totalRowCount, totalNullCount > (includes maintenance) > - > > Key: DRILL-7063 > URL: https://issues.apache.org/jira/browse/DRILL-7063 > Project: Apache Drill > Issue Type: Sub-task > Components: Metadata >Reporter: Venkata Jyothsna Donapati >Assignee: Venkata Jyothsna Donapati >Priority: Major > Fix For: 1.16.0 > > Original Estimate: 252h > Remaining Estimate: 252h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7113) Issue with filtering null values from MapRDB-JSON
[ https://issues.apache.org/jira/browse/DRILL-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7113: -- Fix Version/s: (was: 1.17.0) > Issue with filtering null values from MapRDB-JSON > - > > Key: DRILL-7113 > URL: https://issues.apache.org/jira/browse/DRILL-7113 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Affects Versions: 1.15.0 >Reporter: Hanumath Rao Maduri >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > > When the Drill is querying documents from MapRDBJSON that contain fields with > null value, it returns the wrong result. > The issue is locally reproduced. > Please find the repro steps: > [1] Create a MaprDBJSON table. Say '/tmp/dmdb2/'. > [2] Insert the following sample records to table: > {code:java} > insert --table /tmp/dmdb2/ --value '{"_id": "1", "label": "person", > "confidence": 0.24}' > insert --table /tmp/dmdb2/ --value '{"_id": "2", "label": "person2"}' > insert --table /tmp/dmdb2/ --value '{"_id": "3", "label": "person3", > "confidence": 0.54}' > insert --table /tmp/dmdb2/ --value '{"_id": "4", "label": "person4", > "confidence": null}' > {code} > We can see that for field 'confidence' document 1 has value 0.24, document 3 > has value 0.54, document 2 does not have the field and document 4 has the > field with value null. > [3] Query the table from DRILL. > *Query 1:* > {code:java} > 0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2; > +--+-+ > | label | confidence | > +--+-+ > | person | 0.24| > | person2 | null| > | person3 | 0.54| > | person4 | null| > +--+-+ > 4 rows selected (0.2 seconds) > {code} > *Query 2:* > {code:java} > 0: jdbc:drill:> select * from dfs.tmp.dmdb2; > +--+-+--+ > | _id | confidence | label | > +--+-+--+ > | 1| 0.24| person | > | 2| null| person2 | > | 3| 0.54| person3 | > | 4| null| person4 | > +--+-+--+ > 4 rows selected (0.174 seconds) > {code} > *Query 3:* > {code:java} > 0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2 where confidence > is not null; > +--+-+ > | label | confidence | > +--+-+ > | person | 0.24| > | person3 | 0.54| > | person4 | null| > +--+-+ > 3 rows selected (0.192 seconds) > {code} > *Query 4:* > {code:java} > 0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2 where confidence > is null; > +--+-+ > | label | confidence | > +--+-+ > | person2 | null| > +--+-+ > 1 row selected (0.262 seconds) > {code} > As you can see, Query 3 which queries for all documents with confidence value > 'is not null', returns a document with null value. > *Other observation:* > Querying the same data using DRILL without MapRDB provides the correct > result. > For example, create 4 different JSON files with following data: > {"label": "person", "confidence": 0.24} \{"label": "person2"} \{"label": > "person3", "confidence": 0.54} \{"label": "person4", "confidence": null} > Query it directly using DRILL: > *Query 5:* > {code:java} > 0: jdbc:drill:> select label,confidence from dfs.tmp.t2; > +--+-+ > | label | confidence | > +--+-+ > | person4 | null| > | person3 | 0.54| > | person2 | null| > | person | 0.24| > +--+-+ > 4 rows selected (0.203 seconds) > {code} > *Query 6:* > {code:java} > 0: jdbc:drill:> select label,confidence from dfs.tmp.t2 where confidence is > null; > +--+-+ > | label | confidence | > +--+-+ > | person4 | null| > | person2 | null| > +--+-+ > 2 rows selected (0.352 seconds) > {code} > *Query 7:* > {code:java} > 0: jdbc:drill:> select label,confidence from dfs.tmp.t2 where confidence is > not null; > +--+-+ > | label | confidence | > +--+-+ > | person3 | 0.54| > | person | 0.24| > +--+-+ > 2 rows selected (0.265 seconds) > {code} > As seen in query 6 & 7, it returns the correct result. > I believe the issue is at the MapRDB layer where it is fetching the results. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7114) ANALYZE command generates warnings for stats file and materialization
[ https://issues.apache.org/jira/browse/DRILL-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7114: -- Description: When I run ANALYZE, I see warnings in the log file as shown below. The ANALYZE command should not try to read the stats file or materialize the stats. {noformat} 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN o.a.d.e.p.common.DrillStatsTable - Failed to read the stats file. java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does not exist 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. Continuing without stats. java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does not exist {noformat} was: When I run ANALYZE, I see warnings shown below. The ANALYZE command should not try to read the stats file or materialize the stats. {noformat} 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN o.a.d.e.p.common.DrillStatsTable - Failed to read the stats file. java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does not exist 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. Continuing without stats. java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does not exist {noformat} > ANALYZE command generates warnings for stats file and materialization > - > > Key: DRILL-7114 > URL: https://issues.apache.org/jira/browse/DRILL-7114 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Reporter: Aman Sinha >Assignee: Gautam Parai >Priority: Minor > Fix For: 1.16.0 > > > When I run ANALYZE, I see warnings in the log file as shown below. The > ANALYZE command should not try to read the stats file or materialize the > stats. > {noformat} > 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN > o.a.d.e.p.common.DrillStatsTable - Failed to read the stats file. > java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does > not exist > 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN > o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. > Continuing without stats. > java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does > not exist > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-7114) ANALYZE command generates warnings for stats file and materialization
Aman Sinha created DRILL-7114: - Summary: ANALYZE command generates warnings for stats file and materialization Key: DRILL-7114 URL: https://issues.apache.org/jira/browse/DRILL-7114 Project: Apache Drill Issue Type: Bug Components: Query Planning Optimization Reporter: Aman Sinha Assignee: Gautam Parai Fix For: 1.16.0 When I run ANALYZE, I see warnings shown below. The ANALYZE command should not try to read the stats file or materialize the stats. {noformat} 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN o.a.d.e.p.common.DrillStatsTable - Failed to read the stats file. java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does not exist 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. Continuing without stats. java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does not exist {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7108) With statistics enabled TPCH 16 has two additional exchange operators
[ https://issues.apache.org/jira/browse/DRILL-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7108: -- Reviewer: Aman Sinha Summary: With statistics enabled TPCH 16 has two additional exchange operators (was: Statistics adds two exchange operators) > With statistics enabled TPCH 16 has two additional exchange operators > - > > Key: DRILL-7108 > URL: https://issues.apache.org/jira/browse/DRILL-7108 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Affects Versions: 1.16.0 >Reporter: Robert Hou >Assignee: Gautam Parai >Priority: Major > Fix For: 1.16.0 > > > TPCH 16 with sf 100 runs 14% slower. Here is the query: > {noformat} > select > p.p_brand, > p.p_type, > p.p_size, > count(distinct ps.ps_suppkey) as supplier_cnt > from > partsupp ps, > part p > where > p.p_partkey = ps.ps_partkey > and p.p_brand <> 'Brand#21' > and p.p_type not like 'MEDIUM PLATED%' > and p.p_size in (38, 2, 8, 31, 44, 5, 14, 24) > and ps.ps_suppkey not in ( > select > s.s_suppkey > from > supplier s > where > s.s_comment like '%Customer%Complaints%' > ) > group by > p.p_brand, > p.p_type, > p.p_size > order by > supplier_cnt desc, > p.p_brand, > p.p_type, > p.p_size; > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7108) Statistics adds two exchange operators
[ https://issues.apache.org/jira/browse/DRILL-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795153#comment-16795153 ] Aman Sinha commented on DRILL-7108: --- [~rhou] the description seems incomplete and does not match the title which says there are extraneous exchange operators. Can you add the textual plan with and without statistics ? > Statistics adds two exchange operators > -- > > Key: DRILL-7108 > URL: https://issues.apache.org/jira/browse/DRILL-7108 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Affects Versions: 1.16.0 >Reporter: Robert Hou >Assignee: Gautam Parai >Priority: Major > Fix For: 1.16.0 > > > TPCH 16 with sf 100 runs 14% slower. Here is the query: > {noformat} > select > p.p_brand, > p.p_type, > p.p_size, > count(distinct ps.ps_suppkey) as supplier_cnt > from > partsupp ps, > part p > where > p.p_partkey = ps.ps_partkey > and p.p_brand <> 'Brand#21' > and p.p_type not like 'MEDIUM PLATED%' > and p.p_size in (38, 2, 8, 31, 44, 5, 14, 24) > and ps.ps_suppkey not in ( > select > s.s_suppkey > from > supplier s > where > s.s_comment like '%Customer%Complaints%' > ) > group by > p.p_brand, > p.p_type, > p.p_size > order by > supplier_cnt desc, > p.p_brand, > p.p_type, > p.p_size; > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-7019) Add check for redundant imports
[ https://issues.apache.org/jira/browse/DRILL-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-7019: -- Labels: (was: ready-to-commit) > Add check for redundant imports > --- > > Key: DRILL-7019 > URL: https://issues.apache.org/jira/browse/DRILL-7019 > Project: Apache Drill > Issue Type: Task > Components: Tools, Build Test >Affects Versions: 1.15.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Minor > Fix For: 1.16.0 > > > Currently, used only {{UnusedImports}} check which does not prevents > duplicate imports or imports from the same package. > The goal of this Jira is to add {{RedundantImport}} check and fix checkstyle > errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6992) Support column histogram statistics
[ https://issues.apache.org/jira/browse/DRILL-6992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757549#comment-16757549 ] Aman Sinha commented on DRILL-6992: --- Added a link to the histogram design doc. > Support column histogram statistics > --- > > Key: DRILL-6992 > URL: https://issues.apache.org/jira/browse/DRILL-6992 > Project: Apache Drill > Issue Type: New Feature > Components: Query Planning Optimization >Affects Versions: 1.15.0 >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > > As a follow-up to > [DRILL-1328|https://issues.apache.org/jira/browse/DRILL-1328] which is adding > NDV (num distinct values) support and creating the framework for statistics, > we also need Histograms. These are needed for range predicates selectivity > estimation as well as equality predicates when there is non-uniform > distribution of data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6992) Support column histogram statistics
Aman Sinha created DRILL-6992: - Summary: Support column histogram statistics Key: DRILL-6992 URL: https://issues.apache.org/jira/browse/DRILL-6992 Project: Apache Drill Issue Type: New Feature Components: Query Planning Optimization Affects Versions: 1.15.0 Reporter: Aman Sinha Assignee: Aman Sinha As a follow-up to [DRILL-1328|https://issues.apache.org/jira/browse/DRILL-1328] which is adding NDV (num distinct values) support and creating the framework for statistics, we also need Histograms. These are needed for range predicates selectivity estimation as well as equality predicates when there is non-uniform distribution of data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-519) Some queries fail with complaint about cartesian product when cartesian product not required
[ https://issues.apache.org/jira/browse/DRILL-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744472#comment-16744472 ] Aman Sinha commented on DRILL-519: -- [~vitalii] there's a slight variant of query 19 called 19_1.sql that we do run as part of [TestTpchDistributed.tpch19_1()|https://github.com/apache/drill/blob/master/exec/java-exec/src/test/java/org/apache/drill/TestTpchDistributed.java#L131]. The main difference is that one join condition which was present in all the OR blocks is pulled up outside the OR blocks, so semantically it is the same. However, if we want to address the original query you might want to file a new JIRA and do some investiagation to see if it's a Drill issue or Calcite. > Some queries fail with complaint about cartesian product when cartesian > product not required > > > Key: DRILL-519 > URL: https://issues.apache.org/jira/browse/DRILL-519 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Reporter: Jacques Nadeau >Assignee: Aman Sinha >Priority: Major > Fix For: 0.4.0 > > > Failure for TPCH21: > org.eigenbase.relopt.RelOptPlanner$CannotPlanException: Node > [rel#1080:Subset#89.PHYSICAL.SINGLETON([]).[1 DESC, 0]] could not be > implemented; planner state: > Root: rel#1080:Subset#89.PHYSICAL.SINGLETON([]).[1 DESC, 0] > Original rel: > AbstractConverter(subset=[rel#1080:Subset#89.PHYSICAL.SINGLETON([]).[1 DESC, > 0]], convention=[PHYSICAL], DrillDistributionTraitDef=[SINGLETON([])], > sort=[[1 DESC, 0]]): rowcount = 5.7017433643341056E16, cumulative cost = > {inf}, id = 1082 > DrillScreenRel(subset=[rel#1079:Subset#89.LOGICAL.ANY([]).[1 DESC, 0]]): > rowcount = 5.7017433643341056E16, cumulative cost = {5.701743364334106E15 > rows, 5.701743364334106E15 cpu, 0.0 io}, id = 1078 > DrillLimitRel(subset=[rel#1077:Subset#88.LOGICAL.ANY([]).[1 DESC, 0]], > fetch=[100]): rowcount = 5.7017433643341056E16, cumulative cost = > {5.7017433643341056E16 rows, 5.7017433643341056E16 cpu, 0.0 io}, id = 1076 > DrillSortRel(subset=[rel#1075:Subset#87.LOGICAL.ANY([]).[1 DESC, 0]], > sort0=[$1], sort1=[$0], dir0=[DESC], dir1=[ASC]): rowcount = > 5.7017433643341056E16, cumulative cost = {1.7598833879268286E19 rows, > 5.7017433643341056E16 cpu, 0.0 io}, id = 1074 > DrillAggregateRel(subset=[rel#1073:Subset#86.LOGICAL.ANY([]).[]], > group=[{0}], numwait=[COUNT()]): rowcount = 5.7017433643341056E16, cumulative > cost = {5.7017433643341056E16 rows, 0.0 cpu, 0.0 io}, id = 1072 > DrillProjectRel(subset=[rel#1071:Subset#85.LOGICAL.ANY([]).[]], > s_name=[$0]): rowcount = 5.7017433643341056E17, cumulative cost = > {5.7017433643341056E16 rows, 5.7017433643341056E16 cpu, 0.0 io}, id = 1070 > DrillFilterRel(subset=[rel#1069:Subset#84.LOGICAL.ANY([]).[]], > condition=[NOT(IS TRUE($14))]): rowcount = 5.7017433643341056E17, cumulative > cost = {5.7017433643341056E16 rows, 2.28069734573364224E17 cpu, 0.0 io}, id = > 1068 > DrillJoinRel(subset=[rel#1067:Subset#83.LOGICAL.ANY([]).[]], > condition=[AND(=($4, $12), =($5, $13))], joinType=[left]): rowcount = > 2.28069734573364224E18, cumulative cost = {2.28069734573364224E18 rows, 0.0 > cpu, 0.0 io}, id = 1066 > > DrillProjectRel(subset=[rel#1038:Subset#75.LOGICAL.ANY([]).[]], s_name=[$0], > s_suppkey=[$1], s_nationkey=[$2], l_receiptdate=[$3], l_orderkey=[$4], > l_suppkey=[$5], l_commitdate=[$6], o_orderkey=[$7], o_orderstatus=[$8], > n_name=[$9], n_nationkey=[$10], $f0=[$13]): rowcount = 1.601806640625E8, > cumulative cost = {1.601806640625E7 rows, 1.92216796875E8 cpu, 0.0 io}, id = > 1037 > > DrillFilterRel(subset=[rel#1036:Subset#74.LOGICAL.ANY([]).[]], condition=[IS > TRUE($13)]): rowcount = 1.601806640625E8, cumulative cost = {1.601806640625E7 > rows, 6.4072265625E7 cpu, 0.0 io}, id = 1035 > > DrillJoinRel(subset=[rel#1034:Subset#73.LOGICAL.ANY([]).[]], > condition=[AND(=($4, $11), =($5, $12))], joinType=[left]): rowcount = > 6.4072265625E8, cumulative cost = {6.4072265625E8 rows, 0.0 cpu, 0.0 io}, id > = 1033 > > DrillJoinRel(subset=[rel#1014:Subset#63.LOGICAL.ANY([]).[]], condition=[=($2, > $10)], joinType=[inner]): rowcount = 3796.875, cumulative cost = {3796.875 > rows, 0.0 cpu, 0.0 io}, id = 1013 > > DrillProjectRel(subset=[rel#1009:Subset#60.LOGICAL.ANY([]).[]], $f0=[$6], > $f1=[$7], $f2=[$8], $f3=[$2], $f4=[$3], $f5=[$4], $f6=[$5], $f7=[$0], > $f8=[$1]): rowcount = 1687.5, cumulative cost = {168.75 rows, 1518.75 cpu, > 0.0 io}, id = 1008 > > DrillJoinRel(subset=[rel#1007:Subset#59.LOGICAL.ANY([]).[]], condition=[=($7, >
[jira] [Resolved] (DRILL-6897) TPCH 13 has regressed
[ https://issues.apache.org/jira/browse/DRILL-6897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha resolved DRILL-6897. --- Resolution: Duplicate Duplicate of DRILL-6896. > TPCH 13 has regressed > - > > Key: DRILL-6897 > URL: https://issues.apache.org/jira/browse/DRILL-6897 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators >Affects Versions: 1.15.0 >Reporter: Robert Hou >Assignee: Karthikeyan Manivannan >Priority: Major > Attachments: 240099ed-ef2a-a23a-4559-f1b2e0809e72.sys.drill, > 2400be84-c024-cb92-8743-3211589e0247.sys.drill > > > I ran TPCH query 13 with both scale factor 100 and 1000, and ran them 3x to > get a warm start, and ran them twice to verify the regression. It is > regressing between 26 and 33%. > Here is the query: > {noformat} > select > c_count, > count(*) as custdist > from > ( > select > c.c_custkey, > count(o.o_orderkey) > from > customer c > left outer join orders o > on c.c_custkey = o.o_custkey > and o.o_comment not like '%special%requests%' > group by > c.c_custkey > ) as orders (c_custkey, c_count) > group by > c_count > order by > custdist desc, > c_count desc; > {noformat} > I have attached two profiles. 240099ed-ef2a-a23a-4559-f1b2e0809e72 is for > Drill 1.15. 2400be84-c024-cb92-8743-3211589e0247 is for Drill 1.14. The > commit for Drill 1.15 is 596227bbbecfb19bdb55dd8ea58159890f83bc9c. The commit > for Drill 1.14 is 0508a128853ce796ca7e99e13008e49442f83147. > The two plans nearly the same. One difference is that Drill 1.15 is using > four times more memory in operator 07-01 Unordered Mux Exchange. I think the > problem may be in operator 09-01 Project. Drill 1.15 is projecting the > comment field while Drill 1.14 does not project the comment field. > Another issue is that the Drill 1.15 takes more processing time to filter the > order table. Filter operator 09-03 takes an average of 19.3s. For Drill 1.14, > filter operator 09-04 takes an average of 15.6s. They process the same number > of rows, and have the same number of minor fragments. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6912) NPE when other drillbit is already running
[ https://issues.apache.org/jira/browse/DRILL-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724760#comment-16724760 ] Aman Sinha commented on DRILL-6912: --- [~vvysotskyi] Yes, certainly the bug exists as you mentioned and affects both distributed and embedded. It is up to the release manager to decide whether to consider this for this release but it is sounding like a blocker. In case you have a tentative fix, you can contribute that :) > NPE when other drillbit is already running > -- > > Key: DRILL-6912 > URL: https://issues.apache.org/jira/browse/DRILL-6912 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.15.0 >Reporter: Vitalii Diravka >Assignee: Boaz Ben-Zvi >Priority: Critical > > If user tries to run the second drillbit process, the following output will > be obtained: > {code:java} > vitalii@vitalii-pc:/tmp/apache-drill-1.15.0$ bin/drill-embedded > java.lang.NullPointerException > Apache Drill 1.15.0 > "This isn't your grandfather's SQL." > sqlline> select * from (values(1)); > No current connection > sqlline> !q > {code} > For 1.14.0 drill version the output was correct (but too long): > {code:java} > ./bin/drill-embedded > Dec 18, 2018 7:58:47 PM org.glassfish.jersey.server.ApplicationHandler > initialize > INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 > 01:25:26... > Error: Failure in starting embedded Drillbit: java.net.BindException: Address > already in use (state=,code=0) > java.sql.SQLException: Failure in starting embedded Drillbit: > java.net.BindException: Address already in use > at > org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:143) > at > org.apache.drill.jdbc.impl.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:72) > at > org.apache.drill.jdbc.impl.DrillFactory.newConnection(DrillFactory.java:68) > at > org.apache.calcite.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:138) > at org.apache.drill.jdbc.Driver.connect(Driver.java:72) > at sqlline.DatabaseConnection.connect(DatabaseConnection.java:167) > at sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:213) > at sqlline.Commands.connect(Commands.java:1083) > at sqlline.Commands.connect(Commands.java:1015) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:36) > at sqlline.SqlLine.dispatch(SqlLine.java:742) > at sqlline.SqlLine.initArgs(SqlLine.java:528) > at sqlline.SqlLine.begin(SqlLine.java:596) > at sqlline.SqlLine.start(SqlLine.java:375) > at sqlline.SqlLine.main(SqlLine.java:268) > Caused by: java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:433) > at sun.nio.ch.Net.bind(Net.java:425) > at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:279) > at > org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) > at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:218) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.eclipse.jetty.server.Server.doStart(Server.java:337) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.apache.drill.exec.server.rest.WebServer.start(WebServer.java:155) > at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:200) > at > org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:134) > ... 18 more > apache drill 1.14.0 > "just drill it" > 0: jdbc:drill:zk=local> !q{code} > Looks like it is fine to have a short message in console about the reason of > error, similar to: > {code:java} > java.sql.SQLException: Failure in starting embedded Drillbit: > java.net.BindException: Address already in use > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6912) NPE when other drillbit is already running
[ https://issues.apache.org/jira/browse/DRILL-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724656#comment-16724656 ] Aman Sinha commented on DRILL-6912: --- [~ben-zvi] Thanks, I suppose we should document that env variable (currently, it is set internally in the drill_embedded script) ? I would think quite a few people (me included) run sqlline directly in the embedded mode. I am curious though why [~vitalii]'s command failed since he ran drill_embedded. Let's wait for his confirmation to see if it was an environment issue. > NPE when other drillbit is already running > -- > > Key: DRILL-6912 > URL: https://issues.apache.org/jira/browse/DRILL-6912 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.15.0 >Reporter: Vitalii Diravka >Assignee: Boaz Ben-Zvi >Priority: Critical > > If user tries to run the second drillbit process, the following output will > be obtained: > {code:java} > vitalii@vitalii-pc:/tmp/apache-drill-1.15.0$ bin/drill-embedded > java.lang.NullPointerException > Apache Drill 1.15.0 > "This isn't your grandfather's SQL." > sqlline> select * from (values(1)); > No current connection > sqlline> !q > {code} > For 1.14.0 drill version the output was correct (but too long): > {code:java} > ./bin/drill-embedded > Dec 18, 2018 7:58:47 PM org.glassfish.jersey.server.ApplicationHandler > initialize > INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 > 01:25:26... > Error: Failure in starting embedded Drillbit: java.net.BindException: Address > already in use (state=,code=0) > java.sql.SQLException: Failure in starting embedded Drillbit: > java.net.BindException: Address already in use > at > org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:143) > at > org.apache.drill.jdbc.impl.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:72) > at > org.apache.drill.jdbc.impl.DrillFactory.newConnection(DrillFactory.java:68) > at > org.apache.calcite.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:138) > at org.apache.drill.jdbc.Driver.connect(Driver.java:72) > at sqlline.DatabaseConnection.connect(DatabaseConnection.java:167) > at sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:213) > at sqlline.Commands.connect(Commands.java:1083) > at sqlline.Commands.connect(Commands.java:1015) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:36) > at sqlline.SqlLine.dispatch(SqlLine.java:742) > at sqlline.SqlLine.initArgs(SqlLine.java:528) > at sqlline.SqlLine.begin(SqlLine.java:596) > at sqlline.SqlLine.start(SqlLine.java:375) > at sqlline.SqlLine.main(SqlLine.java:268) > Caused by: java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:433) > at sun.nio.ch.Net.bind(Net.java:425) > at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:279) > at > org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) > at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:218) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.eclipse.jetty.server.Server.doStart(Server.java:337) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.apache.drill.exec.server.rest.WebServer.start(WebServer.java:155) > at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:200) > at > org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:134) > ... 18 more > apache drill 1.14.0 > "just drill it" > 0: jdbc:drill:zk=local> !q{code} > Looks like it is fine to have a short message in console about the reason of > error, similar to: > {code:java} > java.sql.SQLException: Failure in starting embedded Drillbit: > java.net.BindException: Address already in use > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6912) NPE when other drillbit is already running
[ https://issues.apache.org/jira/browse/DRILL-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724622#comment-16724622 ] Aman Sinha commented on DRILL-6912: --- Odd...I do see the same NPE that [~vitalii] reported: {noformat} [root@aman1 ~]# which sqlline /root/apache-drill-1.15.0-src/distribution/target/apache-drill-1.15.0/apache-drill-1.15.0/bin/sqlline [root@aman1 ~]# sqlline -u jdbc:drill:zk=local -n admin --maxWidth=10 java.lang.NullPointerException Apache Drill 1.15.0 {noformat} > NPE when other drillbit is already running > -- > > Key: DRILL-6912 > URL: https://issues.apache.org/jira/browse/DRILL-6912 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.15.0 >Reporter: Vitalii Diravka >Assignee: Boaz Ben-Zvi >Priority: Critical > > If user tries to run the second drillbit process, the following output will > be obtained: > {code:java} > vitalii@vitalii-pc:/tmp/apache-drill-1.15.0$ bin/drill-embedded > java.lang.NullPointerException > Apache Drill 1.15.0 > "This isn't your grandfather's SQL." > sqlline> select * from (values(1)); > No current connection > sqlline> !q > {code} > For 1.14.0 drill version the output was correct (but too long): > {code:java} > ./bin/drill-embedded > Dec 18, 2018 7:58:47 PM org.glassfish.jersey.server.ApplicationHandler > initialize > INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 > 01:25:26... > Error: Failure in starting embedded Drillbit: java.net.BindException: Address > already in use (state=,code=0) > java.sql.SQLException: Failure in starting embedded Drillbit: > java.net.BindException: Address already in use > at > org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:143) > at > org.apache.drill.jdbc.impl.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:72) > at > org.apache.drill.jdbc.impl.DrillFactory.newConnection(DrillFactory.java:68) > at > org.apache.calcite.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:138) > at org.apache.drill.jdbc.Driver.connect(Driver.java:72) > at sqlline.DatabaseConnection.connect(DatabaseConnection.java:167) > at sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:213) > at sqlline.Commands.connect(Commands.java:1083) > at sqlline.Commands.connect(Commands.java:1015) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:36) > at sqlline.SqlLine.dispatch(SqlLine.java:742) > at sqlline.SqlLine.initArgs(SqlLine.java:528) > at sqlline.SqlLine.begin(SqlLine.java:596) > at sqlline.SqlLine.start(SqlLine.java:375) > at sqlline.SqlLine.main(SqlLine.java:268) > Caused by: java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:433) > at sun.nio.ch.Net.bind(Net.java:425) > at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:279) > at > org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) > at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:218) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.eclipse.jetty.server.Server.doStart(Server.java:337) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.apache.drill.exec.server.rest.WebServer.start(WebServer.java:155) > at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:200) > at > org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:134) > ... 18 more > apache drill 1.14.0 > "just drill it" > 0: jdbc:drill:zk=local> !q{code} > Looks like it is fine to have a short message in console about the reason of > error, similar to: > {code:java} > java.sql.SQLException: Failure in starting embedded Drillbit: > java.net.BindException: Address already in use > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6896) Extraneous columns being projected past a join
[ https://issues.apache.org/jira/browse/DRILL-6896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-6896: -- Fix Version/s: 1.16.0 > Extraneous columns being projected past a join > -- > > Key: DRILL-6896 > URL: https://issues.apache.org/jira/browse/DRILL-6896 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.15.0 >Reporter: Karthikeyan Manivannan >Assignee: Aman Sinha >Priority: Major > Fix For: 1.16.0 > > > [~rhou] noted that TPCH13 on Drill 1.15 was running slower than Drill 1.14. > Analysis revealed that an extra column was being projected in 1.15 and the > slowdown was because the extra column was being unnecessarily pushed across > an exchange. > Here is a simplified query written by [~amansinha100] that exhibits the same > problem : > In first plan, o_custkey and o_comment are both extraneous projections. > In the second plan (on 1.14.0), also, there is an extraneous projection: > o_custkey but not o_comment. > On 1.15.0: > - > {noformat} > explain plan without implementation for > select > c.c_custkey > from >cp.`tpch/customer.parquet` c > left outer join cp.`tpch/orders.parquet` o > on c.c_custkey = o.o_custkey > and o.o_comment not like '%special%requests%' >; > DrillScreenRel > DrillProjectRel(c_custkey=[$0]) > DrillProjectRel(c_custkey=[$2], o_custkey=[$0], o_comment=[$1]) > DrillJoinRel(condition=[=($2, $0)], joinType=[right]) > DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))]) > DrillScanRel(table=[[cp, tpch/orders.parquet]], > groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=classpath:/tpch/orders.parquet]], > selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]]) > DrillScanRel(table=[[cp, tpch/customer.parquet]], > groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=classpath:/tpch/customer.parquet]], > selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`c_custkey`]]]) > {noformat} > On 1.14.0: > - > {noformat} > DrillScreenRel > DrillProjectRel(c_custkey=[$0]) > DrillProjectRel(c_custkey=[$1], o_custkey=[$0]) > DrillJoinRel(condition=[=($1, $0)], joinType=[right]) > DrillProjectRel(o_custkey=[$0]) > DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))]) > DrillScanRel(table=[[cp, tpch/orders.parquet]], > groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=classpath:/tpch/orders.parquet]], > selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]]) > DrillScanRel(table=[[cp, tpch/customer.parquet]], > groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=classpath:/tpch/customer.parquet]], > selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`c_custkey`]]]) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6896) Extraneous columns being projected past a join
[ https://issues.apache.org/jira/browse/DRILL-6896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718317#comment-16718317 ] Aman Sinha commented on DRILL-6896: --- On further analysis, I can see a similar (not exactly same) extraneous column projection even on 1.14.0 and with an Inner Join query, with the filter condition in the WHERE clause (instead of ON clause). On 1.14.0: {noformat} explain plan without implementation for select c.c_custkey from cp.`tpch/customer.parquet` c inner join cp.`tpch/orders.parquet` o on c.c_custkey = o.o_custkey where o.o_orderkey < 10 and c.c_name like 'abc%' ; +--+--+ | text | json | +--+--+ DrillScreenRel DrillProjectRel(c_custkey=[$0]) DrillProjectRel(c_custkey=[$2], c_name=[$3], o_custkey=[$0], o_orderkey=[$1]) DrillJoinRel(condition=[=($2, $0)], joinType=[inner]) DrillFilterRel(condition=[<($1, 10)]) DrillScanRel(table=[[cp, tpch/orders.parquet]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/orders.parquet]], selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`o_custkey`, `o_orderkey`]]]) DrillFilterRel(condition=[LIKE($1, 'abc%')]) DrillScanRel(table=[[cp, tpch/customer.parquet]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/customer.parquet]], selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`c_custkey`, `c_name`]]]) {noformat} Notice above that the DrillProjectRel above the Join is projecting both c_name and o_orderkey even though both these columns are only needed by the filter conditions below the DrillJoinRel. There are 2 possible reasons why they were not pushed down: 1. Costing. For whatever reason, the projection pushdown did not change the cost, so Calcite preserved the original non-pushdown plan. This is less likely compared to #2 below. 2. Potential issues with rule application. There were some issues with Calcite's ProjectMergeRule (see https://issues.apache.org/jira/browse/CALCITE-2223) which required a workaround implemented in [DRILL-6212|https://issues.apache.org/jira/browse/DRILL-6212]. A second JIRA: [DRILL-6501|https://issues.apache.org/jira/browse/DRILL-6501] was filed to revert to using ProjectMergeRule later. Considering that 1.14.0 also shows a similar issue (albeit not for the same query), I am inclined to move this JIRA to 1.16.0 release. > Extraneous columns being projected past a join > -- > > Key: DRILL-6896 > URL: https://issues.apache.org/jira/browse/DRILL-6896 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.15.0 >Reporter: Karthikeyan Manivannan >Assignee: Aman Sinha >Priority: Major > > [~rhou] noted that TPCH13 on Drill 1.15 was running slower than Drill 1.14. > Analysis revealed that an extra column was being projected in 1.15 and the > slowdown was because the extra column was being unnecessarily pushed across > an exchange. > Here is a simplified query written by [~amansinha100] that exhibits the same > problem : > In first plan, o_custkey and o_comment are both extraneous projections. > In the second plan (on 1.14.0), also, there is an extraneous projection: > o_custkey but not o_comment. > On 1.15.0: > - > {noformat} > explain plan without implementation for > select > c.c_custkey > from >cp.`tpch/customer.parquet` c > left outer join cp.`tpch/orders.parquet` o > on c.c_custkey = o.o_custkey > and o.o_comment not like '%special%requests%' >; > DrillScreenRel > DrillProjectRel(c_custkey=[$0]) > DrillProjectRel(c_custkey=[$2], o_custkey=[$0], o_comment=[$1]) > DrillJoinRel(condition=[=($2, $0)], joinType=[right]) > DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))]) > DrillScanRel(table=[[cp, tpch/orders.parquet]], > groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=classpath:/tpch/orders.parquet]], > selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]]) > DrillScanRel(table=[[cp, tpch/customer.parquet]], > groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=classpath:/tpch/customer.parquet]], > selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`c_custkey`]]]) > {noformat} > On 1.14.0: > - > {noformat} > DrillScreenRel > DrillProjectRel(c_custkey=[$0]) > DrillProjectRel(c_custkey=[$1], o_custkey=[$0]) > DrillJoinRel(condition=[=($1, $0)], joinType=[right]) >
[jira] [Updated] (DRILL-6896) Extraneous columns being projected past a join
[ https://issues.apache.org/jira/browse/DRILL-6896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-6896: -- Description: [~rhou] noted that TPCH13 on Drill 1.15 was running slower than Drill 1.14. Analysis revealed that an extra column was being projected in 1.15 and the slowdown was because the extra column was being unnecessarily pushed across an exchange. Here is a simplified query written by [~amansinha100] that exhibits the same problem : In first plan, o_custkey and o_comment are both extraneous projections. In the second plan (on 1.14.0), also, there is an extraneous projection: o_custkey but not o_comment. On 1.15.0: - {noformat} explain plan without implementation for select c.c_custkey from cp.`tpch/customer.parquet` c left outer join cp.`tpch/orders.parquet` o on c.c_custkey = o.o_custkey and o.o_comment not like '%special%requests%' ; DrillScreenRel DrillProjectRel(c_custkey=[$0]) DrillProjectRel(c_custkey=[$2], o_custkey=[$0], o_comment=[$1]) DrillJoinRel(condition=[=($2, $0)], joinType=[right]) DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))]) DrillScanRel(table=[[cp, tpch/orders.parquet]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/orders.parquet]], selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]]) DrillScanRel(table=[[cp, tpch/customer.parquet]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/customer.parquet]], selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`c_custkey`]]]) {noformat} On 1.14.0: - {noformat} DrillScreenRel DrillProjectRel(c_custkey=[$0]) DrillProjectRel(c_custkey=[$1], o_custkey=[$0]) DrillJoinRel(condition=[=($1, $0)], joinType=[right]) DrillProjectRel(o_custkey=[$0]) DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))]) DrillScanRel(table=[[cp, tpch/orders.parquet]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/orders.parquet]], selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]]) DrillScanRel(table=[[cp, tpch/customer.parquet]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/customer.parquet]], selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`c_custkey`]]]) {noformat} was: [~rhou] noted that TPCH13 on Drill 1.15 was running slower than Drill 1.14. Analysis revealed that an extra column was being projected in 1.15 and the slowdown was because the extra column was being unnecessarily pushed across an exchange. Here is a simplified query written by [~amansinha100] that exhibits the same problem : In first plan, o_custkey and o_comment are both extraneous projections. In the second plan (on 1.14.0), also, there is an extraneous projection: o_custkey but not o_comment. On 1.15.0: - explain plan without implementation for select c.c_custkey from cp.`tpch/customer.parquet` c left outer join cp.`tpch/orders.parquet` o on c.c_custkey = o.o_custkey and o.o_comment not like '%special%requests%' ; DrillScreenRel DrillProjectRel(c_custkey=[$0]) DrillProjectRel(c_custkey=[$2], o_custkey=[$0], o_comment=[$1]) DrillJoinRel(condition=[=($2, $0)], joinType=[right]) DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))]) DrillScanRel(table=[[cp, tpch/orders.parquet]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/orders.parquet]], selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]]) DrillScanRel(table=[[cp, tpch/customer.parquet]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/customer.parquet]], selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`c_custkey`]]]) On 1.14.0: - DrillScreenRel DrillProjectRel(c_custkey=[$0]) DrillProjectRel(c_custkey=[$1], o_custkey=[$0]) DrillJoinRel(condition=[=($1, $0)], joinType=[right]) DrillProjectRel(o_custkey=[$0]) DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))]) DrillScanRel(table=[[cp, tpch/orders.parquet]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/orders.parquet]], selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]]) DrillScanRel(table=[[cp, tpch/customer.parquet]], groupscan=[ParquetGroupScan
[jira] [Updated] (DRILL-6896) Extraneous columns being projected past a join
[ https://issues.apache.org/jira/browse/DRILL-6896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-6896: -- Summary: Extraneous columns being projected past a join (was: Extraneous columns being projected in Drill 1.15) > Extraneous columns being projected past a join > -- > > Key: DRILL-6896 > URL: https://issues.apache.org/jira/browse/DRILL-6896 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.15.0 >Reporter: Karthikeyan Manivannan >Assignee: Aman Sinha >Priority: Major > > [~rhou] noted that TPCH13 on Drill 1.15 was running slower than Drill 1.14. > Analysis revealed that an extra column was being projected in 1.15 and the > slowdown was because the extra column was being unnecessarily pushed across > an exchange. > Here is a simplified query written by [~amansinha100] that exhibits the same > problem : > In first plan, o_custkey and o_comment are both extraneous projections. > In the second plan (on 1.14.0), also, there is an extraneous projection: > o_custkey but not o_comment. > On 1.15.0: > - > explain plan without implementation for > select > c.c_custkey > from > cp.`tpch/customer.parquet` c > left outer join cp.`tpch/orders.parquet` o > on c.c_custkey = o.o_custkey > and o.o_comment not like '%special%requests%' > ; > DrillScreenRel > DrillProjectRel(c_custkey=[$0]) > DrillProjectRel(c_custkey=[$2], o_custkey=[$0], o_comment=[$1]) > DrillJoinRel(condition=[=($2, $0)], joinType=[right]) > DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))]) > DrillScanRel(table=[[cp, tpch/orders.parquet]], groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=classpath:/tpch/orders.parquet]], > selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]]) > DrillScanRel(table=[[cp, tpch/customer.parquet]], groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=classpath:/tpch/customer.parquet]], > selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`c_custkey`]]]) > On 1.14.0: > - > DrillScreenRel > DrillProjectRel(c_custkey=[$0]) > DrillProjectRel(c_custkey=[$1], o_custkey=[$0]) > DrillJoinRel(condition=[=($1, $0)], joinType=[right]) > DrillProjectRel(o_custkey=[$0]) > DrillFilterRel(condition=[NOT(LIKE($1, '%special%requests%'))]) > DrillScanRel(table=[[cp, tpch/orders.parquet]], groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=classpath:/tpch/orders.parquet]], > selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`o_custkey`, `o_comment`]]]) > DrillScanRel(table=[[cp, tpch/customer.parquet]], groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=classpath:/tpch/customer.parquet]], > selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`c_custkey`]]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6876) Enable CircleCI builds for JDK 9-11
[ https://issues.apache.org/jira/browse/DRILL-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-6876: -- Fix Version/s: (was: Future) 1.15.0 > Enable CircleCI builds for JDK 9-11 > --- > > Key: DRILL-6876 > URL: https://issues.apache.org/jira/browse/DRILL-6876 > Project: Apache Drill > Issue Type: Sub-task > Components: Tools, Build Test >Affects Versions: 1.14.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Labels: ready-to-commit > Fix For: 1.15.0 > > > Currently, CircleCI runs tests for JDK 8, but Drill also supports JDK 9-11. > The aim of this Jira is to allow running tests for JDK9-11. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6882) Handle the cases where RowKeyJoin's left pipeline being called multiple times.
[ https://issues.apache.org/jira/browse/DRILL-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-6882: -- Labels: ready-to-commit (was: ) > Handle the cases where RowKeyJoin's left pipeline being called multiple times. > -- > > Key: DRILL-6882 > URL: https://issues.apache.org/jira/browse/DRILL-6882 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.15.0 >Reporter: Gautam Parai >Assignee: Gautam Parai >Priority: Blocker > Labels: ready-to-commit > Fix For: 1.15.0 > > > RowKeyJoin operator's leftStream pipeline uses IterOutcome.NONE state to > repeat the pipeline multiple times. This should be avoided as > IterOutcome.NONE will assertion checks in validator. We ran into this issue > when trying to fix DRILL-6878. > Sample unit test failure from DRILL-6878 > select t1.id.ssn as ssn from hbase.`index_test_primary` t1, > hbase.`index_test_primary` t2 where t1._id = t2.rowid and > cast(t2.activity.irs.firstlogin as timestamp) = to_timestamp('2013-02-04 > 22:34:38.0', '-MM-dd HH:mm:ss.S') > at > org.apache.drill.test.DrillTestWrapper.compareMergedOnHeapVectors(DrillTestWrapper.java:630) > ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT] > at > org.apache.drill.test.DrillTestWrapper.compareOrderedResults(DrillTestWrapper.java:584) > ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT] > at org.apache.drill.test.DrillTestWrapper.run(DrillTestWrapper.java:157) > ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT] > at org.apache.drill.test.TestBuilder.go(TestBuilder.java:139) > ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT] > at > com.mapr.drill.maprdb.tests.index.IndexPlanTest.testRowkeyJoinPushdown_7(IndexPlanTest.java:1828) > ~[test-classes/:na] > at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_141] > Caused by: org.apache.drill.exec.rpc.RpcException: > org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: > IllegalStateException: next() [on #851, ScanBatch] called again after it > returned NONE. Caller should not have called next() again. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6839) Failed to plan (aggregate + Hash or NL join) when slice target is low
[ https://issues.apache.org/jira/browse/DRILL-6839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690149#comment-16690149 ] Aman Sinha commented on DRILL-6839: --- [~vvysotskyi] it is certainly possible to change the `StreamAggPrule` to generate single-phase agg if for some reason the 2 phase agg is not possible. It would be a workaround but the performance will be poor because after a `NestedLoopJoin` the output row count would increase substantially due to the cross join, so we should ideally do local aggregation first before sending all those rows to the foreman node. After doing local aggregation, we are only sending M rows to the foreman where M is the number of minor fragments, whereas otherwise it will send `O(N^2)` rows to the foreman where N is the number of input rows. So, it would be useful to determine why the 2 phase StreamingAgg was not created in this example. Looking at the code, it is creating the distribution traits using the distribution traits of the child. If there is no GROUP-BY in the aggregation, then I would expect that DrillDistributionTrait.ANY should be sufficient. It would be good to check if that is happening. > Failed to plan (aggregate + Hash or NL join) when slice target is low > -- > > Key: DRILL-6839 > URL: https://issues.apache.org/jira/browse/DRILL-6839 > Project: Apache Drill > Issue Type: Bug >Reporter: Igor Guzenko >Priority: Major > Fix For: 1.16.0 > > > *Case 1.* When nested loop join is about to be used: > - Option "_planner.enable_nljoin_for_scalar_only_" is set to false > - Option "_planner.slice_target_" is set to low value for imitation of big > input tables > > {code:java} > @Category(SqlTest.class) > public class CrossJoinTest extends ClusterTest { > @BeforeClass > public static void setUp() throws Exception { > startCluster(ClusterFixture.builder(dirTestWatcher)); > } > @Test > public void testCrossJoinSucceedsForLowSliceTarget() throws Exception { >try { > client.alterSession(PlannerSettings.NLJOIN_FOR_SCALAR.getOptionName(), > false); > client.alterSession(ExecConstants.SLICE_TARGET, 1); > queryBuilder().sql( > "SELECT COUNT(l.nation_id) " + > "FROM cp.`tpch/nation.parquet` l " + > ", cp.`tpch/region.parquet` r") > .run(); >} finally { > client.resetSession(ExecConstants.SLICE_TARGET); > client.resetSession(PlannerSettings.NLJOIN_FOR_SCALAR.getOptionName()); >} > } > }{code} > > *Case 2.* When hash join is about to be used: > - Option "planner.enable_mergejoin" is set to false, so hash join will be > used instead > - Option "planner.slice_target" is set to low value for imitation of big > input tables > - Comment out //ruleList.add(HashJoinPrule.DIST_INSTANCE); in > PlannerPhase.getPhysicalRules method > {code:java} > @Category(SqlTest.class) > public class CrossJoinTest extends ClusterTest { > @BeforeClass > public static void setUp() throws Exception { >startCluster(ClusterFixture.builder(dirTestWatcher)); > } > @Test > public void testInnerJoinSucceedsForLowSliceTarget() throws Exception { >try { > client.alterSession(PlannerSettings.MERGEJOIN.getOptionName(), false); > client.alterSession(ExecConstants.SLICE_TARGET, 1); > queryBuilder().sql( > "SELECT COUNT(l.nation_id) " + > "FROM cp.`tpch/nation.parquet` l " + > "INNER JOIN cp.`tpch/region.parquet` r " + > "ON r.nation_id = l.nation_id") > .run(); >} finally { > client.resetSession(ExecConstants.SLICE_TARGET); > client.resetSession(PlannerSettings.MERGEJOIN.getOptionName()); >} > } > } > {code} > > *Workaround:* To avoid the exception we need to set option > "_planner.enable_multiphase_agg_" to false. By doing this we avoid > unsuccessful attempts to create 2 phase aggregation plan in StreamAggPrule > and guarantee that logical aggregate will be converted to physical one. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6829) Handle schema change in ExternalSort
[ https://issues.apache.org/jira/browse/DRILL-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677656#comment-16677656 ] Aman Sinha commented on DRILL-6829: --- [~Paul.Rogers] I want to clarify .. by 'union' of two incompatible schemas I did not mean using the union type. I meant the union like operation that we normally do for record batches. Step #7 in my first comment is about doing this cross-schema-union. Suppose there are 3 record batches each with different schema for the sort key. These will be sitting in separate internal queues of the blocking operator and each will be individually sorted. The cross-schema-union will traverse these queues in a certain order (e.g all Numeric types appear first, followed by all String types, followed by Date types) consuming all batches from the first queue and emitting them, followed by second queue and so on. > Handle schema change in ExternalSort > > > Key: DRILL-6829 > URL: https://issues.apache.org/jira/browse/DRILL-6829 > Project: Apache Drill > Issue Type: New Feature >Reporter: Aman Sinha >Priority: Major > > While we continue to enhance the schema provision and metastore aspects in > Drill, we also should explore what it means to be truly schema-less such that > we can better handle \{semi, un}structured data, data sitting in DBs that > store JSON documents (e.g Mongo, MapR-DB). > > The blocking operators are the main hurdles in this goal (other operators > also need to be smarter about this but the problem is harder for the blocking > operators). This Jira is specifically about ExternalSort. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6829) Handle schema change in ExternalSort
[ https://issues.apache.org/jira/browse/DRILL-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677245#comment-16677245 ] Aman Sinha commented on DRILL-6829: --- Hi [~Paul.Rogers], the fundamental assumption you are making is that all input types need to be converted to a common type in order to generate a global sort. But let's drop the requirement of a global sort for the moment. If you examine the approach I described, I am proposing that each distinct schema be sorted independently and only compatible types need to be merged together to produce a common output schema. Incompatible types just need a 'union'. It just occurred to me to check MongoDB's sort behavior and it does not seem much different from what I am proposing. Please see [1]. So, if a user is querying NoSQL DB such as Mongo, their client applications already must be aware of schema changes. How can we provide the same functionality through SQL ? Since Drill's key differentiator is to provide a SQL capability to a wide variety of data sources, this seems like a must-have capability otherwise why would a Mongo user migrate to Drill+Mongo despite the enormous benefits of using a SQL interface ? (BTW, I am just using MongoDB as an example here). That said, I am under no illusion that it is a much more complex task to do this in Drill. But, let's chip away the layers and see what is feasible. [1] https://docs.mongodb.com/manual/reference/bson-type-comparison-order/#bson-types-comparison-order > Handle schema change in ExternalSort > > > Key: DRILL-6829 > URL: https://issues.apache.org/jira/browse/DRILL-6829 > Project: Apache Drill > Issue Type: New Feature >Reporter: Aman Sinha >Priority: Major > > While we continue to enhance the schema provision and metastore aspects in > Drill, we also should explore what it means to be truly schema-less such that > we can better handle \{semi, un}structured data, data sitting in DBs that > store JSON documents (e.g Mongo, MapR-DB). > > The blocking operators are the main hurdles in this goal (other operators > also need to be smarter about this but the problem is harder for the blocking > operators). This Jira is specifically about ExternalSort. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6829) Handle schema change in ExternalSort
[ https://issues.apache.org/jira/browse/DRILL-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676056#comment-16676056 ] Aman Sinha commented on DRILL-6829: --- Here's one approach. Other approaches are also welcome and we can consider the pros and cons. Consider a Sort on (a1, b1) *Scenario 1*: schema change from one numeric type to another numeric type Batch 1: a1: INT, b1: INT (schema 1) Batch 2: a1: FLOAT, b1: INT (schema 2) *Scenario 2*: schema change from numeric to string type Batch 1: a1: INT, b1: INT (schema 1) Batch 2: a1: STRING, b1: INT (schema 3) *Scenario 3:* schema change from numeric to string to float Batch 1: a1: INT, b1: INT (schema 1) Batch 2: a1: STRING, b1: INT (schema 3) Batch 3: a1: FLOAT, b1: INT (schema 2) Let's define *compatible schemas:* schema 1 is compatible with schema 2 since INT and FLOAT are both numeric types. schema 1 and 2 are not compatible with schema 3 although technically one could cast the numeric types to string but if the user has not explicitly done a CAST then we cannot impose it. Define *sort order* among different schemas: For compatible schemas, this is straighforward. For instance, sort order among all numeric types: INT, BIGINT, FLOAT, DOUBLE. (TODO: need to define behavior for Decimal type). What's the sort order between incompatible schemas ? We can have a policy that says by default numeric values appear first followed by String values followed by Date values etc. This is similar to NULLS first/NULLS last. This policy will appear in external documentation so users and BI tools are aware of the expected order. In the future we may add a declarative way of changing the default sort order between incompatible schemas. Define the *output type* for compatible schemas: The output type of compatible schemas will be the 'higher' data type, for example: (INT, FLOAT) ==> FLOAT (DATE, TIMESTAMP) ==> TIMESTAMP *Execution steps* On each minor fragment that contains Sort, the algorithm would be as follows: # Start with a single sorter internally based on the input record batch's schema. If the input schema changes in a new record batch, start a new sorter. Thus, there will be N sorters internally within the Sort operator corresponding to the N schemas. NOTE: in most practical cases, N is quite small. We can have a default 'schema change threshold' of 3. Beyond that, we would still throw a SchemaChangeException. # Each sorter maintains a separate list of batches for each schema. # Each sorter sorts the rows for all batches belonging to that schema, spilling as necessary. # Create as many 'mergers' as the number of compatible schema categories. Thus in scenario 1 there will be 1 merger while scenario 3 will have 2 mergers. # Do a local merge of the data within each compatible category. The generated code that does the comparison will need to handle comparisons of INT vs BIGINT, vs FLOAT etc. # This also means that an operator needs to be able to use more than 1 version of generated code. The exact mechanics of this needs to be determined. If it is not feasible, then we may think of running in non-code-gen mode when we encounter schema change.. it will be slower but at least the query will complete. # Once the individual merges have been done within each compatible schema category, do a cross-schema-union. This simply unions the results of the prior merge. The union will preserve the relative ordering among incompatible schemas. # After the union, each Output Batch that is created will have the schema it inherits from each side of the union. So if there are 3 inputs to the union, there will be at least 3 output batches created each with a different schema. Here are the steps that would need to happen internally in the Sort operator for Scenario 3: {noformat} Output batch | Cross-Schema-Union / \ Merge Sort (schema 3) / \ Sort Sort (schema 1) (schema 2){noformat} On the foreman node that does the final Merge, there will be again a 2 step process: first do the individual merge of compatible schemas coming from separate minor fragments, followed by the Cross-schema-Union. *Schema change due to NULLABILITY:* In all the above scenarios, the data type may be nullable or non-nullable. If the schema change occurs because a type changed from non-nullable to nullable, it would be handled the same way as we would handle a 'compatible' schema change since a nullable INT type is compatible with a non-nullable INT type. *Schema change from Scalar type to Array type:* This may occur for example when a single phone number changes to an array of phone numbers in a customer table. Doing a Sort on an array does not make sense but
[jira] [Created] (DRILL-6829) Handle schema change in ExternalSort
Aman Sinha created DRILL-6829: - Summary: Handle schema change in ExternalSort Key: DRILL-6829 URL: https://issues.apache.org/jira/browse/DRILL-6829 Project: Apache Drill Issue Type: New Feature Reporter: Aman Sinha While we continue to enhance the schema provision and metastore aspects in Drill, we also should explore what it means to be truly schema-less such that we can better handle \{semi, un}structured data, data sitting in DBs that store JSON documents (e.g Mongo, MapR-DB). The blocking operators are the main hurdles in this goal (other operators also need to be smarter about this but the problem is harder for the blocking operators). This Jira is specifically about ExternalSort. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6818) Add descriptions to secondary index options
[ https://issues.apache.org/jira/browse/DRILL-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-6818: -- Labels: doc-impacting (was: ) > Add descriptions to secondary index options > --- > > Key: DRILL-6818 > URL: https://issues.apache.org/jira/browse/DRILL-6818 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.15.0 >Reporter: Arina Ielchiieva >Assignee: Aman Sinha >Priority: Major > Labels: doc-impacting > Fix For: 1.15.0 > > > Add descriptions to secondary index options in ExecConstants and > PlannerSettings: > index plan related options == > planner.use_simple_optimizer: false, > planner.enable_index_planning: true, > planner.enable_statistics: true, > planner.disable_full_table_scan: false, > planner.index.max_chosen_indexes_per_table: 5, > planner.index.force_sort_noncovering: false, > planner.index.use_hashjoin_noncovering: false, > planner.index.covering_selectivity_threshold: 1.0, > planner.index.noncovering_selectivity_threshold: 0.025, > planner.index.rowkeyjoin_cost_factor: 0.1, > planner.index.statistics_rowcount_scaling_factor: 0.1, > planner.index.prefer_intersect_plans: false, > planner.index.max_indexes_to_intersect: 5, > exec.query.rowkeyjoin_batchsize: 128, -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6822) Fix imports for DRILL-6381
[ https://issues.apache.org/jira/browse/DRILL-6822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-6822: -- Fix Version/s: 1.15.0 > Fix imports for DRILL-6381 > -- > > Key: DRILL-6822 > URL: https://issues.apache.org/jira/browse/DRILL-6822 > Project: Apache Drill > Issue Type: Task >Reporter: Venkata Jyothsna Donapati >Assignee: Venkata Jyothsna Donapati >Priority: Major > Fix For: 1.15.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6381) Add capability to do index based planning and execution
[ https://issues.apache.org/jira/browse/DRILL-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-6381: -- Reviewer: Vitalii Diravka > Add capability to do index based planning and execution > --- > > Key: DRILL-6381 > URL: https://issues.apache.org/jira/browse/DRILL-6381 > Project: Apache Drill > Issue Type: New Feature > Components: Execution - Relational Operators, Query Planning > Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Labels: ready-to-commit > Fix For: 1.15.0 > > > If the underlying data source supports indexes (primary and secondary > indexes), Drill should leverage those during planning and execution in order > to improve query performance. > On the planning side, Drill planner should be enhanced to provide an > abstraction layer which express the index metadata and statistics. Further, > a cost-based index selection is needed to decide which index(es) are > suitable. > On the execution side, appropriate operator enhancements would be needed to > handle different categories of indexes such as covering, non-covering > indexes, taking into consideration the index data may not be co-located with > the primary table, i.e a global index. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6381) Add capability to do index based planning and execution
[ https://issues.apache.org/jira/browse/DRILL-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-6381: -- Labels: ready-to-commit (was: ) > Add capability to do index based planning and execution > --- > > Key: DRILL-6381 > URL: https://issues.apache.org/jira/browse/DRILL-6381 > Project: Apache Drill > Issue Type: New Feature > Components: Execution - Relational Operators, Query Planning > Optimization >Reporter: Aman Sinha >Assignee: Aman Sinha >Priority: Major > Labels: ready-to-commit > Fix For: 1.15.0 > > > If the underlying data source supports indexes (primary and secondary > indexes), Drill should leverage those during planning and execution in order > to improve query performance. > On the planning side, Drill planner should be enhanced to provide an > abstraction layer which express the index metadata and statistics. Further, > a cost-based index selection is needed to decide which index(es) are > suitable. > On the execution side, appropriate operator enhancements would be needed to > handle different categories of indexes such as covering, non-covering > indexes, taking into consideration the index data may not be co-located with > the primary table, i.e a global index. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (DRILL-6381) Add capability to do index based planning and execution
[ https://issues.apache.org/jira/browse/DRILL-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613710#comment-16613710 ] Aman Sinha edited comment on DRILL-6381 at 9/21/18 3:28 PM: The index planning and execution feature has been in development in a private repository at MapR. We would like to contribute it to Apache Drill and hope that it spurs further development and adoption by the community. The main ideas are described below. The feature is divided into two broad categories: # A comprehensive index planning and execution framework which supports distributed covering and non-covering indices. Index planning is done for WHERE clause filters, sort-based operation such as ORDER BY, GROUP BY (using StreamingAggregate) and joins (using MergeJoin). The framework is intended to be agnostic to the storage plugins. It provides a clean abstraction layer that allows the Drill planner and executor to work with only core Drill artifacts while storage plugins provide concrete implementations of the interfaces. # A reference implementation with MapR-DB JSON plugin whose backend supports secondary indexing. Other similar DB plugins whose backend supports secondary indices could potentially use the reference implementation as a guide. Note that Drill is a query engine; it does not provide the Create/Update/Delete operations either on the primary table or indexes. These are assumed to be maintained by the respective backend servers to which Drill communicates via storage/format plugins. +*Key design concepts:*+ *Covering index:* An index whose index fields plus non-index fields (a.k.a included fields) 'covers' all the columns referenced in the query. The Drill planner will generate a covering index plan (a.k.a index-only) plan where all the columns are retrieved from the index after pushing down relevant filter conditions to the index scan. *Non-covering index:* An index whose index fields plus included fields only partially covers columns referenced in the query. For instance, suppose the index is created as follows: index keys:\{a, b}, included fields: \{c} and the query is SELECT d, e FROM T WHERE a > 10 AND b < 20. In this case, since columns d, e are not present in the index at all, this is a non-covering index. For such indexes, the Drill planner will generate a non-covering plan where only the row ids are fetched from the index by pushing down the WHERE clause filters and the rest of the columns are fetched after a join-back to the primary table. The join-back is performed using the row ids. A related notion is that of Global index: Drill planner assumes indexes are global in nature, i.e the index blocks are not necessarily co-located with the primary table's data blocks. This is the most general case since an index may be quite large and in order to fully utilize the cluster resources it is best to have it fully distributed. *Functional index:* An index which is created not on the base table columns but on functions/expressions. Currently, only CAST functions have been tested since these are most commonly used in Drill views. If the filter condition is 'WHERE CAST(zip_code as BIGINT) = 95120 and a functional index exists on CAST(zip_code as BIGINT), then the Drill planner will leverage such indexes as long as they are exposed through appropriate metadata interfaces. *Range partitioning:* This is applicable for non-covering indexes. Since the index is typically distributed across multiple nodes in the cluster, once we retrieve the row ids from a particular node, we have to send each row id to the appropriate destination node which contains primary table data for that row. This is done through range partitioning. The Drill executor contains a special operator (RangePartitionRecordBatch) and a corresponding exchange (RangePartitionExchange) for this purpose. For example, suppose the filter condition is WHERE state IN ('CA', 'TX', NY'). Since the primary table data for these states may be spread out over multiple nodes, once the row ids are fetched from the index, they have to be grouped into separate 'ranges' such that co-located row ids can be sent to the same node. *Rowkey Join and Restricted Scan (skip scan):* A Rowkey (a.k.a rowid) join is used whenever we need to fetch the rows from primary table for a set of row ids retrieved from the index. Note that this is not a real join but more of a lookup based on rowid. This is a random I/O operation from the primary table since each row id may access a separate data block from the table. The Drill executor contains a special operator (RowKeyJoinBatch) for doing the join-back. The right side of the RowKeyJoin is the index sub-plan. The left side contains a special type of scan called Restricted Scan (or skip scan). The storage plugin must implement this scan where a list of row ids can be submitted at
[jira] [Commented] (DRILL-6381) Add capability to do index based planning and execution
[ https://issues.apache.org/jira/browse/DRILL-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613710#comment-16613710 ] Aman Sinha commented on DRILL-6381: --- The index planning and execution feature has been in development in a private repository at MapR. We would like to contribute it to Apache Drill and hope that it spurs further development and adoption by the community. The main ideas are described below. The feature is divided into two broad categories: # A comprehensive index planning and execution framework which supports distributed covering and non-covering indices. Index planning is done for WHERE clause filters, sort-based operation such as ORDER BY, GROUP BY (using StreamingAggregate) and joins (using MergeJoin). The framework is intended to be agnostic to the storage plugins. It provides a clean abstraction layer that allows the Drill planner and executor to work with only core Drill artifacts while storage plugins provide concrete implementations of the interfaces. # A reference implementation with MapR-DB JSON plugin whose backend supports secondary indexing. Other similar DB plugins whose backend supports secondary indices could potentially use the reference implementation as a guide. Note that Drill is a query engine; it does not provide CRUD operations either on the primary table or indexes. These are assumed to be maintained by the respective backend servers to which Drill communicates via storage/format plugins. +*Key design concepts:*+ *Covering index:* An index whose index fields plus non-index fields (a.k.a included fields) 'covers' all the columns referenced in the query. The Drill planner will generate a covering index plan (a.k.a index-only) plan where all the columns are retrieved from the index after pushing down relevant filter conditions to the index scan. *Non-covering index:* An index whose index fields plus included fields only partially covers columns referenced in the query. For instance, suppose the index is created as follows: index keys:\{a, b}, included fields: \{c} and the query is SELECT d, e FROM T WHERE a > 10 AND b < 20. In this case, since columns d, e are not present in the index at all, this is a non-covering index. For such indexes, the Drill planner will generate a non-covering plan where only the row ids are fetched from the index by pushing down the WHERE clause filters and the rest of the columns are fetched after a join-back to the primary table. The join-back is performed using the row ids. A related notion is that of Global index: Drill planner assumes indexes are global in nature, i.e the index blocks are not necessarily co-located with the primary table's data blocks. This is the most general case since an index may be quite large and in order to fully utilize the cluster resources it is best to have it fully distributed. *Functional index:* An index which is created not on the base table columns but on functions/expressions. Currently, only CAST functions have been tested since these are most commonly used in Drill views. If the filter condition is 'WHERE CAST(zip_code as BIGINT) = 95120 and a functional index exists on CAST(zip_code as BIGINT), then the Drill planner will leverage such indexes as long as they are exposed through appropriate metadata interfaces. *Range partitioning:* This is applicable for non-covering indexes. Since the index is typically distributed across multiple nodes in the cluster, once we retrieve the row ids from a particular node, we have to send each row id to the appropriate destination node which contains primary table data for that row. This is done through range partitioning. The Drill executor contains a special operator (RangePartitionRecordBatch) and a corresponding exchange (RangePartitionExchange) for this purpose. For example, suppose the filter condition is WHERE state IN ('CA', 'TX', NY'). Since the primary table data for these states may be spread out over multiple nodes, once the row ids are fetched from the index, they have to be grouped into separate 'ranges' such that co-located row ids can be sent to the same node. *Rowkey Join and Restricted Scan (skip scan):* A Rowkey (a.k.a rowid) join is used whenever we need to fetch the rows from primary table for a set of row ids retrieved from the index. Note that this is not a real join but more of a lookup based on rowid. This is a random I/O operation from the primary table since each row id may access a separate data block from the table. The Drill executor contains a special operator (RowKeyJoinBatch) for doing the join-back. The right side of the RowKeyJoin is the index sub-plan. The left side contains a special type of scan called Restricted Scan (or skip scan). The storage plugin must implement this scan where a list of row ids can be submitted at a time if the backend supports bulk fetch. Otherwise, single
[jira] [Updated] (DRILL-6651) Compilation error in IDE due to missing package name
[ https://issues.apache.org/jira/browse/DRILL-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-6651: -- Labels: ready-to-commit (was: ) > Compilation error in IDE due to missing package name > > > Key: DRILL-6651 > URL: https://issues.apache.org/jira/browse/DRILL-6651 > Project: Apache Drill > Issue Type: Bug >Reporter: Aman Sinha >Assignee: Boaz Ben-Zvi >Priority: Major > Labels: ready-to-commit > Fix For: 1.14.0, 1.15.0 > > > I am seeing the following compilation errors in my Eclipse build (only in the > IDE.. this does not happen on the maven command line): > {noformat} > The declared package "" does not match the expected package > "org.apache.drill.exec.store.parquet" TestComplexColumnInSchema.java > /drill-java-exec/src/test/java/org/apache/drill/exec/store/parquet line 1 > The declared package "" does not match the expected package > "org.apache.drill.exec.store.parquet" TestParquetReaderUtility.java > /drill-java-exec/src/test/java/org/apache/drill/exec/store/parquet line > 1{noformat} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6650) Remove Stray Semicolon in Printing Results Listener
[ https://issues.apache.org/jira/browse/DRILL-6650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-6650: -- Labels: ready-to-commit (was: ) > Remove Stray Semicolon in Printing Results Listener > --- > > Key: DRILL-6650 > URL: https://issues.apache.org/jira/browse/DRILL-6650 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Labels: ready-to-commit > > Having empty import statements with stray semicolons can cause compilation in > eclipse to fail. I investigated adding a checkstyle check to prevent empty > import declarations, but apparently there is no way to do this. In fact > having multiple semicolons in an import statement is technically not > supported by the java spec > https://stackoverflow.com/questions/8125558/eclipse-double-semi-colon-on-an-import. > There is also a bug filed for the jdk for not throwing an error in this case > https://bugs.openjdk.java.net/browse/JDK-8072390 > Since there is no way to automate checking this I am just going to manually > remove this error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6651) Compilation error in Eclipse IDE due to missing package name
Aman Sinha created DRILL-6651: - Summary: Compilation error in Eclipse IDE due to missing package name Key: DRILL-6651 URL: https://issues.apache.org/jira/browse/DRILL-6651 Project: Apache Drill Issue Type: Bug Reporter: Aman Sinha I am seeing the following compilation errors in my Eclipse build (only in the IDE.. this does not happen on the maven command line): {noformat} The declared package "" does not match the expected package "org.apache.drill.exec.store.parquet" TestComplexColumnInSchema.java /drill-java-exec/src/test/java/org/apache/drill/exec/store/parquet line 1 The declared package "" does not match the expected package "org.apache.drill.exec.store.parquet" TestParquetReaderUtility.java /drill-java-exec/src/test/java/org/apache/drill/exec/store/parquet line 1{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)
[ https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547976#comment-16547976 ] Aman Sinha commented on DRILL-6385: --- The design discussions are getting lost in among all the GitHub comments. I am not sure how to avoid this in the JIRA.. In any case, [~weijie] regarding the previous comments about broadcast hash join, my point was that the BF does not save network I/O on probe (since there is no network transfer) and it does not save memory (since blocking operators won't let BF pass through). So, the question is how much does the reduction in the row count help hash join probe given that there is a CPU cost associated with creating and evaluating a BF ? {noformat} Suppose original row count from table is N Let selectivity of BF is s1 (fraction between 0 to 1) So, original Hash Join probe side cost Cost_Orig = I/O cost of Scanning N rows + Probe cost of N rows With BF, new Hash Join probe side cost Cost_BF_Probe = I/O cost of scanning N rows + CPU cost of Bloom Filter on N rows + Probe cost of (s1 * N) rows With BF, assume hash join build side cost is Cost_BF_build So total_HJ_BF_Cost = Cost_BF_Probe + Cost_BF_build {noformat} Whether or not the second total cost is cheaper than the original depends on selectivity and the actual cost of creating/evaluation bloom filter. So, all I was saying previously was that for broadcast HJ, the expected benefits would be less. We would only find out through some performance runs. > Support JPPD (Join Predicate Push Down) > --- > > Key: DRILL-6385 > URL: https://issues.apache.org/jira/browse/DRILL-6385 > Project: Apache Drill > Issue Type: New Feature > Components: Server, Execution - Flow >Affects Versions: 1.14.0 >Reporter: weijie.tong >Assignee: weijie.tong >Priority: Major > > This feature is to support the JPPD (Join Predicate Push Down). It will > benefit the HashJoin ,Broadcast HashJoin performance by reducing the number > of rows to send across the network ,the memory consumed. This feature is > already supported by Impala which calls it RuntimeFilter > ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]). > The first PR will try to push down a bloom filter of HashJoin node to > Parquet’s scan node. The propose basic procedure is described as follow: > # The HashJoin build side accumulate the equal join condition rows to > construct a bloom filter. Then it sends out the bloom filter to the foreman > node. > # The foreman node accept the bloom filters passively from all the fragments > that has the HashJoin operator. It then aggregates the bloom filters to form > a global bloom filter. > # The foreman node broadcasts the global bloom filter to all the probe side > scan nodes which maybe already have send out partial data to the hash join > nodes(currently the hash join node will prefetch one batch from both sides ). > 4. The scan node accepts a global bloom filter from the foreman node. > It will filter the rest rows satisfying the bloom filter. > > To implement above execution flow, some main new notion described as below: > 1. RuntimeFilter > It’s a filter container which may contain BloomFilter or MinMaxFilter. > 2. RuntimeFilterReporter > It wraps the logic to send hash join’s bloom filter to the foreman.The > serialized bloom filter will be sent out through the data tunnel.This object > will be instanced by the FragmentExecutor and passed to the > FragmentContext.So the HashJoin operator can obtain it through the > FragmentContext. > 3. RuntimeFilterRequestHandler > It is responsible to accept a SendRuntimeFilterRequest RPC to strip the > actual BloomFilter from the network. It then translates this filter to the > WorkerBee’s new interface registerRuntimeFilter. > Another RPC type is BroadcastRuntimeFilterRequest. It will register the > accepted global bloom filter to the WorkerBee by the registerRuntimeFilter > method and then propagate to the FragmentContext through which the probe side > scan node can fetch the aggregated bloom filter. > 4.RuntimeFilterManager > The foreman will instance a RuntimeFilterManager .It will indirectly get > every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been > accepted and aggregated . It will broadcast the aggregated bloom filter to > all the probe side scan nodes through the data tunnel by a > BroadcastRuntimeFilterRequest RPC. > 5. RuntimeFilterEnableOption > A global option will be added to decide whether to enable this new feature. > > Welcome suggestion and advice from you.The related PR will be presented as > soon as possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6606) Hash Join returns incorrect data types when joining subqueries with limit 0
[ https://issues.apache.org/jira/browse/DRILL-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543541#comment-16543541 ] Aman Sinha commented on DRILL-6606: --- Agree that subquery with filter producing 0 rows is common and we should address the issue. I was mainly referring to the tableau generated limit 0 queries. > Hash Join returns incorrect data types when joining subqueries with limit 0 > --- > > Key: DRILL-6606 > URL: https://issues.apache.org/jira/browse/DRILL-6606 > Project: Apache Drill > Issue Type: Bug >Reporter: Bohdan Kazydub >Assignee: Timothy Farkas >Priority: Blocker > Fix For: 1.14.0 > > > PreparedStatement for query > {code:sql} > SELECT l.l_quantity, l.l_shipdate, o.o_custkey > FROM (SELECT * FROM cp.`tpch/lineitem.parquet` LIMIT 0) l > JOIN (SELECT * FROM cp.`tpch/orders.parquet` LIMIT 0) o > ON l.l_orderkey = o.o_orderkey > LIMIT 0 > {code} > is created with wrong types (nullable INTEGER) for all selected columns, no > matter what their actual type is. This behavior reproduces with hash join > only and is very likely to be caused by DRILL-6027 as the query works fine > before this feature was implemented. > To reproduce the problem you can put the aforementioned query into > TestPreparedStatementProvider#joinOrderByQuery() test method. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6453) TPC-DS query 72 has regressed
[ https://issues.apache.org/jira/browse/DRILL-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543351#comment-16543351 ] Aman Sinha commented on DRILL-6453: --- [~khfaraaz] can you also try running a simplified version of the query with the first say 3 joins (starting from the leaf level in the plan) ? We should see what the behavior is with patterns like hash-partitioned HJ followed by broadcast, broadcast HJ. > TPC-DS query 72 has regressed > - > > Key: DRILL-6453 > URL: https://issues.apache.org/jira/browse/DRILL-6453 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Flow >Affects Versions: 1.14.0 >Reporter: Khurram Faraaz >Assignee: Boaz Ben-Zvi >Priority: Blocker > Fix For: 1.14.0 > > Attachments: 24f75b18-014a-fb58-21d2-baeab5c3352c.sys.drill, > jstack_29173_June_10_2018.txt, jstack_29173_June_10_2018.txt, > jstack_29173_June_10_2018_b.txt, jstack_29173_June_10_2018_b.txt, > jstack_29173_June_10_2018_c.txt, jstack_29173_June_10_2018_c.txt, > jstack_29173_June_10_2018_d.txt, jstack_29173_June_10_2018_d.txt, > jstack_29173_June_10_2018_e.txt, jstack_29173_June_10_2018_e.txt > > > TPC-DS query 72 seems to have regressed, query profile for the case where it > Canceled after 2 hours on Drill 1.14.0 is attached here. > {noformat} > On, Drill 1.14.0-SNAPSHOT > commit : 931b43e (TPC-DS query 72 executed successfully on this commit, took > around 55 seconds to execute) > SF1 parquet data on 4 nodes; > planner.memory.max_query_memory_per_node = 10737418240. > drill.exec.hashagg.fallback.enabled = true > TPC-DS query 72 executed successfully & took 47 seconds to complete execution. > {noformat} > {noformat} > TPC-DS data in the below run has date values stored as DATE datatype and not > VARCHAR type > On, Drill 1.14.0-SNAPSHOT > commit : 82e1a12 > SF1 parquet data on 4 nodes; > planner.memory.max_query_memory_per_node = 10737418240. > drill.exec.hashagg.fallback.enabled = true > and > alter system set `exec.hashjoin.num_partitions` = 1; > TPC-DS query 72 executed for 2 hrs and 11 mins and did not complete, I had to > Cancel it by stopping the Foreman drillbit. > As a result several minor fragments are reported to be in > CANCELLATION_REQUESTED state on UI. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6606) Hash Join returns incorrect data types when joining subqueries with limit 0
[ https://issues.apache.org/jira/browse/DRILL-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543323#comment-16543323 ] Aman Sinha commented on DRILL-6606: --- I don't think LIMIT 0 in the subqueries or Views is common. For instance, Tableau generates a wrapper LIMIT 0 on the entire query, not within each subquery. What is the data type of columns if you only have the outer LIMIT 0 after the join of the subqueries ? > Hash Join returns incorrect data types when joining subqueries with limit 0 > --- > > Key: DRILL-6606 > URL: https://issues.apache.org/jira/browse/DRILL-6606 > Project: Apache Drill > Issue Type: Bug >Reporter: Bohdan Kazydub >Assignee: Timothy Farkas >Priority: Blocker > Fix For: 1.14.0 > > > PreparedStatement for query > {code:sql} > SELECT l.l_quantity, l.l_shipdate, o.o_custkey > FROM (SELECT * FROM cp.`tpch/lineitem.parquet` LIMIT 0) l > JOIN (SELECT * FROM cp.`tpch/orders.parquet` LIMIT 0) o > ON l.l_orderkey = o.o_orderkey > LIMIT 0 > {code} > is created with wrong types (nullable INTEGER) for all selected columns, no > matter what their actual type is. This behavior reproduces with hash join > only and is very likely to be caused by DRILL-6027 as the query works fine > before this feature was implemented. > To reproduce the problem you can put the aforementioned query into > TestPreparedStatementProvider#joinOrderByQuery() test method. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6453) TPC-DS query 72 has regressed
[ https://issues.apache.org/jira/browse/DRILL-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542476#comment-16542476 ] Aman Sinha commented on DRILL-6453: --- [~ben-zvi], [~dgu-atmapr] looking at the query profile and plan, the lower most HashJoin is the only one that does hash partitioned join while the other 8 joins are broadcast joins. Since the hang happens when doing a next() on an UnorderedReceiver (associated with the hash exchange), can we see if changing this to a broadcast join helps avoid the issue ? The reason the hash partition is picked is because the right side's input (Project # 04-29) estimated row count is 11M which is just over the 10M broadcast threshold. Let's see if bumping this up to 20M helps. This could be a temporary workaround until we address the underlying issue > TPC-DS query 72 has regressed > - > > Key: DRILL-6453 > URL: https://issues.apache.org/jira/browse/DRILL-6453 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Flow >Affects Versions: 1.14.0 >Reporter: Khurram Faraaz >Assignee: Boaz Ben-Zvi >Priority: Blocker > Fix For: 1.14.0 > > Attachments: 24f75b18-014a-fb58-21d2-baeab5c3352c.sys.drill, > jstack_29173_June_10_2018.txt, jstack_29173_June_10_2018.txt, > jstack_29173_June_10_2018_b.txt, jstack_29173_June_10_2018_b.txt, > jstack_29173_June_10_2018_c.txt, jstack_29173_June_10_2018_c.txt, > jstack_29173_June_10_2018_d.txt, jstack_29173_June_10_2018_d.txt, > jstack_29173_June_10_2018_e.txt, jstack_29173_June_10_2018_e.txt > > > TPC-DS query 72 seems to have regressed, query profile for the case where it > Canceled after 2 hours on Drill 1.14.0 is attached here. > {noformat} > On, Drill 1.14.0-SNAPSHOT > commit : 931b43e (TPC-DS query 72 executed successfully on this commit, took > around 55 seconds to execute) > SF1 parquet data on 4 nodes; > planner.memory.max_query_memory_per_node = 10737418240. > drill.exec.hashagg.fallback.enabled = true > TPC-DS query 72 executed successfully & took 47 seconds to complete execution. > {noformat} > {noformat} > TPC-DS data in the below run has date values stored as DATE datatype and not > VARCHAR type > On, Drill 1.14.0-SNAPSHOT > commit : 82e1a12 > SF1 parquet data on 4 nodes; > planner.memory.max_query_memory_per_node = 10737418240. > drill.exec.hashagg.fallback.enabled = true > and > alter system set `exec.hashjoin.num_partitions` = 1; > TPC-DS query 72 executed for 2 hrs and 11 mins and did not complete, I had to > Cancel it by stopping the Foreman drillbit. > As a result several minor fragments are reported to be in > CANCELLATION_REQUESTED state on UI. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6588) System table columns incorrectly marked as non-nullable
Aman Sinha created DRILL-6588: - Summary: System table columns incorrectly marked as non-nullable Key: DRILL-6588 URL: https://issues.apache.org/jira/browse/DRILL-6588 Project: Apache Drill Issue Type: Bug Components: Metadata Affects Versions: 1.13.0 Reporter: Aman Sinha System table columns can contain null values but they are incorrectly marked as non-nullable as shown in example table below: {noformat} 0: jdbc:drill:drillbit=10.10.10.191> describe sys.boot; +---++--+ | COLUMN_NAME | DATA_TYPE | IS_NULLABLE | +---++--+ | name | CHARACTER VARYING | NO | | kind | CHARACTER VARYING | NO | | accessibleScopes | CHARACTER VARYING | NO | | optionScope | CHARACTER VARYING | NO | | status | CHARACTER VARYING | NO | | num_val | BIGINT | NO | | string_val | CHARACTER VARYING | NO | | bool_val | BOOLEAN | NO | | float_val | DOUBLE | NO | +---++--+{noformat} Note that several columns are nulls: {noformat} +---+--+--+-++-++--+---+ | name | kind | accessibleScopes | optionScope | status | num_val | string_val | bool_val | float_val | +---+--+--+-++-++--+---+ drill.exec.options.exec.udf.enable_dynamic_support | BOOLEAN | BOOT | BOOT | BOOT | null | null | true | null |{noformat} Because of the not-null metadata, the predicates on these tables such as `WHERE IS NULL` evaluate to FALSE which is incorrect. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)
[ https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534163#comment-16534163 ] Aman Sinha commented on DRILL-6385: --- Couple of design items to discuss, hence I am adding them here instead of in the PR: * For broadcast hash join, the bloom filter will not save on network I/O. It will impact number of rows in the pipeline below the hash join probe but since we are not applying bloom filter if there is a blocking operator on probe side, there is not much scope of memory savings. There is potential benefit if there is some CPU intensive operation on the probe side, which is somewhat unlikely since typically such operations would be done after the join. That leaves only 1 other possibility - if the bloom filter allows partition/rowgroup pruning on the scan itself but the current implementation does not support that. So, my conclusion is that for broadcast hash join the bloom filter will add an overhead (creation of BF plus communication with the foreman plus foreman's communication with drill bits) without much benefit. Do you have thoughts about this ? At the very least, I suggest adding a session option such as 'enable_runtime_filter_broadcast_hashjoin' whose default value is FALSE and look at performance evaluation for a large data set. * For 3 or more table joins, consider a left-deep join plan {noformat} HJ2 / \ HJ1 T3 / \ T1 T2{noformat} Here, I was envisaging that there will be a boom filter associated with the join condition of HJ1 and another bloom filter for HJ2. However, since your implementation checks for blocking operators on the probe side, it will prevent HJ2 from creating a BF..is that correct ? This is quite a common use case, so I was hoping that we would support it. From a design perspective, In this case, first we should make sure that the bloom filter for HJ2 is only created if the join is between T1 and T3, not between T2 and T3 since T2 is on the build side. Second, in terms of aggregating the BFs, it could behave the same way as a 2 column join between T1-T2. In other words, conceptually the following 2 join scenarios are similar from a bloom filter aggregation perspective: {noformat} // 2 column join between T1-T2 T1.a1 = T2.a2 AND T1.b1 = T2.b2 // 1 column join between T1-T2 and 1 column join between T1-T3 T1.a1 = T2.a2 AND T1.b1 = T3.b3{noformat} > Support JPPD (Join Predicate Push Down) > --- > > Key: DRILL-6385 > URL: https://issues.apache.org/jira/browse/DRILL-6385 > Project: Apache Drill > Issue Type: New Feature > Components: Server, Execution - Flow >Affects Versions: 1.14.0 >Reporter: weijie.tong >Assignee: weijie.tong >Priority: Major > > This feature is to support the JPPD (Join Predicate Push Down). It will > benefit the HashJoin ,Broadcast HashJoin performance by reducing the number > of rows to send across the network ,the memory consumed. This feature is > already supported by Impala which calls it RuntimeFilter > ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]). > The first PR will try to push down a bloom filter of HashJoin node to > Parquet’s scan node. The propose basic procedure is described as follow: > # The HashJoin build side accumulate the equal join condition rows to > construct a bloom filter. Then it sends out the bloom filter to the foreman > node. > # The foreman node accept the bloom filters passively from all the fragments > that has the HashJoin operator. It then aggregates the bloom filters to form > a global bloom filter. > # The foreman node broadcasts the global bloom filter to all the probe side > scan nodes which maybe already have send out partial data to the hash join > nodes(currently the hash join node will prefetch one batch from both sides ). > 4. The scan node accepts a global bloom filter from the foreman node. > It will filter the rest rows satisfying the bloom filter. > > To implement above execution flow, some main new notion described as below: > 1. RuntimeFilter > It’s a filter container which may contain BloomFilter or MinMaxFilter. > 2. RuntimeFilterReporter > It wraps the logic to send hash join’s bloom filter to the foreman.The > serialized bloom filter will be sent out through the data tunnel.This object > will be instanced by the FragmentExecutor and passed to the > FragmentContext.So the HashJoin operator can obtain it through the > FragmentContext. > 3. RuntimeFilterRequestHandler > It is responsible to accept a SendRuntimeFilterRequest RPC to strip the > actual BloomFilter from the network. It then translates this filter to the >
[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)
[ https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466808#comment-16466808 ] Aman Sinha commented on DRILL-6385: --- Sending a link to a short design overview doc [2] I had proposed during the Drill design hackathon [1] in September 2017. The proposal was to send the bloom filter past exchange boundary rather than sending to the foreman. However, this is not implemented, so your contribution would be welcome. I think doing the hash partitioned hash join first seems fine since that's the one that would benefit the most. Looking forward to your pull request ! [[1] https://lists.apache.org/thread.html/74cf48dd78d323535dc942c969e72008884e51f8715f4a20f6f8fb66@%3Cdev.drill.apache.org%3E|https://lists.apache.org/thread.html/74cf48dd78d323535dc942c969e72008884e51f8715f4a20f6f8fb66@%3Cdev.drill.apache.org%3E] [2] [https://docs.google.com/document/d/1cNznfv60wwuFJlbKwkVbCBNGSBlY5QbjYNgglPw8JQ0/edit?usp=sharing] > Support JPPD (Join Predicate Push Down) > --- > > Key: DRILL-6385 > URL: https://issues.apache.org/jira/browse/DRILL-6385 > Project: Apache Drill > Issue Type: New Feature > Components: Server, Execution - Flow >Reporter: weijie.tong >Assignee: weijie.tong >Priority: Major > > This feature is to support the JPPD (Join Predicate Push Down). It will > benefit the HashJoin ,Broadcast HashJoin performance by reducing the number > of rows to send across the network ,the memory consumed. This feature is > already supported by Impala which calls it RuntimeFilter > ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]). > The first PR will try to push down a bloom filter of HashJoin node to > Parquet’s scan node. The propose basic procedure is described as follow: > # The HashJoin build side accumulate the equal join condition rows to > construct a bloom filter. Then it sends out the bloom filter to the foreman > node. > # The foreman node accept the bloom filters passively from all the fragments > that has the HashJoin operator. It then aggregates the bloom filters to form > a global bloom filter. > # The foreman node broadcasts the global bloom filter to all the probe side > scan nodes which maybe already have send out partial data to the hash join > nodes(currently the hash join node will prefetch one batch from both sides ). > 4. The scan node accepts a global bloom filter from the foreman node. > It will filter the rest rows satisfying the bloom filter. > > To implement above execution flow, some main new notion described as below: > 1. RuntimeFilter > It’s a filter container which may contain BloomFilter or MinMaxFilter. > 2. RuntimeFilterReporter > It wraps the logic to send hash join’s bloom filter to the foreman.The > serialized bloom filter will be sent out through the data tunnel.This object > will be instanced by the FragmentExecutor and passed to the > FragmentContext.So the HashJoin operator can obtain it through the > FragmentContext. > 3. RuntimeFilterRequestHandler > It is responsible to accept a SendRuntimeFilterRequest RPC to strip the > actual BloomFilter from the network. It then translates this filter to the > WorkerBee’s new interface registerRuntimeFilter. > Another RPC type is BroadcastRuntimeFilterRequest. It will register the > accepted global bloom filter to the WorkerBee by the registerRuntimeFilter > method and then propagate to the FragmentContext through which the probe side > scan node can fetch the aggregated bloom filter. > 4.RuntimeFilterManager > The foreman will instance a RuntimeFilterManager .It will indirectly get > every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been > accepted and aggregated . It will broadcast the aggregated bloom filter to > all the probe side scan nodes through the data tunnel by a > BroadcastRuntimeFilterRequest RPC. > 5. RuntimeFilterEnableOption > A global option will be added to decide whether to enable this new feature. > > Welcome suggestion and advice from you.The related PR will be presented as > soon as possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)
[ https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466032#comment-16466032 ] Aman Sinha commented on DRILL-6385: --- [~weijie] thanks for working on this. It sounds like you are far along in the implementation. Just as a future reference, , it would be good to create the Jira sooner or inform on the dev list about the ongoing work so that others in the community are aware. Regarding the proposal, couple of thoughts: is a global bloom filter always needed or a local bloom filter will suffice in certain cases ? In the case where we are doing broadcast hash join, the probe side is never distributed, so once the build is done on each minor fragment, the bloom filter can be passed to the Scan operator locally without contacting the foreman node. A second related thought: for hash distributed hash join where both probe and build sides are hash distributed, does it mean that a 'global bloom filter' is a synchronization point in your proposal ? In other words, suppose there are 20 minor fragments and one of them is slow in completing the build phase, will the other 19 probes continue at their own pace ? > Support JPPD (Join Predicate Push Down) > --- > > Key: DRILL-6385 > URL: https://issues.apache.org/jira/browse/DRILL-6385 > Project: Apache Drill > Issue Type: New Feature > Components: Server, Execution - Flow >Reporter: weijie.tong >Assignee: weijie.tong >Priority: Major > > This feature is to support the JPPD (Join Predicate Push Down). It will > benefit the HashJoin ,Broadcast HashJoin performance by reducing the number > of rows to send across the network ,the memory consumed. This feature is > already supported by Impala which calls it RuntimeFilter > ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]). > The first PR will try to push down a bloom filter of HashJoin node to > Parquet’s scan node. The propose basic procedure is described as follow: > # The HashJoin build side accumulate the equal join condition rows to > construct a bloom filter. Then it sends out the bloom filter to the foreman > node. > # The foreman node accept the bloom filters passively from all the fragments > that has the HashJoin operator. It then aggregates the bloom filters to form > a global bloom filter. > # The foreman node broadcasts the global bloom filter to all the probe side > scan nodes which maybe already have send out partial data to the hash join > nodes(currently the hash join node will prefetch one batch from both sides ). > 4. The scan node accepts a global bloom filter from the foreman node. > It will filter the rest rows satisfying the bloom filter. > > To implement above execution flow, some main new notion described as below: > 1. RuntimeFilter > It’s a filter container which may contain BloomFilter or MinMaxFilter. > 2. RuntimeFilterReporter > It wraps the logic to send hash join’s bloom filter to the foreman.The > serialized bloom filter will be sent out through the data tunnel.This object > will be instanced by the FragmentExecutor and passed to the > FragmentContext.So the HashJoin operator can obtain it through the > FragmentContext. > 3. RuntimeFilterRequestHandler > It is responsible to accept a SendRuntimeFilterRequest RPC to strip the > actual BloomFilter from the network. It then translates this filter to the > WorkerBee’s new interface registerRuntimeFilter. > Another RPC type is BroadcastRuntimeFilterRequest. It will register the > accepted global bloom filter to the WorkerBee by the registerRuntimeFilter > method and then propagate to the FragmentContext through which the probe side > scan node can fetch the aggregated bloom filter. > 4.RuntimeFilterManager > The foreman will instance a RuntimeFilterManager .It will indirectly get > every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been > accepted and aggregated . It will broadcast the aggregated bloom filter to > all the probe side scan nodes through the data tunnel by a > BroadcastRuntimeFilterRequest RPC. > 5. RuntimeFilterEnableOption > A global option will be added to decide whether to enable this new feature. > > Welcome suggestion and advice from you.The related PR will be presented as > soon as possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6383) View column types, modes are plan-time guesses, not actual types
[ https://issues.apache.org/jira/browse/DRILL-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463417#comment-16463417 ] Aman Sinha commented on DRILL-6383: --- This link [1] has some details of views being used by Tableau. My understanding is that in the data exploration phase, the user previews a sample of the data and defines a view (this can also be done through Drill Explorer). So in that sense, there is indeed a query execution phase before defining the view but it is initiated by the user. Since Drill does not have a CREATE TABLE DDL, the CREATE VIEW is the closest abstraction which is used by Tableau. One option to improve the accuracy of the views is to extend the syntax of view creation to include a sampling clause which would basically mimic what the user was doing in the data exploration phase. I don't have experience with the client side, so I am not sure what would be the impact of removing some pieces of information from the view definition. At the very least column names should be there. For the data type, If CAST is specified, then the data type should be there. If we want to make nullability more accurate, we could mark it UNKNOWN by default but since the user knows best - e.g the rowkey column in HBASE is non-nullable, they would mark it non-nullable. Also cc-ing [~aengelbrecht] who has worked with the Drill+BI tools to shed more light. [1] [http://drill.apache.org/docs/tableau-examples] > View column types, modes are plan-time guesses, not actual types > > > Key: DRILL-6383 > URL: https://issues.apache.org/jira/browse/DRILL-6383 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Priority: Minor > > Create a view views and look at the list of columns within the view. You'll > see that they are often wrong in name, type and mode. > Consider a very simple CSV file with headers: > {noformat} > custId,name,balance,status > 123,Fred,456.78 > 125,Betty,98.76,VIP > 128,Barney,1.23,PAST DUE,30 > {noformat} > Define the simplest possible view: > {noformat} > CREATE VIEW myView2 AS SELECT * FROM `csvh/cust.csvh`; > {noformat} > Then look at the view file: > {noformat} > { > "name" : "myView2", > "sql" : "SELECT *\nFROM `csvh/cust.csvh`", > "fields" : [ { > "name" : "**", > "type" : "DYNAMIC_STAR", > "isNullable" : true > } ], > "workspaceSchemaPath" : [ "local", "data" ] > } > {noformat} > It is clear that the view simply captured the plan-time list of the new > double-star for the wildcard. Since this is not a true type, it should not > have an `isNullable` attribute. > OK, we have to spell out the columns: > {noformat} > CREATE VIEW myView3 AS SELECT custId FROM `csvh/cust.csvh`; > {noformat} > Let's look at the view file: > {noformat} > { > "name" : "myView3", > "sql" : "SELECT `custId`\nFROM `csvh/cust.csvh`", > "fields" : [ { > "name" : "custId", > "type" : "ANY", > "isNullable" : true > } ], > "workspaceSchemaPath" : [ "local", "data" ] > } > {noformat} > The name is correct. The type is `ANY`, which is wrong. Since this is a CSV > file, the column type is `VARCHAR`. Further, because this is a CSV file which > headers, the mode is REQUIRED, but is listed as nullable. To verify: > {noformat} > SELECT sqlTypeOf(custId), modeOf(custId) FROM myView3 LIMIT 1; > ++---+ > | EXPR$0 | EXPR$1 | > ++---+ > | CHARACTER VARYING | NOT NULL | > ++---+ > {noformat} > Now, let's try a CSV file without headers: > {noformat} > 123,Fred,456.78 > 125,Betty,98.76,VIP > {noformat} > {noformat} > CREATE VIEW myView4 AS SELECT columns FROM `csv/cust.csv`; > SELECT * FROM myView4; > ++ > |columns | > ++ > | ["123","Fred","456.78"]| > | ["125","Betty","98.76","VIP"] | > ++ > {noformat} > Let's look at the view file: > {noformat} > { > "name" : "myView4", > "sql" : "SELECT `columns`\nFROM `csv/cust.csv`", > "fields" : [ { > "name" : "columns", > "type" : "ANY", > "isNullable" : true > } ], > "workspaceSchemaPath" : [ "local", "data" ] > } > {noformat} > This is almost non-sensical. `columns` is reported as type `ANY` and > nullable. But, `columns` is Repeated `VARCHAR` and repeated types cannot be > nullable. > The conclusion is that the type information is virtually worthless and the > `isNullable` information is worse than worthless: it is plain wrong. > The type information is valid only if the planner can inver types: > {noformat} > CREATE VIEW myView5 AS > SELECT CAST(custId AS INTEGER) AS custId FROM `csvh/cust.csvh`; >
[jira] [Commented] (DRILL-6383) View column types, modes are plan-time guesses, not actual types
[ https://issues.apache.org/jira/browse/DRILL-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461973#comment-16461973 ] Aman Sinha commented on DRILL-6383: --- Drill views don't execute the query (not even a limit 0) which is by design, afaik. So the primary use case for views is the one with CAST expressions which as your view5 shows has the correct type inference. The nullability is not expressible in a CAST, so Drill uses the default of 'less strict' nullability i.e always assume nullable. For schema-less systems, I would think that this is a reasonable choice. One could argue that even doing a limit 0 query is no guarantee that the schema would match the type and mode of what is returned at run time since it depends on the order in which files are read. Note that for BI tools we recommend creating views with CAST and the tool (at least Tableau) wraps the entire complex query with a LIMIT 0 which hints Drill to use the type inferencing directly at planning time without going through even the build schema phase of execution. > View column types, modes are plan-time guesses, not actual types > > > Key: DRILL-6383 > URL: https://issues.apache.org/jira/browse/DRILL-6383 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Priority: Minor > > Create a view views and look at the list of columns within the view. You'll > see that they are often wrong in name, type and mode. > Consider a very simple CSV file with headers: > {noformat} > custId,name,balance,status > 123,Fred,456.78 > 125,Betty,98.76,VIP > 128,Barney,1.23,PAST DUE,30 > {noformat} > Define the simplest possible view: > {noformat} > CREATE VIEW myView2 AS SELECT * FROM `csvh/cust.csvh`; > {noformat} > Then look at the view file: > {noformat} > { > "name" : "myView2", > "sql" : "SELECT *\nFROM `csvh/cust.csvh`", > "fields" : [ { > "name" : "**", > "type" : "DYNAMIC_STAR", > "isNullable" : true > } ], > "workspaceSchemaPath" : [ "local", "data" ] > } > {noformat} > It is clear that the view simply captured the plan-time list of the new > double-star for the wildcard. Since this is not a true type, it should not > have an `isNullable` attribute. > OK, we have to spell out the columns: > {noformat} > CREATE VIEW myView3 AS SELECT custId FROM `csvh/cust.csvh`; > {noformat} > Let's look at the view file: > {noformat} > { > "name" : "myView3", > "sql" : "SELECT `custId`\nFROM `csvh/cust.csvh`", > "fields" : [ { > "name" : "custId", > "type" : "ANY", > "isNullable" : true > } ], > "workspaceSchemaPath" : [ "local", "data" ] > } > {noformat} > The name is correct. The type is `ANY`, which is wrong. Since this is a CSV > file, the column type is `VARCHAR`. Further, because this is a CSV file which > headers, the mode is REQUIRED, but is listed as nullable. To verify: > {noformat} > SELECT sqlTypeOf(custId), modeOf(custId) FROM myView3 LIMIT 1; > ++---+ > | EXPR$0 | EXPR$1 | > ++---+ > | CHARACTER VARYING | NOT NULL | > ++---+ > {noformat} > Now, let's try a CSV file without headers: > {noformat} > 123,Fred,456.78 > 125,Betty,98.76,VIP > {noformat} > {noformat} > CREATE VIEW myView4 AS SELECT columns FROM `csv/cust.csv`; > SELECT * FROM myView4; > ++ > |columns | > ++ > | ["123","Fred","456.78"]| > | ["125","Betty","98.76","VIP"] | > ++ > {noformat} > Let's look at the view file: > {noformat} > { > "name" : "myView4", > "sql" : "SELECT `columns`\nFROM `csv/cust.csv`", > "fields" : [ { > "name" : "columns", > "type" : "ANY", > "isNullable" : true > } ], > "workspaceSchemaPath" : [ "local", "data" ] > } > {noformat} > This is almost non-sensical. `columns` is reported as type `ANY` and > nullable. But, `columns` is Repeated `VARCHAR` and repeated types cannot be > nullable. > The conclusion is that the type information is virtually worthless and the > `isNullable` information is worse than worthless: it is plain wrong. > The type information is valid only if the planner can inver types: > {noformat} > CREATE VIEW myView5 AS > SELECT CAST(custId AS INTEGER) AS custId FROM `csvh/cust.csvh`; > {noformat} > The view file: > {noformat} > { > "name" : "myView5", > "sql" : "SELECT CAST(`custId` AS INTEGER) AS `custId`\nFROM > `csvh/cust.csvh`", > "fields" : [ { > "name" : "custId", > "type" : "INTEGER", > "isNullable" : true > } ], > "workspaceSchemaPath" : [ "local", "data" ] > } > {noformat} > Note that the `type` is inferred from the cast, but `isNullable` is wrong >
[jira] [Created] (DRILL-6381) Add capability to do index based planning and execution
Aman Sinha created DRILL-6381: - Summary: Add capability to do index based planning and execution Key: DRILL-6381 URL: https://issues.apache.org/jira/browse/DRILL-6381 Project: Apache Drill Issue Type: New Feature Components: Execution - Relational Operators, Query Planning Optimization Reporter: Aman Sinha Assignee: Aman Sinha If the underlying data source supports indexes (primary and secondary indexes), Drill should leverage those during planning and execution in order to improve query performance. On the planning side, Drill planner should be enhanced to provide an abstraction layer which express the index metadata and statistics. Further, a cost-based index selection is needed to decide which index(es) are suitable. On the execution side, appropriate operator enhancements would be needed to handle different categories of indexes such as covering, non-covering indexes, taking into consideration the index data may not be co-located with the primary table, i.e a global index. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6341) Mongo Tests Fail on OSX 10.13.4
[ https://issues.apache.org/jira/browse/DRILL-6341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443265#comment-16443265 ] Aman Sinha commented on DRILL-6341: --- Hey [~timothyfarkas], thanks for chasing this down. I am tagging [~akumarb2010] or [~kam_iitkgp] to see if they can help since they wrote the MongoDB plugin. > Mongo Tests Fail on OSX 10.13.4 > --- > > Key: DRILL-6341 > URL: https://issues.apache.org/jira/browse/DRILL-6341 > Project: Apache Drill > Issue Type: Bug >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > > [mongod output] [mongoimport error]failed MSpanList_Insert 0x604000 > 0x344eb6bdc31 0x0 > [mongoimport error] fatal error: MSpanList_Insert > [mongoimport error] > [mongoimport error] runtime stack: > [mongoimport error] runtime.throw(0x468d0b) > [mongoimport error] /usr/local/go/src/runtime/panic.go:491 +0xad > fp=0x7ffeefbff000 sp=0x7ffeefbfefd0 > [mongoimport error] runtime.MSpanList_Insert(0x48a0c8, 0x604000) > [mongoimport error] /usr/local/go/src/runtime/mheap.c:692 +0x8f > fp=0x7ffeefbff028 sp=0x7ffeefbff000 > [mongoimport error] MHeap_FreeSpanLocked(0x486cc0, 0x604000, 0x100) > [mongoimport error] /usr/local/go/src/runtime/mheap.c:583 +0x163 > fp=0x7ffeefbff068 sp=0x7ffeefbff028 > [mongoimport error] MHeap_Grow(0x486cc0, 0x8, 0x0) > [mongoimport error] /usr/local/go/src/runtime/mheap.c:420 +0x1a8 > fp=0x7ffeefbff0a8 sp=0x7ffeefbff068 > [mongoimport error] MHeap_AllocSpanLocked(0x486cc0, 0x1, 0x0) > [mongoimport error] /usr/local/go/src/runtime/mheap.c:298 +0x365 > fp=0x7ffeefbff0e8 sp=0x7ffeefbff0a8 > [mongoimport error] mheap_alloc(0x486cc0, 0x1, 0x12, 0x0) > [mongoimport error] /usr/local/go/src/runtime/mheap.c:190 +0x121 > fp=0x7ffeefbff110 sp=0x7ffeefbff0e8 > [mongoimport error] runtime.MHeap_Alloc(0x486cc0, 0x1, 0x112, 0xdbc9) > [mongoimport error] /usr/local/go/src/runtime/mheap.c:240 +0x66 > fp=0x7ffeefbff148 sp=0x7ffeefbff110 > [mongoimport error] MCentral_Grow(0x48ea38, 0x0) > [mongoimport error] /usr/local/go/src/runtime/mcentral.c:197 +0x8b > fp=0x7ffeefbff1b0 sp=0x7ffeefbff148 > [mongoimport error] runtime.MCentral_CacheSpan(0x48ea38, 0x0) > [mongoimport error] /usr/local/go/src/runtime/mcentral.c:85 +0x167 > fp=0x7ffeefbff1e8 sp=0x7ffeefbff1b0 > [mongoimport error] runtime.MCache_Refill(0x60, 0x12, 0x0) > [mongoimport error] /usr/local/go/src/runtime/mcache.c:90 +0xa0 > fp=0x7ffeefbff210 sp=0x7ffeefbff1e8 > [mongoimport error] runtime.mcacheRefill_m() > [mongoimport error] /usr/local/go/src/runtime/malloc.c:368 +0x57 > fp=0x7ffeefbff230 sp=0x7ffeefbff210 > [mongoimport error] runtime.onM(0x3917f8) > [mongoimport error] /usr/local/go/src/runtime/asm_amd64.s:273 +0x9a > fp=0x7ffeefbff238 sp=0x7ffeefbff230 > [mongoimport error] runtime.mallocgc(0x120, 0x2f1120, 0x0, 0x0) > [mongoimport error] /usr/local/go/src/runtime/malloc.go:178 +0x849 > fp=0x7ffeefbff2e8 sp=0x7ffeefbff238 > [mongoimport error] runtime.newobject(0x2f1120, 0x60) > [mongoimport error] /usr/local/go/src/runtime/malloc.go:353 +0x49 > fp=0x7ffeefbff310 sp=0x7ffeefbff2e8 > [mongoimport error] runtime.newG(0x29c0a) > [mongoimport error] /usr/local/go/src/runtime/proc.go:233 +0x2a > fp=0x7ffeefbff328 sp=0x7ffeefbff310 > [mongoimport error] allocg(0x478540) > [mongoimport error] /usr/local/go/src/runtime/proc.c:925 +0x1f > fp=0x7ffeefbff338 sp=0x7ffeefbff328 > [mongoimport error] runtime.malg(0x8000, 0x4785e0) > [mongoimport error] /usr/local/go/src/runtime/proc.c:2106 +0x1f > fp=0x7ffeefbff368 sp=0x7ffeefbff338 > [mongoimport error] runtime.mpreinit(0x478bc0) > [mongoimport error] /usr/local/go/src/runtime/os_darwin.c:137 +0x27 > fp=0x7ffeefbff380 sp=0x7ffeefbff368 > [mongoimport error] mcommoninit(0x478bc0) > [mongoimport error] /usr/local/go/src/runtime/proc.c:201 +0xc9 > fp=0x7ffeefbff3a8 sp=0x7ffeefbff380 > [mongoimport error] runtime.schedinit() > [mongoimport error] /usr/local/go/src/runtime/proc.c:138 +0x55 > fp=0x7ffeefbff3d0 sp=0x7ffeefbff3a8 > [mongoimport error] runtime.rt0_go(0x7ffeefbff400, 0xd, 0x7ffeefbff400, 0x0, > 0xd, 0x7ffeefbff6c0, 0x7ffeefbff73b, 0x7ffeefbff742, 0x7ffeefbff748, > 0x7ffeefbff74d, ...) > [mongoimport error] /usr/local/go/src/runtime/asm_amd64.s:95 +0x116 > fp=0x7ffeefbff3d8 sp=0x7ffeefbff3d0 > [mongoimport error] > 12:45:37.991 [main] ERROR d.f.e.p.runtime.AbstractProcess - failed to call > onAfterProcessStart() > [mongod output] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time > elapsed: -635,185.335 sec <<< FAILURE! - in > org.apache.drill.exec.store.mongo.MongoTestSuit > org.apache.drill.exec.store.mongo.MongoTestSuit Time elapsed: -635,185.336 > sec <<< ERROR! > java.io.IOException: Could not start process: > at >
[jira] [Updated] (DRILL-6103) lsb_release: command not found
[ https://issues.apache.org/jira/browse/DRILL-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-6103: -- Labels: ready-to-commit (was: ) > lsb_release: command not found > -- > > Key: DRILL-6103 > URL: https://issues.apache.org/jira/browse/DRILL-6103 > Project: Apache Drill > Issue Type: Bug >Reporter: Chunhui Shi >Assignee: Kunal Khatua >Priority: Major > Labels: ready-to-commit > Fix For: 1.14.0 > > Attachments: drill-config.sh.patch > > > Got this error when running drillbit.sh: > > $ bin/drillbit.sh restart > bin/drill-config.sh: line 317: lsb_release: command not found -- This message was sent by Atlassian JIRA (v7.6.3#76005)