subject:"\[jira\] \[Commented\] \(DRILL\-6118\) Handle item star columns during project \/ filter push down and directory pruning"

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-04-16 Thread Bridget Bevens (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440040#comment-16440040
 ] 

Bridget Bevens commented on DRILL-6118:
---

Documentation is updated for this JIRA: 
[https://drill.apache.org/docs/parquet-filter-pushdown/] 

> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(**=[$0])
> 00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
> content of all three folders
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]],

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-03-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389368#comment-16389368
 ] 

ASF GitHub Bot commented on DRILL-6118:
---

Github user vvysotskyi commented on the issue:

https://github.com/apache/drill/pull/1104
  
@paul-rogers, as you know, there were some differences between the commits 
in the older Calcite version (1.4.0-r23) which was used by Drill and commits 
merged into Apache master. One of such commits was `[CALCITE-1150] Add dynamic 
record type and dynamic star for schema-on-read table`. In our older Calcite 
fork `DYNAMIC_STAR` was `*`, but after rebase onto Calcite 1.15, it became `**`.

As I understand, `DYNAMIC_STAR` itself means "default column" which will be 
added to table row type for the case when we have `select *` query and the 
schema of the underlying table is not known. 

As for the name of constant, I think current `DYNAMIC_STAR` more suitable, 
since Calcite also uses a similar name for the same string: 
`DynamicRecordType.DYNAMIC_STAR_PREFIX`. So current name at least helps to 
avoid divergence in naming for Drill and Calcite.


> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-03-06 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388646#comment-16388646
 ] 

ASF GitHub Bot commented on DRILL-6118:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1104
  
I noticed this change when merging with the latest master. What does 
`DYNAMIC_STAR` mean? The original `WILDCARD` name was more theoretical. How is 
`DYNAMIC_STAR` different from `STAR` (other than being `**`)?

Can we call it `WILDCARD` and `isWildcard()` that the string is `**`, 
presumably for internal reasons, seems less worth calling out in the name.


> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(**=[$0])
> 00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
> content of all three folders
> {noformat}
> 00-00Screen
> 00-01

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-03-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384699#comment-16384699
 ] 

ASF GitHub Bot commented on DRILL-6118:
---

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1104
  
Merged into Apache master with commit id 
9073aed67d89e8b2188870d6c812706085c9c41b


> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(**=[$0])
> 00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
> content of all three folders
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
>

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-03-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384700#comment-16384700
 ] 

ASF GitHub Bot commented on DRILL-6118:
---

Github user arina-ielchiieva closed the pull request at:

https://github.com/apache/drill/pull/1104


> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(**=[$0])
> 00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
> content of all three folders
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]],

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-02-24 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375757#comment-16375757
 ] 

ASF GitHub Bot commented on DRILL-6118:
---

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1104
  
@amansinha100 rebased and ran pre-commit tests. All good expept of 16 
expected failures.


> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(**=[$0])
> 00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
> content of all three folders
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
>

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-02-24 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375674#comment-16375674
 ] 

ASF GitHub Bot commented on DRILL-6118:
---

Github user amansinha100 commented on the issue:

https://github.com/apache/drill/pull/1104
  
@arina-ielchiieva pls rebase on latest master. 


> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(**=[$0])
> 00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
> content of all three folders
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
>

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-02-23 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375185#comment-16375185
 ] 

ASF GitHub Bot commented on DRILL-6118:
---

Github user amansinha100 commented on the issue:

https://github.com/apache/drill/pull/1104
  
+1


> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(**=[$0])
> 00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
> content of all three folders
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]],

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-02-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367537#comment-16367537
 ] 

ASF GitHub Bot commented on DRILL-6118:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/1104#discussion_r168801629
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillFilterItemStarReWriterRule.java
 ---
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.planner.logical;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableSet;
+import org.apache.calcite.adapter.enumerable.EnumerableTableScan;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptRuleOperand;
+import org.apache.calcite.plan.RelOptTable;
+import org.apache.calcite.prepare.RelOptTableImpl;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.CorrelationId;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.core.TableScan;
+import org.apache.calcite.rel.logical.LogicalFilter;
+import org.apache.calcite.rel.logical.LogicalProject;
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.rel.type.RelDataTypeFactory;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexCall;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.rex.RexVisitorImpl;
+import org.apache.calcite.schema.Table;
+import org.apache.drill.exec.planner.types.RelDataTypeDrillImpl;
+import org.apache.drill.exec.planner.types.RelDataTypeHolder;
+import org.apache.drill.exec.util.Utilities;
+
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import static 
org.apache.drill.exec.planner.logical.FieldsReWriterUtil.DesiredField;
+import static 
org.apache.drill.exec.planner.logical.FieldsReWriterUtil.FieldsReWriter;
+
+/**
+ * Rule will transform filter -> project -> scan call with item star 
fields in filter
+ * into project -> filter -> project -> scan where item star fields are 
pushed into scan
+ * and replaced with actual field references.
+ *
+ * This will help partition pruning and push down rules to detect fields 
that can be pruned or push downed.
+ * Item star operator appears when sub-select or cte with star are used as 
source.
+ */
+public class DrillFilterItemStarReWriterRule extends RelOptRule {
+
+  public static final DrillFilterItemStarReWriterRule INSTANCE = new 
DrillFilterItemStarReWriterRule(
+  RelOptHelper.some(Filter.class, RelOptHelper.some(Project.class, 
RelOptHelper.any( TableScan.class))),
+  "DrillFilterItemStarReWriterRule");
+
+  private DrillFilterItemStarReWriterRule(RelOptRuleOperand operand, 
String id) {
+super(operand, id);
+  }
+
+  @Override
+  public void onMatch(RelOptRuleCall call) {
+Filter filterRel = call.rel(0);
+Project projectRel = call.rel(1);
+TableScan scanRel = call.rel(2);
+
+ItemStarFieldsVisitor itemStarFieldsVisitor = new 
ItemStarFieldsVisitor(filterRel.getRowType().getFieldNames());
--- End diff --

@chunhui-shi added more unit tests. Please review.


> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-02-06 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353918#comment-16353918
 ] 

ASF GitHub Bot commented on DRILL-6118:
---

Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1104#discussion_r166310161
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java
 ---
@@ -596,10 +596,10 @@ private void classifyExpr(final NamedExpression ex, 
final RecordBatch incoming,
 final NameSegment ref = ex.getRef().getRootSegment();
 final boolean exprHasPrefix = 
expr.getPath().contains(StarColumnHelper.PREFIX_DELIMITER);
 final boolean refHasPrefix = 
ref.getPath().contains(StarColumnHelper.PREFIX_DELIMITER);
-final boolean exprIsStar = expr.getPath().equals(SchemaPath.WILDCARD);
-final boolean refContainsStar = 
ref.getPath().contains(SchemaPath.WILDCARD);
-final boolean exprContainsStar = 
expr.getPath().contains(SchemaPath.WILDCARD);
-final boolean refEndsWithStar = 
ref.getPath().endsWith(SchemaPath.WILDCARD);
+final boolean exprIsStar = 
expr.getPath().equals(SchemaPath.DYNAMIC_STAR);
--- End diff --

This change became required after Calcite update. With the changes in 
CALCITE-1150, `*` is replaced by `**` after a query is parsed and `**` is added 
to the RowType. Therefore WILDCARD can't come from the plan and its usage 
should be replaced by `**`.


> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
>

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-02-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352838#comment-16352838
 ] 

ASF GitHub Bot commented on DRILL-6118:
---

Github user chunhui-shi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1104#discussion_r166094020
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillFilterItemStarReWriterRule.java
 ---
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.planner.logical;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableSet;
+import org.apache.calcite.adapter.enumerable.EnumerableTableScan;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptRuleOperand;
+import org.apache.calcite.plan.RelOptTable;
+import org.apache.calcite.prepare.RelOptTableImpl;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.CorrelationId;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.core.TableScan;
+import org.apache.calcite.rel.logical.LogicalFilter;
+import org.apache.calcite.rel.logical.LogicalProject;
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.rel.type.RelDataTypeFactory;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexCall;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.rex.RexVisitorImpl;
+import org.apache.calcite.schema.Table;
+import org.apache.drill.exec.planner.types.RelDataTypeDrillImpl;
+import org.apache.drill.exec.planner.types.RelDataTypeHolder;
+import org.apache.drill.exec.util.Utilities;
+
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import static 
org.apache.drill.exec.planner.logical.FieldsReWriterUtil.DesiredField;
+import static 
org.apache.drill.exec.planner.logical.FieldsReWriterUtil.FieldsReWriter;
+
+/**
+ * Rule will transform filter -> project -> scan call with item star 
fields in filter
+ * into project -> filter -> project -> scan where item star fields are 
pushed into scan
+ * and replaced with actual field references.
+ *
+ * This will help partition pruning and push down rules to detect fields 
that can be pruned or push downed.
+ * Item star operator appears when sub-select or cte with star are used as 
source.
+ */
+public class DrillFilterItemStarReWriterRule extends RelOptRule {
+
+  public static final DrillFilterItemStarReWriterRule INSTANCE = new 
DrillFilterItemStarReWriterRule(
+  RelOptHelper.some(Filter.class, RelOptHelper.some(Project.class, 
RelOptHelper.any( TableScan.class))),
+  "DrillFilterItemStarReWriterRule");
+
+  private DrillFilterItemStarReWriterRule(RelOptRuleOperand operand, 
String id) {
+super(operand, id);
+  }
+
+  @Override
+  public void onMatch(RelOptRuleCall call) {
+Filter filterRel = call.rel(0);
+Project projectRel = call.rel(1);
+TableScan scanRel = call.rel(2);
+
+ItemStarFieldsVisitor itemStarFieldsVisitor = new 
ItemStarFieldsVisitor(filterRel.getRowType().getFieldNames());
--- End diff --

Other test cases should be covered are: 
nested field names, 
refer to two different fields under the same parent, eg. a.b and a.c.
and array type referred in filters and projects.


> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
>

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-02-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352837#comment-16352837
 ] 

ASF GitHub Bot commented on DRILL-6118:
---

Github user chunhui-shi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1104#discussion_r166066830
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java
 ---
@@ -596,10 +596,10 @@ private void classifyExpr(final NamedExpression ex, 
final RecordBatch incoming,
 final NameSegment ref = ex.getRef().getRootSegment();
 final boolean exprHasPrefix = 
expr.getPath().contains(StarColumnHelper.PREFIX_DELIMITER);
 final boolean refHasPrefix = 
ref.getPath().contains(StarColumnHelper.PREFIX_DELIMITER);
-final boolean exprIsStar = expr.getPath().equals(SchemaPath.WILDCARD);
-final boolean refContainsStar = 
ref.getPath().contains(SchemaPath.WILDCARD);
-final boolean exprContainsStar = 
expr.getPath().contains(SchemaPath.WILDCARD);
-final boolean refEndsWithStar = 
ref.getPath().endsWith(SchemaPath.WILDCARD);
+final boolean exprIsStar = 
expr.getPath().equals(SchemaPath.DYNAMIC_STAR);
--- End diff --

Why don't we need to handle WILDCARD case anymore?


> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3,

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-01-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346884#comment-16346884
 ] 

ASF GitHub Bot commented on DRILL-6118:
---

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1104
  
@chunhui-shi please review.


> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(**=[$0])
> 00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
> content of all three folders
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
>

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-01-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346883#comment-16346883
 ] 

ASF GitHub Bot commented on DRILL-6118:
---

GitHub user arina-ielchiieva opened a pull request:

https://github.com/apache/drill/pull/1104

DRILL-6118: Handle item star columns during project / filter push dow…

…n and directory pruning

1. Added DrillFilterItemStarReWriterRule to re-write item star fields to 
regular field references.
2. Refactored DrillPushProjectIntoScanRule to handle item star fields, 
factored out helper classes and methods from PreUitl.class.
3. Fixed issue with dynamic star usage (after Calcite upgrade old usage of 
star was still present, replaced WILDCARD -> DYNAMIC_STAR  for clarity).
4. Added unit tests to check project / filter push down and directory 
pruning with item star.

Details in [DRILL-6118](https://issues.apache.org/jira/browse/DRILL-6118).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/arina-ielchiieva/drill DRILL-6118

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1104.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1104


commit 4673bfb593ca6422d58fa9e0e6eb281a69f1ed69
Author: Arina Ielchiieva 
Date:   2017-12-21T17:31:00Z

DRILL-6118: Handle item star columns during project / filter push down and 
directory pruning

1. Added DrillFilterItemStarReWriterRule to re-write item star fields to 
regular field references.
2. Refactored DrillPushProjectIntoScanRule to handle item star fields, 
factored out helper classes and methods from PreUitl.class.
3. Fixed issue with dynamic star usage (after Calcite upgrade old usage of 
star was still present, replaced WILDCARD -> DYNAMIC_STAR  for clarity).
4. Added unit tests to check project / filter push down and directory 
pruning with item star.




> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
>

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-01-30 Thread Arina Ielchiieva (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345320#comment-16345320
 ] 

Arina Ielchiieva commented on DRILL-6118:
-

Yes, Paul, you are right this Jira will cover only plan-time work.

> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(**=[$0])
> 00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
> content of all three folders
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3,

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-01-30 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345307#comment-16345307
 ] 

Paul Rogers commented on DRILL-6118:


This ticket discusses plan-time work. Please note that a complete runtime 
projection implementation already exists in the "batch handling" project. Would 
be wonderful to reuse/leverage that work. Please see [this 
writeup|https://github.com/paul-rogers/drill/wiki/BH-Projection-Framework].

Although not covered in this JIRA, the projection code is also designed to 
better handle file metadata (AKA "implicit") column so we don't create them 
unnecessarily. But, the planner has to recognize such columns as special, and 
not combine them with the wildcard at plan time.

> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(**=[$0])
> 00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
>

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

16 matches

Site Navigation

Mail list logo

Footer information