[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440040#comment-16440040 ] Bridget Bevens commented on DRILL-6118: --- Documentation is updated for this JIRA: [https://drill.apache.org/docs/parquet-filter-pushdown/] > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > *Directory pruning* > {{select * from order_ctas where dir0 = 't1'}} will read data only from one > folder > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(**=[$0]) > 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, > numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where dir0 = 't1'}} will read > content of all three folders > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'dir0'), 't1')]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]],
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389368#comment-16389368 ] ASF GitHub Bot commented on DRILL-6118: --- Github user vvysotskyi commented on the issue: https://github.com/apache/drill/pull/1104 @paul-rogers, as you know, there were some differences between the commits in the older Calcite version (1.4.0-r23) which was used by Drill and commits merged into Apache master. One of such commits was `[CALCITE-1150] Add dynamic record type and dynamic star for schema-on-read table`. In our older Calcite fork `DYNAMIC_STAR` was `*`, but after rebase onto Calcite 1.15, it became `**`. As I understand, `DYNAMIC_STAR` itself means "default column" which will be added to table row type for the case when we have `select *` query and the schema of the underlying table is not known. As for the name of constant, I think current `DYNAMIC_STAR` more suitable, since Calcite also uses a similar name for the same string: `DynamicRecordType.DYNAMIC_STAR_PREFIX`. So current name at least helps to avoid divergence in naming for Drill and Calcite. > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > *Directory pruning* > {{select * from order_ctas where dir0 = 't1'}} will read data only from
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388646#comment-16388646 ] ASF GitHub Bot commented on DRILL-6118: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1104 I noticed this change when merging with the latest master. What does `DYNAMIC_STAR` mean? The original `WILDCARD` name was more theoretical. How is `DYNAMIC_STAR` different from `STAR` (other than being `**`)? Can we call it `WILDCARD` and `isWildcard()` that the string is `**`, presumably for internal reasons, seems less worth calling out in the name. > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > *Directory pruning* > {{select * from order_ctas where dir0 = 't1'}} will read data only from one > folder > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(**=[$0]) > 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, > numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where dir0 = 't1'}} will read > content of all three folders > {noformat} > 00-00Screen > 00-01
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384699#comment-16384699 ] ASF GitHub Bot commented on DRILL-6118: --- Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1104 Merged into Apache master with commit id 9073aed67d89e8b2188870d6c812706085c9c41b > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > *Directory pruning* > {{select * from order_ctas where dir0 = 't1'}} will read data only from one > folder > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(**=[$0]) > 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, > numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where dir0 = 't1'}} will read > content of all three folders > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'dir0'), 't1')]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath >
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384700#comment-16384700 ] ASF GitHub Bot commented on DRILL-6118: --- Github user arina-ielchiieva closed the pull request at: https://github.com/apache/drill/pull/1104 > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > *Directory pruning* > {{select * from order_ctas where dir0 = 't1'}} will read data only from one > folder > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(**=[$0]) > 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, > numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where dir0 = 't1'}} will read > content of all three folders > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'dir0'), 't1')]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]],
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375757#comment-16375757 ] ASF GitHub Bot commented on DRILL-6118: --- Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1104 @amansinha100 rebased and ran pre-commit tests. All good expept of 16 expected failures. > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > *Directory pruning* > {{select * from order_ctas where dir0 = 't1'}} will read data only from one > folder > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(**=[$0]) > 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, > numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where dir0 = 't1'}} will read > content of all three folders > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'dir0'), 't1')]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath >
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375674#comment-16375674 ] ASF GitHub Bot commented on DRILL-6118: --- Github user amansinha100 commented on the issue: https://github.com/apache/drill/pull/1104 @arina-ielchiieva pls rebase on latest master. > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > *Directory pruning* > {{select * from order_ctas where dir0 = 't1'}} will read data only from one > folder > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(**=[$0]) > 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, > numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where dir0 = 't1'}} will read > content of all three folders > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'dir0'), 't1')]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath >
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375185#comment-16375185 ] ASF GitHub Bot commented on DRILL-6118: --- Github user amansinha100 commented on the issue: https://github.com/apache/drill/pull/1104 +1 > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > *Directory pruning* > {{select * from order_ctas where dir0 = 't1'}} will read data only from one > folder > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(**=[$0]) > 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, > numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where dir0 = 't1'}} will read > content of all three folders > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'dir0'), 't1')]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]],
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367537#comment-16367537 ] ASF GitHub Bot commented on DRILL-6118: --- Github user arina-ielchiieva commented on a diff in the pull request: https://github.com/apache/drill/pull/1104#discussion_r168801629 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillFilterItemStarReWriterRule.java --- @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.planner.logical; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableSet; +import org.apache.calcite.adapter.enumerable.EnumerableTableScan; +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptRuleOperand; +import org.apache.calcite.plan.RelOptTable; +import org.apache.calcite.prepare.RelOptTableImpl; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.CorrelationId; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rel.core.TableScan; +import org.apache.calcite.rel.logical.LogicalFilter; +import org.apache.calcite.rel.logical.LogicalProject; +import org.apache.calcite.rel.type.RelDataType; +import org.apache.calcite.rel.type.RelDataTypeFactory; +import org.apache.calcite.rel.type.RelDataTypeField; +import org.apache.calcite.rex.RexCall; +import org.apache.calcite.rex.RexInputRef; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.rex.RexVisitorImpl; +import org.apache.calcite.schema.Table; +import org.apache.drill.exec.planner.types.RelDataTypeDrillImpl; +import org.apache.drill.exec.planner.types.RelDataTypeHolder; +import org.apache.drill.exec.util.Utilities; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static org.apache.drill.exec.planner.logical.FieldsReWriterUtil.DesiredField; +import static org.apache.drill.exec.planner.logical.FieldsReWriterUtil.FieldsReWriter; + +/** + * Rule will transform filter -> project -> scan call with item star fields in filter + * into project -> filter -> project -> scan where item star fields are pushed into scan + * and replaced with actual field references. + * + * This will help partition pruning and push down rules to detect fields that can be pruned or push downed. + * Item star operator appears when sub-select or cte with star are used as source. + */ +public class DrillFilterItemStarReWriterRule extends RelOptRule { + + public static final DrillFilterItemStarReWriterRule INSTANCE = new DrillFilterItemStarReWriterRule( + RelOptHelper.some(Filter.class, RelOptHelper.some(Project.class, RelOptHelper.any( TableScan.class))), + "DrillFilterItemStarReWriterRule"); + + private DrillFilterItemStarReWriterRule(RelOptRuleOperand operand, String id) { +super(operand, id); + } + + @Override + public void onMatch(RelOptRuleCall call) { +Filter filterRel = call.rel(0); +Project projectRel = call.rel(1); +TableScan scanRel = call.rel(2); + +ItemStarFieldsVisitor itemStarFieldsVisitor = new ItemStarFieldsVisitor(filterRel.getRowType().getFieldNames()); --- End diff -- @chunhui-shi added more unit tests. Please review. > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353918#comment-16353918 ] ASF GitHub Bot commented on DRILL-6118: --- Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/1104#discussion_r166310161 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java --- @@ -596,10 +596,10 @@ private void classifyExpr(final NamedExpression ex, final RecordBatch incoming, final NameSegment ref = ex.getRef().getRootSegment(); final boolean exprHasPrefix = expr.getPath().contains(StarColumnHelper.PREFIX_DELIMITER); final boolean refHasPrefix = ref.getPath().contains(StarColumnHelper.PREFIX_DELIMITER); -final boolean exprIsStar = expr.getPath().equals(SchemaPath.WILDCARD); -final boolean refContainsStar = ref.getPath().contains(SchemaPath.WILDCARD); -final boolean exprContainsStar = expr.getPath().contains(SchemaPath.WILDCARD); -final boolean refEndsWithStar = ref.getPath().endsWith(SchemaPath.WILDCARD); +final boolean exprIsStar = expr.getPath().equals(SchemaPath.DYNAMIC_STAR); --- End diff -- This change became required after Calcite update. With the changes in CALCITE-1150, `*` is replaced by `**` after a query is parsed and `**` is added to the RowType. Therefore WILDCARD can't come from the plan and its usage should be replaced by `**`. > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath >
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352838#comment-16352838 ] ASF GitHub Bot commented on DRILL-6118: --- Github user chunhui-shi commented on a diff in the pull request: https://github.com/apache/drill/pull/1104#discussion_r166094020 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillFilterItemStarReWriterRule.java --- @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.planner.logical; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableSet; +import org.apache.calcite.adapter.enumerable.EnumerableTableScan; +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptRuleOperand; +import org.apache.calcite.plan.RelOptTable; +import org.apache.calcite.prepare.RelOptTableImpl; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.CorrelationId; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.core.Project; +import org.apache.calcite.rel.core.TableScan; +import org.apache.calcite.rel.logical.LogicalFilter; +import org.apache.calcite.rel.logical.LogicalProject; +import org.apache.calcite.rel.type.RelDataType; +import org.apache.calcite.rel.type.RelDataTypeFactory; +import org.apache.calcite.rel.type.RelDataTypeField; +import org.apache.calcite.rex.RexCall; +import org.apache.calcite.rex.RexInputRef; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.rex.RexVisitorImpl; +import org.apache.calcite.schema.Table; +import org.apache.drill.exec.planner.types.RelDataTypeDrillImpl; +import org.apache.drill.exec.planner.types.RelDataTypeHolder; +import org.apache.drill.exec.util.Utilities; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static org.apache.drill.exec.planner.logical.FieldsReWriterUtil.DesiredField; +import static org.apache.drill.exec.planner.logical.FieldsReWriterUtil.FieldsReWriter; + +/** + * Rule will transform filter -> project -> scan call with item star fields in filter + * into project -> filter -> project -> scan where item star fields are pushed into scan + * and replaced with actual field references. + * + * This will help partition pruning and push down rules to detect fields that can be pruned or push downed. + * Item star operator appears when sub-select or cte with star are used as source. + */ +public class DrillFilterItemStarReWriterRule extends RelOptRule { + + public static final DrillFilterItemStarReWriterRule INSTANCE = new DrillFilterItemStarReWriterRule( + RelOptHelper.some(Filter.class, RelOptHelper.some(Project.class, RelOptHelper.any( TableScan.class))), + "DrillFilterItemStarReWriterRule"); + + private DrillFilterItemStarReWriterRule(RelOptRuleOperand operand, String id) { +super(operand, id); + } + + @Override + public void onMatch(RelOptRuleCall call) { +Filter filterRel = call.rel(0); +Project projectRel = call.rel(1); +TableScan scanRel = call.rel(2); + +ItemStarFieldsVisitor itemStarFieldsVisitor = new ItemStarFieldsVisitor(filterRel.getRowType().getFieldNames()); --- End diff -- Other test cases should be covered are: nested field names, refer to two different fields under the same parent, eg. a.b and a.c. and array type referred in filters and projects. > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 >
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352837#comment-16352837 ] ASF GitHub Bot commented on DRILL-6118: --- Github user chunhui-shi commented on a diff in the pull request: https://github.com/apache/drill/pull/1104#discussion_r166066830 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java --- @@ -596,10 +596,10 @@ private void classifyExpr(final NamedExpression ex, final RecordBatch incoming, final NameSegment ref = ex.getRef().getRootSegment(); final boolean exprHasPrefix = expr.getPath().contains(StarColumnHelper.PREFIX_DELIMITER); final boolean refHasPrefix = ref.getPath().contains(StarColumnHelper.PREFIX_DELIMITER); -final boolean exprIsStar = expr.getPath().equals(SchemaPath.WILDCARD); -final boolean refContainsStar = ref.getPath().contains(SchemaPath.WILDCARD); -final boolean exprContainsStar = expr.getPath().contains(SchemaPath.WILDCARD); -final boolean refEndsWithStar = ref.getPath().endsWith(SchemaPath.WILDCARD); +final boolean exprIsStar = expr.getPath().equals(SchemaPath.DYNAMIC_STAR); --- End diff -- Why don't we need to handle WILDCARD case anymore? > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3, numRowGroups=3,
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346884#comment-16346884 ] ASF GitHub Bot commented on DRILL-6118: --- Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1104 @chunhui-shi please review. > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > *Directory pruning* > {{select * from order_ctas where dir0 = 't1'}} will read data only from one > folder > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(**=[$0]) > 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, > numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where dir0 = 't1'}} will read > content of all three folders > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'dir0'), 't1')]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath >
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346883#comment-16346883 ] ASF GitHub Bot commented on DRILL-6118: --- GitHub user arina-ielchiieva opened a pull request: https://github.com/apache/drill/pull/1104 DRILL-6118: Handle item star columns during project / filter push dow… …n and directory pruning 1. Added DrillFilterItemStarReWriterRule to re-write item star fields to regular field references. 2. Refactored DrillPushProjectIntoScanRule to handle item star fields, factored out helper classes and methods from PreUitl.class. 3. Fixed issue with dynamic star usage (after Calcite upgrade old usage of star was still present, replaced WILDCARD -> DYNAMIC_STAR for clarity). 4. Added unit tests to check project / filter push down and directory pruning with item star. Details in [DRILL-6118](https://issues.apache.org/jira/browse/DRILL-6118). You can merge this pull request into a Git repository by running: $ git pull https://github.com/arina-ielchiieva/drill DRILL-6118 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1104.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1104 commit 4673bfb593ca6422d58fa9e0e6eb281a69f1ed69 Author: Arina IelchiievaDate: 2017-12-21T17:31:00Z DRILL-6118: Handle item star columns during project / filter push down and directory pruning 1. Added DrillFilterItemStarReWriterRule to re-write item star fields to regular field references. 2. Refactored DrillPushProjectIntoScanRule to handle item star fields, factored out helper classes and methods from PreUitl.class. 3. Fixed issue with dynamic star usage (after Calcite upgrade old usage of star was still present, replaced WILDCARD -> DYNAMIC_STAR for clarity). 4. Added unit tests to check project / filter push down and directory pruning with item star. > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan >
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345320#comment-16345320 ] Arina Ielchiieva commented on DRILL-6118: - Yes, Paul, you are right this Jira will cover only plan-time work. > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > *Directory pruning* > {{select * from order_ctas where dir0 = 't1'}} will read data only from one > folder > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(**=[$0]) > 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, > numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where dir0 = 't1'}} will read > content of all three folders > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'dir0'), 't1')]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3,
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345307#comment-16345307 ] Paul Rogers commented on DRILL-6118: This ticket discusses plan-time work. Please note that a complete runtime projection implementation already exists in the "batch handling" project. Would be wonderful to reuse/leverage that work. Please see [this writeup|https://github.com/paul-rogers/drill/wiki/BH-Projection-Framework]. Although not covered in this JIRA, the projection code is also designed to better handle file metadata (AKA "implicit") column so we don't create them unnecessarily. But, the planner has to recognize such columns as special, and not combine them with the wildcard at plan time. > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > *Directory pruning* > {{select * from order_ctas where dir0 = 't1'}} will read data only from one > folder > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(**=[$0]) > 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, > numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where dir0 = 't1'}} will read >