[jira] [Updated] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-02-20 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6118:

Labels: doc-impacting ready-to-commit  (was: doc-impacting)

> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(**=[$0])
> 00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
> content of all three folders
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false,

[jira] [Updated] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-02-19 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6118:

Description: 
Project push down, filter push down and partition pruning does not work with 
dynamically expanded column with is represented as star in ITEM operator: 
_ITEM($0, 'column_name')_ where $0 is a star.
 This often occurs when view, sub-select or cte with star is issued.
 To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
will rewrite such ITEM operator before filter push down and directory pruning. 
For project into scan push down logic will be handled separately in already 
existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can consider the 
following queries the same: 
 {{select col1 from t}}
 {{select col1 from (select * from t)}}

*Use cases*
Since item star columns where not considered during project / filter push down 
and directory pruning, push down and pruning did not happen. This was causing 
Drill to read all columns from file (when only several are needed) or ready all 
files instead. Views with star query is the most common example. Such behavior 
significantly degrades performance for item star queries comparing to queries 
without item star.

*EXAMPLES*

*Data set* 
will create table with three files each in dedicated sub-folder:
{noformat}
use dfs.tmp;
create table `order_ctas/t1` as select cast(o_orderdate as date) as o_orderdate 
from cp.`tpch/orders.parquet` where o_orderdate between date '1992-01-01' and 
date '1992-01-03';
create table `order_ctas/t2` as select cast(o_orderdate as date) as o_orderdate 
from cp.`tpch/orders.parquet` where o_orderdate between date '1992-01-04' and 
date '1992-01-06';
create table `order_ctas/t3` as select cast(o_orderdate as date) as o_orderdate 
from cp.`tpch/orders.parquet` where o_orderdate between date '1992-01-07' and 
date '1992-01-09';
{noformat}


*Filter push down*
{{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
only one file
{noformat}
00-00Screen
00-01  Project(**=[$0])
00-02Project(T1¦¦**=[$0])
00-03  SelectionVectorRemover
00-04Filter(condition=[=($1, 1992-01-01)])
00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
00-06Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`**`]]])

{noformat}
{{select * from (select * from order_ctas) where o_orderdate = date 
'1992-01-01'}} will ready all three files
{noformat}
00-00Screen
00-01  Project(**=[$0])
00-02SelectionVectorRemover
00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
{noformat}

*Directory pruning*
{{select * from order_ctas where dir0 = 't1'}} will read data only from one 
folder
{noformat}
00-00Screen
00-01  Project(**=[$0])
00-02Project(**=[$0])
00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
{noformat}

{{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
content of all three folders
{noformat}
00-00Screen
00-01  Project(**=[$0])
00-02SelectionVectorRemover
00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
{noformat}

*Project into Scan push down*
{{select o_orderdate, count(1) from order_ctas group by o_orderdate}} will 
ready only one column from the files
{noformat}
00-00Screen
00-01  Project(o_orderdate=[$0], EXPR$1=[$1])
00-02HashAgg(group=[{0}], EXPR$1=[COUNT()])
00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`o_orderdate`]]])
{noformat}

{{select o_orderdate, count(1) from (select * from order_ctas) group by 
o_orderdate}} will ready all columns from the fi

[jira] [Updated] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-01-30 Thread Pritesh Maker (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6118:
-
Reviewer: Chunhui Shi

> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(**=[$0])
> 00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
> content of all three folders
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Project into Scan push dow

[jira] [Updated] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-01-30 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6118:

Description: 
Project push down, filter push down and partition pruning does not work with 
dynamically expanded column with is represented as star in ITEM operator: 
_ITEM($0, 'column_name')_ where $0 is a star.
 This often occurs when view, sub-select or cte with star is issued.
 To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
will rewrite such ITEM operator before filter push down and directory pruning. 
For project into scan push down logic will be handled separately in already 
existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can consider the 
following queries the same: 
 {{select col1 from t}}
 {{select col1 from (select * from t)}}

*Use cases*
Since item star columns where not considered during project / filter push down 
and directory pruning, push down and pruning did not happen. This was causing 
Drill to read all columns from file (when only several are needed) or ready all 
files instead. Views with star query is the most common example. Such behavior 
significantly degrades performance for item star queries comparing to queries 
without item star.

*EXAMPLES*

*Data set* 
will create table with three files each in dedicated sub-folder:
{noformat}
use dfs.tmp;
create table `order_ctas/t1` as select cast(o_orderdate as date) as o_orderdate 
from cp.`tpch/orders.parquet` where o_orderdate between date '1992-01-01' and 
date '1992-01-03';
create table `order_ctas/t2` as select cast(o_orderdate as date) as o_orderdate 
from cp.`tpch/orders.parquet` where o_orderdate between date '1992-01-04' and 
date '1992-01-06';
create table `order_ctas/t3` as select cast(o_orderdate as date) as o_orderdate 
from cp.`tpch/orders.parquet` where o_orderdate between date '1992-01-07' and 
date '1992-01-09';
{noformat}


*Filter push down*
{{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
only one file
{noformat}
00-00Screen
00-01  Project(**=[$0])
00-02Project(T1¦¦**=[$0])
00-03  SelectionVectorRemover
00-04Filter(condition=[=($1, 1992-01-01)])
00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
00-06Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`**`]]])

{noformat}
{{select * from (select * from order_ctas) where o_orderdate = date 
'1992-01-01'}} will ready all three files
{noformat}
00-00Screen
00-01  Project(**=[$0])
00-02SelectionVectorRemover
00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
{noformat}

*Directory pruning*
{{select * from order_ctas where dir0 = 't1'}} will read data only from one 
folder
{noformat}
00-00Screen
00-01  Project(**=[$0])
00-02Project(**=[$0])
00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
{noformat}

{{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
content of all three folders
{noformat}
00-00Screen
00-01  Project(**=[$0])
00-02SelectionVectorRemover
00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
{noformat}

*Project into Scan push down*
{{select o_orderdate, count(1) from order_ctas group by o_orderdate}} will 
ready only one column from the files
{noformat}
00-00Screen
00-01  Project(o_orderdate=[$0], EXPR$1=[$1])
00-02HashAgg(group=[{0}], EXPR$1=[COUNT()])
00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`o_orderdate`]]])
{noformat}

{{select o_orderdate, count(1) from (select * from order_ctas) group by 
o_orderdate}} will ready all columns from the fi

[jira] [Updated] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-01-30 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6118:

Description: 
Project push down, filter push down and partition pruning does not work with 
dynamically expanded column with is represented as star in ITEM operator: 
_ITEM($0, 'column_name')_ where $0 is a star.
 This often occurs when view, sub-select or cte with star is issued.
 To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
will rewrite such ITEM operator before filter push down and directory pruning. 
For project into scan push down logic will be handled separately in already 
existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can consider the 
following queries the same: 
 {{select col1 from t}}
 {{select col1 from (select * from t)}}

*Use cases*
Since item star columns where not considered during project / filter push down 
and directory pruning, push down and pruning did not happen. This was causing 
Drill to read all columns from file (when only several are needed) or ready all 
files instead. Views with star query is the most common example. 

*EXAMPLES*

*Data set* 
will create table with three files each in dedicated sub-folder:
{noformat}
use dfs.tmp;
create table `order_ctas/t1` as select cast(o_orderdate as date) as o_orderdate 
from cp.`tpch/orders.parquet` where o_orderdate between date '1992-01-01' and 
date '1992-01-03';
create table `order_ctas/t2` as select cast(o_orderdate as date) as o_orderdate 
from cp.`tpch/orders.parquet` where o_orderdate between date '1992-01-04' and 
date '1992-01-06';
create table `order_ctas/t3` as select cast(o_orderdate as date) as o_orderdate 
from cp.`tpch/orders.parquet` where o_orderdate between date '1992-01-07' and 
date '1992-01-09';
{noformat}


*Filter push down*
{{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
only one file
{noformat}
00-00Screen
00-01  Project(**=[$0])
00-02Project(T1¦¦**=[$0])
00-03  SelectionVectorRemover
00-04Filter(condition=[=($1, 1992-01-01)])
00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
00-06Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`**`]]])

{noformat}
{{select * from (select * from order_ctas) where o_orderdate = date 
'1992-01-01'}} will ready all three files
{noformat}
00-00Screen
00-01  Project(**=[$0])
00-02SelectionVectorRemover
00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
{noformat}

*Directory pruning*
{{select * from order_ctas where dir0 = 't1'}} will read data only from one 
folder
{noformat}
00-00Screen
00-01  Project(**=[$0])
00-02Project(**=[$0])
00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
{noformat}

{{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
content of all three folders
{noformat}
00-00Screen
00-01  Project(**=[$0])
00-02SelectionVectorRemover
00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
{noformat}

*Project into Scan push down*
{{select o_orderdate, count(1) from order_ctas group by o_orderdate}} will 
ready only one column from the files
{noformat}
00-00Screen
00-01  Project(o_orderdate=[$0], EXPR$1=[$1])
00-02HashAgg(group=[{0}], EXPR$1=[COUNT()])
00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
[path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`o_orderdate`]]])
{noformat}

{{select o_orderdate, count(1) from (select * from order_ctas) group by 
o_orderdate}} will ready all columns from the files
{noformat}
  00-00Screen
00-01  Project(col_vrchr=[$0], EXPR$1=[$1])
00-02StreamAgg(group=[{

[jira] [Updated] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-01-30 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6118:

Labels: doc-impacting  (was: )

> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(T1¦¦**=[$0])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($1, 1992-01-01)])
> 00-05  Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02Project(**=[$0])
> 00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
> content of all three folders
> {noformat}
> 00-00Screen
> 00-01  Project(**=[$0])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
> 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Project into Scan push down*
> {{select o_orderdate, count(1) from order_ctas group by o_orderdate}} will 
> ready only one col

[jira] [Updated] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning

2018-01-30 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6118:

Description: 
Project push down, filter push down and partition pruning does not work with 
dynamically expanded column with is represented as star in ITEM operator: 
_ITEM($0, 'column_name')_ where $0 is a star.
 This often occurs when view, sub-select or cte with star is issued.
 To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
will rewrite such ITEM operator before filter push down and directory pruning. 
For project into scan push down logic will be handled separately in already 
existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can consider the 
following queries the same: 
 {{select col1 from t}}
 {{select col1 from (select * from t)}}

*Use cases*
Since item star columns where not considered during 
 

Examples:

*Filter push down*

 

  was:
Project push down, filter push down and partition pruning does not work with 
dynamically expanded column with is represented as star in ITEM operator: 
_ITEM($0, 'column_name')_ where $0 is a star.
This often occurs when view, sub-select or cte with star is issued.
To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
will rewrite such ITEM operator before filter push down and directory pruning. 
For project into scan push down logic will be handled separately in already 
existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can consider the 
following queries the same: 
{{select col1 from t}}
{{select col1 from (select * from t)}}


> Handle item star columns during project  /  filter push down and directory 
> pruning
> --
>
> Key: DRILL-6118
> URL: https://issues.apache.org/jira/browse/DRILL-6118
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
> Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during 
>  
> Examples:
> *Filter push down*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)