subject:"\[jira\] \[Updated\] \(SPARK\-21520\) Improvement a special case for non\-deterministic projects and filters in optimizer"

[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer

2017-08-10 Thread caoxuewen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21520:
--
Description: 
Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic projects and filters. Deal with that we only need to read 
user needs fields for non-deterministic projects and filters in optimizer.
For example, the fields of project contains nondeterministic function(rand 
function), after a executedPlan optimizer generated:

*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table

HiveTableScan will read all the fields from table. but we only need to ‘d004’ . 
it will affect the performance of task.


  was:
Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic projects and filters. Deal with that we only need to read 
user needs fields for non-deterministic projects and filters in optimizer.



> Improvement a special case for non-deterministic projects and filters in 
> optimizer
> --
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, Did a lot of special handling for non-deterministic projects and 
> filters in optimizer. but not good enough. this patch add a new special case 
> for non-deterministic projects and filters. Deal with that we only need to 
> read user needs fields for non-deterministic projects and filters in 
> optimizer.
> For example, the fields of project contains nondeterministic function(rand 
> function), after a executedPlan optimizer generated:
> *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
> +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) 
> AS k#403L]
>+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
> HiveTableScan will read all the fields from table. but we only need to ‘d004’ 
> . it will affect the performance of task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer

2017-08-09 Thread caoxuewen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21520:
--
Description: 
Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic projects and filters. Deal with that we only need to read 
user needs fields for non-deterministic projects and filters in optimizer.


  was:
Currently, when the rand function is present in the SQL statement, hivetable 
searches all columns in the table.
e.g:
select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, 
ceil(c010) as cceila from XXX_table) a
group by k,k;

generate WholeStageCodegen subtrees:
== Subtree 1 / 2 ==
*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
== Subtree 2 / 2 ==
*HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], 
output=[k#403L, k#403L, sum(id)#797L])
+- Exchange hashpartitioning(k#403L, 200)
   +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
bigint))], output=[k#403L, sum#800L])
  +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
1.0)) AS k#403L]
 +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
 
All columns will be searched in HiveTableScans , Consequently, All column data 
is read to a ORC table.
e.g:
INFO ReaderImpl: Reading ORC rows from 
hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 with 
{include: [true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true], offset: 0, length: 9223372036854775807}

so, The execution of the SQL statement will become very slow.



> Improvement a special case for non-deterministic projects and filters in 
> optimizer
> --
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, Did a lot of special handling for non-deterministic projects and 
> filters in optimizer. but not good enough. this patch add a new special case 
> for non-deterministic projects and filters. Deal with that we only need to 
> read user needs fields for non-deterministic projects and filters in 
> optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer

2017-08-09 Thread caoxuewen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21520:
--
Summary: Improvement a special case for non-deterministic projects and 
filters in optimizer  (was: Hivetable scan for all the columns the SQL 
statement contains the 'rand')

> Improvement a special case for non-deterministic projects and filters in 
> optimizer
> --
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, when the rand function is present in the SQL statement, hivetable 
> searches all columns in the table.
> e.g:
> select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, 
> ceil(c010) as cceila from XXX_table) a
> group by k,k;
> generate WholeStageCodegen subtrees:
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
> +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) 
> AS k#403L]
>+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
> == Subtree 2 / 2 ==
> *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], 
> output=[k#403L, k#403L, sum(id)#797L])
> +- Exchange hashpartitioning(k#403L, 200)
>+- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
>   +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
> 1.0)) AS k#403L]
>  +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
>
> All columns will be searched in HiveTableScans , Consequently, All column 
> data is read to a ORC table.
> e.g:
> INFO ReaderImpl: Reading ORC rows from 
> hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 
> with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true], offset: 0, length: 
> 9223372036854775807}
> so, The execution of the SQL statement will become very slow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer

[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer

[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer

3 matches

Site Navigation

Mail list logo

Footer information