[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer
[ https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21520: -- Description: Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic projects and filters. Deal with that we only need to read user needs fields for non-deterministic projects and filters in optimizer. For example, the fields of project contains nondeterministic function(rand function), after a executedPlan optimizer generated: *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table HiveTableScan will read all the fields from table. but we only need to ‘d004’ . it will affect the performance of task. was: Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic projects and filters. Deal with that we only need to read user needs fields for non-deterministic projects and filters in optimizer. > Improvement a special case for non-deterministic projects and filters in > optimizer > -- > > Key: SPARK-21520 > URL: https://issues.apache.org/jira/browse/SPARK-21520 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, Did a lot of special handling for non-deterministic projects and > filters in optimizer. but not good enough. this patch add a new special case > for non-deterministic projects and filters. Deal with that we only need to > read user needs fields for non-deterministic projects and filters in > optimizer. > For example, the fields of project contains nondeterministic function(rand > function), after a executedPlan optimizer generated: > *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as > bigint))], output=[k#403L, sum#800L]) > +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) > AS k#403L] >+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, > d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, > d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, > c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], > MetastoreRelation XXX_database, XXX_table > HiveTableScan will read all the fields from table. but we only need to ‘d004’ > . it will affect the performance of task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer
[ https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21520: -- Description: Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic projects and filters. Deal with that we only need to read user needs fields for non-deterministic projects and filters in optimizer. was: Currently, when the rand function is present in the SQL statement, hivetable searches all columns in the table. e.g: select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, ceil(c010) as cceila from XXX_table) a group by k,k; generate WholeStageCodegen subtrees: == Subtree 1 / 2 == *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table == Subtree 2 / 2 == *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], output=[k#403L, k#403L, sum(id)#797L]) +- Exchange hashpartitioning(k#403L, 200) +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table All columns will be searched in HiveTableScans , Consequently, All column data is read to a ORC table. e.g: INFO ReaderImpl: Reading ORC rows from hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 with {include: [true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true], offset: 0, length: 9223372036854775807} so, The execution of the SQL statement will become very slow. > Improvement a special case for non-deterministic projects and filters in > optimizer > -- > > Key: SPARK-21520 > URL: https://issues.apache.org/jira/browse/SPARK-21520 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, Did a lot of special handling for non-deterministic projects and > filters in optimizer. but not good enough. this patch add a new special case > for non-deterministic projects and filters. Deal with that we only need to > read user needs fields for non-deterministic projects and filters in > optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer
[ https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21520: -- Summary: Improvement a special case for non-deterministic projects and filters in optimizer (was: Hivetable scan for all the columns the SQL statement contains the 'rand') > Improvement a special case for non-deterministic projects and filters in > optimizer > -- > > Key: SPARK-21520 > URL: https://issues.apache.org/jira/browse/SPARK-21520 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, when the rand function is present in the SQL statement, hivetable > searches all columns in the table. > e.g: > select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, > ceil(c010) as cceila from XXX_table) a > group by k,k; > generate WholeStageCodegen subtrees: > == Subtree 1 / 2 == > *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as > bigint))], output=[k#403L, sum#800L]) > +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) > AS k#403L] >+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, > d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, > d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, > c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], > MetastoreRelation XXX_database, XXX_table > == Subtree 2 / 2 == > *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], > output=[k#403L, k#403L, sum(id)#797L]) > +- Exchange hashpartitioning(k#403L, 200) >+- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as > bigint))], output=[k#403L, sum#800L]) > +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * > 1.0)) AS k#403L] > +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, > d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, > d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, > c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], > MetastoreRelation XXX_database, XXX_table > > All columns will be searched in HiveTableScans , Consequently, All column > data is read to a ORC table. > e.g: > INFO ReaderImpl: Reading ORC rows from > hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 > with {include: [true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true], offset: 0, length: > 9223372036854775807} > so, The execution of the SQL statement will become very slow. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org