[jira] [Comment Edited] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2018-01-02 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308995#comment-16308995
 ] 

liyunzhang edited comment on HIVE-17486 at 1/3/18 1:40 AM:
---

[~stakiar]:
the original purpose to change M->R to M->M->R is to let 
CombineEquivalentWorkResolver combine same Maps. Like
logical plan
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
TS[1] -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
{code}  
physical plan
{code}  
Map1:TS[0]
Map2:TS[1]
Map3:FIL[52]-SEL[2]-GBY[3]-RS[4]
Map4:FIL[53]-SEL[9]-GBY[10]-RS[11]
Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
Reducer2:GBY[12]-RS[43]
{code}
For {{CombineEquivalentWorkResolver}}, it will combine same Maps. In above 
case, Map2 will be removed because TS\[0\] is same as TS\[1\].  

But when I finished the code, I found that there is no necessary to use this 
way to combine TS\[0\] and TS\[1\]. {{MapInput}} is responsible for TS and I 
only need generate same MapInput for TS\[0\] and TS\[1\]. More detail see 
HIVE-17486.5.patch.



was (Author: kellyzly):
[~stakiar]:
the original purpose to change M->R to M->M->R is to let 
CombineEquivalentWorkResolver combine same Maps. Like
logical plan
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
TS[1] -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
{code}  
physical plan
{code}  
Map1:TS[0]
Map2:TS[1]
Map3:FIL[52]-SEL[2]-GBY[3]-RS[4]
Map4:FIL[53]-SEL[9]-GBY[10]-RS[11]
Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
Reducer2:GBY[12]-RS[43]
{code}
For {{CombineEquivalentWorkResolver}}, it will combine same Maps. In above 
case, Map2 will be removed because TS\[0\] is same as TS\[1\].  

But when I finish the code, I found that there is no necessary to use this way 
to combine TS\[0\] and TS\[1\]. {{MapInput}} is responsible for TS and I only 
need generate same MapInput for TS\[0\] and TS\[1\]. More detail see 
HIVE-17486.5.patch.


> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> HIVE-17486.3.patch, HIVE-17486.4.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-15 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292203#comment-16292203
 ] 

liyunzhang edited comment on HIVE-17486 at 12/15/17 9:15 AM:
-

[~lirui] and [~xuefuz]:  I have updated  HIVE-17486.2.patch and designed again 
. A similar case like DS/query28.sql is run successfully.
This proved that current design(M-M-R) can use RDD cache to reduce the table 
scan.
In the latest design doc, there is a simple case.  Currently, although i have 
added the simple case(spark_optimize_shared_work.q), there is some exception 
when running qtest in my local env. Once fix the problem, will trigger Hive QA 
to test.


was (Author: kellyzly):
[~lirui] and [~xuefuz]:  I have updated  HIVE-17486.2.patch and designed again 
because a similar case like DS/query28.sql is run successfully.
This proved that current design(M-M-R) can use RDD cache to reduce the table 
scan.
In the latest design doc, there is a simple case.  Currently, although i have 
added the simple case(spark_optimize_shared_work.q), there is some exception 
when running qtest in my local env. Once fix the problem, will trigger Hive QA 
to test.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, 
> explain.28.share.false, explain.28.share.true, scanshare.after.svg, 
> scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-10 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285558#comment-16285558
 ] 

liyunzhang edited comment on HIVE-17486 at 12/11/17 6:41 AM:
-

[~xuefuz]
{quote}  It seems that we can do so whenever an TS is connected to multiple 
RSs. The split point should happen at the fork. {quote}  not very understand 
about this. Currently the split is on the TS
for example
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
-FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
{code}

->
{code}  
Map1: TS[0]
Map2:FIL[52]-SEL[2]-GBY[3]-RS[4]
Map3:FIL[53]-SEL[9]-GBY[10]-RS[11]
Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
Reducer2:GBY[12]-RS[43]
{code}


was (Author: kellyzly):
[~xuefuz]
{quote}  It seems that we can do so whenever an TS is connected to multiple 
RSs. The split point should happen at the fork. {quote}  not very understand 
about this. Please explain more, thanks!

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-11-01 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235125#comment-16235125
 ] 

liyunzhang edited comment on HIVE-17486 at 11/2/17 3:35 AM:


[~lirui]:
{quote}
My understanding is HoS also supports one Map connecting to multiple Reducers 
{quote}
There is only 1 RS in Map in HoS. It is true that there are cases that 1 Map is 
used by two Reducers in HoS. But in HoT, 2 RS are allowed in 1 Map, the 
different 2 RS in the 1 Map can transfer different data to 2 different 
Reducers. 
{quote}
The problem here is HoS doesn't merge equivalent works as aggressively as HoT 
does. 
{quote}
yes


was (Author: kellyzly):
[~lirui]:
{quote}
My understanding is HoS also supports one Map connecting to multiple Reducers 
{quote}
There is only 1 RS in Map in HoS. It is true that there are cases that 1 Map is 
used by two Reducers in HoS. But in HoT, 2 RS are allowed in 1 Map, the 
different 2 RS in the 1 Map can transfer different data to 2 different 
Reducers. 

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)