[jira] [Comment Edited] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308995#comment-16308995 ] liyunzhang edited comment on HIVE-17486 at 1/3/18 1:40 AM: --- [~stakiar]: the original purpose to change M->R to M->M->R is to let CombineEquivalentWorkResolver combine same Maps. Like logical plan {code} TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51] TS[1] -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48] {code} physical plan {code} Map1:TS[0] Map2:TS[1] Map3:FIL[52]-SEL[2]-GBY[3]-RS[4] Map4:FIL[53]-SEL[9]-GBY[10]-RS[11] Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51] Reducer2:GBY[12]-RS[43] {code} For {{CombineEquivalentWorkResolver}}, it will combine same Maps. In above case, Map2 will be removed because TS\[0\] is same as TS\[1\]. But when I finished the code, I found that there is no necessary to use this way to combine TS\[0\] and TS\[1\]. {{MapInput}} is responsible for TS and I only need generate same MapInput for TS\[0\] and TS\[1\]. More detail see HIVE-17486.5.patch. was (Author: kellyzly): [~stakiar]: the original purpose to change M->R to M->M->R is to let CombineEquivalentWorkResolver combine same Maps. Like logical plan {code} TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51] TS[1] -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48] {code} physical plan {code} Map1:TS[0] Map2:TS[1] Map3:FIL[52]-SEL[2]-GBY[3]-RS[4] Map4:FIL[53]-SEL[9]-GBY[10]-RS[11] Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51] Reducer2:GBY[12]-RS[43] {code} For {{CombineEquivalentWorkResolver}}, it will combine same Maps. In above case, Map2 will be removed because TS\[0\] is same as TS\[1\]. But when I finish the code, I found that there is no necessary to use this way to combine TS\[0\] and TS\[1\]. {{MapInput}} is responsible for TS and I only need generate same MapInput for TS\[0\] and TS\[1\]. More detail see HIVE-17486.5.patch. > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang > Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, > HIVE-17486.3.patch, HIVE-17486.4.patch, explain.28.share.false, > explain.28.share.true, scanshare.after.svg, scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292203#comment-16292203 ] liyunzhang edited comment on HIVE-17486 at 12/15/17 9:15 AM: - [~lirui] and [~xuefuz]: I have updated HIVE-17486.2.patch and designed again . A similar case like DS/query28.sql is run successfully. This proved that current design(M-M-R) can use RDD cache to reduce the table scan. In the latest design doc, there is a simple case. Currently, although i have added the simple case(spark_optimize_shared_work.q), there is some exception when running qtest in my local env. Once fix the problem, will trigger Hive QA to test. was (Author: kellyzly): [~lirui] and [~xuefuz]: I have updated HIVE-17486.2.patch and designed again because a similar case like DS/query28.sql is run successfully. This proved that current design(M-M-R) can use RDD cache to reduce the table scan. In the latest design doc, there is a simple case. Currently, although i have added the simple case(spark_optimize_shared_work.q), there is some exception when running qtest in my local env. Once fix the problem, will trigger Hive QA to test. > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang > Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, > explain.28.share.false, explain.28.share.true, scanshare.after.svg, > scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285558#comment-16285558 ] liyunzhang edited comment on HIVE-17486 at 12/11/17 6:41 AM: - [~xuefuz] {quote} It seems that we can do so whenever an TS is connected to multiple RSs. The split point should happen at the fork. {quote} not very understand about this. Currently the split is on the TS for example {code} TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51] -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48] {code} -> {code} Map1: TS[0] Map2:FIL[52]-SEL[2]-GBY[3]-RS[4] Map3:FIL[53]-SEL[9]-GBY[10]-RS[11] Reducer1:GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51] Reducer2:GBY[12]-RS[43] {code} was (Author: kellyzly): [~xuefuz] {quote} It seems that we can do so whenever an TS is connected to multiple RSs. The split point should happen at the fork. {quote} not very understand about this. Please explain more, thanks! > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang > Attachments: HIVE-17486.1.patch, explain.28.share.false, > explain.28.share.true, scanshare.after.svg, scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235125#comment-16235125 ] liyunzhang edited comment on HIVE-17486 at 11/2/17 3:35 AM: [~lirui]: {quote} My understanding is HoS also supports one Map connecting to multiple Reducers {quote} There is only 1 RS in Map in HoS. It is true that there are cases that 1 Map is used by two Reducers in HoS. But in HoT, 2 RS are allowed in 1 Map, the different 2 RS in the 1 Map can transfer different data to 2 different Reducers. {quote} The problem here is HoS doesn't merge equivalent works as aggressively as HoT does. {quote} yes was (Author: kellyzly): [~lirui]: {quote} My understanding is HoS also supports one Map connecting to multiple Reducers {quote} There is only 1 RS in Map in HoS. It is true that there are cases that 1 Map is used by two Reducers in HoS. But in HoT, 2 RS are allowed in 1 Map, the different 2 RS in the 1 Map can transfer different data to 2 different Reducers. > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang >Priority: Major > Attachments: scanshare.after.svg, scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)