[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang updated HIVE-17486: -- Attachment: HIVE-17486.5.patch > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang > Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, > HIVE-17486.3.patch, HIVE-17486.4.patch, HIVE-17486.5.patch, > explain.28.share.false, explain.28.share.true, scanshare.after.svg, > scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang updated HIVE-17486: -- Attachment: HIVE-17486.4.patch update HIVE-17486.4.patch to fix the compilation error. > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang > Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, > HIVE-17486.3.patch, HIVE-17486.4.patch, explain.28.share.false, > explain.28.share.true, scanshare.after.svg, scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang updated HIVE-17486: -- Attachment: HIVE-17486.3.patch add spark_optimize_shared_work.q.out and update HIVE-17486.3.patch. Trigger QA tests to see whether patch influences current code or not. > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang > Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, > HIVE-17486.3.patch, explain.28.share.false, explain.28.share.true, > scanshare.after.svg, scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang updated HIVE-17486: -- Status: Patch Available (was: Open) > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang > Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, > HIVE-17486.3.patch, explain.28.share.false, explain.28.share.true, > scanshare.after.svg, scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang updated HIVE-17486: -- Attachment: HIVE-17486.2.patch [~lirui] and [~xuefuz]: I have updated HIVE-17486.2.patch and designed again because a similar case like DS/query28.sql is run successfully. This proved that current design(M-M-R) can use RDD cache to reduce the table scan. In the latest design doc, there is a simple case. Currently, although i have added the simple case(spark_optimize_shared_work.q), there is some exception when running qtest in my local env. Once fix the problem, will trigger Hive QA to test. > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang > Attachments: HIVE-17486.1.patch, HIVE-17486.2.patch, > explain.28.share.false, explain.28.share.true, scanshare.after.svg, > scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang updated HIVE-17486: -- Attachment: HIVE-17486.1.patch > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang > Attachments: HIVE-17486.1.patch, explain.28.share.false, > explain.28.share.true, scanshare.after.svg, scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang updated HIVE-17486: -- Attachment: explain.28.share.false explain.28.share.true > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang > Attachments: explain.28.share.false, explain.28.share.true, > scanshare.after.svg, scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang updated HIVE-17486: -- Attachment: (was: explain.28.scan.share.true) > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang > Attachments: scanshare.after.svg, scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang updated HIVE-17486: -- Attachment: (was: explain.28.scan.share.false) > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang > Attachments: scanshare.after.svg, scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang updated HIVE-17486: -- Attachment: explain.28.scan.share.false explain.28.scan.share.true I set the flag {{hive.spark.optimize.shared.work}} to enable the SharedWorkOptimizer in Hive on Spark. The attach explain.28.scan.share.true is the explain when enabling the flag and explain.28.scan.share.false is the explain when disabling the flag for [DS/query28.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query28.sql] > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang > Attachments: explain.28.scan.share.false, explain.28.scan.share.true, > scanshare.after.svg, scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang updated HIVE-17486: -- Attachment: scanshare.after.svg scanshare.before.svg > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang >Assignee: liyunzhang >Priority: Major > Attachments: scanshare.after.svg, scanshare.before.svg > > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17486: Description: in HIVE-16602, Implement shared scans with Tez. Given a query plan, the goal is to identify scans on input tables that can be merged so the data is read only once. Optimization will be carried out at the physical level. In Hive on Spark, it caches the result ofsSpark work if the spark work is used by more than 1 child spark work. After sharedWorkOptimizer is enabled in physical plan in HoS, the identical table scans are merged to 1 table scan. This result of table scan will be used by more 1 child spark work. Thus we need not do the same computation because of cache mechanism. was: in HIVE-16602, Implement shared scans with Tez. Given a query plan, the goal is to identify scans on input tables that can be merged so the data is read only once. Optimization will be carried out at the physical level. > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result ofsSpark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
[ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated HIVE-17486: Description: in HIVE-16602, Implement shared scans with Tez. Given a query plan, the goal is to identify scans on input tables that can be merged so the data is read only once. Optimization will be carried out at the physical level. In Hive on Spark, it caches the result of spark work if the spark work is used by more than 1 child spark work. After sharedWorkOptimizer is enabled in physical plan in HoS, the identical table scans are merged to 1 table scan. This result of table scan will be used by more 1 child spark work. Thus we need not do the same computation because of cache mechanism. was: in HIVE-16602, Implement shared scans with Tez. Given a query plan, the goal is to identify scans on input tables that can be merged so the data is read only once. Optimization will be carried out at the physical level. In Hive on Spark, it caches the result ofsSpark work if the spark work is used by more than 1 child spark work. After sharedWorkOptimizer is enabled in physical plan in HoS, the identical table scans are merged to 1 table scan. This result of table scan will be used by more 1 child spark work. Thus we need not do the same computation because of cache mechanism. > Enable SharedWorkOptimizer in tez on HOS > > > Key: HIVE-17486 > URL: https://issues.apache.org/jira/browse/HIVE-17486 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > > in HIVE-16602, Implement shared scans with Tez. > Given a query plan, the goal is to identify scans on input tables that can be > merged so the data is read only once. Optimization will be carried out at the > physical level. In Hive on Spark, it caches the result of spark work if the > spark work is used by more than 1 child spark work. After sharedWorkOptimizer > is enabled in physical plan in HoS, the identical table scans are merged to 1 > table scan. This result of table scan will be used by more 1 child spark > work. Thus we need not do the same computation because of cache mechanism. -- This message was sent by Atlassian JIRA (v6.4.14#64029)