[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2018-03-06 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388915#comment-16388915 ] Rui Li commented on HIVE-15104: --- [~stakiar], thanks for trying this out. bq. The HiveKryoRegistrator still

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2018-03-06 Thread Sahil Takiar (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388551#comment-16388551 ] Sahil Takiar commented on HIVE-15104: - Hey [~lirui] I found some time to do some internal testing of

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-31 Thread Lefty Leverenz (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233602#comment-16233602 ] Lefty Leverenz commented on HIVE-15104: --- Good doc, thanks [~lirui]. I removed the TODOC3.0 label.

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-29 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224326#comment-16224326 ] Rui Li commented on HIVE-15104: --- Thanks [~leftylev] for the reminder. I've updated the wiki. > Hive on

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-25 Thread Lefty Leverenz (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16219861#comment-16219861 ] Lefty Leverenz commented on HIVE-15104: --- Doc note: This adds *hive.spark.optimize.shuffle.serde* to

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-24 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217849#comment-16217849 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-24 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217431#comment-16217431 ] Xuefu Zhang commented on HIVE-15104: +1 > Hive on Spark generate more shuffle data than hive on mr >

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-18 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208904#comment-16208904 ] Rui Li commented on HIVE-15104: --- The sub-query failures are tracked by HIVE-17823. Others are not related.

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-17 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207563#comment-16207563 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-16 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207028#comment-16207028 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-16 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205863#comment-16205863 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-13 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203867#comment-16203867 ] Xuefu Zhang commented on HIVE-15104: I think it's fairly safe to assume that hive-exec.jar and the new

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-13 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203332#comment-16203332 ] Rui Li commented on HIVE-15104: --- [~xuefuz], we need to locate the jar on Hive side, before we call

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-12 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202897#comment-16202897 ] Xuefu Zhang commented on HIVE-15104: Hi [~lirui], to locate the jar, can we assume that the jar is

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-12 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201695#comment-16201695 ] Rui Li commented on HIVE-15104: --- One correction: the {{NoClassDefFoundError}} is for

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-11 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201405#comment-16201405 ] Rui Li commented on HIVE-15104: --- Hi [~xuefuz], sorry for taking so long to update. I tried out your

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-31 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149777#comment-16149777 ] Xuefu Zhang commented on HIVE-15104: Hi [~lirui], I think creating a trivial package is still better

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-30 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148485#comment-16148485 ] Rui Li commented on HIVE-15104: --- [~xuefuz], I'll try if that's feasible. Do you think it's OK to create a

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-30 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148351#comment-16148351 ] Xuefu Zhang commented on HIVE-15104: I see. It might be possible to put this class in a new package

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-30 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148340#comment-16148340 ] Rui Li commented on HIVE-15104: --- [~xuefuz], my previous

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-30 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148174#comment-16148174 ] Xuefu Zhang commented on HIVE-15104: The patch looks good to me. My only concern is about the

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-25 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141431#comment-16141431 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-20 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16134639#comment-16134639 ] Rui Li commented on HIVE-15104: --- Thanks [~xuefuz] and take your time. I guess we can also run a round of QA

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-18 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133290#comment-16133290 ] Xuefu Zhang commented on HIVE-15104: [~lirui], I found it difficulty to backport HIVE-17114 to our

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-16 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129895#comment-16129895 ] Xuefu Zhang commented on HIVE-15104: Hi [~lirui], thanks for continuing the work here. The improvement

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-16 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129871#comment-16129871 ] Rui Li commented on HIVE-15104: --- Hi [~xuefuz], with HIVE-17114 and HIVE-17321 the benchmark results become

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-07-13 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085465#comment-16085465 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-07-13 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085384#comment-16085384 ] Rui Li commented on HIVE-15104: --- Hi [~xuefuz], I can't reproduce the perf degradation on my side. Some case

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-07-13 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085324#comment-16085324 ] Xuefu Zhang commented on HIVE-15104: [~lirui], I'm wondering if there is anything new (other than

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-07-13 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085276#comment-16085276 ] Rui Li commented on HIVE-15104: --- I also run another round of TPC-DS. The overall shuffle data is reduced by

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-06-16 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052006#comment-16052006 ] Rui Li commented on HIVE-15104: --- The approach here can cause problem when we cache RDDs, e.g. combining

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-20 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16018507#comment-16018507 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-19 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017621#comment-16017621 ] Rui Li commented on HIVE-15104: --- Patch v3 compiles the registrators at runtime, so that we don't have to

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-19 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017527#comment-16017527 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-15 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010433#comment-16010433 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-12 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009026#comment-16009026 ] Rui Li commented on HIVE-15104: --- [~xuefuz], kryo was relocated in HIVE-5915. So it's not intended for Spark.

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-12 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008502#comment-16008502 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-12 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008107#comment-16008107 ] Xuefu Zhang commented on HIVE-15104: [~lirui], great progress! Thanks for keeping up the effort. As

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-05 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998437#comment-15998437 ] Rui Li commented on HIVE-15104: --- Tried disabling relocation locally. It does solve the AbstractMethodError.

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-05 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998177#comment-15998177 ] Rui Li commented on HIVE-15104: --- I looked at the shuffle writers of Spark and none of them seem to need the

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-04-07 Thread Aihua Xu (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960801#comment-15960801 ] Aihua Xu commented on HIVE-15104: - [~lirui] I didn't have time to work on that . Feel free to take it

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-04-06 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960146#comment-15960146 ] Rui Li commented on HIVE-15104: --- Hi [~aihuaxu], are you still working on this? If not, do you mind if I take

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-07 Thread Aihua Xu (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644296#comment-15644296 ] Aihua Xu commented on HIVE-15104: - I will take a look at Spark to see if it's needed after it's

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-03 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634914#comment-15634914 ] Rui Li commented on HIVE-15104: --- [~xuefuz], [~aihuaxu], both MR and Spark need HiveKey.hashCode to compute

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-03 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634112#comment-15634112 ] Xuefu Zhang commented on HIVE-15104: I checked the source code and it seems that both Spark

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-03 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633820#comment-15633820 ] Xuefu Zhang commented on HIVE-15104: [~lirui], thanks for sharing your findings. Can you confirm that

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-03 Thread Aihua Xu (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15632623#comment-15632623 ] Aihua Xu commented on HIVE-15104: - [~lirui] So what you are saying is, it depends on how spark shuffles

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-03 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15632244#comment-15632244 ] Rui Li commented on HIVE-15104: --- [~xuefuz], here's what I find so far. Firstly, MR uses HiveKey as the [key

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-02 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15631376#comment-15631376 ] Xuefu Zhang commented on HIVE-15104: This is rather interesting. I know I originally reviewed

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-02 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15631053#comment-15631053 ] Rui Li commented on HIVE-15104: --- We need to use HiveKey because it holds the proper hash code to be used for

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-02 Thread Aihua Xu (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15630435#comment-15630435 ] Aihua Xu commented on HIVE-15104: - This is changed by HIVE-8017. [~lirui] Do you recall what kind of

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-02 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15628605#comment-15628605 ] Rui Li commented on HIVE-15104: --- Seems MR can just serialize the key as BytesWritable instead of HiveKey. We

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-01 Thread wangwenli (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627812#comment-15627812 ] wangwenli commented on HIVE-15104: -- try select count(distinct col1), count (distinct col2) from table,

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-01 Thread Aihua Xu (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626085#comment-15626085 ] Aihua Xu commented on HIVE-15104: - [~wenli] Can you give an example that I can run and compare? > Hive