[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2018-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2018-06-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2017-12-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2017-02-09 Thread deswal-ajit
Github user deswal-ajit commented on the issue: https://github.com/apache/spark/pull/15297 hi --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-12-11 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue: https://github.com/apache/spark/pull/15297 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-12-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-12-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69990/ Test FAILed. ---

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-12-11 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #69990 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69990/consoleFull)** for PR 15297 at commit

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-12-11 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #69990 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69990/consoleFull)** for PR 15297 at commit

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-26 Thread holdenk
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/15297 This is a really big change - and handling skewed data in joins is certainly an important consideration - have you considered making a design document and running it by the dev list? Maybe

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68448/ Test PASSed. ---

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #68448 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68448/consoleFull)** for PR 15297 at commit

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #68448 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68448/consoleFull)** for PR 15297 at commit

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68446/ Test FAILed. ---

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #68446 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68446/consoleFull)** for PR 15297 at commit

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #68446 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68446/consoleFull)** for PR 15297 at commit

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #68442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68442/consoleFull)** for PR 15297 at commit

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68442/ Test FAILed. ---

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #68442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68442/consoleFull)** for PR 15297 at commit

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-11-08 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/15297 @YuhuWang2002 We should limit the use case for outer join: For left outer join, such as A left join B, this implementation now can not handle the case of skew of table B. That's because

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-10-25 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue: https://github.com/apache/spark/pull/15297 I do some performance test between use skew join algorithm and not use skew join algorithm. I generate 2 table with 1/5 data skew in table S and 1/1 data skew in table R. Two table

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-10-24 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue: https://github.com/apache/spark/pull/15297 skewed join implementation suit for dataframe and sql statement you will get 210 output files. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-10-24 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/15297 Ok so are you saying this skewed join implementation doesn't apply to other dataframe operations, something like: val df_pixels = sqlContext.read.parquet("somefile") val

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-10-22 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue: https://github.com/apache/spark/pull/15297 @tgravescs : In join case,some like : select count(*) from A join B. if the parameter spark.sql.shuffle.partitions=200 ,then we get 200 tasks output about 'count num', the output is not in

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-10-21 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/15297 Ok so how does that affect the overall job and # of outputs? I don't know the internals of Spark SQL so sorry if I'm missing something obvious. Basically now you will have multiple tasks

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-10-20 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue: https://github.com/apache/spark/pull/15297 @tgravescs : Thank you for your response, when a single reduce task handling huge data, it's slowly and unstable. so we split one reduce task to multi- reduce task. A single reduce

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-10-20 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/15297 I haven't looked through the code in detail but can you clarify the design a bit on this, the design pretty much just says we are splitting up the fetch of the map outputs but it doesn't say what