[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user deswal-ajit commented on the issue: https://github.com/apache/spark/pull/15297 hi --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user YuhuWang2002 commented on the issue: https://github.com/apache/spark/pull/15297 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69990/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #69990 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69990/consoleFull)** for PR 15297 at commit [`99b8305`](https://github.com/apache/spark/commit/99b830584aafb53112b5bdd2d723080fa19baa54). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #69990 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69990/consoleFull)** for PR 15297 at commit [`99b8305`](https://github.com/apache/spark/commit/99b830584aafb53112b5bdd2d723080fa19baa54). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/15297 This is a really big change - and handling skewed data in joins is certainly an important consideration - have you considered making a design document and running it by the dev list? Maybe something similar to the recently proposed Spark Improvement Proposals process? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68448/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #68448 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68448/consoleFull)** for PR 15297 at commit [`1bb158b`](https://github.com/apache/spark/commit/1bb158b3035cd4f69dd2f47c26ef1c67bc5e6a6c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #68448 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68448/consoleFull)** for PR 15297 at commit [`1bb158b`](https://github.com/apache/spark/commit/1bb158b3035cd4f69dd2f47c26ef1c67bc5e6a6c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68446/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #68446 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68446/consoleFull)** for PR 15297 at commit [`8728d33`](https://github.com/apache/spark/commit/8728d334a79f3cb385937d61956ea47d2e9a4650). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #68446 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68446/consoleFull)** for PR 15297 at commit [`8728d33`](https://github.com/apache/spark/commit/8728d334a79f3cb385937d61956ea47d2e9a4650). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #68442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68442/consoleFull)** for PR 15297 at commit [`b60f9bc`](https://github.com/apache/spark/commit/b60f9bc76763a0c149cb32bf8b3ab3f318a86635). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68442/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15297 **[Test build #68442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68442/consoleFull)** for PR 15297 at commit [`b60f9bc`](https://github.com/apache/spark/commit/b60f9bc76763a0c149cb32bf8b3ab3f318a86635). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user scwf commented on the issue: https://github.com/apache/spark/pull/15297 @YuhuWang2002 We should limit the use case for outer join: For left outer join, such as A left join B, this implementation now can not handle the case of skew of table B. That's because the result of join depends on the all data of the same reduce data of B, you can not split it to multi-tasks. Similarly, for right outer join, such as A right join B, this implementation now can not handle the case of skew of table A. And for full outer join, we can not use the optimization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user YuhuWang2002 commented on the issue: https://github.com/apache/spark/pull/15297 I do some performance test between use skew join algorithm and not use skew join algorithm. I generate 2 table with 1/5 data skew in table S and 1/1 data skew in table R. Two table skew in same key. spark.sql.adaptive.skewjoin.threshold 600 spark.sql.adaptive.shuffle.targetPostShuffleInputSize 500 record: S 1000 rows; R 1 rows sql: select count(*) from R,S where rid=sid and sname>'wang9' and rname > 'zhang9'; skew algorithm : 167.695s normal algorithm: 303.922s R2_txt is 1 rows without data skew. sql: select count(*) from R2_txt,S where rid=sid and sname>'wang' and rname > 'zhang9'; skew algorithm : 38.717s normal algorithm: 114.21s --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user YuhuWang2002 commented on the issue: https://github.com/apache/spark/pull/15297 skewed join implementation suit for dataframe and sql statement you will get 210 output files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/15297 Ok so are you saying this skewed join implementation doesn't apply to other dataframe operations, something like: val df_pixels = sqlContext.read.parquet("somefile") val df_pixels_renamed = df_pixels.withColumnRenamed("photo_id", "pixels_photo_id") val df_meta = sqlContext.read.parquet("somemeta") val df = df_meta.as("meta").join(df_pixels_renamed, $"meta.photo_id" === $"pixels_photo_id", "inner").drop("pixels_photo_id") df.write.parquet("someoutputfile") Where normally spark.sql.shuffle.partitions=X would configure the number of output files. So in my example if I set spark.sql.shuffle.partitions=200 but skewed join use 210, what happens, how many output files would I get? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user YuhuWang2002 commented on the issue: https://github.com/apache/spark/pull/15297 @tgravescs : In join case,some like : select count(*) from A join B. if the parameter spark.sql.shuffle.partitions=200 ,then we get 200 tasks output about 'count num', the output is not in HDFS but cache in spark . Calculate the sum of 200 tasks. we got the correct value. If skewed. wo get 210 tasks output about 'count num'. it's some processing about next step. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/15297 Ok so how does that affect the overall job and # of outputs? I don't know the internals of Spark SQL so sorry if I'm missing something obvious. Basically now you will have multiple tasks whereas it used to use 1. So lets say I have spark.sql.shuffle.partitions=200 to start with, the skewed join add tasks to process some skewed partition so lets say it runs 210 tasks, then lets say I save that to HDFS, do I get 210 output files or does it join those 10 back into 1 again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user YuhuWang2002 commented on the issue: https://github.com/apache/spark/pull/15297 @tgravescs ï¼ Thank you for your response, when a single reduce task handling huge data, it's slowly and unstable. so we split one reduce task to multi- reduce task. A single reduce task doing like A join B. we split to multi-task. task 1 doing A1 join B, task 2 dong A2 join B and so on. A1 is a part of A which read from a range of maps output. For spark sql, it is the A1 as a separate partitions when processing. so it can use mutil-executor to run the task. for dispersion the process pressure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/15297 I haven't looked through the code in detail but can you clarify the design a bit on this, the design pretty much just says we are splitting up the fetch of the map outputs but it doesn't say what happens then or how this really solves the problem. You say that instead of doing A join B you are splitting it up to do something like A1 join B + A2 join B + â¦. An join B. Is it still just one reduce task fetching it in separate chunks if so how does this fix the problem or is it treating each one of those fetches as a separate partitions when processing it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org