[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95214 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95214/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95213 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95213/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95210/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95208/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95202/ Test FAILed. ---

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95202 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95202/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95207 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95207/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95202 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95202/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 To confirm, is everyone OK with merging this PR, or we are just OK with the direction and need more time to review this PR? ---

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs: > The shuffle simply transfers the bytes its supposed to. Sparks shuffle of those bytes is not consistent in that the order it fetches from can change and without the sort

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95129/ Test PASSed. ---

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95129 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95129/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95128/ Test FAILed. ---

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95128 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95128/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95129 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95129/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95128 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95128/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 BTW, I think a cleaner fix is to make shuffle files reliable(e.g. put them on HDFS), so that Spark will never retry a task from a finished shuffle map stage. Then all the problems go away, the

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/22112 > Without making shuffle output order repeatable, we do not have a way to properly fix this. Perhaps I'm missing it, but you are saying shuffle here, but just shuffle itself can't fix

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 Catching up on discussion ... @cloud-fan > shuffled RDD will never be deterministic unless the shuffle key is the entire record and key ordering is specified. Let me rephrase

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread markhamstra
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/22112 I'm not a fan of the IDEMPOTENT, RANDOM_ORDER, COMPLETE_RANDOM naming. IDEMPOTENT is fine, but I'd prefer UNORDERED and INDETERMINATE to cover the cases of "same values in potentially a

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 FYI I've implemented the support of "repeatable" RDD action in my local branch. It needs to add a new parameter to the public `SparkContext#runJob`, so I'm a little hesitant to push it. Please

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 > how does the user then tell spark that the result stage becomes repeatable because they did the checkpoint? There are 2 concepts here: 1. The random level of the RDD computing

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/22112 > I'm proposing an option 3: > Retry all the tasks of all the succeeding stages if a stage with repartition/zip failed. All RDD actions should tell Spark if it's "repeatable", which becomes a

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95096/ Test PASSed. ---

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95096 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95096/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95096 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95096/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 > 1. force a sort on these operations I think this is the most obvious fix, and that's how we fixed the `Dataset#repartition`. Like we discussed before, it's hard to apply it to RDD,

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95023/ Test FAILed. ---

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95023 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95023/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/22112 I only see 2 options: 1. force a sort on these operations 2. do nothing and require users to sort or handle someway (checkpoint) if they care. You can possibly make

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/22112 > 2. ask the output committer to be able to overwrite a committed task. Note that, the output committer here is the FileCommitProtocol interface in Spark, not the hadoop output committer. We

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95023 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95023/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95005/ Test FAILed. ---

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95005 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95005/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95005 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95005/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94988/ Test FAILed. ---

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #94988 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94988/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 > assuming the ordering from the source RDD is preserved This is the problem we are resolving here. This assumption is incorrect, and the RDD closure should handle it, or use what I

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-21 Thread mengxr
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/22112 Then it doesn't meet the requirements for those operations used by MLlib: * sampling * zipWithIndex, zipWithUniqueId * we also use zip, assuming the ordering from the source RDD is

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 > always return the same result with same order when rerun.. maybe the word "idempotent" is not that accurate. Spark doesn't really care about the order, so the requirement is, for the

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread mengxr
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/22112 If "always return the same result with same order when rerun." is the definition of "idempotent", then yes, MLlib RDD closures always returns the same result if the input doesn't change. We use

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 So there are 2 options: 1. ask the RDD closure to be idempotent. I'm not sure if it's OK for MLlib, cc @mengxr @WeichenXu123 @yanboliang 2. ask the output committer to be able

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #94988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94988/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/22112 @mridulm Thanks for pointing out that comment, I hadn't seen it, its a very nice write up. I don't agree that " We actually cannot support random output". Users can do this now in MR and spark

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/22112 > The FileCommitProtocol is an internal API, and our current implementation does store task-level data temporary in a staging directory (See HadoopMapReduceCommitProtocol). That said, we can fix

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94923/ Test FAILed. ---

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #94923 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94923/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 I've removed the concept of "order sensitive partitioner" and came up with a better abstraction. Please take a look at the updated PR descrption, thanks! ---

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs The `FileCommitProtocol` is an internal API, and our current implementation does store task-level data temporary in a staging directory (See `HadoopMapReduceCommitProtocol`). That

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #94923 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94923/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/22112 I can't envision how that would work? You can't change how output committers work. You would have to not store anything until all pass or store it temporarily, both in my opinion are not good.

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 @mridulm shuffled RDD will never be deterministic unless the shuffle key is the entire record and key ordering is specified. The reduce task fetches multiple remote shuffle blocks at the same

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94898/ Test FAILed. ---

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #94898 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94898/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs Please see https://github.com/apache/spark/pull/22112#discussion_r210788359 for a further elaboration. We actually cannot support random order (except for small subset of cases like

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/22112 @cloud-fan I think you summarized it nicely, but I think you keep forgetting about the ResultTask though.It's one thing to say this is a temporary work around and if you have a failure in a

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #94898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94898/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #94893 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94893/testReport)** for PR 22112 at commit

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94893/ Test FAILed. ---

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed):

<    1   2   3   4   >