Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95214 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95214/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95213 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95213/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95210 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95210/testReport)**
for PR 22112 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95208 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95208/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95202/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95202 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95202/testReport)**
for PR 22112 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95207 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95207/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95202 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95202/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
To confirm, is everyone OK with merging this PR, or we are just OK with the
direction and need more time to review this PR?
---
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22112
@tgravescs:
> The shuffle simply transfers the bytes its supposed to. Sparks shuffle of
those bytes is not consistent in that the order it fetches from can change and
without the sort
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95129/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95129 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95129/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95128/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95128 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95128/testReport)**
for PR 22112 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95129 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95129/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95128 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95128/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
BTW, I think a cleaner fix is to make shuffle files reliable(e.g. put them
on HDFS), so that Spark will never retry a task from a finished shuffle map
stage. Then all the problems go away, the
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/22112
> Without making shuffle output order repeatable, we do not have a way to
properly fix this.
Perhaps I'm missing it, but you are saying shuffle here, but just shuffle
itself can't fix
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22112
Catching up on discussion ...
@cloud-fan
> shuffled RDD will never be deterministic unless the shuffle key is the
entire record and key ordering is specified.
Let me rephrase
Github user markhamstra commented on the issue:
https://github.com/apache/spark/pull/22112
I'm not a fan of the IDEMPOTENT, RANDOM_ORDER, COMPLETE_RANDOM naming.
IDEMPOTENT is fine, but I'd prefer UNORDERED and INDETERMINATE to cover the
cases of "same values in potentially a
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
FYI I've implemented the support of "repeatable" RDD action in my local
branch. It needs to add a new parameter to the public `SparkContext#runJob`, so
I'm a little hesitant to push it. Please
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
> how does the user then tell spark that the result stage becomes
repeatable because they did the checkpoint?
There are 2 concepts here:
1. The random level of the RDD computing
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/22112
> I'm proposing an option 3:
> Retry all the tasks of all the succeeding stages if a stage with
repartition/zip failed. All RDD actions should tell Spark if it's "repeatable",
which becomes a
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95096/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95096 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95096/testReport)**
for PR 22112 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95096 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95096/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
> 1. force a sort on these operations
I think this is the most obvious fix, and that's how we fixed the
`Dataset#repartition`. Like we discussed before, it's hard to apply it to RDD,
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95023/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95023 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95023/testReport)**
for PR 22112 at commit
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/22112
I only see 2 options:
1. force a sort on these operations
2. do nothing and require users to sort or handle someway (checkpoint) if
they care.
You can possibly make
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/22112
> 2. ask the output committer to be able to overwrite a committed task.
Note that, the output committer here is the FileCommitProtocol interface in
Spark, not the hadoop output committer. We
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95023 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95023/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95005/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95005 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95005/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #95005 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95005/testReport)**
for PR 22112 at commit
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94988/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #94988 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94988/testReport)**
for PR 22112 at commit
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
> assuming the ordering from the source RDD is preserved
This is the problem we are resolving here. This assumption is incorrect,
and the RDD closure should handle it, or use what I
Github user mengxr commented on the issue:
https://github.com/apache/spark/pull/22112
Then it doesn't meet the requirements for those operations used by MLlib:
* sampling
* zipWithIndex, zipWithUniqueId
* we also use zip, assuming the ordering from the source RDD is
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
> always return the same result with same order when rerun..
maybe the word "idempotent" is not that accurate. Spark doesn't really care
about the order, so the requirement is, for the
Github user mengxr commented on the issue:
https://github.com/apache/spark/pull/22112
If "always return the same result with same order when rerun." is the
definition of "idempotent", then yes, MLlib RDD closures always returns the
same result if the input doesn't change. We use
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
So there are 2 options:
1. ask the RDD closure to be idempotent. I'm not sure if it's OK for MLlib,
cc @mengxr @WeichenXu123 @yanboliang
2. ask the output committer to be able
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #94988 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94988/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/22112
@mridulm Thanks for pointing out that comment, I hadn't seen it, its a very
nice write up. I don't agree that " We actually cannot support random output".
Users can do this now in MR and spark
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/22112
> The FileCommitProtocol is an internal API, and our current implementation
does store task-level data temporary in a staging directory (See
HadoopMapReduceCommitProtocol). That said, we can fix
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94923/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #94923 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94923/testReport)**
for PR 22112 at commit
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
I've removed the concept of "order sensitive partitioner" and came up with
a better abstraction. Please take a look at the updated PR descrption, thanks!
---
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
@tgravescs The `FileCommitProtocol` is an internal API, and our current
implementation does store task-level data temporary in a staging directory (See
`HadoopMapReduceCommitProtocol`). That
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #94923 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94923/testReport)**
for PR 22112 at commit
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/22112
I can't envision how that would work? You can't change how output
committers work. You would have to not store anything until all pass or store
it temporarily, both in my opinion are not good.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22112
@mridulm shuffled RDD will never be deterministic unless the shuffle key is
the entire record and key ordering is specified. The reduce task fetches
multiple remote shuffle blocks at the same
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94898/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #94898 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94898/testReport)**
for PR 22112 at commit
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22112
@tgravescs Please see
https://github.com/apache/spark/pull/22112#discussion_r210788359 for a further
elaboration. We actually cannot support random order (except for small subset
of cases like
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/22112
@cloud-fan I think you summarized it nicely, but I think you keep
forgetting about the ResultTask though.It's one thing to say this is a
temporary work around and if you have a failure in a
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #94898 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94898/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22112
**[Test build #94893 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94893/testReport)**
for PR 22112 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94893/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22112
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
201 - 300 of 331 matches
Mail list logo