[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22112
  
**[Test build #94855 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94855/testReport)**
 for PR 22112 at commit 
[`d187de8`](https://github.com/apache/spark/commit/d187de804b72ecd71bcd6c04af38af1e6c483ab6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2250/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-16 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/22112
  
Personally I don't want to talk about implementation until we decide what 
we want our semantics to be around the unordered operations because that 
affects any implementation. 
If we are saying we need to fix zip and any other unordered operation that 
means we don't really support unordered operations and everything needs to be 
sorted. 

I would propose we fix the things that are using the round robin type 
partitioning (repartition) but then unordered things like zip/MapPartitions 
(via user code) we document or perhaps give the user the option to sort.  

@mridulm  you caught the issues with zip and others and have said they need 
to be fixed, what are your thoughts?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-16 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/22112
  
You are perfectly correct @jiangxb1987, that was a silly mistake on my part 
- and not trivial at all !
It should be shuffle dependency we should rely on when traversing the 
dependency tree, not shuffledrdd.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-16 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/22112
  
> IMO we should traverse the dependency graph and rely on how ShuffledRDD 
is configured

A trivial point here - Since `ShuffleDependency` is also a DeveloperAPI, 
it's possible for users to write a customized RDD that behaves like 
`ShuffleRDD`, so we may want to depend on dependencies rather than RDDs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-16 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/22112
  
I am not sure what the definition of `isIdempotent` here is.

For example, from MapPartitionsRDD :
```
override private[spark] def isIdempotent = {
if (inputOrderSensitive) {
  prev.isIdempotent
} else {
  true
}
  }
```

Consider:
`val rdd1 = rdd.groupBy().map(...).repartition(...).filter(...)`.
By definition above, this would make rdd1 idempotent.
Depending on what the definition of idempotent is (partition level, record 
level, etc) - this can be correct or wrong code.


Similarly, I am not sure why idempotency or ordering is depending on 
`Partitioner`.
IMO we should traverse the dependency graph and rely on how `ShuffledRDD` 
is configured - whether there is a key ordering specified (applies to both 
global sort and per partition sort), whether it is from a checkpoint or marked 
for checkpoint, whether it is from a stable input source, etc.






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/22112
  
need to look at in more detail but if its straight forward could at least 
do this short term for the repartition case.

I guess I question whether we really want to do it for zip and other 
things, see my comment here though: https://github.com/apache/spark/pull/21698


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/22112
  
nevermind see you have an abortStage in there for ResultTask


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/22112
  
haven't look at code yet, does it just fail with ResultTask then?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2224/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94821/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22112
  
**[Test build #94821 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94821/testReport)**
 for PR 22112 at commit 
[`1f9f6e5`](https://github.com/apache/spark/commit/1f9f6e5b020038be1e7c11b9923010465da385aa).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22112
  
**[Test build #94821 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94821/testReport)**
 for PR 22112 at commit 
[`1f9f6e5`](https://github.com/apache/spark/commit/1f9f6e5b020038be1e7c11b9923010465da385aa).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org