[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-15 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/16574
  
I need to make a survey for better Cartesian implementation, especially in 
shuffle way. Close this PR for now and when the new solution is done I will 
reopen it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/16574
  
@mridulm
Year, I know you are worried about the shuffling cost here. Currently when 
`spark.shuffle.reduceLocality.enabled` is true(by default), each shuffling 
reducer will be launched on the node with the largest outputs. So in this PR 
implementation it will generate good data-locality so that its network transfer 
cost is similar to current `NarrowDependency` implementation, IMO.

BUT, you mention that Cartesian has more efficient way to implement using 
shuffling... I would like to research about it and consider better solution. 
Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-14 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/16574
  

Couple of points :

a) Can recomputation be expensive ? Unfortunately, yes if not used 
properly. For better or for worse, this has been the implementation in spark 
since early days - pre-0.5; and the costs are known. Particularly given Apache 
spark's ability to cache/checkpoint data, the assumption is that shuffle is 
more expensive. This might not hold anymore actually, given improvements since 
1.0 - but only redoing benchmarks will give a better picture.

b) If we were to do a shuffle for cartesian, I would implement it 
differently - take a look at how Apache Pig has implemented it for a more 
efficient way to do it. (Btw, I dont think the impl in the PR actually works, 
but I have not looked at it in detail).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/16574
  
@mridulm En...so that still keep `NarrowDependency` seems better, but I 
think the recomputation is a serious problem when parents RDD not persisted, I 
think in this case we should try to print some warning message to remind 
developer to check their spark application code... how do you think about it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/16574
  

This is a behavior change and will break expectations from existing code 
depending on cartesian to not go through shuffle (particularly when data is 
already persisted).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16574
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71328/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16574
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16574
  
**[Test build #71328 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71328/testReport)**
 for PR 16574 at commit 
[`815063b`](https://github.com/apache/spark/commit/815063b5127857b3e2a76f19ee945ff54d8dd110).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16574
  
**[Test build #71328 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71328/testReport)**
 for PR 16574 at commit 
[`815063b`](https://github.com/apache/spark/commit/815063b5127857b3e2a76f19ee945ff54d8dd110).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/16574
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16574
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71322/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16574
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16574
  
**[Test build #71322 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71322/testReport)**
 for PR 16574 at commit 
[`815063b`](https://github.com/apache/spark/commit/815063b5127857b3e2a76f19ee945ff54d8dd110).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16574
  
**[Test build #71322 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71322/testReport)**
 for PR 16574 at commit 
[`815063b`](https://github.com/apache/spark/commit/815063b5127857b3e2a76f19ee945ff54d8dd110).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16574
  
**[Test build #71321 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71321/testReport)**
 for PR 16574 at commit 
[`e114eed`](https://github.com/apache/spark/commit/e114eeddedd02547e0b57bd9a00291885b116daa).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16574
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71321/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16574
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16574
  
**[Test build #71321 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71321/testReport)**
 for PR 16574 at commit 
[`e114eed`](https://github.com/apache/spark/commit/e114eeddedd02547e0b57bd9a00291885b116daa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16574
  
**[Test build #71320 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71320/testReport)**
 for PR 16574 at commit 
[`14ba3b2`](https://github.com/apache/spark/commit/14ba3b24373d7a1d627bbc8b4b3d60ab6a92da07).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16574
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71320/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16574
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org