[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

2017-05-14 Thread jtengyp
Github user jtengyp commented on the issue: https://github.com/apache/spark/pull/17936 I think you@ConeyLiu should directly test the Cartesian phase with the following patch. val user = model.userFeatures val item = model.productFeatures val start = System.nanoTime

[GitHub] spark pull request #17898: [SPARK-20638][Core]Optimize the CartesianRDD to r...

2017-05-14 Thread jtengyp
Github user jtengyp closed the pull request at: https://github.com/apache/spark/pull/17898 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark issue #17898: Optimize the CartesianRDD to reduce repeatedly data fetc...

2017-05-08 Thread jtengyp
Github user jtengyp commented on the issue: https://github.com/apache/spark/pull/17898 Here is my test: Environment : 3 workers, each has 10 cores, 30G memory, 1 executor Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each is a 10-dim vector. With

[GitHub] spark pull request #17898: Optimize the CartesianRDD

2017-05-08 Thread jtengyp
Github user jtengyp commented on a diff in the pull request: https://github.com/apache/spark/pull/17898#discussion_r115199537 --- Diff: core/src/main/scala/org/apache/spark/rdd/CartesianRDD.scala --- @@ -72,8 +72,10 @@ class CartesianRDD[T: ClassTag, U: ClassTag

[GitHub] spark pull request #17898: Update CartesianRDD.scala

2017-05-08 Thread jtengyp
GitHub user jtengyp opened a pull request: https://github.com/apache/spark/pull/17898 Update CartesianRDD.scala In compute, group each iterator to multiple groups, reducing repeatedly data fetching. ## What changes were proposed in this pull request? In compute

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-04-27 Thread jtengyp
Github user jtengyp commented on the issue: https://github.com/apache/spark/pull/17742 I did some tests with the PR. Here is the cluster configure: 3 workers, each has 10 cores and 30G memory. With the netflix dataset (480,189 users and 17770 movies), the