[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-30 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-98002797
  
@debasish83 Do you mind me sending you an update?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-30 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-98021307
  
@mengxr please go ahead...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-26 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-96403986
  
was very last few weeks...update it in next few days...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-21 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-95009893
  
@debasish83 Do you have time to update the PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-07 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-90740329
  
@debasish83 In RankingMetrics, the label set doesn't have an ordering. It 
is either hit or miss at each position.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-05 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-89728639
  
@debasish83 do you mean RMSE? it is well-defined but not very useful. MAP 
is the useful metric. I think that only a rank-dependent metric makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-05 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-89729377
  
I meant MAP...what's the MAP on netflix dataset you have seen before and 
with what lambda ? I am running MAP experiments with various factorization 
formulations including loglikelihood loss with normalization constraints...also 
how do you define MAP for implicit feedback (binary dataset, click is 1 and no 
click is 0) ? In the label set every rating is 1.0 and so there is no ranking 
defined as such...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-05 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-89729632
  
I don't remember from the last time I tried the Netflix dataset but in 
general MAP is low, like 0.05-0.1, even when recommendations are good 
according to RMSE. 

MAP only requires a ranking and so there is no difference in how you define 
it when the input is ratings or clicks (explicit vs implicit). he input data 
set does not need a ranking, just a notion of relevant or not. The 
recommendation is relevant / correct if the user actually interacted with it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-05 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-89729777
  
agreed with the implicit MAP calculationFor netflix dataset, I got 
0.014...May be I need to use a better regularization...was that 0.05-0.1 number 
from using lambda = 0.065 ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-05 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-89729865
  
That might be a fine score. I remember them being MAP being quite low in 
general and that 0.1 would be very good. I don't recall the number for 
Netflix, so don't take that number as a guide. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-04 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27769592
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -167,23 +169,66 @@ object MovieLensALS {
   .setProductBlocks(params.numProductBlocks)
   .run(training)
 
-val rmse = computeRmse(model, test, params.implicitPrefs)
-
-println(sTest RMSE = $rmse.)
+params.metrics match {
+  case rmse =
+val rmse = computeRmse(model, test, params.implicitPrefs)
+println(sTest RMSE = $rmse)
+  case map =
+val (map, users) = computeRankingMetrics(model, training, test, 
numMovies.toInt)
+println(sTest users $users MAP $map)
+  case _ = println(sMetrics not defined, options are rmse/map)
+}
 
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
-  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean)
-: Double = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
+  def computeRmse(
+model: MatrixFactorizationModel,
+data: RDD[Rating],
+implicitPrefs: Boolean) : Double = {
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r
+  }
+  
+  /** Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation */
+  def computeRankingMetrics(
+model: MatrixFactorizationModel,
+train: RDD[Rating],
+test: RDD[Rating],
+n: Int) : (Double, Long) = {
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
--- End diff --

I will update with topByKeyIs there a better place to move this 
function ? may be inside ALS object for example ? That way I can add a 
test-case to guard it ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-04 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-89697236
  
@srowen For netflix dataset what's the MAP you have seen before...I started 
experiments on Netflix dataset...lambda is 0.065 for netflix as well right ? 
For MovieLens 0.065 works well...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-04 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-89706247
  
@coderxiang @mengxr If I have a dataset with implicit (click or 0) then MAP 
is not that well defined right since in label set everything is 1.0 and so 
there is no ordering definedshould we add a rank independent metric for 
implicit datasets ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27707377
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -138,14 +141,122 @@ class MatrixFactorizationModel(
   }
 
   private def recommend(
-  recommendToFeatures: Array[Double],
-  recommendableFeatures: RDD[(Int, Array[Double])],
-  num: Int): Array[(Int, Double)] = {
-val scored = recommendableFeatures.map { case (id,features) =
-  (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1))
+recommendToFeatures: Array[Double],
+recommendableFeatures: RDD[(Int, Array[Double])],
+num: Int): Array[(Int, Double)] = {
+val recommendToVector = Vectors.dense(recommendToFeatures)
+val scored = recommendableFeatures.map {
+  case (id, features) =
+(id, BLAS.dot(recommendToVector, Vectors.dense(features)))
 }
 scored.top(num)(Ordering.by(_._2))
   }
+
+  /**
+   * Recommends topK products for all users
+   *
+   * @param num how many products to return for every user.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
userID and an array of
+   * rating objects which contains the same userId, recommended productID 
and a score in the
+   * rating field. Semantics of score is same as recommendProducts API
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommends topK users for all products
+   *
+   * @param num how many users to return for every product.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
productID and an array
+   * of rating objects which contains the recommended userId, same 
productID and a score in the
+   * rating field. Semantics of score is same as recommendUsers API
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
+  case class FeatureTopK(feature: Vector, topK: Int)
+
+  /**
+   * Recommend topK products for users in userTopK RDD
+   *
+   * @param userTopK how many products to return for every user in 
userTopK RDD.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
userID and an array
+   * of rating objects which contains the same userId, recommended 
productID and a score in the
+   * rating field. Semantics of score is same as recommendProducts API
+   */
+  def recommendProductsForUsers(
+userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = {
+val userFeaturesTopK = userFeatures.join(userTopK).map {
+  case (userId, (userFeature, topK)) =
+(userId, FeatureTopK(Vectors.dense(userFeature), topK))
+}
+
+// TO DO: Do a mini-batch on productFeatures.collect if the dimension 
of rows are big
+val productVectors = productFeatures.map {
+  x = (x._1, Vectors.dense(x._2))
+}.collect
--- End diff --

`collect` - `collect()` (We use `()` if this is an action.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27707375
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -138,14 +141,122 @@ class MatrixFactorizationModel(
   }
 
   private def recommend(
-  recommendToFeatures: Array[Double],
-  recommendableFeatures: RDD[(Int, Array[Double])],
-  num: Int): Array[(Int, Double)] = {
-val scored = recommendableFeatures.map { case (id,features) =
-  (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1))
+recommendToFeatures: Array[Double],
+recommendableFeatures: RDD[(Int, Array[Double])],
+num: Int): Array[(Int, Double)] = {
+val recommendToVector = Vectors.dense(recommendToFeatures)
+val scored = recommendableFeatures.map {
+  case (id, features) =
+(id, BLAS.dot(recommendToVector, Vectors.dense(features)))
 }
 scored.top(num)(Ordering.by(_._2))
   }
+
+  /**
+   * Recommends topK products for all users
+   *
+   * @param num how many products to return for every user.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
userID and an array of
+   * rating objects which contains the same userId, recommended productID 
and a score in the
+   * rating field. Semantics of score is same as recommendProducts API
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommends topK users for all products
+   *
+   * @param num how many users to return for every product.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
productID and an array
+   * of rating objects which contains the recommended userId, same 
productID and a score in the
+   * rating field. Semantics of score is same as recommendUsers API
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
+  case class FeatureTopK(feature: Vector, topK: Int)
--- End diff --

This is not necessary if we use a global `num`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27707380
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -138,14 +141,122 @@ class MatrixFactorizationModel(
   }
 
   private def recommend(
-  recommendToFeatures: Array[Double],
-  recommendableFeatures: RDD[(Int, Array[Double])],
-  num: Int): Array[(Int, Double)] = {
-val scored = recommendableFeatures.map { case (id,features) =
-  (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1))
+recommendToFeatures: Array[Double],
+recommendableFeatures: RDD[(Int, Array[Double])],
+num: Int): Array[(Int, Double)] = {
+val recommendToVector = Vectors.dense(recommendToFeatures)
+val scored = recommendableFeatures.map {
+  case (id, features) =
+(id, BLAS.dot(recommendToVector, Vectors.dense(features)))
 }
 scored.top(num)(Ordering.by(_._2))
   }
+
+  /**
+   * Recommends topK products for all users
+   *
+   * @param num how many products to return for every user.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
userID and an array of
+   * rating objects which contains the same userId, recommended productID 
and a score in the
+   * rating field. Semantics of score is same as recommendProducts API
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommends topK users for all products
+   *
+   * @param num how many users to return for every product.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
productID and an array
+   * of rating objects which contains the recommended userId, same 
productID and a score in the
+   * rating field. Semantics of score is same as recommendUsers API
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
+  case class FeatureTopK(feature: Vector, topK: Int)
+
+  /**
+   * Recommend topK products for users in userTopK RDD
+   *
+   * @param userTopK how many products to return for every user in 
userTopK RDD.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
userID and an array
+   * of rating objects which contains the same userId, recommended 
productID and a score in the
+   * rating field. Semantics of score is same as recommendProducts API
+   */
+  def recommendProductsForUsers(
+userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = {
+val userFeaturesTopK = userFeatures.join(userTopK).map {
+  case (userId, (userFeature, topK)) =
+(userId, FeatureTopK(Vectors.dense(userFeature), topK))
+}
+
+// TO DO: Do a mini-batch on productFeatures.collect if the dimension 
of rows are big
+val productVectors = productFeatures.map {
+  x = (x._1, Vectors.dense(x._2))
+}.collect
+
+// TO DO: User BLAS dgemm
+userFeaturesTopK.map {
+  case (userId, userFeatureTopK) = {
+val predictions = productVectors.map {
+  case (productId, productVector) =
+Rating(userId, productId,
+  BLAS.dot(userFeatureTopK.feature, productVector))
+}
+(userId, Utils.takeOrdered(predictions.iterator,
+  userFeatureTopK.topK)(ord.reverse).toArray)
--- End diff --

Use the `topByKey` method to simplify the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27707373
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -138,14 +141,122 @@ class MatrixFactorizationModel(
   }
 
   private def recommend(
-  recommendToFeatures: Array[Double],
-  recommendableFeatures: RDD[(Int, Array[Double])],
-  num: Int): Array[(Int, Double)] = {
-val scored = recommendableFeatures.map { case (id,features) =
-  (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1))
+recommendToFeatures: Array[Double],
+recommendableFeatures: RDD[(Int, Array[Double])],
+num: Int): Array[(Int, Double)] = {
+val recommendToVector = Vectors.dense(recommendToFeatures)
+val scored = recommendableFeatures.map {
+  case (id, features) =
+(id, BLAS.dot(recommendToVector, Vectors.dense(features)))
 }
 scored.top(num)(Ordering.by(_._2))
   }
+
+  /**
+   * Recommends topK products for all users
+   *
+   * @param num how many products to return for every user.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
userID and an array of
+   * rating objects which contains the same userId, recommended productID 
and a score in the
+   * rating field. Semantics of score is same as recommendProducts API
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
--- End diff --

Is it common to use different `num`s? I think it should be sufficient to 
set `num` globally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27707386
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -138,14 +141,122 @@ class MatrixFactorizationModel(
   }
 
   private def recommend(
-  recommendToFeatures: Array[Double],
-  recommendableFeatures: RDD[(Int, Array[Double])],
-  num: Int): Array[(Int, Double)] = {
-val scored = recommendableFeatures.map { case (id,features) =
-  (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1))
+recommendToFeatures: Array[Double],
+recommendableFeatures: RDD[(Int, Array[Double])],
+num: Int): Array[(Int, Double)] = {
+val recommendToVector = Vectors.dense(recommendToFeatures)
+val scored = recommendableFeatures.map {
+  case (id, features) =
+(id, BLAS.dot(recommendToVector, Vectors.dense(features)))
 }
 scored.top(num)(Ordering.by(_._2))
   }
+
+  /**
+   * Recommends topK products for all users
+   *
+   * @param num how many products to return for every user.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
userID and an array of
+   * rating objects which contains the same userId, recommended productID 
and a score in the
+   * rating field. Semantics of score is same as recommendProducts API
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommends topK users for all products
+   *
+   * @param num how many users to return for every product.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
productID and an array
+   * of rating objects which contains the recommended userId, same 
productID and a score in the
+   * rating field. Semantics of score is same as recommendUsers API
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
+  case class FeatureTopK(feature: Vector, topK: Int)
+
+  /**
+   * Recommend topK products for users in userTopK RDD
+   *
+   * @param userTopK how many products to return for every user in 
userTopK RDD.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
userID and an array
+   * of rating objects which contains the same userId, recommended 
productID and a score in the
+   * rating field. Semantics of score is same as recommendProducts API
+   */
+  def recommendProductsForUsers(
+userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = {
+val userFeaturesTopK = userFeatures.join(userTopK).map {
+  case (userId, (userFeature, topK)) =
+(userId, FeatureTopK(Vectors.dense(userFeature), topK))
+}
+
+// TO DO: Do a mini-batch on productFeatures.collect if the dimension 
of rows are big
+val productVectors = productFeatures.map {
+  x = (x._1, Vectors.dense(x._2))
+}.collect
+
+// TO DO: User BLAS dgemm
+userFeaturesTopK.map {
+  case (userId, userFeatureTopK) = {
+val predictions = productVectors.map {
+  case (productId, productVector) =
+Rating(userId, productId,
+  BLAS.dot(userFeatureTopK.feature, productVector))
+}
+(userId, Utils.takeOrdered(predictions.iterator,
+  userFeatureTopK.topK)(ord.reverse).toArray)
+  }
+}
+  }
+
+  /**
+   * Recommends topK users for all products in productTopK RDD
+   *
+   * @param productTopK how many users to return for every product in 
productTopK RDD
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
productID and an array
+   * of Rating objects which contains the recommended userId, same 
productID and a score in the
+   * rating field. Semantics of score is same as recommendUsers API
+   */
+  def recommendUsersForProducts(
+productTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = {
+val blocks = userFeatures.partitions.size / 2
+
+// TO DO: Do a mini-batch on productFeatures.collect if the dimension 
of rows are big
+val productVectors = productFeatures.join(productTopK).map {
+  case (productId, (productFeature, topK)) =
+(productId, FeatureTopK(Vectors.dense(productFeature), topK))
+}.collect()
+
+// TO DO: Use BLAS dgemm
+userFeatures.mapPartitions { items =
+  val predictions = 

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27707370
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -138,14 +141,122 @@ class MatrixFactorizationModel(
   }
 
   private def recommend(
-  recommendToFeatures: Array[Double],
-  recommendableFeatures: RDD[(Int, Array[Double])],
-  num: Int): Array[(Int, Double)] = {
-val scored = recommendableFeatures.map { case (id,features) =
-  (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1))
+recommendToFeatures: Array[Double],
+recommendableFeatures: RDD[(Int, Array[Double])],
+num: Int): Array[(Int, Double)] = {
+val recommendToVector = Vectors.dense(recommendToFeatures)
+val scored = recommendableFeatures.map {
+  case (id, features) =
+(id, BLAS.dot(recommendToVector, Vectors.dense(features)))
--- End diff --

and same here. Let's keep this code untouched, which is irrelevant to this 
PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27707366
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -70,26 +71,28 @@ class MatrixFactorizationModel(
 
   /** Predict the rating of one user for one product. */
   def predict(user: Int, product: Int): Double = {
-val userVector = userFeatures.lookup(user).head
-val productVector = productFeatures.lookup(product).head
-blas.ddot(userVector.length, userVector, 1, productVector, 1)
+val userVector = Vectors.dense(userFeatures.lookup(user).head)
+val productVector = Vectors.dense(productFeatures.lookup(product).head)
+BLAS.dot(userVector, productVector)
--- End diff --

This change is not necessary. It add more wrappers (and temp objects) and 
it doesn't save much code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27707360
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -167,23 +169,66 @@ object MovieLensALS {
   .setProductBlocks(params.numProductBlocks)
   .run(training)
 
-val rmse = computeRmse(model, test, params.implicitPrefs)
-
-println(sTest RMSE = $rmse.)
+params.metrics match {
+  case rmse =
+val rmse = computeRmse(model, test, params.implicitPrefs)
+println(sTest RMSE = $rmse)
+  case map =
+val (map, users) = computeRankingMetrics(model, training, test, 
numMovies.toInt)
+println(sTest users $users MAP $map)
+  case _ = println(sMetrics not defined, options are rmse/map)
+}
 
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
-  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean)
-: Double = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
+  def computeRmse(
+model: MatrixFactorizationModel,
+data: RDD[Rating],
+implicitPrefs: Boolean) : Double = {
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r
+  }
+  
+  /** Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation */
+  def computeRankingMetrics(
+model: MatrixFactorizationModel,
+train: RDD[Rating],
+test: RDD[Rating],
+n: Int) : (Double, Long) = {
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey().map {
+  case (userId, products) =
+val sortedProducts = products.toArray.sorted(ord.reverse)
+(userId, sortedProducts.map(_._1))
+}
+
+val trainUserLabels = train.map {
+  x = (x.user, x.product)
+}.groupByKey().map{case (userId, products) = (userId, 
products.toArray)}
--- End diff --

`.groupByKey().mapValues(_.toArray)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27707363
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -167,23 +169,66 @@ object MovieLensALS {
   .setProductBlocks(params.numProductBlocks)
   .run(training)
 
-val rmse = computeRmse(model, test, params.implicitPrefs)
-
-println(sTest RMSE = $rmse.)
+params.metrics match {
+  case rmse =
+val rmse = computeRmse(model, test, params.implicitPrefs)
+println(sTest RMSE = $rmse)
+  case map =
+val (map, users) = computeRankingMetrics(model, training, test, 
numMovies.toInt)
+println(sTest users $users MAP $map)
+  case _ = println(sMetrics not defined, options are rmse/map)
+}
 
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
-  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean)
-: Double = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
+  def computeRmse(
+model: MatrixFactorizationModel,
+data: RDD[Rating],
+implicitPrefs: Boolean) : Double = {
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r
+  }
+  
+  /** Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation */
+  def computeRankingMetrics(
+model: MatrixFactorizationModel,
+train: RDD[Rating],
+test: RDD[Rating],
+n: Int) : (Double, Long) = {
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey().map {
+  case (userId, products) =
+val sortedProducts = products.toArray.sorted(ord.reverse)
+(userId, sortedProducts.map(_._1))
+}
+
+val trainUserLabels = train.map {
+  x = (x.user, x.product)
+}.groupByKey().map{case (userId, products) = (userId, 
products.toArray)}
+
+val rankings = 
model.recommendProductsForUsers(n).join(trainUserLabels).map {
+  case (userId, (pred, train)) = {
+val predictedProducts = pred.map(_.product)
+val trainSet = train.toSet
+(userId, predictedProducts.filterNot { x = trainSet.contains(x) })
+  }
+}.join(testUserLabels).map {
--- End diff --

safer to use `rightOuterJoin`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27707367
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -70,26 +71,28 @@ class MatrixFactorizationModel(
 
   /** Predict the rating of one user for one product. */
   def predict(user: Int, product: Int): Double = {
-val userVector = userFeatures.lookup(user).head
-val productVector = productFeatures.lookup(product).head
-blas.ddot(userVector.length, userVector, 1, productVector, 1)
+val userVector = Vectors.dense(userFeatures.lookup(user).head)
+val productVector = Vectors.dense(productFeatures.lookup(product).head)
+BLAS.dot(userVector, productVector)
   }
 
   /**
-* Predict the rating of many users for many products.
-* The output RDD has an element per each element in the input RDD 
(including all duplicates)
-* unless a user or product is missing in the training set.
-*
-* @param usersProducts  RDD of (user, product) pairs.
-* @return RDD of Ratings.
-*/
+   * Predict the rating of many users for many products.
+   * The output RDD has an element per each element in the input RDD 
(including all duplicates)
+   * unless a user or product is missing in the training set.
+   *
+   * @param usersProducts  RDD of (user, product) pairs.
+   * @return RDD of Ratings.
+   */
   def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] = {
-val users = userFeatures.join(usersProducts).map{
+val users = userFeatures.join(usersProducts).map {
   case (user, (uFeatures, product)) = (product, (user, uFeatures))
 }
 users.join(productFeatures).map {
   case (product, ((user, uFeatures), pFeatures)) =
-Rating(user, product, blas.ddot(uFeatures.length, uFeatures, 1, 
pFeatures, 1))
+val userVector = Vectors.dense(uFeatures)
+val productVector = Vectors.dense(pFeatures)
+Rating(user, product, BLAS.dot(userVector, productVector))
--- End diff --

Same here,


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27707364
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -167,23 +169,66 @@ object MovieLensALS {
   .setProductBlocks(params.numProductBlocks)
   .run(training)
 
-val rmse = computeRmse(model, test, params.implicitPrefs)
-
-println(sTest RMSE = $rmse.)
+params.metrics match {
+  case rmse =
+val rmse = computeRmse(model, test, params.implicitPrefs)
+println(sTest RMSE = $rmse)
+  case map =
+val (map, users) = computeRankingMetrics(model, training, test, 
numMovies.toInt)
+println(sTest users $users MAP $map)
+  case _ = println(sMetrics not defined, options are rmse/map)
+}
 
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
-  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean)
-: Double = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
+  def computeRmse(
+model: MatrixFactorizationModel,
+data: RDD[Rating],
+implicitPrefs: Boolean) : Double = {
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r
+  }
+  
+  /** Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation */
+  def computeRankingMetrics(
+model: MatrixFactorizationModel,
+train: RDD[Rating],
+test: RDD[Rating],
+n: Int) : (Double, Long) = {
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey().map {
+  case (userId, products) =
+val sortedProducts = products.toArray.sorted(ord.reverse)
+(userId, sortedProducts.map(_._1))
+}
+
+val trainUserLabels = train.map {
+  x = (x.user, x.product)
+}.groupByKey().map{case (userId, products) = (userId, 
products.toArray)}
+
+val rankings = 
model.recommendProductsForUsers(n).join(trainUserLabels).map {
+  case (userId, (pred, train)) = {
+val predictedProducts = pred.map(_.product)
+val trainSet = train.toSet
+(userId, predictedProducts.filterNot { x = trainSet.contains(x) })
+  }
+}.join(testUserLabels).map {
+  case (user, (pred, lab)) = (pred, lab)
+}
+
+val metrics = new RankingMetrics(rankings)
+(metrics.meanAveragePrecision, testUserLabels.count)
--- End diff --

If we use `testUserLabels` twice, we should cache it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27707362
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -167,23 +169,66 @@ object MovieLensALS {
   .setProductBlocks(params.numProductBlocks)
   .run(training)
 
-val rmse = computeRmse(model, test, params.implicitPrefs)
-
-println(sTest RMSE = $rmse.)
+params.metrics match {
+  case rmse =
+val rmse = computeRmse(model, test, params.implicitPrefs)
+println(sTest RMSE = $rmse)
+  case map =
+val (map, users) = computeRankingMetrics(model, training, test, 
numMovies.toInt)
+println(sTest users $users MAP $map)
+  case _ = println(sMetrics not defined, options are rmse/map)
+}
 
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
-  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean)
-: Double = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
+  def computeRmse(
+model: MatrixFactorizationModel,
+data: RDD[Rating],
+implicitPrefs: Boolean) : Double = {
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r
+  }
+  
+  /** Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation */
+  def computeRankingMetrics(
+model: MatrixFactorizationModel,
+train: RDD[Rating],
+test: RDD[Rating],
+n: Int) : (Double, Long) = {
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey().map {
+  case (userId, products) =
+val sortedProducts = products.toArray.sorted(ord.reverse)
+(userId, sortedProducts.map(_._1))
+}
+
+val trainUserLabels = train.map {
+  x = (x.user, x.product)
+}.groupByKey().map{case (userId, products) = (userId, 
products.toArray)}
+
+val rankings = 
model.recommendProductsForUsers(n).join(trainUserLabels).map {
+  case (userId, (pred, train)) = {
+val predictedProducts = pred.map(_.product)
--- End diff --

`pred.map(_.product).diff(train)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27707356
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -167,23 +169,66 @@ object MovieLensALS {
   .setProductBlocks(params.numProductBlocks)
   .run(training)
 
-val rmse = computeRmse(model, test, params.implicitPrefs)
-
-println(sTest RMSE = $rmse.)
+params.metrics match {
+  case rmse =
+val rmse = computeRmse(model, test, params.implicitPrefs)
+println(sTest RMSE = $rmse)
+  case map =
+val (map, users) = computeRankingMetrics(model, training, test, 
numMovies.toInt)
+println(sTest users $users MAP $map)
+  case _ = println(sMetrics not defined, options are rmse/map)
+}
 
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
-  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean)
-: Double = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
+  def computeRmse(
+model: MatrixFactorizationModel,
+data: RDD[Rating],
+implicitPrefs: Boolean) : Double = {
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r
+  }
+  
+  /** Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation */
+  def computeRankingMetrics(
+model: MatrixFactorizationModel,
+train: RDD[Rating],
+test: RDD[Rating],
+n: Int) : (Double, Long) = {
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
--- End diff --

Please use the topByKey implementation to compute top items for users: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-89083630
  
@debasish83 The code style still need fixes. On the high level, I think 
having variable `num` is not necessary. Let's do a global num in this PR and 
see whether users ask for variable `num`. @coderxiang added `topByKey` 
recently, so it would be nice if we can reuse the implementation in couple 
places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27712646
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -138,14 +141,122 @@ class MatrixFactorizationModel(
   }
 
   private def recommend(
-  recommendToFeatures: Array[Double],
-  recommendableFeatures: RDD[(Int, Array[Double])],
-  num: Int): Array[(Int, Double)] = {
-val scored = recommendableFeatures.map { case (id,features) =
-  (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1))
+recommendToFeatures: Array[Double],
+recommendableFeatures: RDD[(Int, Array[Double])],
+num: Int): Array[(Int, Double)] = {
+val recommendToVector = Vectors.dense(recommendToFeatures)
+val scored = recommendableFeatures.map {
+  case (id, features) =
+(id, BLAS.dot(recommendToVector, Vectors.dense(features)))
 }
 scored.top(num)(Ordering.by(_._2))
   }
+
+  /**
+   * Recommends topK products for all users
+   *
+   * @param num how many products to return for every user.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
userID and an array of
+   * rating objects which contains the same userId, recommended productID 
and a score in the
+   * rating field. Semantics of score is same as recommendProducts API
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
--- End diff --

For cross validation we use variable num internally but for final 
recommendation global num is fine...I thought having a topK rdd satisfies both 
use-cases...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-01 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27624736
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
+   *
+   * @param num how many users to return. The number returned may be less 
than this.
+   * @return [Array[Rating]] objects, each of which contains a userID, the 
given productID and a
+   *  score in the rating field. Each represents one recommended user, 
and they are sorted
+   *  by score, decreasing. The first returned is the one predicted to be 
most strongly
+   *  recommended to the product. The score is an opaque value that 
indicates how strongly
+   *  recommended the user is.
+   */
+
+  /**
+   * Recommend topK products for all users
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommend topK users for all products
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
--- End diff --

I mean `ord`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-01 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27624801
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -39,14 +39,15 @@ import org.apache.spark.rdd.RDD
 object MovieLensALS {
 
   case class Params(
-  input: String = null,
-  kryo: Boolean = false,
-  numIterations: Int = 20,
-  lambda: Double = 1.0,
-  rank: Int = 10,
-  numUserBlocks: Int = -1,
-  numProductBlocks: Int = -1,
-  implicitPrefs: Boolean = false) extends AbstractParams[Params]
+input: String = null,
--- End diff --

I think it went back to 2-space indentation. The old code has correct 
indentation. Please do not change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88312094
  
  [Test build #29518 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29518/consoleFull)
 for   PR 3098 at commit 
[`98fa424`](https://github.com/apache/spark/commit/98fa4243dc6041290bdde51e1e899a8be7576470).
 * This patch **fails Scala style tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88312097
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29518/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88311999
  
  [Test build #29518 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29518/consoleFull)
 for   PR 3098 at commit 
[`98fa424`](https://github.com/apache/spark/commit/98fa4243dc6041290bdde51e1e899a8be7576470).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88346990
  
I reran the map computation on MovieLens with varying ranks:

Example run:
./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class 
org.apache.spark.examples.mllib.MovieLensALS --jars 
~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar 
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g 
./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 
--metrics map ~/datasets/ml-1m/ratings.dat  

rank = default

Got 1000209 ratings from 6040 users on 3706 movies. 

Training: 800187, test: 200022.
Test users 6035 MAP 0.03499984595868497 


rank = 25

Got 1000209 ratings from 6040 users on 3706 movies. 

Training: 799385, test: 200824.
Test users 6034 MAP 0.042580954047373255


rank = 50

Got 1000209 ratings from 6040 users on 3706 movies. 

Training: 800289, test: 199920.
Test users 6036 MAP 0.048958415806933275

rank = 100

Got 1000209 ratings from 6040 users on 3706 movies. 

Training: 801148, test: 199061.
Test users 6038 MAP 0.05503487765882986

The numbers are consistent with my runs before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88346961
  
  [Test build #29521 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29521/consoleFull)
 for   PR 3098 at commit 
[`3a0c4eb`](https://github.com/apache/spark/commit/3a0c4eb7f81ee0845f4945d395f6652c965f941b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88311843
  
  [Test build #29517 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29517/consoleFull)
 for   PR 3098 at commit 
[`ee99571`](https://github.com/apache/spark/commit/ee9957144bc2d145c91fc4a4b894ccd2ee6bc2b9).
 * This patch **fails Scala style tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88311846
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29517/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88311599
  
  [Test build #29517 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29517/consoleFull)
 for   PR 3098 at commit 
[`ee99571`](https://github.com/apache/spark/commit/ee9957144bc2d145c91fc4a4b894ccd2ee6bc2b9).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88347022
  
@mengxr could you please do another passI might have missed the JavaRDD 
compatibility issue but fixed rest of your comments...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88361341
  
  [Test build #29521 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29521/consoleFull)
 for   PR 3098 at commit 
[`3a0c4eb`](https://github.com/apache/spark/commit/3a0c4eb7f81ee0845f4945d395f6652c965f941b).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class FeatureTopK(feature: Vector, topK: Int)`
  * `class UDFRegistration(object):`

 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88361354
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29521/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27533769
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
+   *
+   * @param num how many users to return. The number returned may be less 
than this.
+   * @return [Array[Rating]] objects, each of which contains a userID, the 
given productID and a
+   *  score in the rating field. Each represents one recommended user, 
and they are sorted
+   *  by score, decreasing. The first returned is the one predicted to be 
most strongly
+   *  recommended to the product. The score is an opaque value that 
indicates how strongly
+   *  recommended the user is.
+   */
+
+  /**
+   * Recommend topK products for all users
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommend topK users for all products
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
--- End diff --

I am bit confused...recommendProducts is also a public member but that's 
not in companion object...recommendProductsForUsers is also very similar API 
right ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27525485
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -74,6 +75,9 @@ object MovieLensALS {
   opt[Unit](implicitPrefs)
 .text(use implicit preference)
 .action((_, c) = c.copy(implicitPrefs = true))
+  opt[Unit](validateRecommendation)
--- End diff --

cleaned up --validateRecommendation to --metrics


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27528071
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
--- End diff --

followed the indentation from current code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27528120
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
--- End diff --

added


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27528991
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
+
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey.map {
+  case (userId, products) =
+val sortedProducts = products.toArray.sorted(ord.reverse)
+(userId, sortedProducts.map { _._1 })
+}
+
+val trainUserLabels = train.map {
+  x = (x.user, x.product)
+}.groupByKey.map {
+  case (userId, products) = (userId, products.toArray)
+}
+
+val rankings = 
model.recommendProductsForUsers(n).join(trainUserLabels).map {
+  case (userId, (pred, train)) = {
+val predictedProducts = pred.map { _.product }
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27528959
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
+
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey.map {
+  case (userId, products) =
+val sortedProducts = products.toArray.sorted(ord.reverse)
+(userId, sortedProducts.map { _._1 })
+}
+
+val trainUserLabels = train.map {
+  x = (x.user, x.product)
+}.groupByKey.map {
+  case (userId, products) = (userId, products.toArray)
--- End diff --

merged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27525568
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
--- End diff --

fixed...can be fit in one line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27528198
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
+
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey.map {
--- End diff --

fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27528238
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
+
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey.map {
+  case (userId, products) =
+val sortedProducts = products.toArray.sorted(ord.reverse)
+(userId, sortedProducts.map { _._1 })
--- End diff --

fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27529347
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -35,33 +41,33 @@ import org.apache.spark.rdd.RDD
  *and the features computed for this product.
  */
 class MatrixFactorizationModel private[mllib] (
-val rank: Int,
-val userFeatures: RDD[(Int, Array[Double])],
-val productFeatures: RDD[(Int, Array[Double])]) extends Serializable {
+  val rank: Int,
+  val userFeatures: RDD[(Int, Array[Double])],
+  val productFeatures: RDD[(Int, Array[Double])]) extends Serializable {
   /** Predict the rating of one user for one product. */
   def predict(user: Int, product: Int): Double = {
-val userVector = new DoubleMatrix(userFeatures.lookup(user).head)
-val productVector = new 
DoubleMatrix(productFeatures.lookup(product).head)
-userVector.dot(productVector)
+val userVector = Vectors.dense(userFeatures.lookup(user).head)
--- End diff --

I cleaned netlib.ddot to BLAS.dot...they will be same for these cases


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27529308
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -35,33 +41,33 @@ import org.apache.spark.rdd.RDD
  *and the features computed for this product.
  */
 class MatrixFactorizationModel private[mllib] (
-val rank: Int,
-val userFeatures: RDD[(Int, Array[Double])],
-val productFeatures: RDD[(Int, Array[Double])]) extends Serializable {
+  val rank: Int,
--- End diff --

after merge this is fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27529218
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -17,14 +17,20 @@
 
 package org.apache.spark.mllib.recommendation
 
-import java.lang.{Integer = JavaInteger}
-
-import org.jblas.DoubleMatrix
+import java.lang.{ Integer = JavaInteger }
 
 import org.apache.spark.SparkContext._
-import org.apache.spark.api.java.{JavaPairRDD, JavaRDD}
+import org.apache.spark.api.java.{ JavaPairRDD, JavaRDD }
 import org.apache.spark.rdd.RDD
 
+import org.apache.spark.util.collection.Utils
+import org.apache.spark.util.BoundedPriorityQueue
+
+import scala.Ordering
--- End diff --

By organizing imports you mean same package imports will move to one right ?

Old: 

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.BLAS

New:

import org.apache.spark.mllib.linalg.{Vectors, Vector, BLAS}



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27529231
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -17,14 +17,20 @@
 
 package org.apache.spark.mllib.recommendation
 
-import java.lang.{Integer = JavaInteger}
-
-import org.jblas.DoubleMatrix
+import java.lang.{ Integer = JavaInteger }
 
 import org.apache.spark.SparkContext._
-import org.apache.spark.api.java.{JavaPairRDD, JavaRDD}
+import org.apache.spark.api.java.{ JavaPairRDD, JavaRDD }
--- End diff --

cleaned


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27529681
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
+   *
+   * @param num how many users to return. The number returned may be less 
than this.
+   * @return [Array[Rating]] objects, each of which contains a userID, the 
given productID and a
+   *  score in the rating field. Each represents one recommended user, 
and they are sorted
+   *  by score, decreasing. The first returned is the one predicted to be 
most strongly
+   *  recommended to the product. The score is an opaque value that 
indicates how strongly
+   *  recommended the user is.
+   */
+
+  /**
+   * Recommend topK products for all users
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommend topK users for all products
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
+  case class FeatureTopK(feature: Vector, topK: Int)
+
+  /**
+   * Recommend topK products for users in userTopK RDD
+   */
+  def recommendProductsForUsers(
+userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = {
+val userFeaturesTopK = userFeatures.join(userTopK).map {
+  case (userId, (userFeature, topK)) =
+(userId, FeatureTopK(Vectors.dense(userFeature), topK))
+}
+val productVectors = productFeatures.map {
+  x = (x._1, Vectors.dense(x._2))
+}.collect
+
+userFeaturesTopK.map {
+  case (userId, userFeatureTopK) = {
+val predictions = productVectors.map {
+  case (productId, productVector) =
+Rating(userId, productId,
+  BLAS.dot(userFeatureTopK.feature, productVector))
--- End diff --

I will bring in lot of level 3 BLAS in the next PR...I am writing the dgemv 
and dgemm versions for several of these APIs...For now I will add a TODO


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27535273
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
+   *
+   * @param num how many users to return. The number returned may be less 
than this.
+   * @return [Array[Rating]] objects, each of which contains a userID, the 
given productID and a
+   *  score in the rating field. Each represents one recommended user, 
and they are sorted
+   *  by score, decreasing. The first returned is the one predicted to be 
most strongly
+   *  recommended to the product. The score is an opaque value that 
indicates how strongly
+   *  recommended the user is.
+   */
+
+  /**
+   * Recommend topK products for all users
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommend topK users for all products
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
+  case class FeatureTopK(feature: Vector, topK: Int)
+
+  /**
+   * Recommend topK products for users in userTopK RDD
--- End diff --

documented the public batch prediction APIs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88291470
  
@mengxr I also added 2 test-cases for batch predict APIs. These features 
are useful if users are interested in computing MAP measures...Let me know if I 
move the function computeRankingMetrics and computeRMSE to the companion class 
of ml.recommendation.ALS ? Currently both of them are in examples...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88292172
  
If we move computeRankingMetrics and computeRMSE to a better place, I can 
guard it through tests...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-16 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r26545110
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -18,14 +18,14 @@
 package org.apache.spark.examples.mllib
 
 import scala.collection.mutable
-
 import org.apache.log4j.{Level, Logger}
 import scopt.OptionParser
-
 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.SparkContext._
 import org.apache.spark.mllib.recommendation.{ALS, 
MatrixFactorizationModel, Rating}
 import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.evaluation.RankingMetrics
+import org.jblas.DoubleMatrix
--- End diff --

Removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955630
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -74,6 +75,9 @@ object MovieLensALS {
   opt[Unit](implicitPrefs)
 .text(use implicit preference)
 .action((_, c) = c.copy(implicitPrefs = true))
+  opt[Unit](validateRecommendation)
--- End diff --

We can switch the evaluation metric between RMSE and MAP via a `--metrics` 
option. This is more specific than `validateRecommendation`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955624
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -18,14 +18,14 @@
 package org.apache.spark.examples.mllib
 
 import scala.collection.mutable
-
 import org.apache.log4j.{Level, Logger}
 import scopt.OptionParser
-
 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.SparkContext._
 import org.apache.spark.mllib.recommendation.{ALS, 
MatrixFactorizationModel, Rating}
 import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.evaluation.RankingMetrics
+import org.jblas.DoubleMatrix
--- End diff --

organize imports 
(https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports).

Do not use JBLAS because we plan to remove it in 1.4. Use 
breeze/netlib-java instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955687
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
+   *
+   * @param num how many users to return. The number returned may be less 
than this.
+   * @return [Array[Rating]] objects, each of which contains a userID, the 
given productID and a
+   *  score in the rating field. Each represents one recommended user, 
and they are sorted
+   *  by score, decreasing. The first returned is the one predicted to be 
most strongly
+   *  recommended to the product. The score is an opaque value that 
indicates how strongly
+   *  recommended the user is.
+   */
+
+  /**
+   * Recommend topK products for all users
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommend topK users for all products
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
+  case class FeatureTopK(feature: Vector, topK: Int)
+
+  /**
+   * Recommend topK products for users in userTopK RDD
+   */
+  def recommendProductsForUsers(
+userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = {
+val userFeaturesTopK = userFeatures.join(userTopK).map {
+  case (userId, (userFeature, topK)) =
+(userId, FeatureTopK(Vectors.dense(userFeature), topK))
+}
+val productVectors = productFeatures.map {
+  x = (x._1, Vectors.dense(x._2))
+}.collect
+
+userFeaturesTopK.map {
+  case (userId, userFeatureTopK) = {
+val predictions = productVectors.map {
+  case (productId, productVector) =
+Rating(userId, productId,
+  BLAS.dot(userFeatureTopK.feature, productVector))
--- End diff --

Leave to TODO here for Level 3 BLAS.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955679
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
+   *
+   * @param num how many users to return. The number returned may be less 
than this.
+   * @return [Array[Rating]] objects, each of which contains a userID, the 
given productID and a
+   *  score in the rating field. Each represents one recommended user, 
and they are sorted
+   *  by score, decreasing. The first returned is the one predicted to be 
most strongly
+   *  recommended to the product. The score is an opaque value that 
indicates how strongly
+   *  recommended the user is.
+   */
+
+  /**
+   * Recommend topK products for all users
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
--- End diff --

Please check whether the return type is Java-friendly. You can generate the 
API doc and check the Java one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955664
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -17,14 +17,20 @@
 
 package org.apache.spark.mllib.recommendation
 
-import java.lang.{Integer = JavaInteger}
-
-import org.jblas.DoubleMatrix
+import java.lang.{ Integer = JavaInteger }
 
 import org.apache.spark.SparkContext._
-import org.apache.spark.api.java.{JavaPairRDD, JavaRDD}
+import org.apache.spark.api.java.{ JavaPairRDD, JavaRDD }
 import org.apache.spark.rdd.RDD
 
+import org.apache.spark.util.collection.Utils
+import org.apache.spark.util.BoundedPriorityQueue
+
+import scala.Ordering
--- End diff --

organize imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955654
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
+
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey.map {
--- End diff --

`.groupByKey()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955648
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
--- End diff --

Fix style. Use `if .. else ..` without `{ .. }` only if it can fit into a 
single line. 
(https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-CurlyBraces)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955659
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
+
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey.map {
+  case (userId, products) =
+val sortedProducts = products.toArray.sorted(ord.reverse)
+(userId, sortedProducts.map { _._1 })
+}
+
+val trainUserLabels = train.map {
+  x = (x.user, x.product)
+}.groupByKey.map {
+  case (userId, products) = (userId, products.toArray)
--- End diff --

merge `case .. =` to the line above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955670
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -17,14 +17,20 @@
 
 package org.apache.spark.mllib.recommendation
 
-import java.lang.{Integer = JavaInteger}
-
-import org.jblas.DoubleMatrix
+import java.lang.{ Integer = JavaInteger }
 
 import org.apache.spark.SparkContext._
-import org.apache.spark.api.java.{JavaPairRDD, JavaRDD}
+import org.apache.spark.api.java.{ JavaPairRDD, JavaRDD }
--- End diff --

Do not add space.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955680
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
+   *
+   * @param num how many users to return. The number returned may be less 
than this.
+   * @return [Array[Rating]] objects, each of which contains a userID, the 
given productID and a
+   *  score in the rating field. Each represents one recommended user, 
and they are sorted
+   *  by score, decreasing. The first returned is the one predicted to be 
most strongly
+   *  recommended to the product. The score is an opaque value that 
indicates how strongly
+   *  recommended the user is.
+   */
+
+  /**
+   * Recommend topK products for all users
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommend topK users for all products
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
--- End diff --

These become public members. Please move them to the companion object and 
mark them package private.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955662
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
+
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey.map {
+  case (userId, products) =
+val sortedProducts = products.toArray.sorted(ord.reverse)
+(userId, sortedProducts.map { _._1 })
+}
+
+val trainUserLabels = train.map {
+  x = (x.user, x.product)
+}.groupByKey.map {
+  case (userId, products) = (userId, products.toArray)
+}
+
+val rankings = 
model.recommendProductsForUsers(n).join(trainUserLabels).map {
+  case (userId, (pred, train)) = {
+val predictedProducts = pred.map { _.product }
--- End diff --

`pred.map(_.product)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955657
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
+
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey.map {
+  case (userId, products) =
+val sortedProducts = products.toArray.sorted(ord.reverse)
+(userId, sortedProducts.map { _._1 })
--- End diff --

`.map(_._1)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955652
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
--- End diff --

add return type


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955678
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
+   *
+   * @param num how many users to return. The number returned may be less 
than this.
--- End diff --

This block doesn't associate with any method. Please move these docs to the 
corresponding methods' JavaDoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955650
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
--- End diff --

fix indentation: 
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955690
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
+   *
+   * @param num how many users to return. The number returned may be less 
than this.
+   * @return [Array[Rating]] objects, each of which contains a userID, the 
given productID and a
+   *  score in the rating field. Each represents one recommended user, 
and they are sorted
+   *  by score, decreasing. The first returned is the one predicted to be 
most strongly
+   *  recommended to the product. The score is an opaque value that 
indicates how strongly
+   *  recommended the user is.
+   */
+
+  /**
+   * Recommend topK products for all users
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommend topK users for all products
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
+  case class FeatureTopK(feature: Vector, topK: Int)
+
+  /**
+   * Recommend topK products for users in userTopK RDD
+   */
+  def recommendProductsForUsers(
+userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = {
+val userFeaturesTopK = userFeatures.join(userTopK).map {
+  case (userId, (userFeature, topK)) =
+(userId, FeatureTopK(Vectors.dense(userFeature), topK))
+}
+val productVectors = productFeatures.map {
+  x = (x._1, Vectors.dense(x._2))
+}.collect
+
+userFeaturesTopK.map {
+  case (userId, userFeatureTopK) = {
+val predictions = productVectors.map {
+  case (productId, productVector) =
+Rating(userId, productId,
+  BLAS.dot(userFeatureTopK.feature, productVector))
+}
+(userId, Utils.takeOrdered(predictions.iterator,
+  userFeatureTopK.topK)(ord.reverse).toArray)
+  }
+}
+  }
+  
+  /**
+   * Recommend topK users for products in productTopK RDD
+   */
+  def recommendUsersForProducts(
+productTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = {
+val blocks = userFeatures.partitions.size / 2
+
+val productVectors = productFeatures.join(productTopK).map {
+  case (productId, (productFeature, topK)) =
+(productId, FeatureTopK(Vectors.dense(productFeature), topK))
+}.collect()
+
+userFeatures.mapPartitions { items =
+  val predictions = productVectors.map {
+x = (x._1, new 
BoundedPriorityQueue[Rating](x._2.topK)(ord.reverse))
+  }.toMap
+  while (items.hasNext) {
+val (userId, userFeature) = items.next
+val userVector = Vectors.dense(userFeature)
+for (i - 0 until productVectors.length) {
+  val (productId, productFeatureTopK) = productVectors(i)
+  val predicted = Rating(userId, productId,
+BLAS.dot(userVector, productFeatureTopK.feature))
+  predictions(productId) ++= Iterator.single(predicted)
+}
+  }
+  predictions.iterator
+}.reduceByKey({ (queue1, queue2) =
+  queue1 ++= queue2
+  queue1
+}, blocks).map {
+  case (productId, predictions) =
+(productId, predictions.toArray)
+}
+  }
+
   private def recommend(
-  recommendToFeatures: Array[Double],
-  recommendableFeatures: RDD[(Int, Array[Double])],
-  num: Int): Array[(Int, Double)] = {
-val recommendToVector = new DoubleMatrix(recommendToFeatures)
-val scored = recommendableFeatures.map { case (id,features) =
-  (id, recommendToVector.dot(new DoubleMatrix(features)))
+recommendToFeatures: Array[Double],
--- End diff --

The previous indentation is correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To 

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955684
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
+   *
+   * @param num how many users to return. The number returned may be less 
than this.
+   * @return [Array[Rating]] objects, each of which contains a userID, the 
given productID and a
+   *  score in the rating field. Each represents one recommended user, 
and they are sorted
+   *  by score, decreasing. The first returned is the one predicted to be 
most strongly
+   *  recommended to the product. The score is an opaque value that 
indicates how strongly
+   *  recommended the user is.
+   */
+
+  /**
+   * Recommend topK products for all users
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommend topK users for all products
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
+  case class FeatureTopK(feature: Vector, topK: Int)
+
+  /**
+   * Recommend topK products for users in userTopK RDD
--- End diff --

Need to doc the input type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955626
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -39,14 +39,15 @@ import org.apache.spark.rdd.RDD
 object MovieLensALS {
 
   case class Params(
-  input: String = null,
-  kryo: Boolean = false,
-  numIterations: Int = 20,
-  lambda: Double = 1.0,
-  rank: Int = 10,
-  numUserBlocks: Int = -1,
-  numProductBlocks: Int = -1,
-  implicitPrefs: Boolean = false) extends AbstractParams[Params]
+input: String = null,
--- End diff --

Please fix the indentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955674
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -35,33 +41,33 @@ import org.apache.spark.rdd.RDD
  *and the features computed for this product.
  */
 class MatrixFactorizationModel private[mllib] (
-val rank: Int,
-val userFeatures: RDD[(Int, Array[Double])],
-val productFeatures: RDD[(Int, Array[Double])]) extends Serializable {
+  val rank: Int,
+  val userFeatures: RDD[(Int, Array[Double])],
+  val productFeatures: RDD[(Int, Array[Double])]) extends Serializable {
   /** Predict the rating of one user for one product. */
   def predict(user: Int, product: Int): Double = {
-val userVector = new DoubleMatrix(userFeatures.lookup(user).head)
-val productVector = new 
DoubleMatrix(productFeatures.lookup(product).head)
-userVector.dot(productVector)
+val userVector = Vectors.dense(userFeatures.lookup(user).head)
--- End diff --

Thanks for switching to BLAS.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955675
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
--- End diff --

This is JavaDoc style. For normal multiline comments, please use `/*` 
instead of `/**`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-02-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r24955671
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -35,33 +41,33 @@ import org.apache.spark.rdd.RDD
  *and the features computed for this product.
  */
 class MatrixFactorizationModel private[mllib] (
-val rank: Int,
-val userFeatures: RDD[(Int, Array[Double])],
-val productFeatures: RDD[(Int, Array[Double])]) extends Serializable {
+  val rank: Int,
--- End diff --

Do not change indentation. The previous 4-space indentation is correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-27 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-64804819
  
@mengxr can we please review and merge this ? It will greatly help in 
reporting results from PLSA and other variants of matrix factorization 
results...should I update the API to use level 3 BLAS? I have used BLAS.dot in 
this code...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-21 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-63979463
  
More details about the API added and experiments are on the JIRA


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-63722805
  
  [Test build #23632 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23632/consoleFull)
 for   PR 3098 at commit 
[`7163a5c`](https://github.com/apache/spark/commit/7163a5c21b394d8bd89694a9f08aa1b446c71956).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-19 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r20616949
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -39,14 +39,15 @@ import org.apache.spark.rdd.RDD
 object MovieLensALS {
 
   case class Params(
-  input: String = null,
-  kryo: Boolean = false,
-  numIterations: Int = 20,
-  lambda: Double = 1.0,
-  rank: Int = 10,
-  numUserBlocks: Int = -1,
-  numProductBlocks: Int = -1,
-  implicitPrefs: Boolean = false) extends AbstractParams[Params]
+input: String = null,
--- End diff --

4-space indentation is fixed now...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-19 Thread debasish83
Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r20617279
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -18,14 +18,14 @@
 package org.apache.spark.examples.mllib
 
 import scala.collection.mutable
-
-import org.apache.log4j.{Level, Logger}
+import org.apache.log4j.{ Level, Logger }
--- End diff --

Fixed the codiing syle...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-63736377
  
  [Test build #23642 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23642/consoleFull)
 for   PR 3098 at commit 
[`3f97c49`](https://github.com/apache/spark/commit/3f97c499004aa58dfa1b51b8d2cbd6e5776f5fb1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-63738984
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23632/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-63738974
  
**[Test build #23632 timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23632/consoleFull)**
 for PR 3098 at commit 
[`7163a5c`](https://github.com/apache/spark/commit/7163a5c21b394d8bd89694a9f08aa1b446c71956)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-63745215
  
  [Test build #23642 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23642/consoleFull)
 for   PR 3098 at commit 
[`3f97c49`](https://github.com/apache/spark/commit/3f97c499004aa58dfa1b51b8d2cbd6e5776f5fb1).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-63745222
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23642/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-62670490
  
  [Test build #23246 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23246/consoleFull)
 for   PR 3098 at commit 
[`d144f57`](https://github.com/apache/spark/commit/d144f57a58c9424365f1242f90961386c016641e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-11 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-62670769
  
@mengxr added recommendAll API to MatrixFactorizationModel and right now 
the catesian based topK finding is also in the code for validation...

Example run:

./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --jars 
/Users/v606014/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar
 --total-executor-cores 4 --executor-memory 4g --driver-memory 1g --class 
org.apache.spark.examples.mllib.MovieLensALS 
./examples/target/spark-examples_2.10-1.2.0-SNAPSHOT.jar --kryo --lambda 0.065 
--validateRecommendation 1.0 hdfs://localhost:8020/sandbox/movielens/

Got 1000209 ratings from 6040 users on 3706 movies.
Training: 800670, test: 199539.
Test RMSE = 0.8485243993052966.
Using recommendAll API
k 20 prec@k 0.04393839019542896
k 40 prec@k 0.039640609473335565
k 60 prec@k 0.03763387435133046
k 80 prec@k 0.033409738324
k 100 prec@k 0.0318681682676383
k 120 prec@k 0.0318289720658054
k 140 prec@k 0.030209861354280044
k 160 prec@k 0.028638415038092092
k 180 prec@k 0.02780078024364211
k 200 prec@k 0.027149718449817808
Test userMapAPI = 0.02964195032393224

Using Cartesian
k 20 prec@k 0.05635144087446176
k 40 prec@k 0.052252401457436246
k 60 prec@k 0.04856188583416142
k 80 prec@k 0.0453461411063266
k 100 prec@k 0.04296621397813845
k 120 prec@k 0.040878602186154356
k 140 prec@k 0.03914612217858326
k 160 prec@k 0.03766768797615105
k 180 prec@k 0.03648559125538258
k 200 prec@k 0.03540990394170256
Test userMap = 0.038507998497677914

Results with predictAll and cartesian should match but right now they are 
not same...debugging it further...

From the JIRA reference https://issues.apache.org/jira/browse/SPARK-3066, 
implemented 2 ideas:
1) collect one side (either user or product) and broadcast it as a matrix
3) use Utils.takeOrdered to find top-k

The third idea 2) use level-3 BLAS to compute inner products will be 
refactored once dense distributed matrix multiplication comes online...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-11 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-62670877
  
More performance tests undergoing for internal datasets where the cartesian 
code is really slow (due to groupByKey)...For MovieLens there is no substantial 
difference...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-62675915
  
  [Test build #23246 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23246/consoleFull)
 for   PR 3098 at commit 
[`d144f57`](https://github.com/apache/spark/commit/d144f57a58c9424365f1242f90961386c016641e).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2014-11-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-62675921
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23246/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org