[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-98002797 @debasish83 Do you mind me sending you an update? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-98021307 @mengxr please go ahead... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-96403986 was very last few weeks...update it in next few days... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-95009893 @debasish83 Do you have time to update the PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-90740329 @debasish83 In RankingMetrics, the label set doesn't have an ordering. It is either hit or miss at each position. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89728639 @debasish83 do you mean RMSE? it is well-defined but not very useful. MAP is the useful metric. I think that only a rank-dependent metric makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89729377 I meant MAP...what's the MAP on netflix dataset you have seen before and with what lambda ? I am running MAP experiments with various factorization formulations including loglikelihood loss with normalization constraints...also how do you define MAP for implicit feedback (binary dataset, click is 1 and no click is 0) ? In the label set every rating is 1.0 and so there is no ranking defined as such... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89729632 I don't remember from the last time I tried the Netflix dataset but in general MAP is low, like 0.05-0.1, even when recommendations are good according to RMSE. MAP only requires a ranking and so there is no difference in how you define it when the input is ratings or clicks (explicit vs implicit). he input data set does not need a ranking, just a notion of relevant or not. The recommendation is relevant / correct if the user actually interacted with it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89729777 agreed with the implicit MAP calculationFor netflix dataset, I got 0.014...May be I need to use a better regularization...was that 0.05-0.1 number from using lambda = 0.065 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89729865 That might be a fine score. I remember them being MAP being quite low in general and that 0.1 would be very good. I don't recall the number for Netflix, so don't take that number as a guide. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27769592 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -167,23 +169,66 @@ object MovieLensALS { .setProductBlocks(params.numProductBlocks) .run(training) -val rmse = computeRmse(model, test, params.implicitPrefs) - -println(sTest RMSE = $rmse.) +params.metrics match { + case rmse = +val rmse = computeRmse(model, test, params.implicitPrefs) +println(sTest RMSE = $rmse) + case map = +val (map, users) = computeRankingMetrics(model, training, test, numMovies.toInt) +println(sTest users $users MAP $map) + case _ = println(sMetrics not defined, options are rmse/map) +} sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ - def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) -: Double = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - + def computeRmse( +model: MatrixFactorizationModel, +data: RDD[Rating], +implicitPrefs: Boolean) : Double = { val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r + } + + /** Compute MAP (Mean Average Precision) statistics for top N product Recommendation */ + def computeRankingMetrics( +model: MatrixFactorizationModel, +train: RDD[Rating], +test: RDD[Rating], +n: Int) : (Double, Long) = { +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { --- End diff -- I will update with topByKeyIs there a better place to move this function ? may be inside ALS object for example ? That way I can add a test-case to guard it ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89697236 @srowen For netflix dataset what's the MAP you have seen before...I started experiments on Netflix dataset...lambda is 0.065 for netflix as well right ? For MovieLens 0.065 works well... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89706247 @coderxiang @mengxr If I have a dataset with implicit (click or 0) then MAP is not that well defined right since in label set everything is 1.0 and so there is no ordering definedshould we add a rank independent metric for implicit datasets ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27707377 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -138,14 +141,122 @@ class MatrixFactorizationModel( } private def recommend( - recommendToFeatures: Array[Double], - recommendableFeatures: RDD[(Int, Array[Double])], - num: Int): Array[(Int, Double)] = { -val scored = recommendableFeatures.map { case (id,features) = - (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1)) +recommendToFeatures: Array[Double], +recommendableFeatures: RDD[(Int, Array[Double])], +num: Int): Array[(Int, Double)] = { +val recommendToVector = Vectors.dense(recommendToFeatures) +val scored = recommendableFeatures.map { + case (id, features) = +(id, BLAS.dot(recommendToVector, Vectors.dense(features))) } scored.top(num)(Ordering.by(_._2)) } + + /** + * Recommends topK products for all users + * + * @param num how many products to return for every user. + * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array of + * rating objects which contains the same userId, recommended productID and a score in the + * rating field. Semantics of score is same as recommendProducts API + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommends topK users for all products + * + * @param num how many users to return for every product. + * @return [(Int, Array[Rating])] objects, where every tuple contains a productID and an array + * of rating objects which contains the recommended userId, same productID and a score in the + * rating field. Semantics of score is same as recommendUsers API + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) + case class FeatureTopK(feature: Vector, topK: Int) + + /** + * Recommend topK products for users in userTopK RDD + * + * @param userTopK how many products to return for every user in userTopK RDD. + * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array + * of rating objects which contains the same userId, recommended productID and a score in the + * rating field. Semantics of score is same as recommendProducts API + */ + def recommendProductsForUsers( +userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = { +val userFeaturesTopK = userFeatures.join(userTopK).map { + case (userId, (userFeature, topK)) = +(userId, FeatureTopK(Vectors.dense(userFeature), topK)) +} + +// TO DO: Do a mini-batch on productFeatures.collect if the dimension of rows are big +val productVectors = productFeatures.map { + x = (x._1, Vectors.dense(x._2)) +}.collect --- End diff -- `collect` - `collect()` (We use `()` if this is an action.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27707375 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -138,14 +141,122 @@ class MatrixFactorizationModel( } private def recommend( - recommendToFeatures: Array[Double], - recommendableFeatures: RDD[(Int, Array[Double])], - num: Int): Array[(Int, Double)] = { -val scored = recommendableFeatures.map { case (id,features) = - (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1)) +recommendToFeatures: Array[Double], +recommendableFeatures: RDD[(Int, Array[Double])], +num: Int): Array[(Int, Double)] = { +val recommendToVector = Vectors.dense(recommendToFeatures) +val scored = recommendableFeatures.map { + case (id, features) = +(id, BLAS.dot(recommendToVector, Vectors.dense(features))) } scored.top(num)(Ordering.by(_._2)) } + + /** + * Recommends topK products for all users + * + * @param num how many products to return for every user. + * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array of + * rating objects which contains the same userId, recommended productID and a score in the + * rating field. Semantics of score is same as recommendProducts API + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommends topK users for all products + * + * @param num how many users to return for every product. + * @return [(Int, Array[Rating])] objects, where every tuple contains a productID and an array + * of rating objects which contains the recommended userId, same productID and a score in the + * rating field. Semantics of score is same as recommendUsers API + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) + case class FeatureTopK(feature: Vector, topK: Int) --- End diff -- This is not necessary if we use a global `num`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27707380 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -138,14 +141,122 @@ class MatrixFactorizationModel( } private def recommend( - recommendToFeatures: Array[Double], - recommendableFeatures: RDD[(Int, Array[Double])], - num: Int): Array[(Int, Double)] = { -val scored = recommendableFeatures.map { case (id,features) = - (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1)) +recommendToFeatures: Array[Double], +recommendableFeatures: RDD[(Int, Array[Double])], +num: Int): Array[(Int, Double)] = { +val recommendToVector = Vectors.dense(recommendToFeatures) +val scored = recommendableFeatures.map { + case (id, features) = +(id, BLAS.dot(recommendToVector, Vectors.dense(features))) } scored.top(num)(Ordering.by(_._2)) } + + /** + * Recommends topK products for all users + * + * @param num how many products to return for every user. + * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array of + * rating objects which contains the same userId, recommended productID and a score in the + * rating field. Semantics of score is same as recommendProducts API + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommends topK users for all products + * + * @param num how many users to return for every product. + * @return [(Int, Array[Rating])] objects, where every tuple contains a productID and an array + * of rating objects which contains the recommended userId, same productID and a score in the + * rating field. Semantics of score is same as recommendUsers API + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) + case class FeatureTopK(feature: Vector, topK: Int) + + /** + * Recommend topK products for users in userTopK RDD + * + * @param userTopK how many products to return for every user in userTopK RDD. + * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array + * of rating objects which contains the same userId, recommended productID and a score in the + * rating field. Semantics of score is same as recommendProducts API + */ + def recommendProductsForUsers( +userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = { +val userFeaturesTopK = userFeatures.join(userTopK).map { + case (userId, (userFeature, topK)) = +(userId, FeatureTopK(Vectors.dense(userFeature), topK)) +} + +// TO DO: Do a mini-batch on productFeatures.collect if the dimension of rows are big +val productVectors = productFeatures.map { + x = (x._1, Vectors.dense(x._2)) +}.collect + +// TO DO: User BLAS dgemm +userFeaturesTopK.map { + case (userId, userFeatureTopK) = { +val predictions = productVectors.map { + case (productId, productVector) = +Rating(userId, productId, + BLAS.dot(userFeatureTopK.feature, productVector)) +} +(userId, Utils.takeOrdered(predictions.iterator, + userFeatureTopK.topK)(ord.reverse).toArray) --- End diff -- Use the `topByKey` method to simplify the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27707373 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -138,14 +141,122 @@ class MatrixFactorizationModel( } private def recommend( - recommendToFeatures: Array[Double], - recommendableFeatures: RDD[(Int, Array[Double])], - num: Int): Array[(Int, Double)] = { -val scored = recommendableFeatures.map { case (id,features) = - (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1)) +recommendToFeatures: Array[Double], +recommendableFeatures: RDD[(Int, Array[Double])], +num: Int): Array[(Int, Double)] = { +val recommendToVector = Vectors.dense(recommendToFeatures) +val scored = recommendableFeatures.map { + case (id, features) = +(id, BLAS.dot(recommendToVector, Vectors.dense(features))) } scored.top(num)(Ordering.by(_._2)) } + + /** + * Recommends topK products for all users + * + * @param num how many products to return for every user. + * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array of + * rating objects which contains the same userId, recommended productID and a score in the + * rating field. Semantics of score is same as recommendProducts API + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } --- End diff -- Is it common to use different `num`s? I think it should be sufficient to set `num` globally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27707386 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -138,14 +141,122 @@ class MatrixFactorizationModel( } private def recommend( - recommendToFeatures: Array[Double], - recommendableFeatures: RDD[(Int, Array[Double])], - num: Int): Array[(Int, Double)] = { -val scored = recommendableFeatures.map { case (id,features) = - (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1)) +recommendToFeatures: Array[Double], +recommendableFeatures: RDD[(Int, Array[Double])], +num: Int): Array[(Int, Double)] = { +val recommendToVector = Vectors.dense(recommendToFeatures) +val scored = recommendableFeatures.map { + case (id, features) = +(id, BLAS.dot(recommendToVector, Vectors.dense(features))) } scored.top(num)(Ordering.by(_._2)) } + + /** + * Recommends topK products for all users + * + * @param num how many products to return for every user. + * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array of + * rating objects which contains the same userId, recommended productID and a score in the + * rating field. Semantics of score is same as recommendProducts API + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommends topK users for all products + * + * @param num how many users to return for every product. + * @return [(Int, Array[Rating])] objects, where every tuple contains a productID and an array + * of rating objects which contains the recommended userId, same productID and a score in the + * rating field. Semantics of score is same as recommendUsers API + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) + case class FeatureTopK(feature: Vector, topK: Int) + + /** + * Recommend topK products for users in userTopK RDD + * + * @param userTopK how many products to return for every user in userTopK RDD. + * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array + * of rating objects which contains the same userId, recommended productID and a score in the + * rating field. Semantics of score is same as recommendProducts API + */ + def recommendProductsForUsers( +userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = { +val userFeaturesTopK = userFeatures.join(userTopK).map { + case (userId, (userFeature, topK)) = +(userId, FeatureTopK(Vectors.dense(userFeature), topK)) +} + +// TO DO: Do a mini-batch on productFeatures.collect if the dimension of rows are big +val productVectors = productFeatures.map { + x = (x._1, Vectors.dense(x._2)) +}.collect + +// TO DO: User BLAS dgemm +userFeaturesTopK.map { + case (userId, userFeatureTopK) = { +val predictions = productVectors.map { + case (productId, productVector) = +Rating(userId, productId, + BLAS.dot(userFeatureTopK.feature, productVector)) +} +(userId, Utils.takeOrdered(predictions.iterator, + userFeatureTopK.topK)(ord.reverse).toArray) + } +} + } + + /** + * Recommends topK users for all products in productTopK RDD + * + * @param productTopK how many users to return for every product in productTopK RDD + * @return [(Int, Array[Rating])] objects, where every tuple contains a productID and an array + * of Rating objects which contains the recommended userId, same productID and a score in the + * rating field. Semantics of score is same as recommendUsers API + */ + def recommendUsersForProducts( +productTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = { +val blocks = userFeatures.partitions.size / 2 + +// TO DO: Do a mini-batch on productFeatures.collect if the dimension of rows are big +val productVectors = productFeatures.join(productTopK).map { + case (productId, (productFeature, topK)) = +(productId, FeatureTopK(Vectors.dense(productFeature), topK)) +}.collect() + +// TO DO: Use BLAS dgemm +userFeatures.mapPartitions { items = + val predictions =
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27707370 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -138,14 +141,122 @@ class MatrixFactorizationModel( } private def recommend( - recommendToFeatures: Array[Double], - recommendableFeatures: RDD[(Int, Array[Double])], - num: Int): Array[(Int, Double)] = { -val scored = recommendableFeatures.map { case (id,features) = - (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1)) +recommendToFeatures: Array[Double], +recommendableFeatures: RDD[(Int, Array[Double])], +num: Int): Array[(Int, Double)] = { +val recommendToVector = Vectors.dense(recommendToFeatures) +val scored = recommendableFeatures.map { + case (id, features) = +(id, BLAS.dot(recommendToVector, Vectors.dense(features))) --- End diff -- and same here. Let's keep this code untouched, which is irrelevant to this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27707366 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -70,26 +71,28 @@ class MatrixFactorizationModel( /** Predict the rating of one user for one product. */ def predict(user: Int, product: Int): Double = { -val userVector = userFeatures.lookup(user).head -val productVector = productFeatures.lookup(product).head -blas.ddot(userVector.length, userVector, 1, productVector, 1) +val userVector = Vectors.dense(userFeatures.lookup(user).head) +val productVector = Vectors.dense(productFeatures.lookup(product).head) +BLAS.dot(userVector, productVector) --- End diff -- This change is not necessary. It add more wrappers (and temp objects) and it doesn't save much code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27707360 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -167,23 +169,66 @@ object MovieLensALS { .setProductBlocks(params.numProductBlocks) .run(training) -val rmse = computeRmse(model, test, params.implicitPrefs) - -println(sTest RMSE = $rmse.) +params.metrics match { + case rmse = +val rmse = computeRmse(model, test, params.implicitPrefs) +println(sTest RMSE = $rmse) + case map = +val (map, users) = computeRankingMetrics(model, training, test, numMovies.toInt) +println(sTest users $users MAP $map) + case _ = println(sMetrics not defined, options are rmse/map) +} sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ - def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) -: Double = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - + def computeRmse( +model: MatrixFactorizationModel, +data: RDD[Rating], +implicitPrefs: Boolean) : Double = { val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r + } + + /** Compute MAP (Mean Average Precision) statistics for top N product Recommendation */ + def computeRankingMetrics( +model: MatrixFactorizationModel, +train: RDD[Rating], +test: RDD[Rating], +n: Int) : (Double, Long) = { +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey().map { + case (userId, products) = +val sortedProducts = products.toArray.sorted(ord.reverse) +(userId, sortedProducts.map(_._1)) +} + +val trainUserLabels = train.map { + x = (x.user, x.product) +}.groupByKey().map{case (userId, products) = (userId, products.toArray)} --- End diff -- `.groupByKey().mapValues(_.toArray)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27707363 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -167,23 +169,66 @@ object MovieLensALS { .setProductBlocks(params.numProductBlocks) .run(training) -val rmse = computeRmse(model, test, params.implicitPrefs) - -println(sTest RMSE = $rmse.) +params.metrics match { + case rmse = +val rmse = computeRmse(model, test, params.implicitPrefs) +println(sTest RMSE = $rmse) + case map = +val (map, users) = computeRankingMetrics(model, training, test, numMovies.toInt) +println(sTest users $users MAP $map) + case _ = println(sMetrics not defined, options are rmse/map) +} sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ - def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) -: Double = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - + def computeRmse( +model: MatrixFactorizationModel, +data: RDD[Rating], +implicitPrefs: Boolean) : Double = { val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r + } + + /** Compute MAP (Mean Average Precision) statistics for top N product Recommendation */ + def computeRankingMetrics( +model: MatrixFactorizationModel, +train: RDD[Rating], +test: RDD[Rating], +n: Int) : (Double, Long) = { +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey().map { + case (userId, products) = +val sortedProducts = products.toArray.sorted(ord.reverse) +(userId, sortedProducts.map(_._1)) +} + +val trainUserLabels = train.map { + x = (x.user, x.product) +}.groupByKey().map{case (userId, products) = (userId, products.toArray)} + +val rankings = model.recommendProductsForUsers(n).join(trainUserLabels).map { + case (userId, (pred, train)) = { +val predictedProducts = pred.map(_.product) +val trainSet = train.toSet +(userId, predictedProducts.filterNot { x = trainSet.contains(x) }) + } +}.join(testUserLabels).map { --- End diff -- safer to use `rightOuterJoin` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27707367 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -70,26 +71,28 @@ class MatrixFactorizationModel( /** Predict the rating of one user for one product. */ def predict(user: Int, product: Int): Double = { -val userVector = userFeatures.lookup(user).head -val productVector = productFeatures.lookup(product).head -blas.ddot(userVector.length, userVector, 1, productVector, 1) +val userVector = Vectors.dense(userFeatures.lookup(user).head) +val productVector = Vectors.dense(productFeatures.lookup(product).head) +BLAS.dot(userVector, productVector) } /** -* Predict the rating of many users for many products. -* The output RDD has an element per each element in the input RDD (including all duplicates) -* unless a user or product is missing in the training set. -* -* @param usersProducts RDD of (user, product) pairs. -* @return RDD of Ratings. -*/ + * Predict the rating of many users for many products. + * The output RDD has an element per each element in the input RDD (including all duplicates) + * unless a user or product is missing in the training set. + * + * @param usersProducts RDD of (user, product) pairs. + * @return RDD of Ratings. + */ def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] = { -val users = userFeatures.join(usersProducts).map{ +val users = userFeatures.join(usersProducts).map { case (user, (uFeatures, product)) = (product, (user, uFeatures)) } users.join(productFeatures).map { case (product, ((user, uFeatures), pFeatures)) = -Rating(user, product, blas.ddot(uFeatures.length, uFeatures, 1, pFeatures, 1)) +val userVector = Vectors.dense(uFeatures) +val productVector = Vectors.dense(pFeatures) +Rating(user, product, BLAS.dot(userVector, productVector)) --- End diff -- Same here, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27707364 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -167,23 +169,66 @@ object MovieLensALS { .setProductBlocks(params.numProductBlocks) .run(training) -val rmse = computeRmse(model, test, params.implicitPrefs) - -println(sTest RMSE = $rmse.) +params.metrics match { + case rmse = +val rmse = computeRmse(model, test, params.implicitPrefs) +println(sTest RMSE = $rmse) + case map = +val (map, users) = computeRankingMetrics(model, training, test, numMovies.toInt) +println(sTest users $users MAP $map) + case _ = println(sMetrics not defined, options are rmse/map) +} sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ - def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) -: Double = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - + def computeRmse( +model: MatrixFactorizationModel, +data: RDD[Rating], +implicitPrefs: Boolean) : Double = { val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r + } + + /** Compute MAP (Mean Average Precision) statistics for top N product Recommendation */ + def computeRankingMetrics( +model: MatrixFactorizationModel, +train: RDD[Rating], +test: RDD[Rating], +n: Int) : (Double, Long) = { +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey().map { + case (userId, products) = +val sortedProducts = products.toArray.sorted(ord.reverse) +(userId, sortedProducts.map(_._1)) +} + +val trainUserLabels = train.map { + x = (x.user, x.product) +}.groupByKey().map{case (userId, products) = (userId, products.toArray)} + +val rankings = model.recommendProductsForUsers(n).join(trainUserLabels).map { + case (userId, (pred, train)) = { +val predictedProducts = pred.map(_.product) +val trainSet = train.toSet +(userId, predictedProducts.filterNot { x = trainSet.contains(x) }) + } +}.join(testUserLabels).map { + case (user, (pred, lab)) = (pred, lab) +} + +val metrics = new RankingMetrics(rankings) +(metrics.meanAveragePrecision, testUserLabels.count) --- End diff -- If we use `testUserLabels` twice, we should cache it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27707362 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -167,23 +169,66 @@ object MovieLensALS { .setProductBlocks(params.numProductBlocks) .run(training) -val rmse = computeRmse(model, test, params.implicitPrefs) - -println(sTest RMSE = $rmse.) +params.metrics match { + case rmse = +val rmse = computeRmse(model, test, params.implicitPrefs) +println(sTest RMSE = $rmse) + case map = +val (map, users) = computeRankingMetrics(model, training, test, numMovies.toInt) +println(sTest users $users MAP $map) + case _ = println(sMetrics not defined, options are rmse/map) +} sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ - def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) -: Double = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - + def computeRmse( +model: MatrixFactorizationModel, +data: RDD[Rating], +implicitPrefs: Boolean) : Double = { val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r + } + + /** Compute MAP (Mean Average Precision) statistics for top N product Recommendation */ + def computeRankingMetrics( +model: MatrixFactorizationModel, +train: RDD[Rating], +test: RDD[Rating], +n: Int) : (Double, Long) = { +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey().map { + case (userId, products) = +val sortedProducts = products.toArray.sorted(ord.reverse) +(userId, sortedProducts.map(_._1)) +} + +val trainUserLabels = train.map { + x = (x.user, x.product) +}.groupByKey().map{case (userId, products) = (userId, products.toArray)} + +val rankings = model.recommendProductsForUsers(n).join(trainUserLabels).map { + case (userId, (pred, train)) = { +val predictedProducts = pred.map(_.product) --- End diff -- `pred.map(_.product).diff(train)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27707356 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -167,23 +169,66 @@ object MovieLensALS { .setProductBlocks(params.numProductBlocks) .run(training) -val rmse = computeRmse(model, test, params.implicitPrefs) - -println(sTest RMSE = $rmse.) +params.metrics match { + case rmse = +val rmse = computeRmse(model, test, params.implicitPrefs) +println(sTest RMSE = $rmse) + case map = +val (map, users) = computeRankingMetrics(model, training, test, numMovies.toInt) +println(sTest users $users MAP $map) + case _ = println(sMetrics not defined, options are rmse/map) +} sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ - def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) -: Double = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - + def computeRmse( +model: MatrixFactorizationModel, +data: RDD[Rating], +implicitPrefs: Boolean) : Double = { val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r + } + + /** Compute MAP (Mean Average Precision) statistics for top N product Recommendation */ + def computeRankingMetrics( +model: MatrixFactorizationModel, +train: RDD[Rating], +test: RDD[Rating], +n: Int) : (Double, Long) = { +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { --- End diff -- Please use the topByKey implementation to compute top items for users: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89083630 @debasish83 The code style still need fixes. On the high level, I think having variable `num` is not necessary. Let's do a global num in this PR and see whether users ask for variable `num`. @coderxiang added `topByKey` recently, so it would be nice if we can reuse the implementation in couple places. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27712646 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -138,14 +141,122 @@ class MatrixFactorizationModel( } private def recommend( - recommendToFeatures: Array[Double], - recommendableFeatures: RDD[(Int, Array[Double])], - num: Int): Array[(Int, Double)] = { -val scored = recommendableFeatures.map { case (id,features) = - (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1)) +recommendToFeatures: Array[Double], +recommendableFeatures: RDD[(Int, Array[Double])], +num: Int): Array[(Int, Double)] = { +val recommendToVector = Vectors.dense(recommendToFeatures) +val scored = recommendableFeatures.map { + case (id, features) = +(id, BLAS.dot(recommendToVector, Vectors.dense(features))) } scored.top(num)(Ordering.by(_._2)) } + + /** + * Recommends topK products for all users + * + * @param num how many products to return for every user. + * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array of + * rating objects which contains the same userId, recommended productID and a score in the + * rating field. Semantics of score is same as recommendProducts API + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } --- End diff -- For cross validation we use variable num internally but for final recommendation global num is fine...I thought having a topK rdd satisfies both use-cases... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27624736 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. + * + * @param num how many users to return. The number returned may be less than this. + * @return [Array[Rating]] objects, each of which contains a userID, the given productID and a + * score in the rating field. Each represents one recommended user, and they are sorted + * by score, decreasing. The first returned is the one predicted to be most strongly + * recommended to the product. The score is an opaque value that indicates how strongly + * recommended the user is. + */ + + /** + * Recommend topK products for all users + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommend topK users for all products + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) --- End diff -- I mean `ord`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27624801 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -39,14 +39,15 @@ import org.apache.spark.rdd.RDD object MovieLensALS { case class Params( - input: String = null, - kryo: Boolean = false, - numIterations: Int = 20, - lambda: Double = 1.0, - rank: Int = 10, - numUserBlocks: Int = -1, - numProductBlocks: Int = -1, - implicitPrefs: Boolean = false) extends AbstractParams[Params] +input: String = null, --- End diff -- I think it went back to 2-space indentation. The old code has correct indentation. Please do not change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88312094 [Test build #29518 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29518/consoleFull) for PR 3098 at commit [`98fa424`](https://github.com/apache/spark/commit/98fa4243dc6041290bdde51e1e899a8be7576470). * This patch **fails Scala style tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88312097 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29518/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88311999 [Test build #29518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29518/consoleFull) for PR 3098 at commit [`98fa424`](https://github.com/apache/spark/commit/98fa4243dc6041290bdde51e1e899a8be7576470). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88346990 I reran the map computation on MovieLens with varying ranks: Example run: ./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 4 --executor-memory 4g --driver-memory 1g ./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 --metrics map ~/datasets/ml-1m/ratings.dat rank = default Got 1000209 ratings from 6040 users on 3706 movies. Training: 800187, test: 200022. Test users 6035 MAP 0.03499984595868497 rank = 25 Got 1000209 ratings from 6040 users on 3706 movies. Training: 799385, test: 200824. Test users 6034 MAP 0.042580954047373255 rank = 50 Got 1000209 ratings from 6040 users on 3706 movies. Training: 800289, test: 199920. Test users 6036 MAP 0.048958415806933275 rank = 100 Got 1000209 ratings from 6040 users on 3706 movies. Training: 801148, test: 199061. Test users 6038 MAP 0.05503487765882986 The numbers are consistent with my runs before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88346961 [Test build #29521 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29521/consoleFull) for PR 3098 at commit [`3a0c4eb`](https://github.com/apache/spark/commit/3a0c4eb7f81ee0845f4945d395f6652c965f941b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88311843 [Test build #29517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29517/consoleFull) for PR 3098 at commit [`ee99571`](https://github.com/apache/spark/commit/ee9957144bc2d145c91fc4a4b894ccd2ee6bc2b9). * This patch **fails Scala style tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88311846 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29517/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88311599 [Test build #29517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29517/consoleFull) for PR 3098 at commit [`ee99571`](https://github.com/apache/spark/commit/ee9957144bc2d145c91fc4a4b894ccd2ee6bc2b9). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88347022 @mengxr could you please do another passI might have missed the JavaRDD compatibility issue but fixed rest of your comments... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88361341 [Test build #29521 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29521/consoleFull) for PR 3098 at commit [`3a0c4eb`](https://github.com/apache/spark/commit/3a0c4eb7f81ee0845f4945d395f6652c965f941b). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class FeatureTopK(feature: Vector, topK: Int)` * `class UDFRegistration(object):` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88361354 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29521/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27533769 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. + * + * @param num how many users to return. The number returned may be less than this. + * @return [Array[Rating]] objects, each of which contains a userID, the given productID and a + * score in the rating field. Each represents one recommended user, and they are sorted + * by score, decreasing. The first returned is the one predicted to be most strongly + * recommended to the product. The score is an opaque value that indicates how strongly + * recommended the user is. + */ + + /** + * Recommend topK products for all users + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommend topK users for all products + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) --- End diff -- I am bit confused...recommendProducts is also a public member but that's not in companion object...recommendProductsForUsers is also very similar API right ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27525485 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -74,6 +75,9 @@ object MovieLensALS { opt[Unit](implicitPrefs) .text(use implicit preference) .action((_, c) = c.copy(implicitPrefs = true)) + opt[Unit](validateRecommendation) --- End diff -- cleaned up --validateRecommendation to --metrics --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27528071 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, --- End diff -- followed the indentation from current code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27528120 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { --- End diff -- added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27528991 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { + +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey.map { + case (userId, products) = +val sortedProducts = products.toArray.sorted(ord.reverse) +(userId, sortedProducts.map { _._1 }) +} + +val trainUserLabels = train.map { + x = (x.user, x.product) +}.groupByKey.map { + case (userId, products) = (userId, products.toArray) +} + +val rankings = model.recommendProductsForUsers(n).join(trainUserLabels).map { + case (userId, (pred, train)) = { +val predictedProducts = pred.map { _.product } --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27528959 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { + +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey.map { + case (userId, products) = +val sortedProducts = products.toArray.sorted(ord.reverse) +(userId, sortedProducts.map { _._1 }) +} + +val trainUserLabels = train.map { + x = (x.user, x.product) +}.groupByKey.map { + case (userId, products) = (userId, products.toArray) --- End diff -- merged --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27525568 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) --- End diff -- fixed...can be fit in one line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27528198 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { + +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey.map { --- End diff -- fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27528238 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { + +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey.map { + case (userId, products) = +val sortedProducts = products.toArray.sorted(ord.reverse) +(userId, sortedProducts.map { _._1 }) --- End diff -- fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27529347 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -35,33 +41,33 @@ import org.apache.spark.rdd.RDD *and the features computed for this product. */ class MatrixFactorizationModel private[mllib] ( -val rank: Int, -val userFeatures: RDD[(Int, Array[Double])], -val productFeatures: RDD[(Int, Array[Double])]) extends Serializable { + val rank: Int, + val userFeatures: RDD[(Int, Array[Double])], + val productFeatures: RDD[(Int, Array[Double])]) extends Serializable { /** Predict the rating of one user for one product. */ def predict(user: Int, product: Int): Double = { -val userVector = new DoubleMatrix(userFeatures.lookup(user).head) -val productVector = new DoubleMatrix(productFeatures.lookup(product).head) -userVector.dot(productVector) +val userVector = Vectors.dense(userFeatures.lookup(user).head) --- End diff -- I cleaned netlib.ddot to BLAS.dot...they will be same for these cases --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27529308 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -35,33 +41,33 @@ import org.apache.spark.rdd.RDD *and the features computed for this product. */ class MatrixFactorizationModel private[mllib] ( -val rank: Int, -val userFeatures: RDD[(Int, Array[Double])], -val productFeatures: RDD[(Int, Array[Double])]) extends Serializable { + val rank: Int, --- End diff -- after merge this is fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27529218 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -17,14 +17,20 @@ package org.apache.spark.mllib.recommendation -import java.lang.{Integer = JavaInteger} - -import org.jblas.DoubleMatrix +import java.lang.{ Integer = JavaInteger } import org.apache.spark.SparkContext._ -import org.apache.spark.api.java.{JavaPairRDD, JavaRDD} +import org.apache.spark.api.java.{ JavaPairRDD, JavaRDD } import org.apache.spark.rdd.RDD +import org.apache.spark.util.collection.Utils +import org.apache.spark.util.BoundedPriorityQueue + +import scala.Ordering --- End diff -- By organizing imports you mean same package imports will move to one right ? Old: import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.linalg.BLAS New: import org.apache.spark.mllib.linalg.{Vectors, Vector, BLAS} --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27529231 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -17,14 +17,20 @@ package org.apache.spark.mllib.recommendation -import java.lang.{Integer = JavaInteger} - -import org.jblas.DoubleMatrix +import java.lang.{ Integer = JavaInteger } import org.apache.spark.SparkContext._ -import org.apache.spark.api.java.{JavaPairRDD, JavaRDD} +import org.apache.spark.api.java.{ JavaPairRDD, JavaRDD } --- End diff -- cleaned --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27529681 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. + * + * @param num how many users to return. The number returned may be less than this. + * @return [Array[Rating]] objects, each of which contains a userID, the given productID and a + * score in the rating field. Each represents one recommended user, and they are sorted + * by score, decreasing. The first returned is the one predicted to be most strongly + * recommended to the product. The score is an opaque value that indicates how strongly + * recommended the user is. + */ + + /** + * Recommend topK products for all users + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommend topK users for all products + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) + case class FeatureTopK(feature: Vector, topK: Int) + + /** + * Recommend topK products for users in userTopK RDD + */ + def recommendProductsForUsers( +userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = { +val userFeaturesTopK = userFeatures.join(userTopK).map { + case (userId, (userFeature, topK)) = +(userId, FeatureTopK(Vectors.dense(userFeature), topK)) +} +val productVectors = productFeatures.map { + x = (x._1, Vectors.dense(x._2)) +}.collect + +userFeaturesTopK.map { + case (userId, userFeatureTopK) = { +val predictions = productVectors.map { + case (productId, productVector) = +Rating(userId, productId, + BLAS.dot(userFeatureTopK.feature, productVector)) --- End diff -- I will bring in lot of level 3 BLAS in the next PR...I am writing the dgemv and dgemm versions for several of these APIs...For now I will add a TODO --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27535273 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. + * + * @param num how many users to return. The number returned may be less than this. + * @return [Array[Rating]] objects, each of which contains a userID, the given productID and a + * score in the rating field. Each represents one recommended user, and they are sorted + * by score, decreasing. The first returned is the one predicted to be most strongly + * recommended to the product. The score is an opaque value that indicates how strongly + * recommended the user is. + */ + + /** + * Recommend topK products for all users + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommend topK users for all products + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) + case class FeatureTopK(feature: Vector, topK: Int) + + /** + * Recommend topK products for users in userTopK RDD --- End diff -- documented the public batch prediction APIs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88291470 @mengxr I also added 2 test-cases for batch predict APIs. These features are useful if users are interested in computing MAP measures...Let me know if I move the function computeRankingMetrics and computeRMSE to the companion class of ml.recommendation.ALS ? Currently both of them are in examples... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88292172 If we move computeRankingMetrics and computeRMSE to a better place, I can guard it through tests... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r26545110 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -18,14 +18,14 @@ package org.apache.spark.examples.mllib import scala.collection.mutable - import org.apache.log4j.{Level, Logger} import scopt.OptionParser - import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ import org.apache.spark.mllib.recommendation.{ALS, MatrixFactorizationModel, Rating} import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.evaluation.RankingMetrics +import org.jblas.DoubleMatrix --- End diff -- Removed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955630 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -74,6 +75,9 @@ object MovieLensALS { opt[Unit](implicitPrefs) .text(use implicit preference) .action((_, c) = c.copy(implicitPrefs = true)) + opt[Unit](validateRecommendation) --- End diff -- We can switch the evaluation metric between RMSE and MAP via a `--metrics` option. This is more specific than `validateRecommendation`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955624 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -18,14 +18,14 @@ package org.apache.spark.examples.mllib import scala.collection.mutable - import org.apache.log4j.{Level, Logger} import scopt.OptionParser - import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ import org.apache.spark.mllib.recommendation.{ALS, MatrixFactorizationModel, Rating} import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.evaluation.RankingMetrics +import org.jblas.DoubleMatrix --- End diff -- organize imports (https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports). Do not use JBLAS because we plan to remove it in 1.4. Use breeze/netlib-java instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955687 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. + * + * @param num how many users to return. The number returned may be less than this. + * @return [Array[Rating]] objects, each of which contains a userID, the given productID and a + * score in the rating field. Each represents one recommended user, and they are sorted + * by score, decreasing. The first returned is the one predicted to be most strongly + * recommended to the product. The score is an opaque value that indicates how strongly + * recommended the user is. + */ + + /** + * Recommend topK products for all users + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommend topK users for all products + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) + case class FeatureTopK(feature: Vector, topK: Int) + + /** + * Recommend topK products for users in userTopK RDD + */ + def recommendProductsForUsers( +userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = { +val userFeaturesTopK = userFeatures.join(userTopK).map { + case (userId, (userFeature, topK)) = +(userId, FeatureTopK(Vectors.dense(userFeature), topK)) +} +val productVectors = productFeatures.map { + x = (x._1, Vectors.dense(x._2)) +}.collect + +userFeaturesTopK.map { + case (userId, userFeatureTopK) = { +val predictions = productVectors.map { + case (productId, productVector) = +Rating(userId, productId, + BLAS.dot(userFeatureTopK.feature, productVector)) --- End diff -- Leave to TODO here for Level 3 BLAS. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955679 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. + * + * @param num how many users to return. The number returned may be less than this. + * @return [Array[Rating]] objects, each of which contains a userID, the given productID and a + * score in the rating field. Each represents one recommended user, and they are sorted + * by score, decreasing. The first returned is the one predicted to be most strongly + * recommended to the product. The score is an opaque value that indicates how strongly + * recommended the user is. + */ + + /** + * Recommend topK products for all users + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { --- End diff -- Please check whether the return type is Java-friendly. You can generate the API doc and check the Java one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955664 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -17,14 +17,20 @@ package org.apache.spark.mllib.recommendation -import java.lang.{Integer = JavaInteger} - -import org.jblas.DoubleMatrix +import java.lang.{ Integer = JavaInteger } import org.apache.spark.SparkContext._ -import org.apache.spark.api.java.{JavaPairRDD, JavaRDD} +import org.apache.spark.api.java.{ JavaPairRDD, JavaRDD } import org.apache.spark.rdd.RDD +import org.apache.spark.util.collection.Utils +import org.apache.spark.util.BoundedPriorityQueue + +import scala.Ordering --- End diff -- organize imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955654 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { + +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey.map { --- End diff -- `.groupByKey()` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955648 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) --- End diff -- Fix style. Use `if .. else ..` without `{ .. }` only if it can fit into a single line. (https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-CurlyBraces) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955659 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { + +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey.map { + case (userId, products) = +val sortedProducts = products.toArray.sorted(ord.reverse) +(userId, sortedProducts.map { _._1 }) +} + +val trainUserLabels = train.map { + x = (x.user, x.product) +}.groupByKey.map { + case (userId, products) = (userId, products.toArray) --- End diff -- merge `case .. =` to the line above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955670 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -17,14 +17,20 @@ package org.apache.spark.mllib.recommendation -import java.lang.{Integer = JavaInteger} - -import org.jblas.DoubleMatrix +import java.lang.{ Integer = JavaInteger } import org.apache.spark.SparkContext._ -import org.apache.spark.api.java.{JavaPairRDD, JavaRDD} +import org.apache.spark.api.java.{ JavaPairRDD, JavaRDD } --- End diff -- Do not add space. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955680 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. + * + * @param num how many users to return. The number returned may be less than this. + * @return [Array[Rating]] objects, each of which contains a userID, the given productID and a + * score in the rating field. Each represents one recommended user, and they are sorted + * by score, decreasing. The first returned is the one predicted to be most strongly + * recommended to the product. The score is an opaque value that indicates how strongly + * recommended the user is. + */ + + /** + * Recommend topK products for all users + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommend topK users for all products + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) --- End diff -- These become public members. Please move them to the companion object and mark them package private. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955662 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { + +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey.map { + case (userId, products) = +val sortedProducts = products.toArray.sorted(ord.reverse) +(userId, sortedProducts.map { _._1 }) +} + +val trainUserLabels = train.map { + x = (x.user, x.product) +}.groupByKey.map { + case (userId, products) = (userId, products.toArray) +} + +val rankings = model.recommendProductsForUsers(n).join(trainUserLabels).map { + case (userId, (pred, train)) = { +val predictedProducts = pred.map { _.product } --- End diff -- `pred.map(_.product)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955657 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { + +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey.map { + case (userId, products) = +val sortedProducts = products.toArray.sorted(ord.reverse) +(userId, sortedProducts.map { _._1 }) --- End diff -- `.map(_._1)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955652 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { --- End diff -- add return type --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955678 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. + * + * @param num how many users to return. The number returned may be less than this. --- End diff -- This block doesn't associate with any method. Please move these docs to the corresponding methods' JavaDoc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955650 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, --- End diff -- fix indentation: https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955690 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. + * + * @param num how many users to return. The number returned may be less than this. + * @return [Array[Rating]] objects, each of which contains a userID, the given productID and a + * score in the rating field. Each represents one recommended user, and they are sorted + * by score, decreasing. The first returned is the one predicted to be most strongly + * recommended to the product. The score is an opaque value that indicates how strongly + * recommended the user is. + */ + + /** + * Recommend topK products for all users + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommend topK users for all products + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) + case class FeatureTopK(feature: Vector, topK: Int) + + /** + * Recommend topK products for users in userTopK RDD + */ + def recommendProductsForUsers( +userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = { +val userFeaturesTopK = userFeatures.join(userTopK).map { + case (userId, (userFeature, topK)) = +(userId, FeatureTopK(Vectors.dense(userFeature), topK)) +} +val productVectors = productFeatures.map { + x = (x._1, Vectors.dense(x._2)) +}.collect + +userFeaturesTopK.map { + case (userId, userFeatureTopK) = { +val predictions = productVectors.map { + case (productId, productVector) = +Rating(userId, productId, + BLAS.dot(userFeatureTopK.feature, productVector)) +} +(userId, Utils.takeOrdered(predictions.iterator, + userFeatureTopK.topK)(ord.reverse).toArray) + } +} + } + + /** + * Recommend topK users for products in productTopK RDD + */ + def recommendUsersForProducts( +productTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = { +val blocks = userFeatures.partitions.size / 2 + +val productVectors = productFeatures.join(productTopK).map { + case (productId, (productFeature, topK)) = +(productId, FeatureTopK(Vectors.dense(productFeature), topK)) +}.collect() + +userFeatures.mapPartitions { items = + val predictions = productVectors.map { +x = (x._1, new BoundedPriorityQueue[Rating](x._2.topK)(ord.reverse)) + }.toMap + while (items.hasNext) { +val (userId, userFeature) = items.next +val userVector = Vectors.dense(userFeature) +for (i - 0 until productVectors.length) { + val (productId, productFeatureTopK) = productVectors(i) + val predicted = Rating(userId, productId, +BLAS.dot(userVector, productFeatureTopK.feature)) + predictions(productId) ++= Iterator.single(predicted) +} + } + predictions.iterator +}.reduceByKey({ (queue1, queue2) = + queue1 ++= queue2 + queue1 +}, blocks).map { + case (productId, predictions) = +(productId, predictions.toArray) +} + } + private def recommend( - recommendToFeatures: Array[Double], - recommendableFeatures: RDD[(Int, Array[Double])], - num: Int): Array[(Int, Double)] = { -val recommendToVector = new DoubleMatrix(recommendToFeatures) -val scored = recommendableFeatures.map { case (id,features) = - (id, recommendToVector.dot(new DoubleMatrix(features))) +recommendToFeatures: Array[Double], --- End diff -- The previous indentation is correct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955684 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. + * + * @param num how many users to return. The number returned may be less than this. + * @return [Array[Rating]] objects, each of which contains a userID, the given productID and a + * score in the rating field. Each represents one recommended user, and they are sorted + * by score, decreasing. The first returned is the one predicted to be most strongly + * recommended to the product. The score is an opaque value that indicates how strongly + * recommended the user is. + */ + + /** + * Recommend topK products for all users + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommend topK users for all products + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) + case class FeatureTopK(feature: Vector, topK: Int) + + /** + * Recommend topK products for users in userTopK RDD --- End diff -- Need to doc the input type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955626 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -39,14 +39,15 @@ import org.apache.spark.rdd.RDD object MovieLensALS { case class Params( - input: String = null, - kryo: Boolean = false, - numIterations: Int = 20, - lambda: Double = 1.0, - rank: Int = 10, - numUserBlocks: Int = -1, - numProductBlocks: Int = -1, - implicitPrefs: Boolean = false) extends AbstractParams[Params] +input: String = null, --- End diff -- Please fix the indentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955674 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -35,33 +41,33 @@ import org.apache.spark.rdd.RDD *and the features computed for this product. */ class MatrixFactorizationModel private[mllib] ( -val rank: Int, -val userFeatures: RDD[(Int, Array[Double])], -val productFeatures: RDD[(Int, Array[Double])]) extends Serializable { + val rank: Int, + val userFeatures: RDD[(Int, Array[Double])], + val productFeatures: RDD[(Int, Array[Double])]) extends Serializable { /** Predict the rating of one user for one product. */ def predict(user: Int, product: Int): Double = { -val userVector = new DoubleMatrix(userFeatures.lookup(user).head) -val productVector = new DoubleMatrix(productFeatures.lookup(product).head) -userVector.dot(productVector) +val userVector = Vectors.dense(userFeatures.lookup(user).head) --- End diff -- Thanks for switching to BLAS. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955675 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. --- End diff -- This is JavaDoc style. For normal multiline comments, please use `/*` instead of `/**`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r24955671 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -35,33 +41,33 @@ import org.apache.spark.rdd.RDD *and the features computed for this product. */ class MatrixFactorizationModel private[mllib] ( -val rank: Int, -val userFeatures: RDD[(Int, Array[Double])], -val productFeatures: RDD[(Int, Array[Double])]) extends Serializable { + val rank: Int, --- End diff -- Do not change indentation. The previous 4-space indentation is correct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-64804819 @mengxr can we please review and merge this ? It will greatly help in reporting results from PLSA and other variants of matrix factorization results...should I update the API to use level 3 BLAS? I have used BLAS.dot in this code... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-63979463 More details about the API added and experiments are on the JIRA --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-63722805 [Test build #23632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23632/consoleFull) for PR 3098 at commit [`7163a5c`](https://github.com/apache/spark/commit/7163a5c21b394d8bd89694a9f08aa1b446c71956). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r20616949 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -39,14 +39,15 @@ import org.apache.spark.rdd.RDD object MovieLensALS { case class Params( - input: String = null, - kryo: Boolean = false, - numIterations: Int = 20, - lambda: Double = 1.0, - rank: Int = 10, - numUserBlocks: Int = -1, - numProductBlocks: Int = -1, - implicitPrefs: Boolean = false) extends AbstractParams[Params] +input: String = null, --- End diff -- 4-space indentation is fixed now... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r20617279 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -18,14 +18,14 @@ package org.apache.spark.examples.mllib import scala.collection.mutable - -import org.apache.log4j.{Level, Logger} +import org.apache.log4j.{ Level, Logger } --- End diff -- Fixed the codiing syle... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-63736377 [Test build #23642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23642/consoleFull) for PR 3098 at commit [`3f97c49`](https://github.com/apache/spark/commit/3f97c499004aa58dfa1b51b8d2cbd6e5776f5fb1). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-63738984 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23632/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-63738974 **[Test build #23632 timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23632/consoleFull)** for PR 3098 at commit [`7163a5c`](https://github.com/apache/spark/commit/7163a5c21b394d8bd89694a9f08aa1b446c71956) after a configured wait of `120m`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-63745215 [Test build #23642 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23642/consoleFull) for PR 3098 at commit [`3f97c49`](https://github.com/apache/spark/commit/3f97c499004aa58dfa1b51b8d2cbd6e5776f5fb1). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-63745222 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23642/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-62670490 [Test build #23246 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23246/consoleFull) for PR 3098 at commit [`d144f57`](https://github.com/apache/spark/commit/d144f57a58c9424365f1242f90961386c016641e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-62670769 @mengxr added recommendAll API to MatrixFactorizationModel and right now the catesian based topK finding is also in the code for validation... Example run: ./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --jars /Users/v606014/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 4 --executor-memory 4g --driver-memory 1g --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.2.0-SNAPSHOT.jar --kryo --lambda 0.065 --validateRecommendation 1.0 hdfs://localhost:8020/sandbox/movielens/ Got 1000209 ratings from 6040 users on 3706 movies. Training: 800670, test: 199539. Test RMSE = 0.8485243993052966. Using recommendAll API k 20 prec@k 0.04393839019542896 k 40 prec@k 0.039640609473335565 k 60 prec@k 0.03763387435133046 k 80 prec@k 0.033409738324 k 100 prec@k 0.0318681682676383 k 120 prec@k 0.0318289720658054 k 140 prec@k 0.030209861354280044 k 160 prec@k 0.028638415038092092 k 180 prec@k 0.02780078024364211 k 200 prec@k 0.027149718449817808 Test userMapAPI = 0.02964195032393224 Using Cartesian k 20 prec@k 0.05635144087446176 k 40 prec@k 0.052252401457436246 k 60 prec@k 0.04856188583416142 k 80 prec@k 0.0453461411063266 k 100 prec@k 0.04296621397813845 k 120 prec@k 0.040878602186154356 k 140 prec@k 0.03914612217858326 k 160 prec@k 0.03766768797615105 k 180 prec@k 0.03648559125538258 k 200 prec@k 0.03540990394170256 Test userMap = 0.038507998497677914 Results with predictAll and cartesian should match but right now they are not same...debugging it further... From the JIRA reference https://issues.apache.org/jira/browse/SPARK-3066, implemented 2 ideas: 1) collect one side (either user or product) and broadcast it as a matrix 3) use Utils.takeOrdered to find top-k The third idea 2) use level-3 BLAS to compute inner products will be refactored once dense distributed matrix multiplication comes online... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-62670877 More performance tests undergoing for internal datasets where the cartesian code is really slow (due to groupByKey)...For MovieLens there is no substantial difference... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-62675915 [Test build #23246 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23246/consoleFull) for PR 3098 at commit [`d144f57`](https://github.com/apache/spark/commit/d144f57a58c9424365f1242f90961386c016641e). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-62675921 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23246/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org