Re: MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?
The assumption of implicit feedback model is that the unobserved ratings are more likely to be negative. So you may want to add some negatives for evaluation. Otherwise, the input ratings are all 1 and the test ratings are all 1 as well. The baseline predictor, which uses the average rating (that is 1), could easily give you an RMSE of 0.0. -Xiangrui - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?
I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields `userId` and `productId`. I have **no product ratings**, just info on what products users have bought, that's all. So to train ALS I use: def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel ( http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$ ) This API requires `Rating` object: Rating(user: Int, product: Int, rating: Double) On the other hand documentation on `trainImplicit` tells: *Train a matrix factorization model given an RDD of 'implicit preferences' ratings given by users to some products, in the form of (userID, productID, **preference**) pairs.* When I set rating / preferences to `1` as in: val ratings = sc.textFile(new File(dir, file).toString).map { line = val fields = line.split(,) // format: (randomNumber, Rating(userId, productId, rating)) (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0)) } val training = ratings.filter(x = x._1 60) .values .repartition(numPartitions) .cache() val validation = ratings.filter(x = x._1 = 60 x._1 80) .values .repartition(numPartitions) .cache() val test = ratings.filter(x = x._1 = 80).values.cache() And then train ALSL: val model = ALS.trainImplicit(ratings, rank, numIter) I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1 value: val validationRmse = computeRmse(model, validation, numValidation) /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = { val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) val predictionsAndRatings = predictions.map(x = ((x.user, x.product), x.rating)) .join(data.map(x = ((x.user, x.product), x.rating))) .values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n) } So my question is: to what value should I set `rating` in: Rating(user: Int, product: Int, rating: Double) for implicit training (in `ALS.trainImplicit` method) ? **Update** With: val alpha = 40 val lambda = 0.01 I get: Got 1895593 ratings from 17471 users on 462685 products. Training: 1136079, validation: 380495, test: 379019 RMSE (validation) = 0.7537217888106758 for the model trained with rank = 8 and numIter = 10. RMSE (validation) = 0.7489005441881798 for the model trained with rank = 8 and numIter = 20. RMSE (validation) = 0.7387672873747732 for the model trained with rank = 12 and numIter = 10. RMSE (validation) = 0.7310003522283959 for the model trained with rank = 12 and numIter = 20. The best model was trained with rank = 12, and numIter = 20, and its RMSE on the test set is 0.7302343904091481. baselineRmse: 0.0 testRmse: 0.7302343904091481 The best model improves the baseline by -Infinity%. Which is still a big error, I guess. Also I get strange baseline improvement where baseline model is simply mean (1).
MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?
I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields `userId` and `productId`. I have **no product ratings**, just info on what products users have bought, that's all. So to train ALS I use: def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel (http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$) This API requires `Rating` object: Rating(user: Int, product: Int, rating: Double) On the other hand documentation on `trainImplicit` tells: *Train a matrix factorization model given an RDD of 'implicit preferences' ratings given by users to some products, in the form of (userID, productID, **preference**) pairs.* When I set rating / preferences to `1` as in: val ratings = sc.textFile(new File(dir, file).toString).map { line = val fields = line.split(,) // format: (randomNumber, Rating(userId, productId, rating)) (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0)) } val training = ratings.filter(x = x._1 60) .values .repartition(numPartitions) .cache() val validation = ratings.filter(x = x._1 = 60 x._1 80) .values .repartition(numPartitions) .cache() val test = ratings.filter(x = x._1 = 80).values.cache() And then train ALSL: val model = ALS.trainImplicit(ratings, rank, numIter) I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1 value: val validationRmse = computeRmse(model, validation, numValidation) /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = { val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) val predictionsAndRatings = predictions.map(x = ((x.user, x.product), x.rating)) .join(data.map(x = ((x.user, x.product), x.rating))) .values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n) } So my question is: to what value should I set `rating` in: Rating(user: Int, product: Int, rating: Double) for implicit training (in `ALS.trainImplicit` method) ? **Update** With: val alpha = 40 val lambda = 0.01 I get: Got 1895593 ratings from 17471 users on 462685 products. Training: 1136079, validation: 380495, test: 379019 RMSE (validation) = 0.7537217888106758 for the model trained with rank = 8 and numIter = 10. RMSE (validation) = 0.7489005441881798 for the model trained with rank = 8 and numIter = 20. RMSE (validation) = 0.7387672873747732 for the model trained with rank = 12 and numIter = 10. RMSE (validation) = 0.7310003522283959 for the model trained with rank = 12 and numIter = 20. The best model was trained with rank = 12, and numIter = 20, and its RMSE on the test set is 0.7302343904091481. baselineRmse: 0.0 testRmse: 0.7302343904091481 The best model improves the baseline by -Infinity%. Which is still a big error, I guess. Also I get strange baseline improvement where baseline model is simply mean (1). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLib-How-to-set-preferences-for-ALS-implicit-feedback-in-Collaborative-Filtering-tp21185.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?
I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields `userId` and `productId`. I have **no product ratings**, just info on what products users have bought, that's all. So to train ALS I use: def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel (http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$) This API requires `Rating` object: Rating(user: Int, product: Int, rating: Double) On the other hand documentation on `trainImplicit` tells: *Train a matrix factorization model given an RDD of 'implicit preferences' ratings given by users to some products, in the form of (userID, productID, **preference**) pairs.* When I set rating / preferences to `1` as in: val ratings = sc.textFile(new File(dir, file).toString).map { line = val fields = line.split(,) // format: (randomNumber, Rating(userId, productId, rating)) (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0)) } val training = ratings.filter(x = x._1 60) .values .repartition(numPartitions) .cache() val validation = ratings.filter(x = x._1 = 60 x._1 80) .values .repartition(numPartitions) .cache() val test = ratings.filter(x = x._1 = 80).values.cache() And then train ALSL: val model = ALS.trainImplicit(ratings, rank, numIter) I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1 value: val validationRmse = computeRmse(model, validation, numValidation) /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = { val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) val predictionsAndRatings = predictions.map(x = ((x.user, x.product), x.rating)) .join(data.map(x = ((x.user, x.product), x.rating))) .values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n) } So my question is: to what value should I set `rating` in: Rating(user: Int, product: Int, rating: Double) for implicit training (in `ALS.trainImplicit` method) ? **Update** With: val alpha = 40 val lambda = 0.01 I get: Got 1895593 ratings from 17471 users on 462685 products. Training: 1136079, validation: 380495, test: 379019 RMSE (validation) = 0.7537217888106758 for the model trained with rank = 8 and numIter = 10. RMSE (validation) = 0.7489005441881798 for the model trained with rank = 8 and numIter = 20. RMSE (validation) = 0.7387672873747732 for the model trained with rank = 12 and numIter = 10. RMSE (validation) = 0.7310003522283959 for the model trained with rank = 12 and numIter = 20. The best model was trained with rank = 12, and numIter = 20, and its RMSE on the test set is 0.7302343904091481. baselineRmse: 0.0 testRmse: 0.7302343904091481 The best model improves the baseline by -Infinity%. Which is still a big error, I guess. Also I get strange baseline improvement where baseline model is simply mean (1). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLib-How-to-set-preferences-for-ALS-implicit-feedback-in-Collaborative-Filtering-tp21186.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?
On Fri, Jan 16, 2015 at 9:58 AM, Zork Sail zorks...@gmail.com wrote: And then train ALSL: val model = ALS.trainImplicit(ratings, rank, numIter) I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1 value: This is likely the problem. RMSE is not an appropriate evaluation metric when you have trained a model on implicit data. The factorization is not minimizing the same squared error loss that RMSE evaluates. Use metrics like AUC instead, for example. Rating value can be 1 if you have no information at all about the interaction other than that it exists. It should be thought of as a weight. 10 means it's 10 times more important to predict an interaction than one with weight 1. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org