Re: MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-20 Thread Xiangrui Meng
The assumption of implicit feedback model is that the unobserved
ratings are more likely to be negative. So you may want to add some
negatives for evaluation. Otherwise, the input ratings are all 1 and
the test ratings are all 1 as well. The baseline predictor, which uses
the average rating (that is 1), could easily give you an RMSE of 0.0.
-Xiangrui

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-16 Thread Zork Sail
I am trying to use Spark MLib ALS with implicit feedback for collaborative
filtering. Input data has only two fields `userId` and `productId`. I have
**no product ratings**, just info on what products users have bought,
that's all. So to train ALS I use:

def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int):
MatrixFactorizationModel

(
http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$
)

This API requires `Rating` object:

Rating(user: Int, product: Int, rating: Double)

On the other hand documentation on `trainImplicit` tells: *Train a matrix
factorization model given an RDD of 'implicit preferences' ratings given by
users to some products, in the form of (userID, productID, **preference**)
pairs.*

When I set rating / preferences to `1` as in:

val ratings = sc.textFile(new File(dir, file).toString).map { line =
  val fields = line.split(,)
  // format: (randomNumber, Rating(userId, productId, rating))
  (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))
}

 val training = ratings.filter(x = x._1  60)
  .values
  .repartition(numPartitions)
  .cache()
val validation = ratings.filter(x = x._1 = 60  x._1  80)
  .values
  .repartition(numPartitions)
  .cache()
val test = ratings.filter(x = x._1 = 80).values.cache()


And then train ALSL:

 val model = ALS.trainImplicit(ratings, rank, numIter)

I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1
value:

val validationRmse = computeRmse(model, validation, numValidation)

/** Compute RMSE (Root Mean Squared Error). */
 def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n:
Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x = (x.user,
x.product)))
val predictionsAndRatings = predictions.map(x = ((x.user, x.product),
x.rating))
  .join(data.map(x = ((x.user, x.product), x.rating)))
  .values
math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 -
x._2)).reduce(_ + _) / n)
}

So my question is: to what value should I set `rating` in:

Rating(user: Int, product: Int, rating: Double)

for implicit training (in `ALS.trainImplicit` method) ?

**Update**

With:

  val alpha = 40
  val lambda = 0.01

I get:

Got 1895593 ratings from 17471 users on 462685 products.
Training: 1136079, validation: 380495, test: 379019
RMSE (validation) = 0.7537217888106758 for the model trained with rank
= 8 and numIter = 10.
RMSE (validation) = 0.7489005441881798 for the model trained with rank
= 8 and numIter = 20.
RMSE (validation) = 0.7387672873747732 for the model trained with rank
= 12 and numIter = 10.
RMSE (validation) = 0.7310003522283959 for the model trained with rank
= 12 and numIter = 20.
The best model was trained with rank = 12, and numIter = 20, and its
RMSE on the test set is 0.7302343904091481.
baselineRmse: 0.0 testRmse: 0.7302343904091481
The best model improves the baseline by -Infinity%.

Which is still a big error, I guess. Also I get strange baseline
improvement where baseline model is simply mean (1).


MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-16 Thread Julian Ricardo
I am trying to use Spark MLib ALS with implicit feedback for collaborative
filtering. Input data has only two fields `userId` and `productId`. I have
**no product ratings**, just info on what products users have bought, that's
all. So to train ALS I use:
 
def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int):
MatrixFactorizationModel

(http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$)

This API requires `Rating` object:

Rating(user: Int, product: Int, rating: Double)

On the other hand documentation on `trainImplicit` tells: *Train a matrix
factorization model given an RDD of 'implicit preferences' ratings given by
users to some products, in the form of (userID, productID, **preference**)
pairs.*
 
When I set rating / preferences to `1` as in:
 
val ratings = sc.textFile(new File(dir, file).toString).map { line =
  val fields = line.split(,)
  // format: (randomNumber, Rating(userId, productId, rating))
  (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))
}

 val training = ratings.filter(x = x._1  60)
  .values
  .repartition(numPartitions)
  .cache()
val validation = ratings.filter(x = x._1 = 60  x._1  80)
  .values
  .repartition(numPartitions)
  .cache()
val test = ratings.filter(x = x._1 = 80).values.cache()


And then train ALSL:

 val model = ALS.trainImplicit(ratings, rank, numIter)

I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1
value:

val validationRmse = computeRmse(model, validation, numValidation)

/** Compute RMSE (Root Mean Squared Error). */
 def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n:
Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x = (x.user,
x.product)))
val predictionsAndRatings = predictions.map(x = ((x.user, x.product),
x.rating))
  .join(data.map(x = ((x.user, x.product), x.rating)))
  .values
math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 -
x._2)).reduce(_ + _) / n)
}

So my question is: to what value should I set `rating` in:

Rating(user: Int, product: Int, rating: Double)

for implicit training (in `ALS.trainImplicit` method) ?

**Update**

With:

  val alpha = 40
  val lambda = 0.01

I get:

Got 1895593 ratings from 17471 users on 462685 products.
Training: 1136079, validation: 380495, test: 379019
RMSE (validation) = 0.7537217888106758 for the model trained with rank =
8 and numIter = 10.
RMSE (validation) = 0.7489005441881798 for the model trained with rank =
8 and numIter = 20.
RMSE (validation) = 0.7387672873747732 for the model trained with rank =
12 and numIter = 10.
RMSE (validation) = 0.7310003522283959 for the model trained with rank =
12 and numIter = 20.
The best model was trained with rank = 12, and numIter = 20, and its
RMSE on the test set is 0.7302343904091481.
baselineRmse: 0.0 testRmse: 0.7302343904091481
The best model improves the baseline by -Infinity%.

Which is still a big error, I guess. Also I get strange baseline improvement
where baseline model is simply mean (1).




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLib-How-to-set-preferences-for-ALS-implicit-feedback-in-Collaborative-Filtering-tp21185.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-16 Thread Zork
I am trying to use Spark MLib ALS with implicit feedback for
collaborative filtering. Input data has only two fields `userId` and
`productId`. I have **no product ratings**, just info on what products users
have bought, that's all. So to train ALS I use:
 
def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int):
MatrixFactorizationModel

   
(http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$)

This API requires `Rating` object:

Rating(user: Int, product: Int, rating: Double)

On the other hand documentation on `trainImplicit` tells: *Train a
matrix factorization model given an RDD of 'implicit preferences' ratings
given by users to some products, in the form of (userID, productID,
**preference**) pairs.*
 
When I set rating / preferences to `1` as in:
 
val ratings = sc.textFile(new File(dir, file).toString).map { line
=
  val fields = line.split(,)
  // format: (randomNumber, Rating(userId, productId, rating))
  (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))
}

 val training = ratings.filter(x = x._1  60)
  .values
  .repartition(numPartitions)
  .cache()
val validation = ratings.filter(x = x._1 = 60  x._1  80)
  .values
  .repartition(numPartitions)
  .cache()
val test = ratings.filter(x = x._1 = 80).values.cache()


And then train ALSL:

 val model = ALS.trainImplicit(ratings, rank, numIter)

I get RMSE 0.9, which is a big error in case of preferences taking 0 or
1 value:

val validationRmse = computeRmse(model, validation, numValidation)

/** Compute RMSE (Root Mean Squared Error). */
 def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating],
n: Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x = (x.user,
x.product)))
val predictionsAndRatings = predictions.map(x = ((x.user,
x.product), x.rating))
  .join(data.map(x = ((x.user, x.product), x.rating)))
  .values
math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 -
x._2)).reduce(_ + _) / n)
}

So my question is: to what value should I set `rating` in:

Rating(user: Int, product: Int, rating: Double)

for implicit training (in `ALS.trainImplicit` method) ?

**Update**

With:

  val alpha = 40
  val lambda = 0.01

I get:

Got 1895593 ratings from 17471 users on 462685 products.
Training: 1136079, validation: 380495, test: 379019
RMSE (validation) = 0.7537217888106758 for the model trained with
rank = 8 and numIter = 10.
RMSE (validation) = 0.7489005441881798 for the model trained with
rank = 8 and numIter = 20.
RMSE (validation) = 0.7387672873747732 for the model trained with
rank = 12 and numIter = 10.
RMSE (validation) = 0.7310003522283959 for the model trained with
rank = 12 and numIter = 20.
The best model was trained with rank = 12, and numIter = 20, and its
RMSE on the test set is 0.7302343904091481.
baselineRmse: 0.0 testRmse: 0.7302343904091481
The best model improves the baseline by -Infinity%.

Which is still a big error, I guess. Also I get strange baseline
improvement where baseline model is simply mean (1).




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLib-How-to-set-preferences-for-ALS-implicit-feedback-in-Collaborative-Filtering-tp21186.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-16 Thread Sean Owen
On Fri, Jan 16, 2015 at 9:58 AM, Zork Sail zorks...@gmail.com wrote:
 And then train ALSL:

  val model = ALS.trainImplicit(ratings, rank, numIter)

 I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1
 value:

This is likely the problem. RMSE is not an appropriate evaluation
metric when you have trained a model on implicit data. The
factorization is not minimizing the same squared error loss that RMSE
evaluates. Use metrics like AUC instead, for example.

Rating value can be 1 if you have no information at all about the
interaction other than that it exists. It should be thought of as a
weight. 10 means it's 10 times more important to predict an
interaction than one with weight 1.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org