Hi I am working on building a recommender system on a learning content data. My data format is a user-item matrix of views. Similar to the below one
NS 353 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 354 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 355 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 356 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 357 0 0 0 0 0 Where each row is for a user id and each column is all the videos in the system. The value corresponding to each video column is either 1 if a user has watched/clicked on the video else 0. This is an implicit feedback dataset. Now, I am looking at spark.Mllib package and they seem to give an example where they are saying the dataframe should be of the form [(userid), (product),(ratings)]. My dataframe is basically user-video where each column is different videos and value of those columns is the rating(views in this case). I guess this is what the original paper and elsewhere in collaborative filtering algorithm the data is represented. Am not sure if this format of data is supported by Spark.mllib or I have to convert it to the one they have given an example for?. Any idea how to do that from my dataset? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Data-Format-for-Running-Collaborative-Filtering-in-Spark-MLlib-tp27832.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org