Data Format for Running Collaborative Filtering in Spark MLlib

Baktaawar Mon, 03 Oct 2016 11:08:58 -0700

Hi 

I am working on building a recommender system on a learning content data. My
data format is a user-item matrix of views. Similar to the below one


NS                                                                              
                                                                                
        
353     0       0       0       0       0       0       0       0       0       
0       ...     0       0       0       0       0       0       0       0       
0       0
354     0       0       0       0       0       0       0       0       0       
0       ...     0       0       0       0       0       0       0       0       
0       0
355     0       0       0       0       0       0       0       0       0       
0       ...     0       0       0       0       0       0       0       0       
0       0
356     0       0       0       0       0       0       0       0       0       
0       ...     0       0       0       0       0       0       0       0       
0       0
357     0       0       0       0       0

Where each row is for a user id and each column is all the videos in the
system. The value corresponding to each video column is either 1 if a user
has watched/clicked on the video else 0. 

This is an implicit feedback dataset. 

Now, I am looking at spark.Mllib package and they seem to give an example
where they are saying the dataframe should be of the form [(userid),
(product),(ratings)]. My dataframe is basically user-video where each column
is different videos and value of those columns is the rating(views in this
case). 

I guess this is what the original paper and elsewhere in collaborative
filtering algorithm the data is represented. Am not sure if this format of
data is supported by Spark.mllib or I have to convert it to the one they
have given an example for?. Any idea how to do that from my dataset?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Data-Format-for-Running-Collaborative-Filtering-in-Spark-MLlib-tp27832.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Data Format for Running Collaborative Filtering in Spark MLlib

Reply via email to