Thanks a lot Evan. Your help is really appreciated.
BR, Aslan On Sun, Dec 1, 2013 at 3:00 AM, Evan R. Sparks <[email protected]>wrote: > The MLI repo doesn't yet have support for collaborative filtering, though > we've got a private branch we're working on cleaning up that will add it > shortly. To use MLI, you need to build it with sbt/sbt assembly, and then > make sure all workers have access to it by passing the filename of the jar > to SparkContext when you create it. > > For now, your best bet is to just use the MLlib implementation of ALS > that's in Spark today. > > If you have an input file where each line is of the format > "user,song,rating", you could load up your data for appropriate input like > this: > > val ratings = sc.textFile(ratingsFile).map { line => > > > val fields = line.split(',') > > > Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble) > > > }.cache() > > > Then, you can train with something like: > val model = ALS.train(ratings, 10, 100, 0.1) > > You can also take a look at the ALS code in MLlib - there's a command line > tool that will do the same thing and save your model to a couple of files. > > This will train a MatrixFactorizationModel of rank 10 in 100 iterations > with a regularization parameter of 0.1. > > As for how I came up with those values - > Rank is a measure of model complexity - the model is estimated as > essentially #rank parameters per user and #rank parameters per song. The > larger, the more complex the model (but also, the more complex it is to > train and the greater the chance that you're overfitting to the input > data.) Reasonable values are anywhere from 5 to 50. > > Iterations is the number of passes of the ALS algorithm to run - > eventually the model will converge to some roughly fixed point and > additional iterations won't change it much. Reasonable values are anywhere > from 10 to 1000 - depending on the complexity of the data and the rank of > the model. There are checks for early termination you can do when training > these things, but that's not currently implemented in spark. > > Regularization is a tool that encourages model sparsity. Higher > regularization encourages model parameters that are near zero to stay > small. This is one way to combat overfitting, and often yields models that > work better on an out-of-sample basis. > > - Evan > > > > On Sat, Nov 30, 2013 at 12:04 PM, Aslan Bekirov <[email protected]>wrote: > >> Hi Evan, >> >> Thank you very much for your quick response. >> >> I am using ALS to create model, here is my method >> >> def doCollab() { >> >> val sc = new SparkContext("local[2]", "Log Query") >> val mc = new MLContext(sc) >> var pairs = mc.load("user_song_pairs", 1 to 2) >> val ratings = mc.load("user_ratings", 1) >> >> val als = new ALS() >> als.setBlocks(-1) >> als.setIterations(15) >> als.setRank(10) >> >> val model = als.run(ratings) >> >> } >> >> But here first of all MLConext could not resolved, Am I creating context >> wrongly? >> >> Secondly ALS has parameters like >> >> - *rank* is the number of latent factors in our model. >> - *iterations* is the number of iterations to run. >> - *lambda* specifies the regularization parameter in ALS. >> >> But I could not find some example values for this parameters. Can you >> give a bit more explanation for these and give some example values? >> >> BR, >> Aslan >> >> >> >> >> >> On Fri, Nov 29, 2013 at 9:03 PM, Evan Sparks <[email protected]>wrote: >> >>> Hi Aslan, >>> >>> You'll need to link against the spark-mllib artifact. The method we have >>> currently for collaborative filtering is ALS. >>> >>> Documentation is available here - >>> http://spark.incubator.apache.org/docs/latest/mllib-guide.html >>> >>> We're working on a more complete ALS tutorial, and will link to it from >>> that page when it's ready. >>> >>> - Evan >>> >>> > On Nov 29, 2013, at 10:33 AM, Aslan Bekirov <[email protected]> >>> wrote: >>> > >>> > Hi All, >>> > >>> > I am trying to do collaborative filtering with MLbase. I am using >>> spark 0.8.0 >>> > >>> > I have some basic questions. >>> > >>> > 1) I am using maven and added dependency to my pom >>> > <dependency> >>> > <groupId>org.apache.spark</groupId> >>> > <artifactId>spark-core_2.9.3</artifactId> >>> > <version>0.8.0-incubating</version> >>> > </dependency> >>> > >>> > I could not see any MLbase related classes in downloaded jar that is >>> why I could not import mli libraries. Am I missing something? Do I have to >>> add some more dependency for mli? >>> > >>> > 2) Is there exist java api for MLBase? >>> > >>> > Thanks in advance, >>> > >>> > BR, >>> > Aslan >>> >> >> >
