FYI, adding to Pat's reply below Slope-One has been long deprecated. On Mon, Apr 6, 2015 at 5:00 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> Sorry, we are trying to get a release out. > > You can look at a custom similarity measure. Look at where > SIMILARITY_COSINE leads you and customize that maybe? There are in-memory > and mapreduce versions and not sure which you are using. That is code I > haven’t looked at for a long time so can’t get you much closer. > > > On Apr 3, 2015, at 10:52 AM, PierLorenzo Bianchini > <piell...@yahoo.com.INVALID> wrote: > > Hi again, > seeing the answers to this question and the other I had posted ("adjusted > cosine similarity for item-based recommender?"), I think I should clarify a > bit what I'm trying to achieve and why I (believe I should) do things the > way I'm doing. > > I'm doing a class called "Learning from User-Generated data". Our first > assignment deals with analysing the results of various types of > recommenders. I'll go as far as saying "old-school" recommenders, given the > content of your answers. > We have been introduced to: > * Memory based: > - user-based > - item-based (*with* adjusted cosine similarity!) > - slope-one > - graph-based transitivity > * Memory based > - preprocessed item/user based (? this is unclear to me but I didn't > reach this part of the assignment so I'll search for information before I > ask questions; I also found an article where they mentioned slope-one > amongst the model based; I guess I'll need to do more research on this) > - matrix factorization-based (I saw that SVD is available in Mahout; > my project partner is looking into that right now) > > We have a *static* training dataset (800.000 <user,movie,preference> > triples) and another static dataset for which we have to extract the > predicted preferences (200.000 <user,movie> tuples) and write them back to > a movie (i.e. recompose the <user,movie,preference> triples). Note that > this will never go in a production environment, as it is merely a > university requirement. For the same reason, I would prefer not to mix up > things too much and I'd rather do a step-by-step learning (i.e. focus on > Mahout for now, before I dig deeper and check the search-based approach, > which uses DB-mahout-solr-spark... maybe a bit too much to handle at once > with the deadline we were given). > > So if I might get back to my original questions (again, I'm sorry for > being stubborn but I'm under specific constraints - I'll really try to > understand the search-based approach when I have more time) ;) > 1. I'm guessing that to implement an adjusted cosine similarity I should > extend AbstractSimilarity (or maybe even AbstractRecommender?). Is this > right? > 2. I still can't believe that it takes more than at-most a few minutes to > go through my 200.000 lines and find the already calculated preference. > What am I doing wrong? :/ Should I store my whole datamodel in a file > (how?) and then read through the file? I don't see how this could be faster > than just reading the exact value I'm searching for... > > Thanks again for your answers! Regards, > > Pier Lorenzo > > > -------------------------------------------- > On Fri, 4/3/15, Ted Dunning <ted.dunn...@gmail.com> wrote: > > Subject: Re: fast performance way of writing preferences to file? > To: "user@mahout.apache.org" <user@mahout.apache.org> > Date: Friday, April 3, 2015, 5:52 PM > > Are you sure that the > problem is writing the results? It seems to me that > the real problem is the use of a user-based > recommender. > > For such a > small data set, for instance, a search-based recommender > will be > able to make recommendations in less > than a millisecond with multiple > recommendations possible in parallel. This > should allow you to do 200,000 > recommendations in a few minutes on a single > machine. > > With such a small > dataset, indicator-based methods may not be the best > option. To improve that, try using something > larger such as the million > song dataset. > See http://labrosa.ee.columbia.edu/millionsong/ > > Also, using and estimating > ratings is not a particularly good thing to be > doing if you want to build a real > recommender. > > > On > Fri, Apr 3, 2015 at 3:26 AM, PierLorenzo Bianchini < > piell...@yahoo.com.invalid> > wrote: > > > Hello > everyone, > > I'm new to mahout, to > recommender systems and to the mailing list. > > > > I''m trying > to find a (fast) way to write back preferences to a file. > I > > tried a few methods but I'm sure > there must be a better approach. > > > Here's the deal (you can find the same post in > stackoverflow[1]). > > I have a training > dataset of 800.000 records from 6000 users rating 3900 > > movies. These are stored in a comma > separated file like: > > > userId,movieId,preference. I have another dataset (200.000 > records) in the > > format: userId,movieId. > My goal is to use the first dataset as a > > training-set, in order to determine the > missing preferences of the second > > > set. > > > > So far, I > managed to load the training dataset and I generated > user-based > > recommendations. This is > pretty smooth and doesn't take too much time. But > > I'm struggling when it comes to > writing back the recommendations. > > > > The first method I tried is: > > > > * read a line from > the file and get the userId,movieId tuple. > > * retrieve the calculated preference > with estimatePreference(userId, > > > movieId) > > * append the preference to > the line and save it in a new file > > This > works, but it's incredibly slow (I added a counter to > print every > > 10.000th iteration: after a > couple of minutes it had only printed once. I > > have 8GB-RAM with an i7-core... how long > can it take to process 200.000 > > > lines?!) > > > > My second > choise was: > > > > * > create a new FileDataModel with the second dataset > > * do something like this: > newDataModel.setPreference(userId, movieId, > > recommender.estimatePreference(userId, > movieId)); > > > > Here I > get several problems: > > * at runtime: > java.lang.UnsupportedOperationException (as I found out > in > > [2], FileDataModel actually > can't be updated. I don't understand why the > > function setPreference exists in the first > place...) > > * The API of > FileDataModel#setPreference states "This method should > also > > be considered relatively > slow." > > > > I read > around that a solution would be to use delta files, but I > couldn't > > find out what that > actually means. Any suggestion on how I could speed up > > my writing-the-preferences process? > > Thank you! > > > > Pier Lorenzo > > > > > > [1] > > > http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file > > [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330 > > > > >