That sounds quite slow. You're definitely computing item-item similarity? if users are rows, then this job is computing user-user similarity.
An item-based recommender isn't necessary per se, just item similarity. The ItemBasedRecommender has a convenience method to just find the top N most similar items. If your scale is such that working in memory is feasible, that is by far the best answer. Sean On Tue, Sep 18, 2012 at 4:14 PM, yamo93 <[email protected]> wrote: > Hi Sean, > > My need is to compute document similarity (30.000 docs) and more > precisely, to find the n most similar docs. > As written above, i use RowSimilarityJob but it takes 2h+ to compute. > > Seb suggest to use an item-item recommender with input data (term, > document, tf-idf). > > Rgds, > Y. > > > On 09/18/2012 04:21 PM, Sean Owen wrote: > >> If you are computing user-user similarity, the number of items is not >> nearly as important as the number of users. If you have 1M users, then >> computing about 500 billion user-user similarities is going to take a long >> time no matter what. >> >> CSV is the input for both Hadoop-based and non-Hadoop-based >> implementations. The Hadoop-based implementation converts to vectors. You >> can inject vectors directly if you want, there. But you need CSV for the >> non-Hadoop code. >> >> There are a number of tuning params in the Hadoop implementation (and >> similar but different hooks in the non-Hadoop implementation) that let you >> prune data at several stages. This is the most important thing for speed. >> Yes, removing stop-words falls in that category. >> >> Tuning the JVM helps but marginally. More Hadoop nodes helps, linearly. >> >> On Tue, Sep 18, 2012 at 1:49 PM, yamo93 <[email protected]> wrote: >> >> Hi, >>> >>> I have 30.000 items and the computation takes more than 2h on a >>> pseudo-cluster, which is too long in my case. >>> >>> I think of some ways to reduce the execution time of RowSimilarityJob and >>> I wonder if some of you have implemented them and how, or explored other >>> ways. >>> 1. tune the JVM >>> 2. developing an in memory implementation (i.e. without hadoop) >>> 3. reduce the size of the matrix (by removing those which have no words >>> in >>> common, for example) >>> 4. run on real hadoop cluster with several nodes (does anyone have an >>> idea >>> of the number of nodes to make it interesting) >>> >>> Thanks for your help, >>> Yann. >>> >>> >
