If you are computing user-user similarity, the number of items is not nearly as important as the number of users. If you have 1M users, then computing about 500 billion user-user similarities is going to take a long time no matter what.
CSV is the input for both Hadoop-based and non-Hadoop-based implementations. The Hadoop-based implementation converts to vectors. You can inject vectors directly if you want, there. But you need CSV for the non-Hadoop code. There are a number of tuning params in the Hadoop implementation (and similar but different hooks in the non-Hadoop implementation) that let you prune data at several stages. This is the most important thing for speed. Yes, removing stop-words falls in that category. Tuning the JVM helps but marginally. More Hadoop nodes helps, linearly. On Tue, Sep 18, 2012 at 1:49 PM, yamo93 <[email protected]> wrote: > Hi, > > I have 30.000 items and the computation takes more than 2h on a > pseudo-cluster, which is too long in my case. > > I think of some ways to reduce the execution time of RowSimilarityJob and > I wonder if some of you have implemented them and how, or explored other > ways. > 1. tune the JVM > 2. developing an in memory implementation (i.e. without hadoop) > 3. reduce the size of the matrix (by removing those which have no words in > common, for example) > 4. run on real hadoop cluster with several nodes (does anyone have an idea > of the number of nodes to make it interesting) > > Thanks for your help, > Yann. >
