Hi Sean,
My need is to compute document similarity (30.000 docs) and more
precisely, to find the n most similar docs.
As written above, i use RowSimilarityJob but it takes 2h+ to compute.
Seb suggest to use an item-item recommender with input data (term,
document, tf-idf).
Rgds,
Y.
On 09/18/2012 04:21 PM, Sean Owen wrote:
If you are computing user-user similarity, the number of items is not
nearly as important as the number of users. If you have 1M users, then
computing about 500 billion user-user similarities is going to take a long
time no matter what.
CSV is the input for both Hadoop-based and non-Hadoop-based
implementations. The Hadoop-based implementation converts to vectors. You
can inject vectors directly if you want, there. But you need CSV for the
non-Hadoop code.
There are a number of tuning params in the Hadoop implementation (and
similar but different hooks in the non-Hadoop implementation) that let you
prune data at several stages. This is the most important thing for speed.
Yes, removing stop-words falls in that category.
Tuning the JVM helps but marginally. More Hadoop nodes helps, linearly.
On Tue, Sep 18, 2012 at 1:49 PM, yamo93 <[email protected]> wrote:
Hi,
I have 30.000 items and the computation takes more than 2h on a
pseudo-cluster, which is too long in my case.
I think of some ways to reduce the execution time of RowSimilarityJob and
I wonder if some of you have implemented them and how, or explored other
ways.
1. tune the JVM
2. developing an in memory implementation (i.e. without hadoop)
3. reduce the size of the matrix (by removing those which have no words in
common, for example)
4. run on real hadoop cluster with several nodes (does anyone have an idea
of the number of nodes to make it interesting)
Thanks for your help,
Yann.