Re: Need to reduce execution time of RowSimilarityJob

Sean Owen Tue, 18 Sep 2012 07:21:53 -0700

If you are computing user-user similarity, the number of items is not
nearly as important as the number of users. If you have 1M users, then
computing about 500 billion user-user similarities is going to take a long
time no matter what.

CSV is the input for both Hadoop-based and non-Hadoop-based
implementations. The Hadoop-based implementation converts to vectors. You
can inject vectors directly if you want, there. But you need CSV for the
non-Hadoop code.

There are a number of tuning params in the Hadoop implementation (and
similar but different hooks in the non-Hadoop implementation) that let you
prune data at several stages. This is the most important thing for speed.
Yes, removing stop-words falls in that category.

Tuning the JVM helps but marginally. More Hadoop nodes helps, linearly.

On Tue, Sep 18, 2012 at 1:49 PM, yamo93 <[email protected]> wrote:

> Hi,
>
> I have 30.000 items and the computation takes more than 2h on a
> pseudo-cluster, which is too long in my case.
>
> I think of some ways to reduce the execution time of RowSimilarityJob and
> I wonder if some of you have implemented them and how, or explored other
> ways.
> 1. tune the JVM
> 2. developing an in memory implementation (i.e. without hadoop)
> 3. reduce the size of the matrix (by removing those which have no words in
> common, for example)
> 4. run on real hadoop cluster with several nodes (does anyone have an idea
> of the number of nodes to make it interesting)
>
> Thanks for your help,
> Yann.
>

Re: Need to reduce execution time of RowSimilarityJob

Reply via email to