Re: Need to reduce execution time of RowSimilarityJob

Anna Lahoud Tue, 18 Sep 2012 06:09:23 -0700

Yenn,

I have spent a great deal of time working with it recently and can offer
you some quick tips.

   1. The density of the matrix is the biggest factor. Use sparse vectors
   if you can. It will reduce the time.
   2. Set a larger number of reducers to decrease the processing time per
   node. I have had job failures when a single node cannot merge the results.
   3. If you are using the output from seq2sparse, using the tfidf vectors
   as input can be significantly less dense, depending on the parameters you
   used to run seq2sparse.

Using these suggestions, we were able to get a job that took many hours to
run in well under an hour on a large cluster.

Anna

On Tue, Sep 18, 2012 at 8:49 AM, yamo93 <[email protected]> wrote:

> Hi,
>
> I have 30.000 items and the computation takes more than 2h on a
> pseudo-cluster, which is too long in my case.
>
> I think of some ways to reduce the execution time of RowSimilarityJob and
> I wonder if some of you have implemented them and how, or explored other
> ways.
> 1. tune the JVM
> 2. developing an in memory implementation (i.e. without hadoop)
> 3. reduce the size of the matrix (by removing those which have no words in
> common, for example)
> 4. run on real hadoop cluster with several nodes (does anyone have an idea
> of the number of nodes to make it interesting)
>
> Thanks for your help,
> Yann.
>

Re: Need to reduce execution time of RowSimilarityJob

Reply via email to