Yenn, I have spent a great deal of time working with it recently and can offer you some quick tips.
1. The density of the matrix is the biggest factor. Use sparse vectors if you can. It will reduce the time. 2. Set a larger number of reducers to decrease the processing time per node. I have had job failures when a single node cannot merge the results. 3. If you are using the output from seq2sparse, using the tfidf vectors as input can be significantly less dense, depending on the parameters you used to run seq2sparse. Using these suggestions, we were able to get a job that took many hours to run in well under an hour on a large cluster. Anna On Tue, Sep 18, 2012 at 8:49 AM, yamo93 <[email protected]> wrote: > Hi, > > I have 30.000 items and the computation takes more than 2h on a > pseudo-cluster, which is too long in my case. > > I think of some ways to reduce the execution time of RowSimilarityJob and > I wonder if some of you have implemented them and how, or explored other > ways. > 1. tune the JVM > 2. developing an in memory implementation (i.e. without hadoop) > 3. reduce the size of the matrix (by removing those which have no words in > common, for example) > 4. run on real hadoop cluster with several nodes (does anyone have an idea > of the number of nodes to make it interesting) > > Thanks for your help, > Yann. >
