Hello everyone,

I have been using the Mahout library to implement 'Iterative Recursive
Least Squares' for some analysis on biological microarrays. In my
setup, I have to perform IRLS about 54,000 times on different
probesets, and each job involves multiplying large sparse matrices and
solving linear systems via conjugate gradient. I am using mahout
because my matrices are sparse but large (500,000 x 30,000).

The problem I have now is that each IRLS Job takes about 48 hours to
complete running on a small cluster of 75 nodes with 16 cores each. I
can run about 200 jobs at the same time, but this is not very helpful
considering I have 54,000 jobs to process.

Using the mahout library, the matrix multiplication is greatly sped up
( less than 5 minutes for each product). However solving a linear
system using conjugate gradient is a time consuming process and is the
bulk of the computation. Considering that IRLS calls for several
iterations, this problem is magnified by however many iterations I run
it for.

Considering the issues, I hope someone can help me find a solution.
Below lists several concerns:

1) Is using Mahout for this computation the correct approach. (I have
tried running this in R and the simple step of multiplying matrices
would take days, if it could fit in memory)
2) When running the IRLS, even with 200 jobs running (or maps), the
cpu usage for each node barely goes above 5%. How can I use more CPU?

Thank you for your help
Vincent

Reply via email to