With LSMR, you can pass in an abstract multiply function. This avoids the fill-in.
Thus, you would need to store 1.4 + 0.5 million non-zeros. Each one takes about 12 bytes so this is about 24MB (i.e. tiny). This trades memory for a bit of CPU time. Your other option is 20% x 30,000 ^ 2 = 20% x 900M elements = 200M non-zero elements = 2.4GB per matrix. Option 1 would let you have lots of slots on each machine. Option 2 would limit you to 10 or fewer live slots this limiting your ability to saturate CPU. On Mon, Jun 20, 2011 at 11:25 PM, Vincent Xue <[email protected]> wrote: > Hi Danny, thanks for your suggestion. I will take a look. > > To answer your questions Ted, > > Each slave-node has 32GB of memory. I will look into the LSMR > implementation and hopefully it will prove to be faster than the > conjugate gradient solver. > > For the matrix multiplication part of the algorithm, each matrix > multiplied is either a design matrix "X" (1.4M non zero elements) or a > diagonal weight matrix "W" (0.5M non zero elements). The input matrix > to the conjugate gradient solver is the product of (X^T * W * X) > resulting in a symmetric matrix with about 20 % being non zero. > > Vincent > > On Mon, Jun 20, 2011 at 9:21 PM, Ted Dunning <[email protected]> > wrote: > > Sounds like you should invert your loops. > > > > These sparse matrices are probably very reasonable for solving each one > on a > > single machine in memory. > > As such, take a look at the LSMR implementation which is a > > good implementation of a conjugate > > gradient-like algorithm that plays nice with sparse data. It could > probably > > even be used in a re-weighting > > scheme where the matrix being solved varies in each iteration (at a > serious > > cost in understanding for the > > algorithm). > > > > What I would suggest is that you define your mapper so that each mapper > > solves one of your problem LS problems. > > You have 50,000 of them and so that will provide plenty of parallelism. > > LSMR works out of core if you have to > > do that, but it really sounds like you should be able to handle several > in > > memory at the same time. > > > > Can you say more about your memory on each node and how many non-zero > > elements are in your matrices? > > > > On Mon, Jun 20, 2011 at 12:25 PM, Vincent Xue <[email protected]> wrote: > > > >> Hello everyone, > >> > >> I have been using the Mahout library to implement 'Iterative Recursive > >> Least Squares' for some analysis on biological microarrays. In my > >> setup, I have to perform IRLS about 54,000 times on different > >> probesets, and each job involves multiplying large sparse matrices and > >> solving linear systems via conjugate gradient. I am using mahout > >> because my matrices are sparse but large (500,000 x 30,000). > >> > >> The problem I have now is that each IRLS Job takes about 48 hours to > >> complete running on a small cluster of 75 nodes with 16 cores each. I > >> can run about 200 jobs at the same time, but this is not very helpful > >> considering I have 54,000 jobs to process. > >> > >> Using the mahout library, the matrix multiplication is greatly sped up > >> ( less than 5 minutes for each product). However solving a linear > >> system using conjugate gradient is a time consuming process and is the > >> bulk of the computation. Considering that IRLS calls for several > >> iterations, this problem is magnified by however many iterations I run > >> it for. > >> > >> Considering the issues, I hope someone can help me find a solution. > >> Below lists several concerns: > >> > >> 1) Is using Mahout for this computation the correct approach. (I have > >> tried running this in R and the simple step of multiplying matrices > >> would take days, if it could fit in memory) > >> 2) When running the IRLS, even with 200 jobs running (or maps), the > >> cpu usage for each node barely goes above 5%. How can I use more CPU? > >> > >> Thank you for your help > >> Vincent > >> > > >
