With LSMR, you can pass in an abstract multiply function.  This avoids the
fill-in.

Thus, you would need to store 1.4 + 0.5 million non-zeros.  Each one takes
about 12 bytes so this is about 24MB (i.e. tiny).  This trades memory for a
bit of CPU time.

Your other option is 20% x 30,000 ^ 2 = 20% x 900M elements = 200M non-zero
elements = 2.4GB per matrix.

Option 1 would let you have lots of slots on each machine.  Option 2 would
limit you to 10 or fewer live slots this limiting your ability to saturate
CPU.

On Mon, Jun 20, 2011 at 11:25 PM, Vincent Xue <[email protected]> wrote:

> Hi Danny, thanks for your suggestion. I will take a look.
>
> To answer your questions Ted,
>
> Each slave-node has 32GB of memory. I will look into the LSMR
> implementation and hopefully it will prove to be faster than the
> conjugate gradient solver.
>
> For the matrix multiplication part of the algorithm, each matrix
> multiplied is either a design matrix "X" (1.4M non zero elements) or a
> diagonal weight matrix "W" (0.5M non zero elements). The input matrix
> to the conjugate gradient solver is the product of (X^T * W * X)
> resulting in a symmetric matrix with about 20 % being non zero.
>
> Vincent
>
> On Mon, Jun 20, 2011 at 9:21 PM, Ted Dunning <[email protected]>
> wrote:
> > Sounds like you should invert your loops.
> >
> > These sparse matrices are probably very reasonable for solving each one
> on a
> > single machine in memory.
> > As such, take a look at the LSMR implementation which is a
> > good implementation of a conjugate
> > gradient-like algorithm that plays nice with sparse data.  It could
> probably
> > even be used in a re-weighting
> > scheme where the matrix being solved varies in each iteration (at a
> serious
> > cost in understanding for the
> > algorithm).
> >
> > What I would suggest is that you define your mapper so that each mapper
> > solves one of your problem LS problems.
> > You have 50,000 of them and so that will provide plenty of parallelism.
> >  LSMR works out of core if you have to
> > do that, but it really sounds like you should be able to handle several
> in
> > memory at the same time.
> >
> > Can you say more about your memory on each node and how many non-zero
> > elements are in your matrices?
> >
> > On Mon, Jun 20, 2011 at 12:25 PM, Vincent Xue <[email protected]> wrote:
> >
> >> Hello everyone,
> >>
> >> I have been using the Mahout library to implement 'Iterative Recursive
> >> Least Squares' for some analysis on biological microarrays. In my
> >> setup, I have to perform IRLS about 54,000 times on different
> >> probesets, and each job involves multiplying large sparse matrices and
> >> solving linear systems via conjugate gradient. I am using mahout
> >> because my matrices are sparse but large (500,000 x 30,000).
> >>
> >> The problem I have now is that each IRLS Job takes about 48 hours to
> >> complete running on a small cluster of 75 nodes with 16 cores each. I
> >> can run about 200 jobs at the same time, but this is not very helpful
> >> considering I have 54,000 jobs to process.
> >>
> >> Using the mahout library, the matrix multiplication is greatly sped up
> >> ( less than 5 minutes for each product). However solving a linear
> >> system using conjugate gradient is a time consuming process and is the
> >> bulk of the computation. Considering that IRLS calls for several
> >> iterations, this problem is magnified by however many iterations I run
> >> it for.
> >>
> >> Considering the issues, I hope someone can help me find a solution.
> >> Below lists several concerns:
> >>
> >> 1) Is using Mahout for this computation the correct approach. (I have
> >> tried running this in R and the simple step of multiplying matrices
> >> would take days, if it could fit in memory)
> >> 2) When running the IRLS, even with 200 jobs running (or maps), the
> >> cpu usage for each node barely goes above 5%. How can I use more CPU?
> >>
> >> Thank you for your help
> >> Vincent
> >>
> >
>

Reply via email to