Hi Jake, Thank you for the detailed explanation; seems like a very clever way to distribute the matrix multiplication process.
I tried running the transpose job on my T matrix using the following options: bin/mahout transpose -i /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix --numRows 6076937 --numCols 20444 I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3 reducers per node and mapred.child.java.opts=-Xmx4096m. The map tasks completed within a few minutes but then all of my 24 reducers failed near the 70% mark with error "Task attempt_201103201840_0004_r_000023_0 failed to report status for 601 seconds. Killing!" The data node servers stayed at a healthy load avg below 4 and never paged ... So I increased the "mapred.task.timeout" Hadoop parameter to 20 minutes instead of the default 10 minutes and it failed again. The reduce code for the TransposeJob looks straight-forward, so I'm going to have to dig in deeper to figure out what's causing the problem. Cheers, Tim On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <jake.man...@gmail.com> wrote: > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <thelabd...@gmail.com>wrote: > >> Regarding Jake's comment: " ... you need to run the RowIdJob on these >> tfidf-vectors first ..." >> >> I did this and now have an m x n matrix T (m=6076937, n=20444). My SVD >> eigenvector matrix E is p x q (p=87, q=20444). > > > Ok, so to help you understand what's going on here, I'm going to go into > a little of the inner details of what's going on here. > > You are right, you have a matrix T, with 6,076,937 rows, and each row has > 20,444 columns (most of which are zero, and it's represented sparsely, but > still, they live in a vector space of dimension 20,444). Similarly, you've > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) and > each of these rows has exactly 20,444 columns (and most likely, they'll > all be nonzero, because eigenvectors have no reason to be sparse). > > In particular, T and E are represented as *lists of rows*, each row is a > vector of dimension 20,444. T has six million of these rows, and E has > only 87. > > >> So to multiply these two >> matrices, I need to transpose E so that the number of columns in T equals >> the number of rows in E (i.e. E^T is q x p) the result of the matrixmult >> would give me an m x p matrix (m=6076937, p=87). >> > > You're exactly right that you want to multiply T by E^T, because you can't > compute T * E. > > The way it turns out in practice, computing the matrix product of two > matrices as a map-reduce job is efficiently done as a map-side join on > two row-based matrices with the same number of rows, and the columns > are the ones which are different. In particular, if you take a matrix X > which > is represented as a set of numRowsX rows, each of which has numColsX, > and another matrix with numRowsY == numRowsX, each of which has > numColsY (!= numColsX), then by summing the outer-products of each > of the numRowsX pairs of vectors, you get a matrix of with numRowsZ == > numColsX, and numColsZ == numColsY (if you instead take the reverse > outer product of the vector pairs, you can end up with the transpose of > this > final result, with numRowsZ == numColsY, and numColsZ == numColsX). > > Unfortunately, you have a pair of matrices which have different numbers > of rows, and the same number of columns, but you want a pair of matrices > with the same number of rows and (possibly) different numbers of columns. > > >> So I tried to run matrixmult with: > > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444 --numColsB >> 87 \ >> --inputPathA >> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix \ >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244 > > >> (--inputPathA points to the output of the rowid job) >> >> This results in: >> Exception in thread "main" org.apache.mahout.math.CardinalityException: >> > > >> In the code, I see the test that row counts must be identical for the two >> >> input matrices. Thus, it seems like the job requires me to transpose this >> large matrix, just to re-transpose it back to it's original form during >> the >> multiplication? Or have I missed something crucial again? >> > > You actually need to transpose the input matrix (T), and then re-run with > T^t > and E^t (the latter you apparently already have created). > > We should really rename the "matrixmultiply" job to be called > "transposematrixmultiply", because that's what it really does. > > -jake >