On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <thelabd...@gmail.com>wrote:

> Regarding Jake's comment: " ... you need to run the RowIdJob on these
> tfidf-vectors first ..."
>
> I did this and now have an m x n matrix T (m=6076937, n=20444). My SVD
> eigenvector matrix E is p x q (p=87, q=20444).


Ok, so to help you understand what's going on here, I'm going to go into
a little of the inner details of what's going on here.

You are right, you have a matrix T, with 6,076,937 rows, and each row has
20,444 columns (most of which are zero, and it's represented sparsely, but
still, they live in a vector space of dimension 20,444).  Similarly, you've
made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) and
each of these rows has exactly 20,444 columns (and most likely, they'll
all be nonzero, because eigenvectors have no reason to be sparse).

In particular, T and E are represented as *lists of rows*, each row is a
vector of dimension 20,444.  T has six million of these rows, and E has
only 87.


> So to multiply these two
> matrices, I need to transpose E so that the number of columns in T equals
> the number of rows in E (i.e. E^T is q x p) the result of the matrixmult
> would give me an m x p matrix (m=6076937, p=87).
>

You're exactly right that you want to multiply T by E^T, because you can't
compute T * E.

The way it turns out in practice, computing the matrix product of two
matrices as a map-reduce job is efficiently done as a map-side join on
two row-based matrices with the same number of rows, and the columns
are the ones which are different.  In particular, if you take a matrix X
which
is represented as a set of numRowsX rows, each of which has numColsX,
and another matrix with numRowsY == numRowsX, each of which has
numColsY (!= numColsX), then by summing the outer-products of each
of the numRowsX pairs of vectors, you get a matrix of with numRowsZ ==
numColsX, and numColsZ == numColsY (if you instead take the reverse
outer product of the vector pairs, you can end up with the transpose of this
final result, with numRowsZ == numColsY, and numColsZ == numColsX).

Unfortunately, you have a pair of matrices which have different numbers
of rows, and the same number of columns, but you want a pair of matrices
with the same number of rows and (possibly) different numbers of columns.


> So I tried to run matrixmult with:

matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444 --numColsB
> 87 \
> --inputPathA
> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix \
> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244


> (--inputPathA points to the output of the rowid job)
>
> This results in:
> Exception in thread "main" org.apache.mahout.math.CardinalityException:
>


> In the code, I see the test that row counts must be identical for the two
> input matrices. Thus, it seems like the job requires me to transpose this
> large matrix, just to re-transpose it back to it's original form during the
> multiplication? Or have I missed something crucial again?
>

You actually need to transpose the input matrix (T), and then re-run with
T^t
and E^t (the latter you apparently already have created).

We should really rename the "matrixmultiply" job to be called
"transposematrixmultiply", because that's what it really does.

  -jake

Reply via email to