Hi! I got exactly the same error when running alternating least squares. You need to setup the task timeout and expiry interval timeouts. Exact details are found in my blog: http://bickson.blogspot.com/2011/03/tunning-hadoop-configuration-for-high.html
Best, - Danny Bickson On Sun, Mar 20, 2011 at 8:27 PM, Timothy Potter <thelabd...@gmail.com>wrote: > Hi Jake, > > Thank you for the detailed explanation; seems like a very clever way to > distribute the matrix multiplication process. > > I tried running the transpose job on my T matrix using the following > options: > > bin/mahout transpose -i > /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix > --numRows 6076937 --numCols 20444 > > I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3 > reducers > per node and mapred.child.java.opts=-Xmx4096m. The map tasks completed > within a few minutes but then all of my 24 reducers failed near the 70% > mark > with error "Task attempt_201103201840_0004_r_000023_0 failed to report > status for 601 seconds. Killing!" The data node servers stayed at a healthy > load avg below 4 and never paged ... > > So I increased the "mapred.task.timeout" Hadoop parameter to 20 minutes > instead of the default 10 minutes and it failed again. The reduce code for > the TransposeJob looks straight-forward, so I'm going to have to dig in > deeper to figure out what's causing the problem. > > Cheers, > Tim > > On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <jake.man...@gmail.com> > wrote: > > > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <thelabd...@gmail.com > >wrote: > > > >> Regarding Jake's comment: " ... you need to run the RowIdJob on these > >> tfidf-vectors first ..." > >> > >> I did this and now have an m x n matrix T (m=6076937, n=20444). My SVD > >> eigenvector matrix E is p x q (p=87, q=20444). > > > > > > Ok, so to help you understand what's going on here, I'm going to go into > > a little of the inner details of what's going on here. > > > > You are right, you have a matrix T, with 6,076,937 rows, and each row has > > 20,444 columns (most of which are zero, and it's represented sparsely, > but > > still, they live in a vector space of dimension 20,444). Similarly, > you've > > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) and > > each of these rows has exactly 20,444 columns (and most likely, they'll > > all be nonzero, because eigenvectors have no reason to be sparse). > > > > In particular, T and E are represented as *lists of rows*, each row is a > > vector of dimension 20,444. T has six million of these rows, and E has > > only 87. > > > > > >> So to multiply these two > >> matrices, I need to transpose E so that the number of columns in T > equals > >> the number of rows in E (i.e. E^T is q x p) the result of the matrixmult > >> would give me an m x p matrix (m=6076937, p=87). > >> > > > > You're exactly right that you want to multiply T by E^T, because you > can't > > compute T * E. > > > > The way it turns out in practice, computing the matrix product of two > > matrices as a map-reduce job is efficiently done as a map-side join on > > two row-based matrices with the same number of rows, and the columns > > are the ones which are different. In particular, if you take a matrix X > > which > > is represented as a set of numRowsX rows, each of which has numColsX, > > and another matrix with numRowsY == numRowsX, each of which has > > numColsY (!= numColsX), then by summing the outer-products of each > > of the numRowsX pairs of vectors, you get a matrix of with numRowsZ == > > numColsX, and numColsZ == numColsY (if you instead take the reverse > > outer product of the vector pairs, you can end up with the transpose of > > this > > final result, with numRowsZ == numColsY, and numColsZ == numColsX). > > > > Unfortunately, you have a pair of matrices which have different numbers > > of rows, and the same number of columns, but you want a pair of matrices > > with the same number of rows and (possibly) different numbers of columns. > > > > > >> So I tried to run matrixmult with: > > > > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444 > --numColsB > >> 87 \ > >> --inputPathA > >> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix \ > >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244 > > > > > >> (--inputPathA points to the output of the rowid job) > >> > >> This results in: > >> Exception in thread "main" org.apache.mahout.math.CardinalityException: > >> > > > > > >> In the code, I see the test that row counts must be identical for the > two > >> > >> input matrices. Thus, it seems like the job requires me to transpose > this > >> large matrix, just to re-transpose it back to it's original form during > >> the > >> multiplication? Or have I missed something crucial again? > >> > > > > You actually need to transpose the input matrix (T), and then re-run with > > T^t > > and E^t (the latter you apparently already have created). > > > > We should really rename the "matrixmultiply" job to be called > > "transposematrixmultiply", because that's what it really does. > > > > -jake > > >