This sort could probably done with a radix sort to make it truly linear as well.
I suspect previously unsuspected n^2 behavior. That behavior now officially loses its unsuspected nature (whether it exists or not). On Mon, Mar 28, 2011 at 10:27 PM, Jake Mannix <jake.man...@gmail.com> wrote: > Hmm... a sort should be fine (N log N is the new linear!), what I'm > wondering > if it's not something more insidious: like incrementally building up the > SequentialAccessSparseVector one element at a time (or some such N^2 > operation - N^2 is not the new anything, it's still N^2) which isn't > noticed > when it's only called rarely (at the end of the reduce, once per vector) > and > N is usually 100's to 1000's, not millions. > > On Tue, Mar 29, 2011 at 5:14 AM, Ted Dunning <ted.dunn...@gmail.com>wrote: > >> The issue here is almost certainly that the sort implied by this >> conversion >> is only reasonable with a small number of non-zero elements. Special >> casing >> a relatively dense case is probably worthwhile. >> >> >> On Mon, Mar 28, 2011 at 10:04 PM, Timothy Potter <thelabd...@gmail.com >> >wrote: >> >> > Hi Jake, >> > >> > I've tracked down the bottleneck in the TransposeJob's reducer to the >> line >> > where Mahout creates a new SequentialAccessSparseVector from a >> > RandomAccessSparseVector after the while loop completes: >> > >> > SequentialAccessSparseVector outVector = new >> > SequentialAccessSparseVector(tmp); >> > >> > For high-frequency terms (some of which occur over ~1M times in my >> data), >> > the code to create a SequentialAccessSparseVector from a >> > RandomAccessSparseVector bogs down completely. Of course, I should >> probably >> > get rid of the terms that occur this frequently and it seems I could >> > continue to increase mapred.task.timeout but as you said, "transpose job >> is >> > totally trivial" so there may be a bigger issue here that is worth >> digging >> > into further. I think it has something to do with >> OrderedIntDoubleMapping, >> > but haven't had enough time to devote to this issue to know for sure ... >> > I'll keep at it though. >> > >> > Cheers, >> > Tim >> > >> > On Mon, Mar 21, 2011 at 6:34 PM, Jake Mannix <jake.man...@gmail.com> >> > wrote: >> > >> > > >> > > >> > > On Sun, Mar 20, 2011 at 5:27 PM, Timothy Potter <thelabd...@gmail.com >> > >wrote: >> > > >> > >> Hi Jake, >> > >> >> > >> >> > >> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3 >> > >> reducers >> > >> per node and mapred.child.java.opts=-Xmx4096m. The map tasks >> completed >> > >> within a few minutes but then all of my 24 reducers failed near the >> 70% >> > >> mark >> > >> with error "Task attempt_201103201840_0004_r_000023_0 failed to >> report >> > >> status for 601 seconds. Killing!" The data node servers stayed at a >> > >> healthy >> > >> load avg below 4 and never paged ... >> > >> >> > > >> > > Weird. The transpose job is totally trivial, but it may be that the >> > > vectors you >> > > are producing are abnormally large. In fact, if this came from text, >> > this >> > > is >> > > probably the case: very common words (which aren't stop words) occur >> > > in most of your documents, and so are very very large. Might they be >> > > larger than your memory? 6-million entries, each one, when represented >> > > sparsely (sadly, in this case), have an int and a float, so still less >> > than >> > > 100MB, >> > > so that's not it. >> > > >> > > BTW: you need to make sure you use the same number of reducers as when >> > > you made your eigenvector matrix, as that's a requirement for a >> map-side >> > > join. That should probably be documented! >> > > >> > > >> > >> So I increased the "mapred.task.timeout" Hadoop parameter to 20 >> minutes >> > >> instead of the default 10 minutes and it failed again. The reduce >> code >> > for >> > >> the TransposeJob looks straight-forward, so I'm going to have to dig >> in >> > >> deeper to figure out what's causing the problem. >> > >> >> > > >> > > Yeah, if you can see where it's timing out, that would be great. Not >> > sure >> > > why that task should ever take very long. >> > > >> > > -jake >> > > >> > > >> > > >> > >> Cheers, >> > >> Tim >> > >> >> > >> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <jake.man...@gmail.com> >> > >> wrote: >> > >> >> > >> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter < >> > thelabd...@gmail.com >> > >> >wrote: >> > >> > >> > >> >> Regarding Jake's comment: " ... you need to run the RowIdJob on >> these >> > >> >> tfidf-vectors first ..." >> > >> >> >> > >> >> I did this and now have an m x n matrix T (m=6076937, n=20444). My >> > SVD >> > >> >> eigenvector matrix E is p x q (p=87, q=20444). >> > >> > >> > >> > >> > >> > Ok, so to help you understand what's going on here, I'm going to go >> > into >> > >> > a little of the inner details of what's going on here. >> > >> > >> > >> > You are right, you have a matrix T, with 6,076,937 rows, and each >> row >> > >> has >> > >> > 20,444 columns (most of which are zero, and it's represented >> sparsely, >> > >> but >> > >> > still, they live in a vector space of dimension 20,444). >> Similarly, >> > >> you've >> > >> > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) >> and >> > >> > each of these rows has exactly 20,444 columns (and most likely, >> > they'll >> > >> > all be nonzero, because eigenvectors have no reason to be sparse). >> > >> > >> > >> > In particular, T and E are represented as *lists of rows*, each row >> is >> > a >> > >> > vector of dimension 20,444. T has six million of these rows, and E >> > has >> > >> > only 87. >> > >> > >> > >> > >> > >> >> So to multiply these two >> > >> >> matrices, I need to transpose E so that the number of columns in T >> > >> equals >> > >> >> the number of rows in E (i.e. E^T is q x p) the result of the >> > >> matrixmult >> > >> >> would give me an m x p matrix (m=6076937, p=87). >> > >> >> >> > >> > >> > >> > You're exactly right that you want to multiply T by E^T, because >> you >> > >> can't >> > >> > compute T * E. >> > >> > >> > >> > The way it turns out in practice, computing the matrix product of >> two >> > >> > matrices as a map-reduce job is efficiently done as a map-side join >> on >> > >> > two row-based matrices with the same number of rows, and the >> columns >> > >> > are the ones which are different. In particular, if you take a >> matrix >> > X >> > >> > which >> > >> > is represented as a set of numRowsX rows, each of which has >> numColsX, >> > >> > and another matrix with numRowsY == numRowsX, each of which has >> > >> > numColsY (!= numColsX), then by summing the outer-products of each >> > >> > of the numRowsX pairs of vectors, you get a matrix of with numRowsZ >> == >> > >> > numColsX, and numColsZ == numColsY (if you instead take the reverse >> > >> > outer product of the vector pairs, you can end up with the >> transpose >> > of >> > >> > this >> > >> > final result, with numRowsZ == numColsY, and numColsZ == numColsX). >> > >> > >> > >> > Unfortunately, you have a pair of matrices which have different >> > numbers >> > >> > of rows, and the same number of columns, but you want a pair of >> > matrices >> > >> > with the same number of rows and (possibly) different numbers of >> > >> columns. >> > >> > >> > >> > >> > >> >> So I tried to run matrixmult with: >> > >> > >> > >> > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444 >> > >> --numColsB >> > >> >> 87 \ >> > >> >> --inputPathA >> > >> >> >> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix >> > \ >> > >> >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244 >> > >> > >> > >> > >> > >> >> (--inputPathA points to the output of the rowid job) >> > >> >> >> > >> >> This results in: >> > >> >> Exception in thread "main" >> > org.apache.mahout.math.CardinalityException: >> > >> >> >> > >> > >> > >> > >> > >> >> In the code, I see the test that row counts must be identical for >> the >> > >> two >> > >> >> >> > >> >> input matrices. Thus, it seems like the job requires me to >> transpose >> > >> this >> > >> >> large matrix, just to re-transpose it back to it's original form >> > during >> > >> >> the >> > >> >> multiplication? Or have I missed something crucial again? >> > >> >> >> > >> > >> > >> > You actually need to transpose the input matrix (T), and then >> re-run >> > >> with >> > >> > T^t >> > >> > and E^t (the latter you apparently already have created). >> > >> > >> > >> > We should really rename the "matrixmultiply" job to be called >> > >> > "transposematrixmultiply", because that's what it really does. >> > >> > >> > >> > -jake >> > >> > >> > >> >> > > >> > > >> > >> > >