Suspicion confirmed: public SequentialAccessSparseVector(Vector other) { this(other.size(), other.getNumNondefaultElements()); Iterator<Element> it = other.iterateNonZero(); Element e; while (it.hasNext() && (e = it.next()) != null) { set(e.index(), e.get()); } }
we iterate over the other vector (which is in random/hashed order), adding it to the sequential access vector (which always tries to stay in sequential order). So actually, this may be *worse* than O(n^2), but I'd prefer to just not know how much worse, and instead we should fix it. Should be fairly straightforward: make an array of structs (essentially) with the index and the double, of size other.getNumNonDefaultElements() (what a horrible method name), fill it up on one iteration over the other vector, sort it in place, then make your new OrderedIntDoubleMapping out of the indexes and values (unless someone has a cleverer idea to sort a pair of two arrays at the same time, shuffling one based on the ordering criterion of the other). -jake On Tue, Mar 29, 2011 at 5:38 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > This sort could probably done with a radix sort to make it truly linear as > well. > > I suspect previously unsuspected n^2 behavior. That behavior now > officially loses its unsuspected nature (whether it exists or not). > > > On Mon, Mar 28, 2011 at 10:27 PM, Jake Mannix <jake.man...@gmail.com>wrote: > >> Hmm... a sort should be fine (N log N is the new linear!), what I'm >> wondering >> if it's not something more insidious: like incrementally building up the >> SequentialAccessSparseVector one element at a time (or some such N^2 >> operation - N^2 is not the new anything, it's still N^2) which isn't >> noticed >> when it's only called rarely (at the end of the reduce, once per vector) >> and >> N is usually 100's to 1000's, not millions. >> >> On Tue, Mar 29, 2011 at 5:14 AM, Ted Dunning <ted.dunn...@gmail.com>wrote: >> >>> The issue here is almost certainly that the sort implied by this >>> conversion >>> is only reasonable with a small number of non-zero elements. Special >>> casing >>> a relatively dense case is probably worthwhile. >>> >>> >>> On Mon, Mar 28, 2011 at 10:04 PM, Timothy Potter <thelabd...@gmail.com >>> >wrote: >>> >>> > Hi Jake, >>> > >>> > I've tracked down the bottleneck in the TransposeJob's reducer to the >>> line >>> > where Mahout creates a new SequentialAccessSparseVector from a >>> > RandomAccessSparseVector after the while loop completes: >>> > >>> > SequentialAccessSparseVector outVector = new >>> > SequentialAccessSparseVector(tmp); >>> > >>> > For high-frequency terms (some of which occur over ~1M times in my >>> data), >>> > the code to create a SequentialAccessSparseVector from a >>> > RandomAccessSparseVector bogs down completely. Of course, I should >>> probably >>> > get rid of the terms that occur this frequently and it seems I could >>> > continue to increase mapred.task.timeout but as you said, "transpose >>> job is >>> > totally trivial" so there may be a bigger issue here that is worth >>> digging >>> > into further. I think it has something to do with >>> OrderedIntDoubleMapping, >>> > but haven't had enough time to devote to this issue to know for sure >>> ... >>> > I'll keep at it though. >>> > >>> > Cheers, >>> > Tim >>> > >>> > On Mon, Mar 21, 2011 at 6:34 PM, Jake Mannix <jake.man...@gmail.com> >>> > wrote: >>> > >>> > > >>> > > >>> > > On Sun, Mar 20, 2011 at 5:27 PM, Timothy Potter < >>> thelabd...@gmail.com >>> > >wrote: >>> > > >>> > >> Hi Jake, >>> > >> >>> > >> >>> > >> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3 >>> > >> reducers >>> > >> per node and mapred.child.java.opts=-Xmx4096m. The map tasks >>> completed >>> > >> within a few minutes but then all of my 24 reducers failed near the >>> 70% >>> > >> mark >>> > >> with error "Task attempt_201103201840_0004_r_000023_0 failed to >>> report >>> > >> status for 601 seconds. Killing!" The data node servers stayed at a >>> > >> healthy >>> > >> load avg below 4 and never paged ... >>> > >> >>> > > >>> > > Weird. The transpose job is totally trivial, but it may be that the >>> > > vectors you >>> > > are producing are abnormally large. In fact, if this came from text, >>> > this >>> > > is >>> > > probably the case: very common words (which aren't stop words) occur >>> > > in most of your documents, and so are very very large. Might they be >>> > > larger than your memory? 6-million entries, each one, when >>> represented >>> > > sparsely (sadly, in this case), have an int and a float, so still >>> less >>> > than >>> > > 100MB, >>> > > so that's not it. >>> > > >>> > > BTW: you need to make sure you use the same number of reducers as >>> when >>> > > you made your eigenvector matrix, as that's a requirement for a >>> map-side >>> > > join. That should probably be documented! >>> > > >>> > > >>> > >> So I increased the "mapred.task.timeout" Hadoop parameter to 20 >>> minutes >>> > >> instead of the default 10 minutes and it failed again. The reduce >>> code >>> > for >>> > >> the TransposeJob looks straight-forward, so I'm going to have to dig >>> in >>> > >> deeper to figure out what's causing the problem. >>> > >> >>> > > >>> > > Yeah, if you can see where it's timing out, that would be great. Not >>> > sure >>> > > why that task should ever take very long. >>> > > >>> > > -jake >>> > > >>> > > >>> > > >>> > >> Cheers, >>> > >> Tim >>> > >> >>> > >> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <jake.man...@gmail.com >>> > >>> > >> wrote: >>> > >> >>> > >> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter < >>> > thelabd...@gmail.com >>> > >> >wrote: >>> > >> > >>> > >> >> Regarding Jake's comment: " ... you need to run the RowIdJob on >>> these >>> > >> >> tfidf-vectors first ..." >>> > >> >> >>> > >> >> I did this and now have an m x n matrix T (m=6076937, n=20444). >>> My >>> > SVD >>> > >> >> eigenvector matrix E is p x q (p=87, q=20444). >>> > >> > >>> > >> > >>> > >> > Ok, so to help you understand what's going on here, I'm going to >>> go >>> > into >>> > >> > a little of the inner details of what's going on here. >>> > >> > >>> > >> > You are right, you have a matrix T, with 6,076,937 rows, and each >>> row >>> > >> has >>> > >> > 20,444 columns (most of which are zero, and it's represented >>> sparsely, >>> > >> but >>> > >> > still, they live in a vector space of dimension 20,444). >>> Similarly, >>> > >> you've >>> > >> > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) >>> and >>> > >> > each of these rows has exactly 20,444 columns (and most likely, >>> > they'll >>> > >> > all be nonzero, because eigenvectors have no reason to be sparse). >>> > >> > >>> > >> > In particular, T and E are represented as *lists of rows*, each >>> row is >>> > a >>> > >> > vector of dimension 20,444. T has six million of these rows, and >>> E >>> > has >>> > >> > only 87. >>> > >> > >>> > >> > >>> > >> >> So to multiply these two >>> > >> >> matrices, I need to transpose E so that the number of columns in >>> T >>> > >> equals >>> > >> >> the number of rows in E (i.e. E^T is q x p) the result of the >>> > >> matrixmult >>> > >> >> would give me an m x p matrix (m=6076937, p=87). >>> > >> >> >>> > >> > >>> > >> > You're exactly right that you want to multiply T by E^T, because >>> you >>> > >> can't >>> > >> > compute T * E. >>> > >> > >>> > >> > The way it turns out in practice, computing the matrix product of >>> two >>> > >> > matrices as a map-reduce job is efficiently done as a map-side >>> join on >>> > >> > two row-based matrices with the same number of rows, and the >>> columns >>> > >> > are the ones which are different. In particular, if you take a >>> matrix >>> > X >>> > >> > which >>> > >> > is represented as a set of numRowsX rows, each of which has >>> numColsX, >>> > >> > and another matrix with numRowsY == numRowsX, each of which has >>> > >> > numColsY (!= numColsX), then by summing the outer-products of each >>> > >> > of the numRowsX pairs of vectors, you get a matrix of with >>> numRowsZ == >>> > >> > numColsX, and numColsZ == numColsY (if you instead take the >>> reverse >>> > >> > outer product of the vector pairs, you can end up with the >>> transpose >>> > of >>> > >> > this >>> > >> > final result, with numRowsZ == numColsY, and numColsZ == >>> numColsX). >>> > >> > >>> > >> > Unfortunately, you have a pair of matrices which have different >>> > numbers >>> > >> > of rows, and the same number of columns, but you want a pair of >>> > matrices >>> > >> > with the same number of rows and (possibly) different numbers of >>> > >> columns. >>> > >> > >>> > >> > >>> > >> >> So I tried to run matrixmult with: >>> > >> > >>> > >> > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444 >>> > >> --numColsB >>> > >> >> 87 \ >>> > >> >> --inputPathA >>> > >> >> >>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix >>> > \ >>> > >> >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244 >>> > >> > >>> > >> > >>> > >> >> (--inputPathA points to the output of the rowid job) >>> > >> >> >>> > >> >> This results in: >>> > >> >> Exception in thread "main" >>> > org.apache.mahout.math.CardinalityException: >>> > >> >> >>> > >> > >>> > >> > >>> > >> >> In the code, I see the test that row counts must be identical for >>> the >>> > >> two >>> > >> >> >>> > >> >> input matrices. Thus, it seems like the job requires me to >>> transpose >>> > >> this >>> > >> >> large matrix, just to re-transpose it back to it's original form >>> > during >>> > >> >> the >>> > >> >> multiplication? Or have I missed something crucial again? >>> > >> >> >>> > >> > >>> > >> > You actually need to transpose the input matrix (T), and then >>> re-run >>> > >> with >>> > >> > T^t >>> > >> > and E^t (the latter you apparently already have created). >>> > >> > >>> > >> > We should really rename the "matrixmultiply" job to be called >>> > >> > "transposematrixmultiply", because that's what it really does. >>> > >> > >>> > >> > -jake >>> > >> > >>> > >> >>> > > >>> > > >>> > >>> >> >> >