This sort could probably done with a radix sort to make it truly linear as
well.

I suspect previously unsuspected n^2 behavior.  That behavior now officially
loses its unsuspected nature (whether it exists or not).

On Mon, Mar 28, 2011 at 10:27 PM, Jake Mannix <jake.man...@gmail.com> wrote:

> Hmm... a sort should be fine (N log N is the new linear!), what I'm
> wondering
> if it's not something more insidious: like incrementally building up the
> SequentialAccessSparseVector one element at a time (or some such N^2
> operation - N^2 is not the new anything, it's still N^2) which isn't
> noticed
> when it's only called rarely (at the end of the reduce, once per vector)
> and
> N is usually 100's to 1000's, not millions.
>
> On Tue, Mar 29, 2011 at 5:14 AM, Ted Dunning <ted.dunn...@gmail.com>wrote:
>
>> The issue here is almost certainly that the sort implied by this
>> conversion
>> is only reasonable with a small number of non-zero elements.  Special
>> casing
>> a relatively dense case is probably worthwhile.
>>
>>
>> On Mon, Mar 28, 2011 at 10:04 PM, Timothy Potter <thelabd...@gmail.com
>> >wrote:
>>
>> > Hi Jake,
>> >
>> > I've tracked down the bottleneck in the TransposeJob's reducer to the
>> line
>> > where Mahout creates a new SequentialAccessSparseVector from a
>> > RandomAccessSparseVector after the while loop completes:
>> >
>> >      SequentialAccessSparseVector outVector = new
>> > SequentialAccessSparseVector(tmp);
>> >
>> > For high-frequency terms (some of which occur over ~1M times in my
>> data),
>> > the code to create a SequentialAccessSparseVector from a
>> > RandomAccessSparseVector bogs down completely. Of course, I should
>> probably
>> > get rid of the terms that occur this frequently and it seems I could
>> > continue to increase mapred.task.timeout but as you said, "transpose job
>> is
>> > totally trivial" so there may be a bigger issue here that is worth
>> digging
>> > into further. I think it has something to do with
>> OrderedIntDoubleMapping,
>> > but haven't had enough time to devote to this issue to know for sure ...
>> > I'll keep at it though.
>> >
>> > Cheers,
>> > Tim
>> >
>> > On Mon, Mar 21, 2011 at 6:34 PM, Jake Mannix <jake.man...@gmail.com>
>> > wrote:
>> >
>> > >
>> > >
>> > > On Sun, Mar 20, 2011 at 5:27 PM, Timothy Potter <thelabd...@gmail.com
>> > >wrote:
>> > >
>> > >> Hi Jake,
>> > >>
>> > >>
>> > >> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3
>> > >> reducers
>> > >> per node and mapred.child.java.opts=-Xmx4096m. The map tasks
>> completed
>> > >> within a few minutes but then all of my 24 reducers failed near the
>> 70%
>> > >> mark
>> > >> with error "Task attempt_201103201840_0004_r_000023_0 failed to
>> report
>> > >> status for 601 seconds. Killing!" The data node servers stayed at a
>> > >> healthy
>> > >> load avg below 4 and never paged ...
>> > >>
>> > >
>> > > Weird.   The transpose job is totally trivial, but it may be that the
>> > > vectors you
>> > > are producing are abnormally large.  In fact, if this came from text,
>> > this
>> > > is
>> > > probably the case: very common words (which aren't stop words) occur
>> > > in most of your documents, and so are very very large.  Might they be
>> > > larger than your memory? 6-million entries, each one, when represented
>> > > sparsely (sadly, in this case), have an int and a float, so still less
>> > than
>> > > 100MB,
>> > > so that's not it.
>> > >
>> > > BTW: you need to make sure you use the same number of reducers as when
>> > > you made your eigenvector matrix, as that's a requirement for a
>> map-side
>> > > join.   That should probably be documented!
>> > >
>> > >
>> > >> So I increased the "mapred.task.timeout" Hadoop parameter to 20
>> minutes
>> > >> instead of the default 10 minutes and it failed again. The reduce
>> code
>> > for
>> > >> the TransposeJob looks straight-forward, so I'm going to have to dig
>> in
>> > >> deeper to figure out what's causing the problem.
>> > >>
>> > >
>> > > Yeah, if you can see where it's timing out, that would be great.  Not
>> > sure
>> > > why that task should ever take very long.
>> > >
>> > >   -jake
>> > >
>> > >
>> > >
>> > >> Cheers,
>> > >> Tim
>> > >>
>> > >> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <jake.man...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <
>> > thelabd...@gmail.com
>> > >> >wrote:
>> > >> >
>> > >> >> Regarding Jake's comment: " ... you need to run the RowIdJob on
>> these
>> > >> >> tfidf-vectors first ..."
>> > >> >>
>> > >> >> I did this and now have an m x n matrix T (m=6076937, n=20444). My
>> > SVD
>> > >> >> eigenvector matrix E is p x q (p=87, q=20444).
>> > >> >
>> > >> >
>> > >> > Ok, so to help you understand what's going on here, I'm going to go
>> > into
>> > >> > a little of the inner details of what's going on here.
>> > >> >
>> > >> > You are right, you have a matrix T, with 6,076,937 rows, and each
>> row
>> > >> has
>> > >> > 20,444 columns (most of which are zero, and it's represented
>> sparsely,
>> > >> but
>> > >> > still, they live in a vector space of dimension 20,444).
>>  Similarly,
>> > >> you've
>> > >> > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors)
>> and
>> > >> > each of these rows has exactly 20,444 columns (and most likely,
>> > they'll
>> > >> > all be nonzero, because eigenvectors have no reason to be sparse).
>> > >> >
>> > >> > In particular, T and E are represented as *lists of rows*, each row
>> is
>> > a
>> > >> > vector of dimension 20,444.  T has six million of these rows, and E
>> > has
>> > >> > only 87.
>> > >> >
>> > >> >
>> > >> >> So to multiply these two
>> > >> >> matrices, I need to transpose E so that the number of columns in T
>> > >> equals
>> > >> >> the number of rows in E (i.e. E^T is q x p) the result of the
>> > >> matrixmult
>> > >> >> would give me an m x p matrix (m=6076937, p=87).
>> > >> >>
>> > >> >
>> > >> > You're exactly right that you want to multiply T by E^T, because
>> you
>> > >> can't
>> > >> > compute T * E.
>> > >> >
>> > >> > The way it turns out in practice, computing the matrix product of
>> two
>> > >> > matrices as a map-reduce job is efficiently done as a map-side join
>> on
>> > >> > two row-based matrices with the same number of rows, and the
>> columns
>> > >> > are the ones which are different.  In particular, if you take a
>> matrix
>> > X
>> > >> > which
>> > >> > is represented as a set of numRowsX rows, each of which has
>> numColsX,
>> > >> > and another matrix with numRowsY == numRowsX, each of which has
>> > >> > numColsY (!= numColsX), then by summing the outer-products of each
>> > >> > of the numRowsX pairs of vectors, you get a matrix of with numRowsZ
>> ==
>> > >> > numColsX, and numColsZ == numColsY (if you instead take the reverse
>> > >> > outer product of the vector pairs, you can end up with the
>> transpose
>> > of
>> > >> > this
>> > >> > final result, with numRowsZ == numColsY, and numColsZ == numColsX).
>> > >> >
>> > >> > Unfortunately, you have a pair of matrices which have different
>> > numbers
>> > >> > of rows, and the same number of columns, but you want a pair of
>> > matrices
>> > >> > with the same number of rows and (possibly) different numbers of
>> > >> columns.
>> > >> >
>> > >> >
>> > >> >> So I tried to run matrixmult with:
>> > >> >
>> > >> > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444
>> > >> --numColsB
>> > >> >> 87 \
>> > >> >> --inputPathA
>> > >> >>
>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix
>> > \
>> > >> >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244
>> > >> >
>> > >> >
>> > >> >> (--inputPathA points to the output of the rowid job)
>> > >> >>
>> > >> >> This results in:
>> > >> >> Exception in thread "main"
>> > org.apache.mahout.math.CardinalityException:
>> > >> >>
>> > >> >
>> > >> >
>> > >> >> In the code, I see the test that row counts must be identical for
>> the
>> > >> two
>> > >> >>
>> > >> >> input matrices. Thus, it seems like the job requires me to
>> transpose
>> > >> this
>> > >> >> large matrix, just to re-transpose it back to it's original form
>> > during
>> > >> >> the
>> > >> >> multiplication? Or have I missed something crucial again?
>> > >> >>
>> > >> >
>> > >> > You actually need to transpose the input matrix (T), and then
>> re-run
>> > >> with
>> > >> > T^t
>> > >> > and E^t (the latter you apparently already have created).
>> > >> >
>> > >> > We should really rename the "matrixmultiply" job to be called
>> > >> > "transposematrixmultiply", because that's what it really does.
>> > >> >
>> > >> >   -jake
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Reply via email to