Re: Generating a Document Similarity Matrix

Kris Jack Fri, 18 Jun 2010 09:47:13 -0700

Thanks Ted,

I got that working.  Unfortunately, the matrix multiplication job is taking
far longer than I hoped.  With just over 10 million documents, 10 mappers
and 10 reducers, I can't get it to complete the job in under 48 hours.

Perhaps you have an idea for speeding it up?  I have already been quite
ruthless with making the vectors sparse.  I did not include terms that
appeared in over 1% of the corpus and only kept terms that appeared at least
50 times.  Is it normal that the matrix multiplication map reduce task
should take so long to process with this quantity of data and resources
available or do you think that my system is not configured properly?

Thanks,
Kris

2010/6/15 Ted Dunning <[email protected]>

> Threshold are generally dangerous.  It is usually preferable to specify the
> sparseness you want (1%, 0.2%, whatever), sort the results in descending
> score order using Hadoop's builtin capabilities and just drop the rest.
>
> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[email protected]> wrote:
>
> >  I was wondering if there was an
> > interesting way to do this with the current mahout code such as
> requesting
> > that the Vector accumulator returns only elements that have values
> greater
> > than a given threshold, sorting the vector by value rather than key, or
> > something else?
> >
>

Re: Generating a Document Similarity Matrix

Reply via email to