Thanks Ted, I got that working. Unfortunately, the matrix multiplication job is taking far longer than I hoped. With just over 10 million documents, 10 mappers and 10 reducers, I can't get it to complete the job in under 48 hours.
Perhaps you have an idea for speeding it up? I have already been quite ruthless with making the vectors sparse. I did not include terms that appeared in over 1% of the corpus and only kept terms that appeared at least 50 times. Is it normal that the matrix multiplication map reduce task should take so long to process with this quantity of data and resources available or do you think that my system is not configured properly? Thanks, Kris 2010/6/15 Ted Dunning <[email protected]> > Threshold are generally dangerous. It is usually preferable to specify the > sparseness you want (1%, 0.2%, whatever), sort the results in descending > score order using Hadoop's builtin capabilities and just drop the rest. > > On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[email protected]> wrote: > > > I was wondering if there was an > > interesting way to do this with the current mahout code such as > requesting > > that the Vector accumulator returns only elements that have values > greater > > than a given threshold, sorting the vector by value rather than key, or > > something else? > > >
