Hi Kris, maybe you want to give the patch from https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet tested it with larger data yet, but I would be happy to get some feedback for it and maybe it helps you with your usecase.
-sebastian Am 18.06.2010 18:46, schrieb Kris Jack: > Thanks Ted, > > I got that working. Unfortunately, the matrix multiplication job is taking > far longer than I hoped. With just over 10 million documents, 10 mappers > and 10 reducers, I can't get it to complete the job in under 48 hours. > > Perhaps you have an idea for speeding it up? I have already been quite > ruthless with making the vectors sparse. I did not include terms that > appeared in over 1% of the corpus and only kept terms that appeared at least > 50 times. Is it normal that the matrix multiplication map reduce task > should take so long to process with this quantity of data and resources > available or do you think that my system is not configured properly? > > Thanks, > Kris > > > > 2010/6/15 Ted Dunning <[email protected]> > > >> Threshold are generally dangerous. It is usually preferable to specify the >> sparseness you want (1%, 0.2%, whatever), sort the results in descending >> score order using Hadoop's builtin capabilities and just drop the rest. >> >> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[email protected]> wrote: >> >> >>> I was wondering if there was an >>> interesting way to do this with the current mahout code such as >>> >> requesting >> >>> that the Vector accumulator returns only elements that have values >>> >> greater >> >>> than a given threshold, sorting the vector by value rather than key, or >>> something else? >>> >>> >> >
