Thanks Sebastian, I'll give it a try!
2010/6/18 Sebastian Schelter <[email protected]> > Hi Kris, > > maybe you want to give the patch from > https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet > tested it with larger data yet, but I would be happy to get some > feedback for it and maybe it helps you with your usecase. > > -sebastian > > Am 18.06.2010 18:46, schrieb Kris Jack: > > Thanks Ted, > > > > I got that working. Unfortunately, the matrix multiplication job is > taking > > far longer than I hoped. With just over 10 million documents, 10 mappers > > and 10 reducers, I can't get it to complete the job in under 48 hours. > > > > Perhaps you have an idea for speeding it up? I have already been quite > > ruthless with making the vectors sparse. I did not include terms that > > appeared in over 1% of the corpus and only kept terms that appeared at > least > > 50 times. Is it normal that the matrix multiplication map reduce task > > should take so long to process with this quantity of data and resources > > available or do you think that my system is not configured properly? > > > > Thanks, > > Kris > > > > > > > > 2010/6/15 Ted Dunning <[email protected]> > > > > > >> Threshold are generally dangerous. It is usually preferable to specify > the > >> sparseness you want (1%, 0.2%, whatever), sort the results in descending > >> score order using Hadoop's builtin capabilities and just drop the rest. > >> > >> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[email protected]> > wrote: > >> > >> > >>> I was wondering if there was an > >>> interesting way to do this with the current mahout code such as > >>> > >> requesting > >> > >>> that the Vector accumulator returns only elements that have values > >>> > >> greater > >> > >>> than a given threshold, sorting the vector by value rather than key, or > >>> something else? > >>> > >>> > >> > > > > -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/
