Re: Generating a Document Similarity Matrix

Sebastian Schelter Fri, 18 Jun 2010 09:51:49 -0700

Hi Kris,

maybe you want to give the patch from
https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet
tested it with larger data yet, but I would be happy to get some
feedback for it and maybe it helps you with your usecase.


-sebastian

Am 18.06.2010 18:46, schrieb Kris Jack:
> Thanks Ted,
>
> I got that working.  Unfortunately, the matrix multiplication job is taking
> far longer than I hoped.  With just over 10 million documents, 10 mappers
> and 10 reducers, I can't get it to complete the job in under 48 hours.
>
> Perhaps you have an idea for speeding it up?  I have already been quite
> ruthless with making the vectors sparse.  I did not include terms that
> appeared in over 1% of the corpus and only kept terms that appeared at least
> 50 times.  Is it normal that the matrix multiplication map reduce task
> should take so long to process with this quantity of data and resources
> available or do you think that my system is not configured properly?
>
> Thanks,
> Kris
>
>
>
> 2010/6/15 Ted Dunning <[email protected]>
>
>   
>> Threshold are generally dangerous.  It is usually preferable to specify the
>> sparseness you want (1%, 0.2%, whatever), sort the results in descending
>> score order using Hadoop's builtin capabilities and just drop the rest.
>>
>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[email protected]> wrote:
>>
>>     
>>>  I was wondering if there was an
>>> interesting way to do this with the current mahout code such as
>>>       
>> requesting
>>     
>>> that the Vector accumulator returns only elements that have values
>>>       
>> greater
>>     
>>> than a given threshold, sorting the vector by value rather than key, or
>>> something else?
>>>
>>>       
>>     
>

Re: Generating a Document Similarity Matrix

Reply via email to