Thinking loud here: It would be great to have a DocumentSimilarityJob
that is supplied a collection of documents and then applies necessary
preprocessing (tokenization, vectorization, etc) and computes document
similarities.

Could be a nice starter task to add something like this.

On 03.04.2013 17:09, Suneel Marthi wrote:
> Akshay,
> 
> If you are trying to determine document similarity using MapReduce, Mahout's 
> RowSimiliarity may be useful here.
> 
> Have a look at the following thread:-
> 
> http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results
> 
> 
> I had tried this on a corpus of 2 million web sites and had good results.
> 
> Let us know if this works for u.
> 
> 
> 
> ________________________________
>  From: akshay bhatt <[email protected]>
> To: [email protected] 
> Sent: Wednesday, April 3, 2013 5:36 AM
> Subject: Integrating Mahout with existing nlp libraries 
>  
> I tried searching for it here and there, but could not find any good solution,
> so though of asking nlp experts. I am developing an text similarity finding
> application for which I need to match thousands and thousands of documents (of
> around 1000 words each) with each other. For nlp part, my best bet is NLTK
> (seeing its capabilities and algorithm friendlyness of python.But now when 
> parts
> of speech tagging in itself taking so much of time, I believe, nltk may not be
> best suitable. Java or C won't hurt me, hence any solution will work for me.
> Please note, I have already started migrating from mysql to hbase in order to
> work with more freedom on such large number of data. But still question 
> exists,
> how to perform algos. Mahout may be a choice, but that too is for machine
> learning, not dedicated for nlp (may be good for speech recognition). What 
> else
> are available options. In gist, I need high performance nlp, (a step down from
> high performance machine learning). (I am inclined a bit towards Mahout, 
> seeing
> future usage).
> 
> (already asked at -
> http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives)
> . 
> 

Reply via email to