Akshay, If you are trying to determine document similarity using MapReduce, Mahout's RowSimiliarity may be useful here.
Have a look at the following thread:- http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results I had tried this on a corpus of 2 million web sites and had good results. Let us know if this works for u. ________________________________ From: akshay bhatt <[email protected]> To: [email protected] Sent: Wednesday, April 3, 2013 5:36 AM Subject: Integrating Mahout with existing nlp libraries I tried searching for it here and there, but could not find any good solution, so though of asking nlp experts. I am developing an text similarity finding application for which I need to match thousands and thousands of documents (of around 1000 words each) with each other. For nlp part, my best bet is NLTK (seeing its capabilities and algorithm friendlyness of python.But now when parts of speech tagging in itself taking so much of time, I believe, nltk may not be best suitable. Java or C won't hurt me, hence any solution will work for me. Please note, I have already started migrating from mysql to hbase in order to work with more freedom on such large number of data. But still question exists, how to perform algos. Mahout may be a choice, but that too is for machine learning, not dedicated for nlp (may be good for speech recognition). What else are available options. In gist, I need high performance nlp, (a step down from high performance machine learning). (I am inclined a bit towards Mahout, seeing future usage). (already asked at - http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives) .
