Re: Integrating Mahout with existing nlp libraries

Suneel Marthi Wed, 03 Apr 2013 08:10:27 -0700

Akshay,

If you are trying to determine document similarity using MapReduce, Mahout's 
RowSimiliarity may be useful here.


Have a look at the following thread:-

http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results


I had tried this on a corpus of 2 million web sites and had good results.

Let us know if this works for u.



________________________________
 From: akshay bhatt <[email protected]>
To: [email protected] 
Sent: Wednesday, April 3, 2013 5:36 AM
Subject: Integrating Mahout with existing nlp libraries 
 
I tried searching for it here and there, but could not find any good solution,
so though of asking nlp experts. I am developing an text similarity finding
application for which I need to match thousands and thousands of documents (of
around 1000 words each) with each other. For nlp part, my best bet is NLTK
(seeing its capabilities and algorithm friendlyness of python.But now when parts
of speech tagging in itself taking so much of time, I believe, nltk may not be
best suitable. Java or C won't hurt me, hence any solution will work for me.
Please note, I have already started migrating from mysql to hbase in order to
work with more freedom on such large number of data. But still question exists,
how to perform algos. Mahout may be a choice, but that too is for machine
learning, not dedicated for nlp (may be good for speech recognition). What else
are available options. In gist, I need high performance nlp, (a step down from
high performance machine learning). (I am inclined a bit towards Mahout, seeing
future usage).

(already asked at -
http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives)
.

Re: Integrating Mahout with existing nlp libraries

Reply via email to