+1 I would like to take the lead on making this happen.
________________________________ From: Sebastian Schelter <[email protected]> To: [email protected] Sent: Wednesday, April 3, 2013 11:28 AM Subject: Re: Integrating Mahout with existing nlp libraries Thinking loud here: It would be great to have a DocumentSimilarityJob that is supplied a collection of documents and then applies necessary preprocessing (tokenization, vectorization, etc) and computes document similarities. Could be a nice starter task to add something like this. On 03.04.2013 17:09, Suneel Marthi wrote: > Akshay, > > If you are trying to determine document similarity using MapReduce, Mahout's > RowSimiliarity may be useful here. > > Have a look at the following thread:- > > http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results > > > I had tried this on a corpus of 2 million web sites and had good results. > > Let us know if this works for u. > > > > ________________________________ > From: akshay bhatt <[email protected]> > To: [email protected] > Sent: Wednesday, April 3, 2013 5:36 AM > Subject: Integrating Mahout with existing nlp libraries > > I tried searching for it here and there, but could not find any good solution, > so though of asking nlp experts. I am developing an text similarity finding > application for which I need to match thousands and thousands of documents (of > around 1000 words each) with each other. For nlp part, my best bet is NLTK > (seeing its capabilities and algorithm friendlyness of python.But now when > parts > of speech tagging in itself taking so much of time, I believe, nltk may not be > best suitable. Java or C won't hurt me, hence any solution will work for me. > Please note, I have already started migrating from mysql to hbase in order to > work with more freedom on such large number of data. But still question > exists, > how to perform algos. Mahout may be a choice, but that too is for machine > learning, not dedicated for nlp (may be good for speech recognition). What > else > are available options. In gist, I need high performance nlp, (a step down from > high performance machine learning). (I am inclined a bit towards Mahout, > seeing > future usage). > > (already asked at - > http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives) > . >
