This is typically what Behemoth can be used for https://github.com/DigitalPebble/behemoth. It has a Mahout module to generate vectors at the same format as SparseVectorsFromSequenceFiles. Assuming that the document similarity job itself can run on the same input as the clustering then you'd be able to use that in combination with the other Behemoth modules e.g. import the documents, parse with Tika, tokenize, do some NLP with GATE or UIMA, find the similarities with Mahout, send to SOLR etc...
Julien * * * * On 3 April 2013 16:28, Sebastian Schelter <[email protected]> wrote: > Thinking loud here: It would be great to have a DocumentSimilarityJob > that is supplied a collection of documents and then applies necessary > preprocessing (tokenization, vectorization, etc) and computes document > similarities. > > Could be a nice starter task to add something like this. > > On 03.04.2013 17:09, Suneel Marthi wrote: > > Akshay, > > > > If you are trying to determine document similarity using MapReduce, > Mahout's RowSimiliarity may be useful here. > > > > Have a look at the following thread:- > > > > > http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results > > > > > > I had tried this on a corpus of 2 million web sites and had good results. > > > > Let us know if this works for u. > > > > > > > > ________________________________ > > From: akshay bhatt <[email protected]> > > To: [email protected] > > Sent: Wednesday, April 3, 2013 5:36 AM > > Subject: Integrating Mahout with existing nlp libraries > > > > I tried searching for it here and there, but could not find any good > solution, > > so though of asking nlp experts. I am developing an text similarity > finding > > application for which I need to match thousands and thousands of > documents (of > > around 1000 words each) with each other. For nlp part, my best bet is > NLTK > > (seeing its capabilities and algorithm friendlyness of python.But now > when parts > > of speech tagging in itself taking so much of time, I believe, nltk may > not be > > best suitable. Java or C won't hurt me, hence any solution will work for > me. > > Please note, I have already started migrating from mysql to hbase in > order to > > work with more freedom on such large number of data. But still question > exists, > > how to perform algos. Mahout may be a choice, but that too is for machine > > learning, not dedicated for nlp (may be good for speech recognition). > What else > > are available options. In gist, I need high performance nlp, (a step > down from > > high performance machine learning). (I am inclined a bit towards Mahout, > seeing > > future usage). > > > > (already asked at - > > > http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives > ) > > . > > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
