Hi Joyce, Mahout uses clustering algorithm to extract top terms or topics from documents sets. It uses basically three types of algorithm for keyword extraction . 1) Collocations extraction:- https://cwiki.apache.org/confluence/display/MAHOUT/Collocations 2) Clustering algorithm: It supports clustering algorithm like k-means, fuzzy k-mean, cancopy etc. 3)Latent Dirichet Allocation:- https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation Mahout uses simple unsupervised(clustering) algorithm for keyword extraction. Where as I think OpenCalasis uses supervised and deep semantic approaches. I think you are looking some supervised(classification) algorithm for keyphrase extraction. I suggest to look at kea( http://www.nzdl.org/Kea/download.html) and maui-indexer( http://code.google.com/p/maui-indexer/) Thanks Vineet Yadav
On Thu, Feb 3, 2011 at 6:51 PM, Joyce Babu <[email protected]> wrote: > Hi, > > I am new to Java and Machine Learning concept. I was searching for a method > to extract keywords (like names of people, organization, places etc) from > new stories sorted by relevance. I found several web services like > OpenCalais that provide similar service, but they don't detect most of my > terms. I have a list of approved keywords, and only need to detect from that > list. > > I found out about Machine Learning and got interested in the concept. I > read somewhere that the classification feature of mahout can be used for > detecting keywords by classifying terms as keywords and non-keywords. I have > been trying to learn mahout for the past 30 hours, but haven't reached > anywhere. It is not useful to waste time trying to learn, if mahout is not > the tool to solve my problem. > > Can someone provide details on using mahout for term extraction? Is it > possible to do this with little to medium knowledge in Java? Is it an > overkill to use mahout for this? Should I go for an NLP solution? > > Thanks, > Joyce > > >
