Hello Jeff, 2012/5/14 Jeff Eastman <[email protected]>: > Look at ClusterIterator.iterate(). This will do clustering in memory without > any Hadoop. ClusterIterator.iterateSeq will do clustering in a single > process from/to Hadoop sequence files but without map/reduce. > ClusterIterator.iterateMR uses full Hadoop to do clustering for the same > algorithms (k-means, fuzzy-k, Dirichlet), all configured using > ClusteringPolicy instances.
Thanks for the response. It's exactly what I need. >From what I can figure out, please correct me if I'm wrong, the scenario will look like this (in my case): - vectorize my documents and run ClusterIterator.iterate*() to get back a ClusterClassifier. - call ClusterClassifier.classify( newDocumentVector) to get a list of probabilities as to which cluster my newDocument belongs. However there are some issues that I can't get my head around. How do I make the vector to use the dictionary from my model so the vectors will have terms on the same positions and the classifier will be able to correctly compute distances between the new vector and the model. Another way to put it: Doing online clustering with text documents will result in vectors that contain elemtents/terms that do not exist in the model. Doesn't this mean I will get IndexOutOfbounds or some exception when I try to classify()? Does mahout offer some support for updating the model? Thanks, > > On 5/14/12 8:34 AM, Ioan Eugen Stan wrote: >> >> Hi, >> >> Dos mahout offer online clustering out of the box using sequential >> clustering (no MapReduce). I'm looking over the code (trunk) and I >> found ClusterClassifier but I can't figure out how that works. Any >> examples or more docs on this topic? >> >> Thanks, > > -- Ioan Eugen Stan http://ieugen.blogspot.com/ *** http://bucharest-jug.github.com/ ***
