This is a question regarding the new KNN library that Ted Dunning and Dan Filimon are working on (as I understand it'll be in Mahout 0.8) so I hope this is the appropriate list for this question instead of mahout-dev.
First off, it's great. I was looking for a streaming kmeans library (or writing my own) to integrate with storm and have -- as with all things with Mahout -- been really impressed. Naturally taking the appropriate I'm-using-this-new-code-at-my-peril attitude, I had a few questions. Right now I'm running streamingKMeans with the twitter streaming api. When I iterate through each cluster using the FastProjectionSearch, I'm occasionally hitting a concurrent modification exception because (of course) i'm trying to perform the search while vectors are added in a different thread. Do you have any plans to make the code more concurrency friendly, or is it more sensible to pause and wait for the FastProjectionSearch to finish before adding more vectors. Or am I totally missing something? As i understand there are performance implications to using concurrent collections, is that why you're steering clear thus far? Because I am clustering text, I have run into the issue Dan talked about here https://github.com/dfilimon/knn/issues/1, and have found that clusters aren't too stable with a large(er) number of dimensions. I'm happy to play around with the math a little bit, but I'd love to hear if you've made any progress or have other suggestions. What I'm trying to do is (roughly) cluster tweets relating to a topic so that I can look for patterns in the conversations. It would be preferable to keep many of the clusters consistent, so that I can monitor how they ebb or flow. If one cluster (for example) contains tweets about Obama with words like "socialist" "communist" "Kenya" and "Fox News" it would be preferable to keep that cluster relatively stable so that I could watch how that conversation changes. I realize this may require a number of cheats beyond traditional Kmeans, but I'd love to hear your suggestions. Is there any way I can help? Thank you so much! Brandon Root
