This is a question regarding the new KNN library that Ted Dunning and Dan
Filimon are working on (as I understand it'll be in Mahout 0.8) so I hope
this is the appropriate list for this question instead of mahout-dev.

First off, it's great. I was looking for a streaming kmeans library (or
writing my own) to integrate with storm and have -- as with all things with
Mahout -- been really impressed. Naturally taking the appropriate
I'm-using-this-new-code-at-my-peril attitude, I had a few questions.

Right now I'm running streamingKMeans with the twitter streaming api. When
I iterate through each cluster using the FastProjectionSearch, I'm
occasionally hitting a concurrent modification exception because (of
course) i'm trying to perform the search while vectors are added in a
different thread. Do you have any plans to make the code
more concurrency friendly, or is it more sensible to pause and wait for the
FastProjectionSearch to finish before adding more vectors. Or am I totally
missing something? As i understand there are performance implications to
using concurrent collections, is that why you're steering clear thus far?

Because I am clustering text, I have run into the issue Dan talked about
here https://github.com/dfilimon/knn/issues/1, and have found that clusters
aren't too stable with a large(er) number of dimensions. I'm happy to play
around with the math a little bit, but I'd love to hear if you've made any
progress or have other suggestions. What I'm trying to do is (roughly)
cluster tweets relating to a topic so that I can look for patterns in the
conversations. It would be preferable to keep many of the clusters
consistent, so that I can monitor how they ebb or flow. If one cluster (for
example) contains tweets about Obama with words like "socialist"
"communist" "Kenya" and "Fox News" it would be preferable to keep that
cluster relatively stable so that I could watch how that conversation
changes. I realize this may require a number of cheats beyond traditional
Kmeans, but I'd love to hear your suggestions. Is there any way I can help?

Thank you so much!

Brandon Root

Reply via email to