Re: Streaming KMeans Text Clustering Concurrency and Advice

Dan Filimon Thu, 13 Dec 2012 15:11:47 -0800

Hi there! Glad to see someone's using it! :D

On Fri, Dec 14, 2012 at 12:29 AM, Brandon Root <[email protected]> wrote:
> This is a question regarding the new KNN library that Ted Dunning and Dan
> Filimon are working on (as I understand it'll be in Mahout 0.8) so I hope
> this is the appropriate list for this question instead of mahout-dev.
>
> First off, it's great. I was looking for a streaming kmeans library (or
> writing my own) to integrate with storm and have -- as with all things with
> Mahout -- been really impressed. Naturally taking the appropriate
> I'm-using-this-new-code-at-my-peril attitude, I had a few questions.


Awww... :D

> Right now I'm running streamingKMeans with the twitter streaming api. When
> I iterate through each cluster using the FastProjectionSearch, I'm
> occasionally hitting a concurrent modification exception because (of
> course) i'm trying to perform the search while vectors are added in a
> different thread. Do you have any plans to make the code
> more concurrency friendly, or is it more sensible to pause and wait for the
> FastProjectionSearch to finish before adding more vectors. Or am I totally
> missing something? As i understand there are performance implications to
> using concurrent collections, is that why you're steering clear thus far?

There used to be a threaded version of streaming k-means that I since
discarded. You can find it in older revisions of my repo but I think
Ted's repo still has it.
My understanding of us not using concurrent collections is that we
want to make the serial case fast and we'd like to work best in the
map-reduce scenario where streaming k-means is running in a mapper in
its own process using just one thread.

The FastProjectionSearch in there right now that fixes an issue with
an ArrayOutOfBoundsException is actually slower that Ted would have
liked since it's using collections rather than pairs of arrays
cleverly indexed into one another. From what I talked to Ted when
replacing it, we'll eventually want to have a better performing
version of that searcher, but there's still work to be done before
that.

I wouldn't completely exclude using concurrent collections (in the
sense of providing a searcher that does that), but for now, there are
other things that need to get done. :)

> Because I am clustering text, I have run into the issue Dan talked about
> here https://github.com/dfilimon/knn/issues/1, and have found that clusters
> aren't too stable with a large(er) number of dimensions. I'm happy to play
> around with the math a little bit, but I'd love to hear if you've made any
> progress or have other suggestions. What I'm trying to do is (roughly)
> cluster tweets relating to a topic so that I can look for patterns in the
> conversations. It would be preferable to keep many of the clusters
> consistent, so that I can monitor how they ebb or flow. If one cluster (for
> example) contains tweets about Obama with words like "socialist"
> "communist" "Kenya" and "Fox News" it would be preferable to keep that
> cluster relatively stable so that I could watch how that conversation
> changes. I realize this may require a number of cheats beyond traditional
> Kmeans, but I'd love to hear your suggestions. Is there any way I can help?

So, it turns out that I had a misconception about the paper. What I
was doing is, knowing there are k clusters, I requested k clusters
from streaming k-means thinking I could skip the ball k-means step and
still get decent results.

This isn't actually true. What you want to do is ask for K ~ k log n
clusters (theta of that), and with the sketch perform a ball k-means
step. In the code when we re-estimate the number of clusters we
actually do:
estimatedNumClusters = max(estimatedNumClusters, clusterLogFactor *
log(numProcessedDatapoints)).

>From what I can tell though, this doesn't increase fast enough to get
a good sketch, since we're only getting in points sequentially so that
log will be small initially.
I'm also unsure why we use that particular estimate. The paper just
uses k log n, but we don't have the real k (although the initial
numClusters might be that). I want to try:
estimatedNumClusters = estimatedNumClusters * clusterLogFactor *
log(numProcessedDatapoints)

This in my mind is closer to what the paper says, although that
clusterLogFactor (which is 10) might be a bit too high.
Long story, short it's kind of ambiguous what the numClusters actually
is when you call streaming k-means initially and that needs some
clarification.

Ted, I see, gave you great advice. He told me to try those techniques
too and (eventually) see how clustering 20 newsgroups performs.

I've put this project a bit on hold due to school work... I should be
back doing more stuff next week.
But, please, try the code out and feel free to open issues! :D

Also, a demo of the Twitter conversations would be really neat!

> Thank you so much!
>
> Brandon Root

Re: Streaming KMeans Text Clustering Concurrency and Advice

Reply via email to