Random low dimensional projections tend to look like normal distributions. This is the law of large numbers at work. I think it is hard to diagnose anything from this.
On the other hand, projections against the principal components tend to show more structure. On Thu, Dec 27, 2012 at 11:53 AM, Dan Filimon <[email protected]>wrote: > Hi! > > I'm finally getting back to work on Streaming KMeans! :) > The last thing I did was experiment with different ways of vectorizing > the 20 newsgroups data set and I wanted to project them in 3D and > check out what I get. > > The result is pretty odd, but I get it regardless of the method I use > to generate vectors. > It looks like someone splashed a 2D normal distribution on a sphere. > > Here's an image from Ted's algorithm [2] and one from mine [3] using > log term-frequency scoring. > Ted's uses vectors of size 9000 with hashing (using > StaticWordValueEncoder) while mine uses vectors of size ~90000 with a > manual approach. > > I think the vectorization actually went okay for both algorithms, but > maybe the projection is off? > > The shape is odd. What am I doing wrong? :/ > > [1] https://gist.github.com/4391252 > [2] http://swarm.cs.pub.ro/~dfilimon/skm-mahout/ted-projected.png > [3] http://swarm.cs.pub.ro/~dfilimon/skm-mahout/log-projected.png >
