Ted Dunning wrote:
On Mon, Nov 15, 2010 at 1:54 AM, Lance Norskog<[email protected]> wrote:
I have some questions about KMeans and clustering. I'm generating matrices
from recommendation data models.
what does "generating matrices" mean?
The matrix output is a pair of matrices. Each is a separate set of vectors, one
for each item and one for each user.
For this project I create a set of canopies with the CanopyClusterer
from the item matrix. Then, I run KMeans using the Canopy cluster set.
This approach is suggested in Mahout In Action, Section 9.1.5.
To decide whether the generated matrices have interesting data, I'm
generating and charting KMeans clusters. Next, I'm mapping all of the
vectors in the matrix to a nearest "corner" and then clustering those
corners.
This mapping sounds like assignment to a randomly generated cluster.
Why does clustering those corners give you any different results before or
after mapping vectors to the corners? Does the mapping change the corner?
Ah! I'm not using KMeans on random clusters; I'm using it on the canopy
output. I make the canopies from the training set. I then run KMeans on
the test set using the canopies from the training set. You mentioned
recently that this should come out very different. I also ran a random
"item vector" matrix using the same canopies, and they look as wrong as
the KMeans output from the test set.
Now, to the corner concept. I quantize the training set vectors. The
output is just corners, and there may be several items at the same
corner. I then ran KMeans on on the quantized vectors, again using the
canopies from training set. In other words, I just made a
lower-information version of the training set and clustered it according
to the more precise canopies. This is what made the really crazy
heart-shaped spiral.
Oh well, thanks for your time.
Lance