Canopy is a bit fussy about its T1 and T2 parameters: If you set T2 too small, 
you will get one cluster for each input vector; too large and you will get only 
one cluster for all vectors. T1 is less sensitive and will only impact how many 
points near each cluster are included in its centroid calculation.  My guess is 
you are in the first situation with T2 too small and, with the larger dataset, 
are creating more clusters than will fit into your memory.

How many clusters did you get from your small dataset? If the small set is a 
subset of the large set you could always run Canopy over the small set to get 
your k-means initial cluster centers, then run k-means iterations over the full 
dataset after. You can also skip the Canopy step entirely when using k-means: 
include a -k parameter and k-means will sample that many initial cluster 
centers from your data and then run its iterations. 

Glad to hear MeanShift is working for you. It has similar scaling limitations 
to Canopy. I've been pleasantly surprised by its performance on problems I 
thought were out of scope for it. Don't know why it works on your larger 
dataset when Canopy fails though.

-----Original Message-----
From: Jure Jeseničnik [mailto:[email protected]] 
Sent: Wednesday, November 17, 2010 3:54 AM
To: [email protected]
Subject: Canopy memory consumption

Hi Guys.

What I'm trying to do is the basic news clustering, that will group the news 
about the same topic into clusters.  I have the data in a database so I took 
the following approach:

1.       Wrote a small program that puts the data from the db into a Lucene 
Index.

2.       Created vectors from index with the following command:
mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i link 
-n 2

3.       Ran canopy, to get initial clusters:
mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow

4.       Ran the kmeans to perform the final clustering:
mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10 -cl 
-ow

5.       Do the clusterdump to view results:
mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p 
output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt

When I run this with cca 1000 records (8000 distinct terms), the results are 
just perfect. I get exactly the clusters I want. The problems start when I try 
the same steps with a bit more data.

With 6000 records (28000 terms) or even the half of that, the process fails at 
the canopy step with Java heap space OutOfMemoryError. The  MAHOUT_HEAPSIZE 
variable value on my local machine is 1024.  I even tried running it on our 
development hadoop cluster with approximately the same amount of memory, but it 
failed with the same error.

I realize  that software needs a certain amount of memory to work properly but 
I find it hard to believe that 1 GB is not enough for processing a 3.1 MB file, 
which is the size of the vectors file produced by the second step. We're hoping 
to use this solution on a hundreds of thousands of records and I can't help but 
to wonder what sort of hardware we'll be needing in order to process them if 
such memory consumption is a normal thing.

Am I missing something here? Are there any other setting that I should be 
taking into consideration.

And one more thing. I tried the meanshift implementation and it seems to be 
working fine, with that much data.

Thanks.

Jure

Reply via email to