Hi Guys.

What I'm trying to do is the basic news clustering, that will group the news 
about the same topic into clusters.  I have the data in a database so I took 
the following approach:

1.       Wrote a small program that puts the data from the db into a Lucene 
Index.

2.       Created vectors from index with the following command:
mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i link 
-n 2

3.       Ran canopy, to get initial clusters:
mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow

4.       Ran the kmeans to perform the final clustering:
mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10 -cl 
-ow

5.       Do the clusterdump to view results:
mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p 
output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt

When I run this with cca 1000 records (8000 distinct terms), the results are 
just perfect. I get exactly the clusters I want. The problems start when I try 
the same steps with a bit more data.

With 6000 records (28000 terms) or even the half of that, the process fails at 
the canopy step with Java heap space OutOfMemoryError. The  MAHOUT_HEAPSIZE 
variable value on my local machine is 1024.  I even tried running it on our 
development hadoop cluster with approximately the same amount of memory, but it 
failed with the same error.

I realize  that software needs a certain amount of memory to work properly but 
I find it hard to believe that 1 GB is not enough for processing a 3.1 MB file, 
which is the size of the vectors file produced by the second step. We're hoping 
to use this solution on a hundreds of thousands of records and I can't help but 
to wonder what sort of hardware we'll be needing in order to process them if 
such memory consumption is a normal thing.

Am I missing something here? Are there any other setting that I should be 
taking into consideration.

And one more thing. I tried the meanshift implementation and it seems to be 
working fine, with that much data.

Thanks.

Jure

Reply via email to