Hi Guys. What I'm trying to do is the basic news clustering, that will group the news about the same topic into clusters. I have the data in a database so I took the following approach:
1. Wrote a small program that puts the data from the db into a Lucene Index. 2. Created vectors from index with the following command: mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i link -n 2 3. Ran canopy, to get initial clusters: mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow 4. Ran the kmeans to perform the final clustering: mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10 -cl -ow 5. Do the clusterdump to view results: mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt When I run this with cca 1000 records (8000 distinct terms), the results are just perfect. I get exactly the clusters I want. The problems start when I try the same steps with a bit more data. With 6000 records (28000 terms) or even the half of that, the process fails at the canopy step with Java heap space OutOfMemoryError. The MAHOUT_HEAPSIZE variable value on my local machine is 1024. I even tried running it on our development hadoop cluster with approximately the same amount of memory, but it failed with the same error. I realize that software needs a certain amount of memory to work properly but I find it hard to believe that 1 GB is not enough for processing a 3.1 MB file, which is the size of the vectors file produced by the second step. We're hoping to use this solution on a hundreds of thousands of records and I can't help but to wonder what sort of hardware we'll be needing in order to process them if such memory consumption is a normal thing. Am I missing something here? Are there any other setting that I should be taking into consideration. And one more thing. I tried the meanshift implementation and it seems to be working fine, with that much data. Thanks. Jure
