A stacktrace of error would have helped in finding the exact error.
However, number of clusters can create Heap Space problems ( If the vector dimension is also high ). Either try to reduce the number of initial clusters ( In my opinion, the best way to know about initial clusters is Canopy Clustering https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
or, try to reduce the dimension of the vectors. PS : you are also providing numClusters twice --numClusters 1000 \ --numClusters 5 \ On 08-08-2012 10:42, Abramov Pavel wrote:
Hello, I am trying to run KMeans example on 15 000 000 documents (seq2sparse output). There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms document size (titles). seq2sparse produces 200 files 80 MB each. My job failed with Java heap space Error. 1st iteration passes while 2nd iteration fails. On a Map phase of buildClusters I see a lot of warnings, but it passes. Reduce phase of buildClusters fails with "Java Heap space". I can not increase reducer/mapper memory in hadoop. My cluster is tunned well. How can I avoid this situation? My cluster has 300 Mappers and 220 Reducers running 40 servers 8-core 12 GB RAM. Thanks in advance! Here is KMeans parameters: ------------------------------------------------ mahout kmeans -Dmapred.reduce.tasks=200 \ -i ...tfidf-vectors/ \ -o /tmp/clustering_results_kmeans/ \ --clusters /tmp/clusters/ \ --numClusters 1000 \ --numClusters 5 \ --overwrite \ --clustering ------------------------------------------------ Pavel
