Hello, 

I am trying to run KMeans example on 15 000 000 documents (seq2sparse output).
There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms document size 
(titles). seq2sparse produces 200 files 80 MB each.

My job failed with Java heap space Error. 1st iteration passes while 2nd 
iteration fails. On a Map phase of buildClusters I see a lot of warnings, but 
it passes. Reduce phase of buildClusters fails with "Java Heap space".

I can not increase reducer/mapper memory in hadoop. My cluster is tunned well.

How can I avoid this situation? My cluster has 300 Mappers and 220 Reducers 
running 40 servers 8-core 12 GB RAM.

Thanks in advance!

Here is KMeans parameters:

------------------------------------------------
mahout kmeans -Dmapred.reduce.tasks=200 \
-i ...tfidf-vectors/  \
-o /tmp/clustering_results_kmeans/ \
--clusters /tmp/clusters/ \
--numClusters 1000 \
--numClusters 5 \
--overwrite \
--clustering
------------------------------------------------

Pavel

Reply via email to