Re: KMeans job fails during 2nd iteration. Java Heap space

Jeff Eastman Wed, 08 Aug 2012 05:16:08 -0700

Consider that each cluster retains 4 vectors in memory in each mapperand reducer, and that these vectors tend to become more dense (throughaddition of multiple sparse components) as iterations proceed. With 1000clusters and 200k terms in your dictionary this will cause the heapspace to be consumed rapidly as you have noted. Some times you can workaround this problem by increasing your heap size on a per-job basis orreducing the number of mappers and reducers allowed on each node. Alsobe sure you are not launching reducers until all of your mapper taskshave completed.

In order to provide more help to you, it would be useful to understandmore about how your cluster is "well tuned". How many mappers & reducersare you launching in parallel, the heapspace limits set for tasks oneach node, etc.

For a quick test, try reducing the k to 500 or 100 to see how manyclusters you can reasonably compute with your dataset on your cluster.Canopy is also a good way to get a feel for a good initial k, though itcan be hard to arrive at good T-values in some text clustering cases.You, can also try hierarchical clustering with reduced k to stay underyour memory limits.



On 8/8/12 5:40 AM, Paritosh Ranjan wrote:

A stacktrace of error would have helped in finding the exact error.
However, number of clusters can create Heap Space problems ( If thevector dimension is also high ).Either try to reduce the number of initial clusters ( In my opinion,the best way to know about initial clusters is Canopy Clusteringhttps://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
or, try to reduce the dimension of the vectors.

PS : you are also providing numClusters twice

--numClusters 1000 \ --numClusters 5 \

On 08-08-2012 10:42, Abramov Pavel wrote:
Hello,
I am trying to run KMeans example on 15 000 000 documents (seq2sparseoutput).There are 1 000 clusters, 200 000 terms dictionary and 3-10 termsdocument size (titles). seq2sparse produces 200 files 80 MB each.
My job failed with Java heap space Error. 1st iteration passes while2nd iteration fails. On a Map phase of buildClusters I see a lot ofwarnings, but it passes. Reduce phase of buildClusters fails with"Java Heap space".
I can not increase reducer/mapper memory in hadoop. My cluster istunned well.
How can I avoid this situation? My cluster has 300 Mappers and 220Reducers running 40 servers 8-core 12 GB RAM.
Thanks in advance!

Here is KMeans parameters:

------------------------------------------------
mahout kmeans -Dmapred.reduce.tasks=200 \
-i ...tfidf-vectors/  \
-o /tmp/clustering_results_kmeans/ \
--clusters /tmp/clusters/ \
--numClusters 1000 \
--numClusters 5 \
--overwrite \
--clustering
------------------------------------------------

Pavel

Re: KMeans job fails during 2nd iteration. Java Heap space

Reply via email to