Amir,
This has been reported before by several others (and has been my experience too). The OOM happens during Canopy Generation phase of Canopy clustering because it only runs with a single reducer. If you are using Mahout 0.8 (or trunk), suggest that u look at the new Streaming Kmeans clustering which is a quicker and more efficient than the traditional Canopy -> KMeans. See the following link for how to run Streaming KMeans. http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied <[email protected]> wrote: Hi, I've been trying to run Mahout (with Hadoop) on our data for quite sometime now. Everything is fine on relatively small data sets, but when I try to do K-Means clustering with the aid of Canopy on like 300000 documents, I can't even get past the canopy generation because of OOM. We're going to cluster similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead to desired results on sample data). I tried setting both "mapred.map.child.java.opts", and "mapred.reduce.child.java.opts" to "-Xmx4096M", I also exported HADOOP_HEAPSIZE to 4000, and still having issues. I'm running all of this in Hadoop's single node, pseudo-distributed mode on a machine with 16GB of RAM. Searching Internet for solutions I found this[1]. One of the bullet points states that: "In all of the algorithms, all clusters are retained in memory by the mappers and reducers" So my question is, does Mahout on Hadoop only help in distributing CPU bound operations? What one should do if they have a large dataset, and only a handful of low-RAM commodity nodes? I'm obviously a newbie, thanks for bearing with me. [1] http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%[email protected]%3E Cheers, Amir
