Re: Avoiding OOM for large datasets

Suneel Marthi Wed, 04 Dec 2013 10:28:21 -0800

Amir,


This has been reported before by several others (and has been my experience 
too). The OOM happens during Canopy Generation phase of Canopy clustering 
because it only runs with a single reducer.

If you are using Mahout 0.8 (or trunk), suggest that u look at the new 
Streaming Kmeans clustering which is a quicker and more efficient than the 
traditional Canopy -> KMeans. 

See the following link for how to run Streaming KMeans.

http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means











On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied 
<[email protected]> wrote:
 
Hi,

I've been trying to run Mahout (with Hadoop) on our data for quite sometime
now. Everything is fine on relatively small data sets, but when I try to do
K-Means clustering with the aid of Canopy on like 300000 documents, I can't
even get past the canopy generation because of OOM. We're going to cluster
similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead to
desired results on sample data).

I tried setting both "mapred.map.child.java.opts", and
"mapred.reduce.child.java.opts" to "-Xmx4096M", I also
exported HADOOP_HEAPSIZE to 4000, and still having issues.

I'm running all of this in Hadoop's single node, pseudo-distributed mode on
a machine with 16GB of RAM.

Searching Internet for solutions I found this[1]. One of the bullet points
states that:

    "In all of the algorithms, all clusters are retained in memory by the
mappers and reducers"

So my question is, does Mahout on Hadoop only help in distributing CPU
bound operations? What one should do if they have a large dataset, and only
a handful of low-RAM commodity nodes?

I'm obviously a newbie, thanks for bearing with me.

[1]
http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%[email protected]%3E

Cheers,

Amir

Re: Avoiding OOM for large datasets

Reply via email to