Re: Avoiding OOM for large datasets

Amir Mohammad Saied Wed, 11 Dec 2013 04:04:28 -0800

Hi,

I first tried Streaming K-means with about 5000 news stories, and it worked
just fine. Then I tried it over 300,000 news stories and gave it 10GB of
RAM. After more than 43 hours, It was still in the last merge-pass when I
eventually decided to stop it.


I set K to 200000 and KM 2522308 (its for detecting similar/related news
stories). Using these values, is it expected to take so long?

Cheers,

amir


On Thu, Dec 5, 2013 at 3:38 PM, Amir Mohammad Saied <[email protected]>wrote:

> Suneel,
>
> Thanks!
>
> I tried Streaming K-Means, and now I've two naive questions:
>
> 1) If I understand correctly to use the results of streaming k-means I
> need to iterate over all of my vectors again and assign them to the cluster
> with the closest centroid to the vector, right?
>
> 2) In clustering news, the number of clusters isn't known beforehand. We
> used to use canopy as a fast approximate clustering technique, but as I
> understand streaming k-means requires "K" in advance. How can I avoid
> guessing K?
>
> Regards,
>
> Amir
>
>
>
> On Wed, Dec 4, 2013 at 6:27 PM, Suneel Marthi <[email protected]>wrote:
>
>> Amir,
>>
>>
>> This has been reported before by several others (and has been my
>> experience too). The OOM happens during Canopy Generation phase of Canopy
>> clustering because it only runs with a single reducer.
>>
>> If you are using Mahout 0.8 (or trunk), suggest that u look at the new
>> Streaming Kmeans clustering which is a quicker and more efficient than the
>> traditional Canopy -> KMeans.
>>
>> See the following link for how to run Streaming KMeans.
>>
>>
>> http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied <
>> [email protected]> wrote:
>>
>> Hi,
>>
>> I've been trying to run Mahout (with Hadoop) on our data for quite
>> sometime
>> now. Everything is fine on relatively small data sets, but when I try to
>> do
>> K-Means clustering with the aid of Canopy on like 300000 documents, I
>> can't
>> even get past the canopy generation because of OOM. We're going to cluster
>> similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead to
>> desired results on sample data).
>>
>> I tried setting both "mapred.map.child.java.opts", and
>> "mapred.reduce.child.java.opts" to "-Xmx4096M", I also
>> exported HADOOP_HEAPSIZE to 4000, and still having issues.
>>
>> I'm running all of this in Hadoop's single node, pseudo-distributed mode
>> on
>> a machine with 16GB of RAM.
>>
>> Searching Internet for solutions I found this[1]. One of the bullet points
>> states that:
>>
>>     "In all of the algorithms, all clusters are retained in memory by the
>> mappers and reducers"
>>
>> So my question is, does Mahout on Hadoop only help in distributing CPU
>> bound operations? What one should do if they have a large dataset, and
>> only
>> a handful of low-RAM commodity nodes?
>>
>> I'm obviously a newbie, thanks for bearing with me.
>>
>> [1]
>>
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%[email protected]%3E
>>
>> Cheers,
>>
>> Amir
>>
>
>

Re: Avoiding OOM for large datasets

Reply via email to