Hi, I first tried Streaming K-means with about 5000 news stories, and it worked just fine. Then I tried it over 300,000 news stories and gave it 10GB of RAM. After more than 43 hours, It was still in the last merge-pass when I eventually decided to stop it.
I set K to 200000 and KM 2522308 (its for detecting similar/related news stories). Using these values, is it expected to take so long? Cheers, amir On Thu, Dec 5, 2013 at 3:38 PM, Amir Mohammad Saied <[email protected]>wrote: > Suneel, > > Thanks! > > I tried Streaming K-Means, and now I've two naive questions: > > 1) If I understand correctly to use the results of streaming k-means I > need to iterate over all of my vectors again and assign them to the cluster > with the closest centroid to the vector, right? > > 2) In clustering news, the number of clusters isn't known beforehand. We > used to use canopy as a fast approximate clustering technique, but as I > understand streaming k-means requires "K" in advance. How can I avoid > guessing K? > > Regards, > > Amir > > > > On Wed, Dec 4, 2013 at 6:27 PM, Suneel Marthi <[email protected]>wrote: > >> Amir, >> >> >> This has been reported before by several others (and has been my >> experience too). The OOM happens during Canopy Generation phase of Canopy >> clustering because it only runs with a single reducer. >> >> If you are using Mahout 0.8 (or trunk), suggest that u look at the new >> Streaming Kmeans clustering which is a quicker and more efficient than the >> traditional Canopy -> KMeans. >> >> See the following link for how to run Streaming KMeans. >> >> >> http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means >> >> >> >> >> >> >> >> >> >> >> >> On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied < >> [email protected]> wrote: >> >> Hi, >> >> I've been trying to run Mahout (with Hadoop) on our data for quite >> sometime >> now. Everything is fine on relatively small data sets, but when I try to >> do >> K-Means clustering with the aid of Canopy on like 300000 documents, I >> can't >> even get past the canopy generation because of OOM. We're going to cluster >> similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead to >> desired results on sample data). >> >> I tried setting both "mapred.map.child.java.opts", and >> "mapred.reduce.child.java.opts" to "-Xmx4096M", I also >> exported HADOOP_HEAPSIZE to 4000, and still having issues. >> >> I'm running all of this in Hadoop's single node, pseudo-distributed mode >> on >> a machine with 16GB of RAM. >> >> Searching Internet for solutions I found this[1]. One of the bullet points >> states that: >> >> "In all of the algorithms, all clusters are retained in memory by the >> mappers and reducers" >> >> So my question is, does Mahout on Hadoop only help in distributing CPU >> bound operations? What one should do if they have a large dataset, and >> only >> a handful of low-RAM commodity nodes? >> >> I'm obviously a newbie, thanks for bearing with me. >> >> [1] >> >> http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%[email protected]%3E >> >> Cheers, >> >> Amir >> > >
