I tried it again with K=1000, and KM=12610, and it finished after about 16 hours. I'm running the mapreduce version on top of a single node, pseudo-distributed Hadoop.
How can I calculate the reasonable K for my clustering needs? On Wed, Dec 11, 2013 at 1:34 PM, Ted Dunning <[email protected]> wrote: > This is not right. THe sequential version would have finished long before > this for any reasonable value of k. > > I do note, however, that you have set k = 200,000 where you only have > 300,000 documents. Depending on which value you set (I don't have the code > handy), this may actually be increased inside the streaming k-means when it > computes the number of sketch centroids by a factor of roughly 2 log N > \approx 2 * 18. This gives far more clusters than you have data points > which is silly. > > Try again with a more reasonably value of k such as 1000. > > > > > > On Wed, Dec 11, 2013 at 7:02 AM, Amir Mohammad Saied <[email protected] > >wrote: > > > Hi, > > > > I first tried Streaming K-means with about 5000 news stories, and it > worked > > just fine. Then I tried it over 300,000 news stories and gave it 10GB of > > RAM. After more than 43 hours, It was still in the last merge-pass when I > > eventually decided to stop it. > > > > I set K to 200000 and KM 2522308 (its for detecting similar/related news > > stories). Using these values, is it expected to take so long? > > > > Cheers, > > > > amir > > > > > > On Thu, Dec 5, 2013 at 3:38 PM, Amir Mohammad Saied <[email protected] > > >wrote: > > > > > Suneel, > > > > > > Thanks! > > > > > > I tried Streaming K-Means, and now I've two naive questions: > > > > > > 1) If I understand correctly to use the results of streaming k-means I > > > need to iterate over all of my vectors again and assign them to the > > cluster > > > with the closest centroid to the vector, right? > > > > > > 2) In clustering news, the number of clusters isn't known beforehand. > We > > > used to use canopy as a fast approximate clustering technique, but as I > > > understand streaming k-means requires "K" in advance. How can I avoid > > > guessing K? > > > > > > Regards, > > > > > > Amir > > > > > > > > > > > > On Wed, Dec 4, 2013 at 6:27 PM, Suneel Marthi <[email protected] > > >wrote: > > > > > >> Amir, > > >> > > >> > > >> This has been reported before by several others (and has been my > > >> experience too). The OOM happens during Canopy Generation phase of > > Canopy > > >> clustering because it only runs with a single reducer. > > >> > > >> If you are using Mahout 0.8 (or trunk), suggest that u look at the new > > >> Streaming Kmeans clustering which is a quicker and more efficient than > > the > > >> traditional Canopy -> KMeans. > > >> > > >> See the following link for how to run Streaming KMeans. > > >> > > >> > > >> > > > http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied < > > >> [email protected]> wrote: > > >> > > >> Hi, > > >> > > >> I've been trying to run Mahout (with Hadoop) on our data for quite > > >> sometime > > >> now. Everything is fine on relatively small data sets, but when I try > to > > >> do > > >> K-Means clustering with the aid of Canopy on like 300000 documents, I > > >> can't > > >> even get past the canopy generation because of OOM. We're going to > > cluster > > >> similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead > > to > > >> desired results on sample data). > > >> > > >> I tried setting both "mapred.map.child.java.opts", and > > >> "mapred.reduce.child.java.opts" to "-Xmx4096M", I also > > >> exported HADOOP_HEAPSIZE to 4000, and still having issues. > > >> > > >> I'm running all of this in Hadoop's single node, pseudo-distributed > mode > > >> on > > >> a machine with 16GB of RAM. > > >> > > >> Searching Internet for solutions I found this[1]. One of the bullet > > points > > >> states that: > > >> > > >> "In all of the algorithms, all clusters are retained in memory by > > the > > >> mappers and reducers" > > >> > > >> So my question is, does Mahout on Hadoop only help in distributing CPU > > >> bound operations? What one should do if they have a large dataset, and > > >> only > > >> a handful of low-RAM commodity nodes? > > >> > > >> I'm obviously a newbie, thanks for bearing with me. > > >> > > >> [1] > > >> > > >> > > > http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%[email protected]%3E > > >> > > >> Cheers, > > >> > > >> Amir > > >> > > > > > > > > >
