Re: Avoiding OOM for large datasets

Amir Mohammad Saied Fri, 13 Dec 2013 03:48:06 -0800

I tried it again with K=1000, and KM=12610, and it finished after about 16
hours. I'm running the mapreduce version on top of a single node,
pseudo-distributed Hadoop.


How can I calculate the reasonable K for my clustering needs?


On Wed, Dec 11, 2013 at 1:34 PM, Ted Dunning <[email protected]> wrote:

> This is not right.  THe sequential version would have finished long before
> this for any reasonable value of k.
>
> I do note, however, that you have set k = 200,000 where you only have
> 300,000 documents.  Depending on which value you set (I don't have the code
> handy), this may actually be increased inside the streaming k-means when it
> computes the number of sketch centroids by a factor of roughly 2 log N
> \approx 2 * 18.  This gives far more clusters than you have data points
> which is silly.
>
> Try again with a more reasonably value of k such as 1000.
>
>
>
>
>
> On Wed, Dec 11, 2013 at 7:02 AM, Amir Mohammad Saied <[email protected]
> >wrote:
>
> > Hi,
> >
> > I first tried Streaming K-means with about 5000 news stories, and it
> worked
> > just fine. Then I tried it over 300,000 news stories and gave it 10GB of
> > RAM. After more than 43 hours, It was still in the last merge-pass when I
> > eventually decided to stop it.
> >
> > I set K to 200000 and KM 2522308 (its for detecting similar/related news
> > stories). Using these values, is it expected to take so long?
> >
> > Cheers,
> >
> > amir
> >
> >
> > On Thu, Dec 5, 2013 at 3:38 PM, Amir Mohammad Saied <[email protected]
> > >wrote:
> >
> > > Suneel,
> > >
> > > Thanks!
> > >
> > > I tried Streaming K-Means, and now I've two naive questions:
> > >
> > > 1) If I understand correctly to use the results of streaming k-means I
> > > need to iterate over all of my vectors again and assign them to the
> > cluster
> > > with the closest centroid to the vector, right?
> > >
> > > 2) In clustering news, the number of clusters isn't known beforehand.
> We
> > > used to use canopy as a fast approximate clustering technique, but as I
> > > understand streaming k-means requires "K" in advance. How can I avoid
> > > guessing K?
> > >
> > > Regards,
> > >
> > > Amir
> > >
> > >
> > >
> > > On Wed, Dec 4, 2013 at 6:27 PM, Suneel Marthi <[email protected]
> > >wrote:
> > >
> > >> Amir,
> > >>
> > >>
> > >> This has been reported before by several others (and has been my
> > >> experience too). The OOM happens during Canopy Generation phase of
> > Canopy
> > >> clustering because it only runs with a single reducer.
> > >>
> > >> If you are using Mahout 0.8 (or trunk), suggest that u look at the new
> > >> Streaming Kmeans clustering which is a quicker and more efficient than
> > the
> > >> traditional Canopy -> KMeans.
> > >>
> > >> See the following link for how to run Streaming KMeans.
> > >>
> > >>
> > >>
> >
> http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied <
> > >> [email protected]> wrote:
> > >>
> > >> Hi,
> > >>
> > >> I've been trying to run Mahout (with Hadoop) on our data for quite
> > >> sometime
> > >> now. Everything is fine on relatively small data sets, but when I try
> to
> > >> do
> > >> K-Means clustering with the aid of Canopy on like 300000 documents, I
> > >> can't
> > >> even get past the canopy generation because of OOM. We're going to
> > cluster
> > >> similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead
> > to
> > >> desired results on sample data).
> > >>
> > >> I tried setting both "mapred.map.child.java.opts", and
> > >> "mapred.reduce.child.java.opts" to "-Xmx4096M", I also
> > >> exported HADOOP_HEAPSIZE to 4000, and still having issues.
> > >>
> > >> I'm running all of this in Hadoop's single node, pseudo-distributed
> mode
> > >> on
> > >> a machine with 16GB of RAM.
> > >>
> > >> Searching Internet for solutions I found this[1]. One of the bullet
> > points
> > >> states that:
> > >>
> > >>     "In all of the algorithms, all clusters are retained in memory by
> > the
> > >> mappers and reducers"
> > >>
> > >> So my question is, does Mahout on Hadoop only help in distributing CPU
> > >> bound operations? What one should do if they have a large dataset, and
> > >> only
> > >> a handful of low-RAM commodity nodes?
> > >>
> > >> I'm obviously a newbie, thanks for bearing with me.
> > >>
> > >> [1]
> > >>
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%[email protected]%3E
> > >>
> > >> Cheers,
> > >>
> > >> Amir
> > >>
> > >
> > >
> >
>

Re: Avoiding OOM for large datasets

Reply via email to