If k-means is trying to maintain too many clusters, then it will use way
more memory and run much more slowly.

That alone could be the genesis of the problem.

2010/11/18 Jeff Eastman <[email protected]>

> 900 clusters from 1000 vectors seems unusual. I'd be looking for a
> clustering that produced maybe 5-10% of that. Looking over your parameters,
> I notice your T1 value is less than T2. This violates the T1>T2 expectation
> for both Canopy and Mean Shift which is, apparently, not enforced. It
> probably should be and this might be the source of your problems but I'm not
> sure how this could cause a premature OME.
>
> In terms of using Mean Shift, I'd say the proof of the pudding is in the
> eating. If it gives you reasonable results and can handle your data then
> it's all good. Canopy/k-Means is more of a main-stream approach and *should*
> scale better. I'd be interested in seeing a stack trace of where Canopy is
> bombing on you. A gig of memory should be more than enough to run your 3.1
> Mb file - using sequential (-xm sequential) execution method, never mind
> using mapreduce!
>
> Any chance you could share your input vectors file?
>
> -----Original Message-----
> From: Jure Jeseničnik [mailto:[email protected]]
> Sent: Wednesday, November 17, 2010 11:02 PM
> To: [email protected]
> Subject: RE: Canopy memory consumption
>
> Hi Jeff
>
> Thank you for your answer.  On a smaller scale I got around 10% less
> clusters than there are records (900 clusters from 1000 records). This
> corresponds with the actual data that I fed to the Canopy and I even checked
> the results manually and It was almost exactly what I wanted. A bit more
> fiddeling with the T1 and T2 and it would have been it.
> When I run  the Meanshift with the same T1 and T2 it is able to process
> 6000 clusters with ease. On  the cases where I was able to get the
> Canopy+k-means through, the results seemed pretty similar of those that he
> Meanshift gave me.
>
> Could Meanshift be the path that I'm looking for or is there a possibility
> of running into problems later?
>
> Regards,
>
> Jure
>
>
> -----Original Message-----
> From: Jeff Eastman [mailto:[email protected]]
> Sent: Thursday, November 18, 2010 1:02 AM
> To: [email protected]
> Subject: RE: Canopy memory consumption
>
> Canopy is a bit fussy about its T1 and T2 parameters: If you set T2 too
> small, you will get one cluster for each input vector; too large and you
> will get only one cluster for all vectors. T1 is less sensitive and will
> only impact how many points near each cluster are included in its centroid
> calculation.  My guess is you are in the first situation with T2 too small
> and, with the larger dataset, are creating more clusters than will fit into
> your memory.
>
> How many clusters did you get from your small dataset? If the small set is
> a subset of the large set you could always run Canopy over the small set to
> get your k-means initial cluster centers, then run k-means iterations over
> the full dataset after. You can also skip the Canopy step entirely when
> using k-means: include a -k parameter and k-means will sample that many
> initial cluster centers from your data and then run its iterations.
>
> Glad to hear MeanShift is working for you. It has similar scaling
> limitations to Canopy. I've been pleasantly surprised by its performance on
> problems I thought were out of scope for it. Don't know why it works on your
> larger dataset when Canopy fails though.
>
> -----Original Message-----
> From: Jure Jeseničnik [mailto:[email protected]]
> Sent: Wednesday, November 17, 2010 3:54 AM
> To: [email protected]
> Subject: Canopy memory consumption
>
> Hi Guys.
>
> What I'm trying to do is the basic news clustering, that will group the
> news about the same topic into clusters.  I have the data in a database so I
> took the following approach:
>
> 1.       Wrote a small program that puts the data from the db into a Lucene
> Index.
>
> 2.       Created vectors from index with the following command:
> mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i
> link -n 2
>
> 3.       Ran canopy, to get initial clusters:
> mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow
>
> 4.       Ran the kmeans to perform the final clustering:
> mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10
> -cl -ow
>
> 5.       Do the clusterdump to view results:
> mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p
> output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt
>
> When I run this with cca 1000 records (8000 distinct terms), the results
> are just perfect. I get exactly the clusters I want. The problems start when
> I try the same steps with a bit more data.
>
> With 6000 records (28000 terms) or even the half of that, the process fails
> at the canopy step with Java heap space OutOfMemoryError. The
>  MAHOUT_HEAPSIZE variable value on my local machine is 1024.  I even tried
> running it on our development hadoop cluster with approximately the same
> amount of memory, but it failed with the same error.
>
> I realize  that software needs a certain amount of memory to work properly
> but I find it hard to believe that 1 GB is not enough for processing a 3.1
> MB file, which is the size of the vectors file produced by the second step.
> We're hoping to use this solution on a hundreds of thousands of records and
> I can't help but to wonder what sort of hardware we'll be needing in order
> to process them if such memory consumption is a normal thing.
>
> Am I missing something here? Are there any other setting that I should be
> taking into consideration.
>
> And one more thing. I tried the meanshift implementation and it seems to be
> working fine, with that much data.
>
> Thanks.
>
> Jure
>
>

Reply via email to