If k-means is trying to maintain too many clusters, then it will use way more memory and run much more slowly.
That alone could be the genesis of the problem. 2010/11/18 Jeff Eastman <[email protected]> > 900 clusters from 1000 vectors seems unusual. I'd be looking for a > clustering that produced maybe 5-10% of that. Looking over your parameters, > I notice your T1 value is less than T2. This violates the T1>T2 expectation > for both Canopy and Mean Shift which is, apparently, not enforced. It > probably should be and this might be the source of your problems but I'm not > sure how this could cause a premature OME. > > In terms of using Mean Shift, I'd say the proof of the pudding is in the > eating. If it gives you reasonable results and can handle your data then > it's all good. Canopy/k-Means is more of a main-stream approach and *should* > scale better. I'd be interested in seeing a stack trace of where Canopy is > bombing on you. A gig of memory should be more than enough to run your 3.1 > Mb file - using sequential (-xm sequential) execution method, never mind > using mapreduce! > > Any chance you could share your input vectors file? > > -----Original Message----- > From: Jure Jeseničnik [mailto:[email protected]] > Sent: Wednesday, November 17, 2010 11:02 PM > To: [email protected] > Subject: RE: Canopy memory consumption > > Hi Jeff > > Thank you for your answer. On a smaller scale I got around 10% less > clusters than there are records (900 clusters from 1000 records). This > corresponds with the actual data that I fed to the Canopy and I even checked > the results manually and It was almost exactly what I wanted. A bit more > fiddeling with the T1 and T2 and it would have been it. > When I run the Meanshift with the same T1 and T2 it is able to process > 6000 clusters with ease. On the cases where I was able to get the > Canopy+k-means through, the results seemed pretty similar of those that he > Meanshift gave me. > > Could Meanshift be the path that I'm looking for or is there a possibility > of running into problems later? > > Regards, > > Jure > > > -----Original Message----- > From: Jeff Eastman [mailto:[email protected]] > Sent: Thursday, November 18, 2010 1:02 AM > To: [email protected] > Subject: RE: Canopy memory consumption > > Canopy is a bit fussy about its T1 and T2 parameters: If you set T2 too > small, you will get one cluster for each input vector; too large and you > will get only one cluster for all vectors. T1 is less sensitive and will > only impact how many points near each cluster are included in its centroid > calculation. My guess is you are in the first situation with T2 too small > and, with the larger dataset, are creating more clusters than will fit into > your memory. > > How many clusters did you get from your small dataset? If the small set is > a subset of the large set you could always run Canopy over the small set to > get your k-means initial cluster centers, then run k-means iterations over > the full dataset after. You can also skip the Canopy step entirely when > using k-means: include a -k parameter and k-means will sample that many > initial cluster centers from your data and then run its iterations. > > Glad to hear MeanShift is working for you. It has similar scaling > limitations to Canopy. I've been pleasantly surprised by its performance on > problems I thought were out of scope for it. Don't know why it works on your > larger dataset when Canopy fails though. > > -----Original Message----- > From: Jure Jeseničnik [mailto:[email protected]] > Sent: Wednesday, November 17, 2010 3:54 AM > To: [email protected] > Subject: Canopy memory consumption > > Hi Guys. > > What I'm trying to do is the basic news clustering, that will group the > news about the same topic into clusters. I have the data in a database so I > took the following approach: > > 1. Wrote a small program that puts the data from the db into a Lucene > Index. > > 2. Created vectors from index with the following command: > mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i > link -n 2 > > 3. Ran canopy, to get initial clusters: > mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow > > 4. Ran the kmeans to perform the final clustering: > mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10 > -cl -ow > > 5. Do the clusterdump to view results: > mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p > output-kmeans/clusteredPoints -dt text -b 100 -n 10 > result.txt > > When I run this with cca 1000 records (8000 distinct terms), the results > are just perfect. I get exactly the clusters I want. The problems start when > I try the same steps with a bit more data. > > With 6000 records (28000 terms) or even the half of that, the process fails > at the canopy step with Java heap space OutOfMemoryError. The > MAHOUT_HEAPSIZE variable value on my local machine is 1024. I even tried > running it on our development hadoop cluster with approximately the same > amount of memory, but it failed with the same error. > > I realize that software needs a certain amount of memory to work properly > but I find it hard to believe that 1 GB is not enough for processing a 3.1 > MB file, which is the size of the vectors file produced by the second step. > We're hoping to use this solution on a hundreds of thousands of records and > I can't help but to wonder what sort of hardware we'll be needing in order > to process them if such memory consumption is a normal thing. > > Am I missing something here? Are there any other setting that I should be > taking into consideration. > > And one more thing. I tried the meanshift implementation and it seems to be > working fine, with that much data. > > Thanks. > > Jure > >
