VL- means you have converged, which is good. CL- means I have clusters which have not converged -- ie I need to run more iterations, or adjust my threshold.
I don't use the commandline kmeans, rather I use the kmeansDriver api. I have runClustering set as true. Is this counting discrepancy just due to the fact I have not converged for some of my clusters -- so even though they are observed by a cluster they are not assigned to that cluster? -----Original Message----- From: Yuji NISHIDA@U-Tokyo [mailto:[email protected]] Sent: Monday, September 10, 2012 12:46 PM To: [email protected] Subject: Re: mahout clusterdump output Thank you for your kind explanation. I added -cl option when conducting kmeans, so it seems no longer a problem. But I also want to make sure that my clusterDump result shows "VL-", not "CL-". Do you think this is correct output? Best regards. 2012/9/11 Jeff Eastman <[email protected]>: > I think the discrepancy between the number (n=) of vectors reported by the > cluster and the number of points actually clustered by the -cl option is > normal. > > In the final iteration, points are assigned to (observed by) (classified as) > each cluster based upon the distance measure and the cluster center computed > from the previous iteration. The (n=) value records the number of points > "observed by" the cluster in that iteration. > After the final iteration, a new cluster center is calculated for each > cluster. This moves the center by some amount, less than the convergence > threshold, but it moves. > During the subsequent classification (-cl) step, these new centers are used > to classify the points for output. This will inevitably cause some points to > be assigned to (observed by) (classified as) a different cluster and so the > output clusteredPoints will reflect this final assignment. > > In small, contrived examples, the clustering will likely be more stable > between the final iteration and the output of clustered points. > > > > On 9/10/12 9:06 AM, Whitmore, Mattie wrote: > > Hi, > > I too am having this problem. I have a very small dimension space (3), and > a lot of vectors (hundreds of millions). Therefore I can't print all to > disk (I receive an OOM error). However, I can print 30 sample points > easily, and doing so showed results similar to you (I "named" my vectors to > be the number of vectors clusterDumper printed in the cluster): > > VL-50{n=0 c=[...] r=[]} > Weight : [props - optional]: Point: > 1.0: 1 = [...] > 1.0: 2 = [...] > ... > 1.0: 10 = [...] > > --> note also radius is blank, whereas the points do have spread in all > dimensions, this happened ONLY with converged clusters. > > CL-51{n=4 c=[...] r=[...]} > Weight : [props - optional]: Point: > 1.0: 1 = [...] > 1.0: 2 = [...] > ... > 1.0: 6 = [...] > > As far as I understand the algorithm, problems which arise due to > dimensionality are convergence problems. Basically, distance between points > is "longer" as dimension increases (volume increases dramatically as > dimension increases). > > This shouldn't affect clusterDumper, as clusterDumper simply reports on > sequence files from a completed job. This is why the discrepancy is not > making a lot of sense to me. Having more vectors within each cluster makes > sense -- when I sum the printed n values, I receive a number magnitudes > smaller than the number of vectors I clustered. > > I used Mahout v0.7, Hadoop 0.20.2-cdh3u3 > > > -----Original Message----- > From: Yuji NISHIDA@U-Tokyo [mailto:[email protected]] > Sent: Sunday, September 09, 2012 4:46 AM > To: [email protected] > Subject: Re: mahout clusterdump output > > Hi all > > I still want to confirm that this is not a problem. > Especially the n value, I just hope it is not problematic... > > I discussed this in my lab, one of our members noted that the dimension of > feature vectors and the number of vectors I used were very different. > I have used 100 dimensions of vector and 600,000 vectors. > > Do you think it may cause some problems if I use both small dimensions and > large number of vectors simultaneously and we need to make sure that there > is relation between them (especially in number)? > Or do you think 100 is too small for the dimension? > > I will appreciate very much that someone follows my question. > > Regards. > > 2012/8/4 Yuji NISHIDA@U-Tokyo <[email protected]>: > > Dear all > > I am working on mahout to use canopy and kmeans and got a problem > about clusterdump output. > Each vector has simple number incremented from 1 as its name. > > When I used 5,000 vectors, I got a correct output. It looks like: > > VL-0{n=64,c=[...], r[...]} > 1.0: 1= [...] > 1.0: 3= [...] > 1.0: 4= [...] > ... > 1.0: 396= [...] # The number of vectors is exactly same as n(64). > VL-1{n=5,c=[...], r[...]} > 1.0: 2= [...] > 1.0: 12= [...] > ... > 1.0: 4221= [...] > VL-2{n=121,c=[...], r[...]} > ... > > Each number of vectors in VL is exactly same as its n value. > > When I used 600,000 vectors, the output looks wrong like: > > VL-0{n=14,c=[...], r[...]} > 1.0: 66636= [...] > 1.0: 122570= [...] > ... > 1.0: 522794= [...] # The number of vectors is 31. > VL-8{n=0,c=[...], r[...]} > 1.0: 393539= [...] > 1.0: 398877= [...] > ... > 1.0: 513448= [...] # The number of vectors is 5. > VL-16{n=2,c=[...], r[...]} > ... > > It looks VL-1 to VL-7 and VL-9 to VL-15 are not used but I confirmed > them existing in the output. > It seems using VL in order as 0,8,16,...,11552, 1,9,17,...,11553, > 2,10,18... and so on. > > Can I believe this result or should I doubt this is caused by some bugs? > > Hadoop : 0.20.204 > Mahout : rev. 1351561, 1366995, 1367871 > > Best regards. > > -- > nishidy@u-tokyo > > > -- nishidy@u-tokyo
