RE: mahout clusterdump output

Whitmore, Mattie Mon, 10 Sep 2012 11:03:48 -0700

VL- means you have converged, which is good.  CL- means I have clusters which 
have not converged -- ie I need to run more iterations, or adjust my threshold.


I don't use the commandline kmeans, rather I use the kmeansDriver api.  I have 
runClustering set as true.  Is this counting discrepancy just due to the fact I 
have not converged for some of my clusters -- so even though they are observed 
by a cluster they are not assigned to that cluster?

-----Original Message-----
From: Yuji NISHIDA@U-Tokyo [mailto:[email protected]] 
Sent: Monday, September 10, 2012 12:46 PM
To: [email protected]
Subject: Re: mahout clusterdump output

Thank you for your kind explanation.
I added -cl option when conducting kmeans, so it seems no longer a problem.

But I also want to make sure that my clusterDump result shows "VL-", not "CL-".
Do you think this is correct output?

Best regards.

2012/9/11 Jeff Eastman <[email protected]>:
> I think the discrepancy between the number (n=) of vectors reported by the
> cluster and the number of points actually clustered by the -cl option is
> normal.
>
> In the final iteration, points are assigned to (observed by) (classified as)
> each cluster based upon the distance measure and the cluster center computed
> from the previous iteration. The (n=) value records the number of points
> "observed by" the cluster in that iteration.
> After the final iteration, a new cluster center is calculated for each
> cluster. This moves the center by some amount, less than the convergence
> threshold, but it moves.
> During the subsequent classification (-cl) step, these new centers are used
> to classify the points for output. This will inevitably cause some points to
> be assigned to (observed by) (classified as) a different cluster and so the
> output clusteredPoints will reflect this final assignment.
>
> In small, contrived examples, the clustering will likely be more stable
> between the final iteration and the output of clustered points.
>
>
>
> On 9/10/12 9:06 AM, Whitmore, Mattie wrote:
>
> Hi,
>
> I too am having this problem.  I have a very small dimension space (3), and
> a lot of vectors (hundreds of millions).  Therefore I can't print all to
> disk (I receive an OOM error).  However, I can print 30 sample points
> easily, and doing so showed results similar to you (I "named" my vectors to
> be the number of vectors clusterDumper printed in the cluster):
>
> VL-50{n=0 c=[...] r=[]}
>         Weight : [props - optional]:  Point:
>         1.0:    1 = [...]
>         1.0:    2 = [...]
>               ...
>         1.0:   10 = [...]
>
> --> note also radius is blank, whereas the points do have spread in all
> dimensions, this happened ONLY with converged clusters.
>
> CL-51{n=4 c=[...] r=[...]}
>         Weight : [props - optional]:  Point:
>         1.0:    1 = [...]
>         1.0:    2 = [...]
>               ...
>         1.0:    6 = [...]
>
> As far as I understand the algorithm, problems which arise due to
> dimensionality are convergence problems.  Basically, distance between points
> is "longer" as dimension increases (volume increases dramatically as
> dimension increases).
>
> This shouldn't affect clusterDumper, as clusterDumper simply reports on
> sequence files from a completed job.  This is why the discrepancy is not
> making a lot of sense to me.  Having more vectors within each cluster makes
> sense -- when I sum the printed n values, I receive a number magnitudes
> smaller than the number of vectors I clustered.
>
> I used Mahout v0.7, Hadoop 0.20.2-cdh3u3
>
>
> -----Original Message-----
> From: Yuji NISHIDA@U-Tokyo [mailto:[email protected]]
> Sent: Sunday, September 09, 2012 4:46 AM
> To: [email protected]
> Subject: Re: mahout clusterdump output
>
> Hi all
>
> I still want to confirm that this is not a problem.
> Especially the n value, I just hope it is not problematic...
>
> I discussed this in my lab, one of our members noted that the dimension of
> feature vectors and the number of vectors I used were very different.
> I have used 100 dimensions of vector and 600,000 vectors.
>
> Do you think it may cause some problems if I use both small dimensions and
> large number of vectors simultaneously and we need to make sure that there
> is relation between them (especially in number)?
> Or do you think 100 is too small for the dimension?
>
> I will appreciate very much that someone follows my question.
>
> Regards.
>
> 2012/8/4 Yuji NISHIDA@U-Tokyo <[email protected]>:
>
> Dear all
>
> I am working on mahout to use canopy and kmeans and got a problem
> about clusterdump output.
> Each vector has simple number incremented from 1 as its name.
>
> When I used 5,000 vectors, I got a correct output. It looks like:
>
> VL-0{n=64,c=[...], r[...]}
>     1.0: 1= [...]
>     1.0: 3= [...]
>     1.0: 4= [...]
>      ...
>     1.0: 396= [...]    # The number of vectors is exactly same as n(64).
> VL-1{n=5,c=[...], r[...]}
>     1.0: 2= [...]
>     1.0: 12= [...]
>     ...
>     1.0: 4221= [...]
> VL-2{n=121,c=[...], r[...]}
> ...
>
> Each number of vectors in VL is exactly same as its n value.
>
> When I used 600,000 vectors, the output looks wrong like:
>
> VL-0{n=14,c=[...], r[...]}
>     1.0: 66636= [...]
>     1.0: 122570= [...]
>     ...
>     1.0: 522794= [...]    # The number of vectors is 31.
> VL-8{n=0,c=[...], r[...]}
>     1.0: 393539= [...]
>     1.0: 398877= [...]
>     ...
>     1.0: 513448= [...]    # The number of vectors is 5.
> VL-16{n=2,c=[...], r[...]}
> ...
>
> It looks VL-1 to VL-7 and VL-9 to VL-15 are not used but I confirmed
> them existing in the output.
> It seems using VL in order as 0,8,16,...,11552, 1,9,17,...,11553,
> 2,10,18... and so on.
>
> Can I believe this result or should I doubt this is caused by some bugs?
>
> Hadoop : 0.20.204
> Mahout : rev. 1351561, 1366995, 1367871
>
> Best regards.
>
> --
> nishidy@u-tokyo
>
>
>



-- 
nishidy@u-tokyo

RE: mahout clusterdump output

Reply via email to