Dear all
I am working on mahout to use canopy and kmeans and got a problem
about clusterdump output.
Each vector has simple number incremented from 1 as its name.
When I used 5,000 vectors, I got a correct output. It looks like:
VL-0{n=64,c=[...], r[...]}
1.0: 1= [...]
1.0: 3= [...]
1.0: 4= [...]
...
1.0: 396= [...] # The number of vectors is exactly same as n(64).
VL-1{n=5,c=[...], r[...]}
1.0: 2= [...]
1.0: 12= [...]
...
1.0: 4221= [...]
VL-2{n=121,c=[...], r[...]}
...
Each number of vectors in VL is exactly same as its n value.
When I used 600,000 vectors, the output looks wrong like:
VL-0{n=14,c=[...], r[...]}
1.0: 66636= [...]
1.0: 122570= [...]
...
1.0: 522794= [...] # The number of vectors is 31.
VL-8{n=0,c=[...], r[...]}
1.0: 393539= [...]
1.0: 398877= [...]
...
1.0: 513448= [...] # The number of vectors is 5.
VL-16{n=2,c=[...], r[...]}
...
It looks VL-1 to VL-7 and VL-9 to VL-15 are not used but I confirmed
them existing in the output.
It seems using VL in order as 0,8,16,...,11552, 1,9,17,...,11553,
2,10,18... and so on.
Can I believe this result or should I doubt this is caused by some bugs?
Hadoop : 0.20.204
Mahout : rev. 1351561, 1366995, 1367871
Best regards.
--
nishidy@u-tokyo