Hi,
We have been using K-Means to cluster a fairly large dataset (just under a
million 128 dimension vectors of floating point values - about 9.2GB in space
delimited file format). We’re using Hadoop 2.2.0 and Mahout 0.9. The dataset is
first converted from simple space delimited format into
RandomAccessSparseVector format for K-Means using the
org.apache.mahout.clustering.conversion.InputDriver utility.
We’re not using Canopy clustering to determine the initial clusters as we want
a specific number of clusters (100,000) so we let K-Means create the initial
random 100,000 centroids:
./mahout kmeans -i /lookandlearn/vectors_all -c /data/initial_centres -o
/data/clusters_output -k 100000 -x 20 -ow -xm mapreduce
It all runs fine and we then extract the computed centroids using the
clusterdump utility:
./mahout clusterdump -i /data/clusters_output/clusters-1-final/ -o
./clusters.txt -of TEXT
The clusters.txt output file contains the expected 100,000 lines (once cluster
per line) however there seem to be some idiosyncrasies in the output format…
If we add up all the values of n for each cluster, which should be the number
of data points belonging to each cluster, we get a total of 39,160,754. But we
expect this to be the same as the number of input points (9,769,004) as each
input point should belong to a single cluster. We are not sure why the sum of n
values is nearly 4 times as large as the number of input points.
We also notice that the vector output format for the cluster centroids and
radii seem to be in a couple of different formats. The majority are a simple
comma separated array format e.g.
c=[0.008, 0.006, 0.009, 0.014, 0.006, 0.003, 0.007, 0.005, 0.032, 0.004, 0.001,
0.003, 0.002, 0.002, 0.007, 0.017, 0.011, 0.002, 0.001, 0.014, 0.032, 0.015,
0.001, 0.002, 0.025, 0.007, 0.001, 0.007, 0.031, 0.004, 0.000, 0.005, 0.006,
0.003, 0.005, 0.029, 0.023, 0.001, 0.000, 0.005, 0.032, 0.007, 0.001, 0.009,
0.014, 0.002, 0.000, 0.004, 0.011, 0.001, 0.002, 0.010, 0.032, 0.017, 0.000,
0.002, 0.013, 0.019, 0.008, 0.009, 0.017, 0.005, 0.001, 0.003, 0.007, 0.005,
0.002, 0.014, 0.021, 0.002, 0.001, 0.005, 0.032, 0.006, 0.005, 0.014, 0.016,
0.003, 0.001, 0.004, 0.006, 0.000, 0.001, 0.005, 0.031, 0.026, 0.001, 0.002,
0.009, 0.002, 0.003, 0.004, 0.006, 0.015, 0.004, 0.006, 0.006, 0.002, 0.002,
0.006, 0.003, 0.001, 0.003, 0.009, 0.004, 0.002, 0.005, 0.018, 0.012, 0.001,
0.000, 0.002, 0.001, 0.000, 0.007, 0.016, 0.021, 0.006, 0.001, 0.000, 0.006,
0.003, 0.013, 0.012, 0.003, 0.002, 0.000, 0.001]
But there are also a significant number of clusters where the format appears to
be a sparse array representation with each value prefixed by the position index
e.g.
c=[0:0.056, 1:0.006, 2:0.000, 3:0.000, 4:0.000, 5:0.000, 6:0.000, 7:0.004,
8:0.057, 9:0.002, 10:0.000, 11:0.000, 12:0.000, 13:0.000, 14:0.000, 15:0.005,
16:0.056, 17:0.004, 18:0.000, 19:0.000, 20:0.000, 23:0.002, 24:0.024, 25:0.009,
26:0.013, 27:0.005, 28:0.001, 29:0.001, 30:0.000, 31:0.000, 32:0.057, 33:0.006,
34:0.000, 35:0.000, 36:0.000, 37:0.000, 38:0.000, 39:0.002, 40:0.057, 41:0.007,
42:0.000, 43:0.000, 44:0.000, 45:0.000, 46:0.000, 47:0.004, 48:0.057, 49:0.008,
50:0.000, 51:0.000, 52:0.000, 55:0.001, 56:0.050, 57:0.007, 58:0.000, 59:0.000,
60:0.000, 61:0.000, 62:0.000, 63:0.001, 64:0.057, 65:0.003, 66:0.000, 67:0.000,
68:0.000, 69:0.000, 70:0.000, 71:0.006, 72:0.057, 73:0.004, 74:0.000, 75:0.000,
76:0.000, 77:0.000, 78:0.000, 79:0.009, 80:0.057, 81:0.003, 82:0.000, 83:0.000,
84:0.000, 87:0.006, 88:0.047, 89:0.004, 90:0.000, 91:0.000, 92:0.000, 93:0.000,
94:0.000, 95:0.006, 96:0.056, 97:0.005, 98:0.000, 99:0.000, 100:0.000,
101:0.000, 102:0.000, 103:0.003, 104:0.057, 105:0.003, 106:0.000, 107:0.000,
108:0.000, 109:0.000, 110:0.000, 111:0.006, 112:0.056, 113:0.000, 114:0.000,
115:0.000, 116:0.000, 117:0.000, 118:0.000, 119:0.008, 120:0.038, 121:0.001,
122:0.000, 123:0.000, 124:0.000, 125:0.000, 126:0.000, 127:0.006]
In this case should values missing in this sparse vector format be interpreted
as 0.0 e.g. the value for dimension 21 in the above example? Why are zero
values still included in his output format (e.g. dimensions 2,3,4 etc. above)
and it seems awkward to us that the clusterdump output contains different
vector formats as it makes it more complex to parse. Also we find that if we
set the clusterdump output format to CSV instead of TEXT ("-of CSV”) no output
file is produced.
Any information or feedback on the above would be greatly appreciated.
Regards,
Oisin.