Running kmeans on doc vectors turned into a DistributedRowMatrix works fine (no
surprise).
But when I do an SSVD on the above input, then create U * Sigma, a
DistributedRowMatrix (IntWritable, VectorWritable) I get clusters in
clusters-xx-final but in clusteredPoints the vectors have no IDs. Therefor the
clustered points cannot be tied back to the clusters that contain them and
can't be tied to the original input documents????
To my eye the two input matrices look the same except for the weights but A is
a sparse matrix and U is a dense matrix, not sure if this matters… Also
performing rowsimilarity on the two matrices produces correct results with
vector IDs in the output so there is something special about kmeans?
===================================================================
Below are seqdumper snippets for clusteredPoints created from A and U * Sigma
clusteredPoints from kmeans on raw doc vectors turned into a DRM (DRM A)
Input Path: b/clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.classify.WeightedVectorWritable
Key: 810: Value: 1.0: [2:0.047, 4:0.044, 8:0.049, 9:0.041, 15:0.048, 23:0.042,
26:0.041, 38:0.047, 44:0.041, 50:0.041, 57:0.045, 58:0.046, 62:0.047, 87:0.062,
101:0.046, 106:0.048, 108:0.110, 113:0.047, 120:0.049, 135:0.045,
A bit from DRM A
Input Path: /Users/pat/Projects/big-data/b/doc-matrix/matrix
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value: {2127:1.0}
Key: 1: Value:
{1:0.04140155813392109,23:0.04761729906397759,33:0.04140155813392109,35:0.03874202735546817,50:0.03318442428909763,69:0.04140155813392109,90:0.03993791262049265,100:0.04140155813392109,105:0.04140155813392109,119:0.03993791262049265,124:0.04140155813392109,133:0.036082496577015254,138:0.04140155813392109,143:
clusteredPoints from kmeans on SSVD of raw doc vectors, the input the kmeans =
U * Sigma (DRM U)
Input Path: b/clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.classify.WeightedVectorWritable
Key: 810: Value: 1.0: [0.047, 0.032, -0.062, -0.132, -0.006, -0.076, 0.024,
0.001, -0.040, -0.031, -0.051, 0.058, 0.006, -0.002, 0.038, 0.040, 0.065,
-0.038, 0.013, -0.004]
Key: 810: Value: 1.0: [0.208, -0.074, -0.076, -0.039, 0.036, -0.066, 0.037,
-0.016, 0.008, -0.024,
A bit from DRM U (actually U * Sigma)
Input Path: /Users/pat/Projects/big-data/b/ssvd/U/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value:
{0:-0.05851791792014975,1:0.0806831653032894,2:-0.04529094469362176,3:0.07412534594545293,4:-0.0014950001103841534,5:0.00858150208231669,6:0.08167911600523817,7:-0.044944387969145426,8:0.10480124786699137,9:-0.012858223284407562,10:-0.178659257217503,11:0.004960726322870974,12:-0.009355080152537257,13:-0.08287756217734399,14:-0.06421245242503033,15:0.034723492354354006,16:-0.04544718418425494,17:-0.03280318371313618,18:0.014036530324351837,19:-0.011233038447454465}