Running kmeans on doc vectors turned into a DistributedRowMatrix works fine (no 
surprise).

But when I do an SSVD on the above input, then create U * Sigma, a 
DistributedRowMatrix (IntWritable, VectorWritable) I get clusters in 
clusters-xx-final but in clusteredPoints the vectors have no IDs. Therefor the 
clustered points cannot be tied back to the clusters that contain them and 
can't be tied to the original input documents???? 

To my eye the two input matrices look the same except for the weights but A is 
a sparse matrix and U is a dense matrix, not sure if this matters… Also 
performing rowsimilarity on the two matrices produces correct results with 
vector IDs in the output so there is something special about kmeans?

===================================================================

Below are seqdumper snippets for clusteredPoints created from A and U * Sigma

clusteredPoints from kmeans on raw doc vectors turned into a DRM  (DRM A) 

Input Path: b/clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.clustering.classify.WeightedVectorWritable
Key: 810: Value: 1.0: [2:0.047, 4:0.044, 8:0.049, 9:0.041, 15:0.048, 23:0.042, 
26:0.041, 38:0.047, 44:0.041, 50:0.041, 57:0.045, 58:0.046, 62:0.047, 87:0.062, 
101:0.046, 106:0.048, 108:0.110, 113:0.047, 120:0.049, 135:0.045,

A bit from DRM A

Input Path: /Users/pat/Projects/big-data/b/doc-matrix/matrix
Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.math.VectorWritable
Key: 0: Value: {2127:1.0}
Key: 1: Value: 
{1:0.04140155813392109,23:0.04761729906397759,33:0.04140155813392109,35:0.03874202735546817,50:0.03318442428909763,69:0.04140155813392109,90:0.03993791262049265,100:0.04140155813392109,105:0.04140155813392109,119:0.03993791262049265,124:0.04140155813392109,133:0.036082496577015254,138:0.04140155813392109,143:

clusteredPoints from kmeans on SSVD of raw doc vectors, the input the kmeans = 
U * Sigma (DRM U)

Input Path: b/clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.clustering.classify.WeightedVectorWritable
Key: 810: Value: 1.0: [0.047, 0.032, -0.062, -0.132, -0.006, -0.076, 0.024, 
0.001, -0.040, -0.031, -0.051, 0.058, 0.006, -0.002, 0.038, 0.040, 0.065, 
-0.038, 0.013, -0.004]
Key: 810: Value: 1.0: [0.208, -0.074, -0.076, -0.039, 0.036, -0.066, 0.037, 
-0.016, 0.008, -0.024,

A bit from DRM U (actually U * Sigma)

Input Path: /Users/pat/Projects/big-data/b/ssvd/U/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.math.VectorWritable
Key: 0: Value: 
{0:-0.05851791792014975,1:0.0806831653032894,2:-0.04529094469362176,3:0.07412534594545293,4:-0.0014950001103841534,5:0.00858150208231669,6:0.08167911600523817,7:-0.044944387969145426,8:0.10480124786699137,9:-0.012858223284407562,10:-0.178659257217503,11:0.004960726322870974,12:-0.009355080152537257,13:-0.08287756217734399,14:-0.06421245242503033,15:0.034723492354354006,16:-0.04544718418425494,17:-0.03280318371313618,18:0.014036530324351837,19:-0.011233038447454465}

Reply via email to