Hi everybody,
I am trying to run k-means clustering on my own data.
I modified NewsKMeansExample from the Mahout book, to read some of my
documents.
I can see that the follwing have been created correctly:
tokenized-documets/part-m-00000
df-count/part-r-00000
tf-vectors/part-r-00000
The numbers are in perfect match with the input.
The directory and frequencies files are also ok.
However, the tfidf-vectors seem to have an empty vector for each document.
Reading them gives (e.g., for document id2):
id2 = >
{"class":"org.apache.mahout.math.SequentialAccessSparseVector","vector":"{\"values\":{\"indices\":[],\"values\":[],\"numMappings\":0},\"size\":4968,\"lengthSquared\":-1.0}"}
Clustering results in the following:
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
.....
Any help regarding how to get meaningful tf-idf vectors will be much
appreciated :)
Thanks!