tfidf vectors are generated without data

Royi Ronen Sun, 21 Aug 2011 13:34:51 -0700

Hi everybody,

I am trying to run k-means clustering on my own data.


I modified NewsKMeansExample from the Mahout book, to read some of my
documents.

I can see that the follwing have been created correctly:

tokenized-documets/part-m-00000
df-count/part-r-00000
tf-vectors/part-r-00000

The numbers are in perfect match with the input.
The directory and frequencies files are also ok.

However, the tfidf-vectors seem to have an empty vector for each document.
Reading them gives (e.g., for document id2):

id2 = >
{"class":"org.apache.mahout.math.SequentialAccessSparseVector","vector":"{\"values\":{\"indices\":[],\"values\":[],\"numMappings\":0},\"size\":4968,\"lengthSquared\":-1.0}"}

Clustering results in the following:

0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
.....

Any help regarding how to get meaningful tf-idf vectors will be much
appreciated :)

Thanks!

tfidf vectors are generated without data

Reply via email to