Hi List,

I am trying to use Mahout to do cluster on text. The problem is after
running the procedure SparseVectorsFromSequenceFiles, the dimension of
tf-idf vector is too high (about 50K) and it increases as the number
of document increases. I think there are two ways to handle that. One
is to use dimension reduction. The other one is to used a better
tokenizer which should be the better option.

My questions are

1) how can I change the default tokenizer? or where can I find a new one?
2) Is there a third option for me to deal with the number of dimension?

Thanks a lot.

-- 
Regards,
Jiaan

Reply via email to