Hi List, I am trying to use Mahout to do cluster on text. The problem is after running the procedure SparseVectorsFromSequenceFiles, the dimension of tf-idf vector is too high (about 50K) and it increases as the number of document increases. I think there are two ways to handle that. One is to use dimension reduction. The other one is to used a better tokenizer which should be the better option.
My questions are 1) how can I change the default tokenizer? or where can I find a new one? 2) Is there a third option for me to deal with the number of dimension? Thanks a lot. -- Regards, Jiaan
