tokenizer for text

Jiaan Zeng Fri, 18 May 2012 07:15:49 -0700

Hi List,

I am trying to use Mahout to do cluster on text. The problem is after
running the procedure SparseVectorsFromSequenceFiles, the dimension of
tf-idf vector is too high (about 50K) and it increases as the number
of document increases. I think there are two ways to handle that. One
is to use dimension reduction. The other one is to used a better
tokenizer which should be the better option.


My questions are

1) how can I change the default tokenizer? or where can I find a new one?
2) Is there a third option for me to deal with the number of dimension?

Thanks a lot.

-- 
Regards,
Jiaan

tokenizer for text

Reply via email to