What do you have in mind as far as a different tokenizer? Are you doing stopword filtering? Maybe look at the stopword list and see if there are other noise words you wish to add. If you are using Lucene to filter stopwords, its stopword list if pretty small(20 or so words). Stemming is another method often used to reduce your feature space. You could look at lemmatization instead of stemming. It wont reduce the feature space as much, but could help in normalizing different terms with the same lemme.
You can put together your own lucene analyzer with whatever lucene filter pipeline you want into SparseVectorsFromSequenceFiles in order to replace the stock tokenizer. On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[email protected]> wrote: > Hi List, > > I am trying to use Mahout to do cluster on text. The problem is after > running the procedure SparseVectorsFromSequenceFiles, the dimension of > tf-idf vector is too high (about 50K) and it increases as the number > of document increases. I think there are two ways to handle that. One > is to use dimension reduction. The other one is to used a better > tokenizer which should be the better option. > > My questions are > > 1) how can I change the default tokenizer? or where can I find a new one? > 2) Is there a third option for me to deal with the number of dimension? > > Thanks a lot. > > -- > Regards, > Jiaan > -- Thanks, John C
