In addition. You could try to increase the word occurance thresholds in -s and -md options.
On Fri, May 18, 2012 at 9:41 AM, John Conwell <[email protected]> wrote: > What do you have in mind as far as a different tokenizer? Are you doing > stopword filtering? Maybe look at the stopword list and see if there are > other noise words you wish to add. If you are using Lucene to filter > stopwords, its stopword list if pretty small(20 or so words). Stemming is > another method often used to reduce your feature space. You could look > at lemmatization instead of stemming. It wont reduce the feature space as > much, but could help in normalizing different terms with the same lemme. > > You can put together your own lucene analyzer with whatever lucene filter > pipeline you want into SparseVectorsFromSequenceFiles in order to replace > the stock tokenizer. > > > > On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[email protected]> wrote: > >> Hi List, >> >> I am trying to use Mahout to do cluster on text. The problem is after >> running the procedure SparseVectorsFromSequenceFiles, the dimension of >> tf-idf vector is too high (about 50K) and it increases as the number >> of document increases. I think there are two ways to handle that. One >> is to use dimension reduction. The other one is to used a better >> tokenizer which should be the better option. >> >> My questions are >> >> 1) how can I change the default tokenizer? or where can I find a new one? >> 2) Is there a third option for me to deal with the number of dimension? >> >> Thanks a lot. >> >> -- >> Regards, >> Jiaan >> > > > > -- > > Thanks, > John C
