Thanks for the quick reply. Stop word filtering or stemming may not help much I think. Too, the point of using tf-idf vector is to deal with high occurrence frequency word. Stop word filtering or stemming seems counter against the tf-idf intention. The problem is that the text has lots of noises (the text is OCR text so it has lots of OCR errors). Is there a tokenizer with noise filter that I can plug in? Or where can I find a noise filter to deal with that?
On Fri, May 18, 2012 at 10:56 AM, Baoqiang Cao <[email protected]> wrote: > In addition. You could try to increase the word occurance thresholds > in -s and -md options. > > On Fri, May 18, 2012 at 9:41 AM, John Conwell <[email protected]> wrote: >> What do you have in mind as far as a different tokenizer? Are you doing >> stopword filtering? Maybe look at the stopword list and see if there are >> other noise words you wish to add. If you are using Lucene to filter >> stopwords, its stopword list if pretty small(20 or so words). Stemming is >> another method often used to reduce your feature space. You could look >> at lemmatization instead of stemming. It wont reduce the feature space as >> much, but could help in normalizing different terms with the same lemme. >> >> You can put together your own lucene analyzer with whatever lucene filter >> pipeline you want into SparseVectorsFromSequenceFiles in order to replace >> the stock tokenizer. >> >> >> >> On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[email protected]> wrote: >> >>> Hi List, >>> >>> I am trying to use Mahout to do cluster on text. The problem is after >>> running the procedure SparseVectorsFromSequenceFiles, the dimension of >>> tf-idf vector is too high (about 50K) and it increases as the number >>> of document increases. I think there are two ways to handle that. One >>> is to use dimension reduction. The other one is to used a better >>> tokenizer which should be the better option. >>> >>> My questions are >>> >>> 1) how can I change the default tokenizer? or where can I find a new one? >>> 2) Is there a third option for me to deal with the number of dimension? >>> >>> Thanks a lot. >>> >>> -- >>> Regards, >>> Jiaan >>> >> >> >> >> -- >> >> Thanks, >> John C -- Regards, Jiaan
