very helpful info! Thanks a lot. On Fri, May 18, 2012 at 11:37 AM, John Conwell <[email protected]> wrote: > Noise in OCR often manifests itself as a whole bunch of singletons in the > corpus of meaningless terms like "lsdjfdslkfj". So the minFrequency flag > can help in filtering out these terms. > > Stopwords should be handled by tfidf. For example the word "the" probably > has a high frequency in every document in the corpus, so it'll have a low > tfidf score. But trimming out stopwords is still a good way to reduce your > feature space, even if its just to reduce the size of your dataset, and > speed up processing. This can be very helpful when you have a very large > corpus. > > Lemmatization and Stemming can actually enhance the the tfidf score > of influential terms. For example say a document used the following list > of terms, each term twice: "jog, jogging, jogged, jogs, jogger". Here are > 5 terms that will each be treated as distinct values in your vector space, > each with a frequency of 2. The document seems to have a lot to do with > the act of jogging, but since each term will get a tfidf score of its own > frequency value of 2, these terms wont strongly influence the similarity > function when clustering. Stemming/lemmatization will normalize these 5 > terms down to one term "jog", with a frequency value of 10, and will have a > higher tfidf score than any of the individual terms (as long as the corpus > of documents isn't all about running). This does two things, dramatically > reduces your feature space, and can increase the influence of key terms in > a document, which will give you stronger clustering results around these > key terms. > > > On Fri, May 18, 2012 at 8:09 AM, Jiaan Zeng <[email protected]> wrote: > >> Thanks for the quick reply. >> >> Stop word filtering or stemming may not help much I think. Too, the >> point of using tf-idf vector is to deal with high occurrence frequency >> word. Stop word filtering or stemming seems counter against the tf-idf >> intention. The problem is that the text has lots of noises (the text >> is OCR text so it has lots of OCR errors). Is there a tokenizer with >> noise filter that I can plug in? Or where can I find a noise filter to >> deal with that? >> >> On Fri, May 18, 2012 at 10:56 AM, Baoqiang Cao <[email protected]> >> wrote: >> > In addition. You could try to increase the word occurance thresholds >> > in -s and -md options. >> > >> > On Fri, May 18, 2012 at 9:41 AM, John Conwell <[email protected]> wrote: >> >> What do you have in mind as far as a different tokenizer? Are you doing >> >> stopword filtering? Maybe look at the stopword list and see if there >> are >> >> other noise words you wish to add. If you are using Lucene to filter >> >> stopwords, its stopword list if pretty small(20 or so words). Stemming >> is >> >> another method often used to reduce your feature space. You could look >> >> at lemmatization instead of stemming. It wont reduce the feature space >> as >> >> much, but could help in normalizing different terms with the same lemme. >> >> >> >> You can put together your own lucene analyzer with whatever lucene >> filter >> >> pipeline you want into SparseVectorsFromSequenceFiles in order to >> replace >> >> the stock tokenizer. >> >> >> >> >> >> >> >> On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[email protected]> >> wrote: >> >> >> >>> Hi List, >> >>> >> >>> I am trying to use Mahout to do cluster on text. The problem is after >> >>> running the procedure SparseVectorsFromSequenceFiles, the dimension of >> >>> tf-idf vector is too high (about 50K) and it increases as the number >> >>> of document increases. I think there are two ways to handle that. One >> >>> is to use dimension reduction. The other one is to used a better >> >>> tokenizer which should be the better option. >> >>> >> >>> My questions are >> >>> >> >>> 1) how can I change the default tokenizer? or where can I find a new >> one? >> >>> 2) Is there a third option for me to deal with the number of dimension? >> >>> >> >>> Thanks a lot. >> >>> >> >>> -- >> >>> Regards, >> >>> Jiaan >> >>> >> >> >> >> >> >> >> >> -- >> >> >> >> Thanks, >> >> John C >> >> >> >> -- >> Regards, >> Jiaan >> > > > > -- > > Thanks, > John C
-- Regards, Jiaan
