Noise in OCR often manifests itself as a whole bunch of singletons in the
corpus of meaningless terms like "lsdjfdslkfj".  So the minFrequency flag
can help in filtering out these terms.

Stopwords should be handled by tfidf.  For example the word "the" probably
has a high frequency in every document in the corpus, so it'll have a low
tfidf score.  But trimming out stopwords is still a good way to reduce your
feature space, even if its just to reduce the size of your dataset, and
speed up processing.  This can be very helpful when you have a very large
corpus.

Lemmatization and Stemming can actually enhance the the tfidf score
of influential terms.  For example say a document used the following list
of terms, each term twice: "jog, jogging, jogged, jogs, jogger".  Here are
5 terms that will each be treated as distinct values in your vector space,
each with a frequency of 2.  The document seems to have a lot to do with
the act of jogging, but since each term will get a tfidf score of its own
frequency value of 2, these terms wont strongly influence the similarity
function when clustering.  Stemming/lemmatization will normalize these 5
terms down to one term "jog", with a frequency value of 10, and will have a
higher tfidf score than any of the individual terms (as long as the corpus
of documents isn't all about running).  This does two things, dramatically
reduces your feature space, and can increase the influence of key terms in
a document, which will give you stronger clustering results around these
key terms.


On Fri, May 18, 2012 at 8:09 AM, Jiaan Zeng <[email protected]> wrote:

> Thanks for the quick reply.
>
> Stop word filtering or stemming may not help much I think. Too, the
> point of using tf-idf vector is to deal with high occurrence frequency
> word. Stop word filtering or stemming seems counter against the tf-idf
> intention. The problem is that the text has lots of noises (the text
> is OCR text so it has lots of OCR errors). Is there a tokenizer with
> noise filter that I can plug in? Or where can I find a noise filter to
> deal with that?
>
> On Fri, May 18, 2012 at 10:56 AM, Baoqiang Cao <[email protected]>
> wrote:
> > In addition. You could try to increase the word occurance thresholds
> > in -s and -md options.
> >
> > On Fri, May 18, 2012 at 9:41 AM, John Conwell <[email protected]> wrote:
> >> What do you have in mind as far as a different tokenizer?  Are you doing
> >> stopword filtering?  Maybe look at the stopword list and see if there
> are
> >> other noise words you wish to add.  If you are using Lucene to filter
> >> stopwords, its stopword list if pretty small(20 or so words).  Stemming
> is
> >> another method often used to reduce your feature space.  You could look
> >> at lemmatization instead of stemming.  It wont reduce the feature space
> as
> >> much, but could help in normalizing different terms with the same lemme.
> >>
> >> You can put together your own lucene analyzer with whatever lucene
> filter
> >> pipeline you want into SparseVectorsFromSequenceFiles in order to
> replace
> >> the stock tokenizer.
> >>
> >>
> >>
> >> On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[email protected]>
> wrote:
> >>
> >>> Hi List,
> >>>
> >>> I am trying to use Mahout to do cluster on text. The problem is after
> >>> running the procedure SparseVectorsFromSequenceFiles, the dimension of
> >>> tf-idf vector is too high (about 50K) and it increases as the number
> >>> of document increases. I think there are two ways to handle that. One
> >>> is to use dimension reduction. The other one is to used a better
> >>> tokenizer which should be the better option.
> >>>
> >>> My questions are
> >>>
> >>> 1) how can I change the default tokenizer? or where can I find a new
> one?
> >>> 2) Is there a third option for me to deal with the number of dimension?
> >>>
> >>> Thanks a lot.
> >>>
> >>> --
> >>> Regards,
> >>> Jiaan
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> Thanks,
> >> John C
>
>
>
> --
> Regards,
> Jiaan
>



-- 

Thanks,
John C

Reply via email to