Re: tokenizer for text

Jiaan Zeng Fri, 18 May 2012 09:38:21 -0700

very helpful info! Thanks a lot.

On Fri, May 18, 2012 at 11:37 AM, John Conwell <[email protected]> wrote:
> Noise in OCR often manifests itself as a whole bunch of singletons in the
> corpus of meaningless terms like "lsdjfdslkfj".  So the minFrequency flag
> can help in filtering out these terms.
>
> Stopwords should be handled by tfidf.  For example the word "the" probably
> has a high frequency in every document in the corpus, so it'll have a low
> tfidf score.  But trimming out stopwords is still a good way to reduce your
> feature space, even if its just to reduce the size of your dataset, and
> speed up processing.  This can be very helpful when you have a very large
> corpus.
>
> Lemmatization and Stemming can actually enhance the the tfidf score
> of influential terms.  For example say a document used the following list
> of terms, each term twice: "jog, jogging, jogged, jogs, jogger".  Here are
> 5 terms that will each be treated as distinct values in your vector space,
> each with a frequency of 2.  The document seems to have a lot to do with
> the act of jogging, but since each term will get a tfidf score of its own
> frequency value of 2, these terms wont strongly influence the similarity
> function when clustering.  Stemming/lemmatization will normalize these 5
> terms down to one term "jog", with a frequency value of 10, and will have a
> higher tfidf score than any of the individual terms (as long as the corpus
> of documents isn't all about running).  This does two things, dramatically
> reduces your feature space, and can increase the influence of key terms in
> a document, which will give you stronger clustering results around these
> key terms.
>
>
> On Fri, May 18, 2012 at 8:09 AM, Jiaan Zeng <[email protected]> wrote:
>
>> Thanks for the quick reply.
>>
>> Stop word filtering or stemming may not help much I think. Too, the
>> point of using tf-idf vector is to deal with high occurrence frequency
>> word. Stop word filtering or stemming seems counter against the tf-idf
>> intention. The problem is that the text has lots of noises (the text
>> is OCR text so it has lots of OCR errors). Is there a tokenizer with
>> noise filter that I can plug in? Or where can I find a noise filter to
>> deal with that?
>>
>> On Fri, May 18, 2012 at 10:56 AM, Baoqiang Cao <[email protected]>
>> wrote:
>> > In addition. You could try to increase the word occurance thresholds
>> > in -s and -md options.
>> >
>> > On Fri, May 18, 2012 at 9:41 AM, John Conwell <[email protected]> wrote:
>> >> What do you have in mind as far as a different tokenizer?  Are you doing
>> >> stopword filtering?  Maybe look at the stopword list and see if there
>> are
>> >> other noise words you wish to add.  If you are using Lucene to filter
>> >> stopwords, its stopword list if pretty small(20 or so words).  Stemming
>> is
>> >> another method often used to reduce your feature space.  You could look
>> >> at lemmatization instead of stemming.  It wont reduce the feature space
>> as
>> >> much, but could help in normalizing different terms with the same lemme.
>> >>
>> >> You can put together your own lucene analyzer with whatever lucene
>> filter
>> >> pipeline you want into SparseVectorsFromSequenceFiles in order to
>> replace
>> >> the stock tokenizer.
>> >>
>> >>
>> >>
>> >> On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[email protected]>
>> wrote:
>> >>
>> >>> Hi List,
>> >>>
>> >>> I am trying to use Mahout to do cluster on text. The problem is after
>> >>> running the procedure SparseVectorsFromSequenceFiles, the dimension of
>> >>> tf-idf vector is too high (about 50K) and it increases as the number
>> >>> of document increases. I think there are two ways to handle that. One
>> >>> is to use dimension reduction. The other one is to used a better
>> >>> tokenizer which should be the better option.
>> >>>
>> >>> My questions are
>> >>>
>> >>> 1) how can I change the default tokenizer? or where can I find a new
>> one?
>> >>> 2) Is there a third option for me to deal with the number of dimension?
>> >>>
>> >>> Thanks a lot.
>> >>>
>> >>> --
>> >>> Regards,
>> >>> Jiaan
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> Thanks,
>> >> John C
>>
>>
>>
>> --
>> Regards,
>> Jiaan
>>
>
>
>
> --
>
> Thanks,
> John C




-- 
Regards,
Jiaan

Re: tokenizer for text

Reply via email to