Re: tokenizer for text

Jiaan Zeng Fri, 18 May 2012 08:09:40 -0700

Thanks for the quick reply.

Stop word filtering or stemming may not help much I think. Too, the
point of using tf-idf vector is to deal with high occurrence frequency
word. Stop word filtering or stemming seems counter against the tf-idf
intention. The problem is that the text has lots of noises (the text
is OCR text so it has lots of OCR errors). Is there a tokenizer with
noise filter that I can plug in? Or where can I find a noise filter to
deal with that?


On Fri, May 18, 2012 at 10:56 AM, Baoqiang Cao <[email protected]> wrote:
> In addition. You could try to increase the word occurance thresholds
> in -s and -md options.
>
> On Fri, May 18, 2012 at 9:41 AM, John Conwell <[email protected]> wrote:
>> What do you have in mind as far as a different tokenizer?  Are you doing
>> stopword filtering?  Maybe look at the stopword list and see if there are
>> other noise words you wish to add.  If you are using Lucene to filter
>> stopwords, its stopword list if pretty small(20 or so words).  Stemming is
>> another method often used to reduce your feature space.  You could look
>> at lemmatization instead of stemming.  It wont reduce the feature space as
>> much, but could help in normalizing different terms with the same lemme.
>>
>> You can put together your own lucene analyzer with whatever lucene filter
>> pipeline you want into SparseVectorsFromSequenceFiles in order to replace
>> the stock tokenizer.
>>
>>
>>
>> On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[email protected]> wrote:
>>
>>> Hi List,
>>>
>>> I am trying to use Mahout to do cluster on text. The problem is after
>>> running the procedure SparseVectorsFromSequenceFiles, the dimension of
>>> tf-idf vector is too high (about 50K) and it increases as the number
>>> of document increases. I think there are two ways to handle that. One
>>> is to use dimension reduction. The other one is to used a better
>>> tokenizer which should be the better option.
>>>
>>> My questions are
>>>
>>> 1) how can I change the default tokenizer? or where can I find a new one?
>>> 2) Is there a third option for me to deal with the number of dimension?
>>>
>>> Thanks a lot.
>>>
>>> --
>>> Regards,
>>> Jiaan
>>>
>>
>>
>>
>> --
>>
>> Thanks,
>> John C



-- 
Regards,
Jiaan

Re: tokenizer for text

Reply via email to