Re: tokenizer for text

Baoqiang Cao Fri, 18 May 2012 07:57:01 -0700

In addition. You could try to increase the word occurance thresholds
in -s and -md options.


On Fri, May 18, 2012 at 9:41 AM, John Conwell <[email protected]> wrote:
> What do you have in mind as far as a different tokenizer?  Are you doing
> stopword filtering?  Maybe look at the stopword list and see if there are
> other noise words you wish to add.  If you are using Lucene to filter
> stopwords, its stopword list if pretty small(20 or so words).  Stemming is
> another method often used to reduce your feature space.  You could look
> at lemmatization instead of stemming.  It wont reduce the feature space as
> much, but could help in normalizing different terms with the same lemme.
>
> You can put together your own lucene analyzer with whatever lucene filter
> pipeline you want into SparseVectorsFromSequenceFiles in order to replace
> the stock tokenizer.
>
>
>
> On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[email protected]> wrote:
>
>> Hi List,
>>
>> I am trying to use Mahout to do cluster on text. The problem is after
>> running the procedure SparseVectorsFromSequenceFiles, the dimension of
>> tf-idf vector is too high (about 50K) and it increases as the number
>> of document increases. I think there are two ways to handle that. One
>> is to use dimension reduction. The other one is to used a better
>> tokenizer which should be the better option.
>>
>> My questions are
>>
>> 1) how can I change the default tokenizer? or where can I find a new one?
>> 2) Is there a third option for me to deal with the number of dimension?
>>
>> Thanks a lot.
>>
>> --
>> Regards,
>> Jiaan
>>
>
>
>
> --
>
> Thanks,
> John C

Re: tokenizer for text

Reply via email to