What do you have in mind as far as a different tokenizer?  Are you doing
stopword filtering?  Maybe look at the stopword list and see if there are
other noise words you wish to add.  If you are using Lucene to filter
stopwords, its stopword list if pretty small(20 or so words).  Stemming is
another method often used to reduce your feature space.  You could look
at lemmatization instead of stemming.  It wont reduce the feature space as
much, but could help in normalizing different terms with the same lemme.

You can put together your own lucene analyzer with whatever lucene filter
pipeline you want into SparseVectorsFromSequenceFiles in order to replace
the stock tokenizer.



On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[email protected]> wrote:

> Hi List,
>
> I am trying to use Mahout to do cluster on text. The problem is after
> running the procedure SparseVectorsFromSequenceFiles, the dimension of
> tf-idf vector is too high (about 50K) and it increases as the number
> of document increases. I think there are two ways to handle that. One
> is to use dimension reduction. The other one is to used a better
> tokenizer which should be the better option.
>
> My questions are
>
> 1) how can I change the default tokenizer? or where can I find a new one?
> 2) Is there a third option for me to deal with the number of dimension?
>
> Thanks a lot.
>
> --
> Regards,
> Jiaan
>



-- 

Thanks,
John C

Reply via email to