The text value encoder has a special set of methods so that you can add text
that it tokenizes for you.  That is generally the easiest method.

You can tokenize it yourself and use the addToVector method if you like.
 Sometimes that is preferable because you may have a non-Lucene tokenizer or
you may want to avoid double tokenization (or a hundred other reasons).

On Fri, May 27, 2011 at 8:49 AM, David Saile <[email protected]> wrote:

> I really appreciate your help Ted!
>
> As I am new to mahout, could you please point me into the right direction?
>
> From looking at the code I get the impression, that I would need to use the
> TextValueEncoder class and continuously call
> addToVector(String originalForm, double weight, Vector data)
> for each word in a given document. Is this correct?
>
>
> Am 27.05.2011 um 17:26 schrieb Ted Dunning:
>
> > You have to write or adapt some code.  This is the big current down-side
> of
> > the hashing encoders.
> >
> > On Fri, May 27, 2011 at 2:38 AM, David Saile <[email protected]>
> wrote:
> >
> >>> The other option is to use the hashing encoders.  They inherently
> produce
> >>> output of fixed cardinality.  The down-side with that is that the
> meaning
> >> of
> >>> lots of distance measures is hard to understand in the hashed
> frameworks.
> >>> Distances that are invariant under linear transformations work
> perfectly.
> >>> Some others like Manhattan distance work pretty well.  Others can be
> >>> totally confused.
> >>
> >> This sounds like an option that eliminates the need for a global
> dictionary
> >> (in regards to multiple vecotrizer runs).
> >> How can I specify the use of hashing encoders for vectorization?
>
>

Reply via email to