I really appreciate your help Ted!

As I am new to mahout, could you please point me into the right direction?

From looking at the code I get the impression, that I would need to use the 
TextValueEncoder class and continuously call 
addToVector(String originalForm, double weight, Vector data)
for each word in a given document. Is this correct?

 
Am 27.05.2011 um 17:26 schrieb Ted Dunning:

> You have to write or adapt some code.  This is the big current down-side of
> the hashing encoders.
> 
> On Fri, May 27, 2011 at 2:38 AM, David Saile <[email protected]> wrote:
> 
>>> The other option is to use the hashing encoders.  They inherently produce
>>> output of fixed cardinality.  The down-side with that is that the meaning
>> of
>>> lots of distance measures is hard to understand in the hashed frameworks.
>>> Distances that are invariant under linear transformations work perfectly.
>>> Some others like Manhattan distance work pretty well.  Others can be
>>> totally confused.
>> 
>> This sounds like an option that eliminates the need for a global dictionary
>> (in regards to multiple vecotrizer runs).
>> How can I specify the use of hashing encoders for vectorization?

Reply via email to