I really appreciate your help Ted! As I am new to mahout, could you please point me into the right direction?
From looking at the code I get the impression, that I would need to use the TextValueEncoder class and continuously call addToVector(String originalForm, double weight, Vector data) for each word in a given document. Is this correct? Am 27.05.2011 um 17:26 schrieb Ted Dunning: > You have to write or adapt some code. This is the big current down-side of > the hashing encoders. > > On Fri, May 27, 2011 at 2:38 AM, David Saile <[email protected]> wrote: > >>> The other option is to use the hashing encoders. They inherently produce >>> output of fixed cardinality. The down-side with that is that the meaning >> of >>> lots of distance measures is hard to understand in the hashed frameworks. >>> Distances that are invariant under linear transformations work perfectly. >>> Some others like Manhattan distance work pretty well. Others can be >>> totally confused. >> >> This sounds like an option that eliminates the need for a global dictionary >> (in regards to multiple vecotrizer runs). >> How can I specify the use of hashing encoders for vectorization?
