Am 26.05.2011 um 20:05 schrieb Ted Dunning: > On Thu, May 26, 2011 at 10:35 AM, David Saile <[email protected]> wrote: > >> I assume, this exception occurs because the new vectors have a different >> cardinality than the previously computed clusters. >> > > Correct > > >> Is there some way to assign a fixed cardinality to all vectors? Or is there >> any other solution for this? >> > > I think that there is a way to use a fixed dictionary.
I guess what you are referring to (what I actually overlooked), is that I need to use the dictionaries from previous runs, in order to ensure that words have consistent IDs. Can someone point me to how I can pass an existing dictionary to the DictionaryVectorizer? In the mahout-0.4 release I am using, DictionaryVectorizer.createTermFrequencyVectors(…) does not take any dictionary-path argument. > If we don't already have it, there should be a provision for adding an > extra slot for unknown words to fit into. I could not find this functionality, but I guess implementing this should not be too hard. > > The other option is to use the hashing encoders. They inherently produce > output of fixed cardinality. The down-side with that is that the meaning of > lots of distance measures is hard to understand in the hashed frameworks. > Distances that are invariant under linear transformations work perfectly. > Some others like Manhattan distance work pretty well. Others can be > totally confused. This sounds like an option that eliminates the need for a global dictionary (in regards to multiple vecotrizer runs). How can I specify the use of hashing encoders for vectorization? Thanks for your help! David
