Am 26.05.2011 um 20:05 schrieb Ted Dunning:

> On Thu, May 26, 2011 at 10:35 AM, David Saile <[email protected]> wrote:
> 
>> I assume, this exception occurs because the new vectors have a different
>> cardinality than the previously computed clusters.
>> 
> 
> Correct
> 
> 
>> Is there some way to assign a fixed cardinality to all vectors? Or is there
>> any other solution for this?
>> 
> 
> I think that there is a way to use a fixed dictionary.  

I guess what you are referring to (what I actually overlooked), is that I need 
to use the dictionaries from previous runs, in order to ensure that words have 
consistent IDs.

Can someone point me to how I can pass an existing dictionary to the 
DictionaryVectorizer? 
In the mahout-0.4 release I am using, 
DictionaryVectorizer.createTermFrequencyVectors(…) does not take any 
dictionary-path argument.


> If we don't already have it, there should be a provision for adding an
> extra slot for unknown words to fit into.

I could not find this functionality, but I guess implementing this should not 
be too hard.    

> 
> The other option is to use the hashing encoders.  They inherently produce
> output of fixed cardinality.  The down-side with that is that the meaning of
> lots of distance measures is hard to understand in the hashed frameworks.
> Distances that are invariant under linear transformations work perfectly.
> Some others like Manhattan distance work pretty well.  Others can be
> totally confused.

This sounds like an option that eliminates the need for a global dictionary (in 
regards to multiple vecotrizer runs).
How can I specify the use of hashing encoders for vectorization?


Thanks for your help!

David

Reply via email to