Hashed feature vectors are an excellent choice for the unknown vocabulary problem.
One problem you will have is that the static weighting won't by default weight rare words more highly than common words. One way to deal with this is to build a dictionary on a small subset of documents and assume all missing words are relatively rare and can have a default weight. You definitely need some weighting according to prevalence. On Wed, Apr 24, 2013 at 12:32 PM, Johannes Schulte < [email protected]> wrote: > Hi Martin, > > i guess you should be fine with the StaticWordValueEncoder , following e.g. > this discussion on this list, it is about clustering but matches some of > your questions > > > http://mahout.markmail.org/search/?q=hashing%20clustering#query:hashing%20clustering+page:1+mid:eitskeb7pkk3pupr+state:results > > What i am missing is a clear / big picture on the latter of the two most > mentioned benefits of hashing: > - dictionary / vectorization (with collisions and "small error on average") > - dimensionality reduction > > for the dimensionality reduction part, there is the minHash-Clustering, > home brewed k-means with feature vector encoders plus the new streaming > k-means stuff having own random projection and hash implementations. Any > guidance: cool! > > For you Martin, i think you will have to roll your own incremental vector > generation but it should be really straightforward > > Cheers, > > Johannes > > > > On Wed, Apr 24, 2013 at 6:32 PM, Martin Bayly <[email protected]> > wrote: > > > We have a system that needs to be able to incrementally calculate > > document-document text similarity metrics as new documents are seen. I'm > > trying to understand if using feature hashing with for example, a > > StaticWordValueEncoder is appropriate for this kind of use case. Our > text > > documents can contain web content so the size of the feature vector is > > really only bound by the number of words in the language. > > > > Currently our implementation uses a simple vector based bag of words type > > model to create 'one cell per word' feature vectors for each document and > > then we use cosine similarity to determine document to document > similarity. > > We are not using Mahout. > > > > The issue with this approach is that the one cell per word feature > vectors > > require the use of a singleton dictionary object to turn words into > vector > > indexes, so we can only index one document at a time. > > > > I've been reading through the Mahout archives and the Mahout in Action > book > > looking to see if Mahout has any answers to help with incremental > > parallelized vector generation, but it seems like the Mahout seq2sparse > > processes have the same 'batch' issue. I've seen various posts referring > > to using feature hashing as a way around this and the classifiers in > part 3 > > of Mahout in Action explain how to use feature hashing to encode text > like > > features. > > > > I'm just too green to know whether it's appropriate for our use case. > > Particularly whether the multiple probes recommended when using feature > > hashing with text, and the liklihood of feature collisions will > > significantly compromise our cosine similarity calculations. > > > > Thanks for any insights > > Martin > > >
