Hi Ted, On Apr 13, 2013, at 8:46pm, Ted Dunning wrote:
> On Sat, Apr 13, 2013 at 7:05 AM, Ken Krugler > <[email protected]>wrote: > >> >> On Apr 12, 2013, at 11:55pm, Ted Dunning wrote: >> >>> The first thing to try is feature hashing to reduce your feature vector >> size. >> >> Unfortunately LibLinear takes feature indices directly (assumes they're >> sequential ints from 0..n-1), so I don't think feature hashing will help >> here. >> > > I am sure that it would. The feature indices that you give to liblinear > don't have to be your original indices. > > The simplest level of feature hashing would be to take the original feature > indices and use multiple hashing to get 1, 2 or more new feature index > values for each original index. Then take these modulo the new feature > vector size (which can be much smaller than your original). Thanks for clarifying - I was stuck on using the hash trick to get rid of the terms to index map, versus creating a denser matrix. Though I haven't yet found a good write-up on the value of generating more than one hash - seems like multiple hash values would increase the odds of collisions. For a not-so-sparse matrix and a single hash function, I got a 6% drop in accuracy from a single hash. I'll have to try with a more real/sparser data set. -- Ken > There will be some collisions, but the result here is a linear > transformation of the original space and if you use multiple indexes for > each original feature, you will lose very little, if anything. The SVM > will almost always be able to learn around the effects of collisions. -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
