Hi Ted, On Apr 13, 2013, at 8:46pm, Ted Dunning wrote:
> On Sat, Apr 13, 2013 at 7:05 AM, Ken Krugler > <[email protected]>wrote: > >> >> On Apr 12, 2013, at 11:55pm, Ted Dunning wrote: >> >>> The first thing to try is feature hashing to reduce your feature vector >> size. >> >> Unfortunately LibLinear takes feature indices directly (assumes they're >> sequential ints from 0..n-1), so I don't think feature hashing will help >> here. >> > > I am sure that it would. The feature indices that you give to liblinear > don't have to be your original indices. > > The simplest level of feature hashing would be to take the original feature > indices and use multiple hashing to get 1, 2 or more new feature index > values for each original index. Then take these modulo the new feature > vector size (which can be much smaller than your original). I finally got to run this on a full set of training data, and it worked really well - even with a single hash function. Without hashing, I got 81% accuracy on a held-out dataset equal to 10% of all documents. Hashing to 20% of the original size gave me 80% accuracy. Hashing to 10% gave me 79.6% accuracy - so essentially no change. Which means my 850MB model is now 81MB. Thanks for the help! -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
