Glad to be able to help. Double hashing would probably allow you to preserve full accuracy at higher compression, but if you are happy, then you might as well be done.
On Wed, Apr 24, 2013 at 1:56 PM, Ken Krugler <[email protected]>wrote: > Hi Ted, > > On Apr 13, 2013, at 8:46pm, Ted Dunning wrote: > > > On Sat, Apr 13, 2013 at 7:05 AM, Ken Krugler < > [email protected]>wrote: > > > >> > >> On Apr 12, 2013, at 11:55pm, Ted Dunning wrote: > >> > >>> The first thing to try is feature hashing to reduce your feature vector > >> size. > >> > >> Unfortunately LibLinear takes feature indices directly (assumes they're > >> sequential ints from 0..n-1), so I don't think feature hashing will help > >> here. > >> > > > > I am sure that it would. The feature indices that you give to liblinear > > don't have to be your original indices. > > > > The simplest level of feature hashing would be to take the original > feature > > indices and use multiple hashing to get 1, 2 or more new feature index > > values for each original index. Then take these modulo the new feature > > vector size (which can be much smaller than your original). > > I finally got to run this on a full set of training data, and it worked > really well - even with a single hash function. > > Without hashing, I got 81% accuracy on a held-out dataset equal to 10% of > all documents. > > Hashing to 20% of the original size gave me 80% accuracy. > > Hashing to 10% gave me 79.6% accuracy - so essentially no change. > > Which means my 850MB model is now 81MB. > > Thanks for the help! > > -- Ken > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > >
