Hi Ted,

On Apr 13, 2013, at 8:46pm, Ted Dunning wrote:

> On Sat, Apr 13, 2013 at 7:05 AM, Ken Krugler 
> <[email protected]>wrote:
> 
>> 
>> On Apr 12, 2013, at 11:55pm, Ted Dunning wrote:
>> 
>>> The first thing to try is feature hashing to reduce your feature vector
>> size.
>> 
>> Unfortunately LibLinear takes feature indices directly (assumes they're
>> sequential ints from 0..n-1), so I don't think feature hashing will help
>> here.
>> 
> 
> I am sure that it would.  The feature indices that you give to liblinear
> don't have to be your original indices.
> 
> The simplest level of feature hashing would be to take the original feature
> indices and use multiple hashing to get 1, 2 or more new feature index
> values for each original index.  Then take these modulo the new feature
> vector size (which can be much smaller than your original).

I finally got to run this on a full set of training data, and it worked really 
well - even with a single hash function.

Without hashing, I got 81% accuracy on a held-out dataset equal to 10% of all 
documents.

Hashing to 20% of the original size gave me 80% accuracy.

Hashing to 10% gave me 79.6% accuracy - so essentially no change.

Which means my 850MB model is now 81MB.

Thanks for the help!

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to