On Apr 12, 2013, at 11:55pm, Ted Dunning wrote: > The first thing to try is feature hashing to reduce your feature vector size. >
Unfortunately LibLinear takes feature indices directly (assumes they're sequential ints from 0..n-1), so I don't think feature hashing will help here. If I constructed a minimal perfect hash function then I could skip storing the mapping from feature to index, but that's not what's taking most of the memory; it's the n x m array of weights used by LibLinear. > With multiple probes and possibly with random weights you might be able to > drop the size by 10x. More details here would be great, sometime when you're not trying to type on an iPhone :) -- Ken PS - My initial naive idea was to remove any row where all of the weights were below a threshold that I calculated from the distribution of all weights. > > Sent from my iPhone > > On Apr 12, 2013, at 18:30, Ken Krugler <[email protected]> wrote: > >> Hi all, >> >> We're (ab)using LibLinear (linear SVM) as a multi-class classifier, with >> 200+ labels and 400K features. >> >> This results in a model that's > 800MB, which is a bit unwieldy. >> Unfortunately LibLinear uses a full array of weights (nothing sparse), being >> a port from the C version. >> >> I could do feature reduction (removing rows from the matrix) with Mahout >> prior to training the model, but I'd prefer to reduce the (in memory) nxm >> array of weights. >> >> Any suggestions for approaches to take? >> >> Thanks, >> >> -- Ken >> >> -------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://www.scaleunlimited.com >> custom big data solutions & training >> Hadoop, Cascading, Cassandra & Solr >> >> >> >> >> -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
