Yes. The assignment of features to locations in a fixed size vector is done using hashing rather than a dictionary. With a reasonably large vector collisions will on average not be too terrible. With smaller vectors or where we are using massive vocabularies due to feature interactions or simply to worry less, we can use multiple hashing to assign a single feature multiple locations. We can prove that the resulting hashed representation retains all the information we want and that learning a linear classifier using the hashed representation should work pretty much as well as a full representation using the same tricks as the random projection guys use because, well, the hashed representation *is* a random linear projection.
And yes, this definitely can be considered. The encoders in question have a basic interface style where you create an encoder for each variable using the variable name. There are a few different kinds of encoders for different kinds of data. Then at encoding time, you create the feature vector and add the different features one at a time. The simplest way to encode a feature value takes a string value and a weight for that feature. The encoder will parse the string value and multiply by the weight. Sometimes you can have access to features in non-string form which means that you can cheat and encode things faster. There is also a form of encoder optimized for text that uses a Lucene analyzer to parse text you give to it. That kind of encoder also allows you to add bits of text one at a time before finalizing the encoding. This is helpful because you often want to use log(term frequency) so you have to count the number of times each word in the text occurs before encoding the feature. There are lots of examples in the MiA chapters on classification. The guts of the examples are embodied in the TrainNewsGroups example program. The book specific examples mostly add information about how to deploy a classifier as a web service using Thrift. On Thu, Jan 20, 2011 at 7:45 PM, Jeff Eastman <[email protected]>wrote: > Ted, can you add a little more about hashed vectorization? Would that come > from hashing the terms to determine their dictionary indices? I can see how > hashing terms into a large dimensional space could result in constant sized > vectors, though there might be problems if multiple terms hashed to the same > index. Is this something we could consider for seq2sparse to make online > clustering processes work more smoothly? > > > > On 1/20/11 12:08 PM, Ted Dunning wrote: > >> On Thu, Jan 20, 2011 at 10:56 AM, Jeff Eastman >> <[email protected]>wrote: >> >> Hi Veronica, >>> >>> I've only tried incremental clustering as a thought-experiment but the >>> kind >>> of problem you are attacking has many areas of applicability. The problem >>> you are seeing is the new articles bring new terms with them and this >>> will >>> produce different cardinality vectors as new articles are added. You can >>> trick the Vector implementation by creating all the vectors with maxInt >>> cardinality but the current Mahout text vectorization (seq2sparse) does >>> not >>> handle the growth in the directory which incremental additions would >>> entail. >>> If we could prime seq2sparse with with the dictionary from the last >>> addition >>> we might be able to support incremental vectorization with minimal >>> changes. >>> >>> Jeff, using hashed vectorization would solve this as well because the >> document vectors will always have constant size. Commonly used distances >> should work unchanged with a hashed representation although you might have >> a >> few scaling surprises with multiple probes. >> >> >> I don't completely agree with MIA 11.3.1's "use canopy clustering" >>> phrase; >>> I think it is a bit misleading. Each of the clustering algorithms >>> (including >>> canopy) has two phases: cluster generation and vector classification >>> using >>> those clusters. I think the best choice for a maximum likelihood >>> classifier >>> would actually be KMeansDriver.clusterData() and not the CanopyDriver >>> version (which requires t1 and t2 values to initialize the clusterer but >>> these are never used for classification). >>> >>> To really implement the case study it would seem to me to require a >>> single >>> threshold classification to avoid assigning new articles to existing >>> clusters which were too dissimilar to really fit. Then these leftovers >>> could >>> be used to generate new clusters which could then be added to the list. >>> >>> Perhaps one of the authors can add some clarification on this too? >>> >>> Jeff >>> >>> On 1/20/11 8:24 AM, Veronica Joh wrote: >>> >>> Hi >>>> I have large number of artcles clustered by kmeans. >>>> For the new articles that comes in, it says I need to "use canopy >>>> clustering to assign it to the cluster whose centroid is closest based >>>> on a >>>> very small distance threshold" according to Mahout in Action book. >>>> I'm not sure how to add new article canopies to the existing cluster. >>>> >>>> So I'm saving batch articles in a list of Cluster like this. >>>> List<Cluster> clusters = new ArrayList<Cluster>(); >>>> >>>> For the new article canopies, I'm trying following to measure the >>>> distance, but I get error like this. "Required cardinality 11981 but got >>>> 77372" >>>> Text key = new Text(); >>>> Canopy value = new Canopy(); >>>> DistanceMeasure measure = new EuclideanDistanceMeasure(); >>>> while (reader.next(key, value)){ >>>> for (int i=0; i<clusters.size(); i++){ >>>> double d = measure.distance(clusters.get(i).getCenter(), >>>> value.getCenter()); >>>> } >>>> } >>>> >>>> Is this how to compare cluster centroids with new canopies? or Did I >>>> misundertand something? >>>> Can you please help me so I can get the online news clustering working? >>>> Thank you very much! >>>> >>>> >>> >
