There are two dictionary-like systems in Mahout. Neither is quite right. The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary. It doesn't do the frequency counting you want.
The more complex one is in DictionaryVectorizer. Unfortunately, it is a mass of static functions that depend on statically named files rather than being a real API. There is a third choice as well in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder. It does on-line IDF weighting and can be used underneath a text encoder to get on-line TF-IDF weighting of the sort you desire. You can preset counts using the getDictionary accessor. A fourth choice is to simply use a static word encoder with hashed vectors and do the IDF weighting as a vector element-wise multiplication. That way you only need to keep around a vector of weights and no dictionary. That should be much cheaper in memory. On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[email protected]>wrote: > Hi, > > I would like to try LSI processing of results produced by seq2sparse. > > What's more, I need to be able to fold-in a bunch of new documents > afterwards. > > Is there any support for fold-in indexing in Mahout? > > if not, is there a quick way for me to gain the understanding of seq2sparse > output? > In particular, if i wanted to add fold-in indexing, i need to be able to > produce TF or TF-IDF of the new document on the fly using pre-existing > dictionary and word counts. What's the api for this dictionary? > > Thank you. > -Dmitriy >
