The fourth choice is what I would recommend in general unless you need very easy reverse-engineering of your vectors.
On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <[email protected]> wrote: > > There are two dictionary-like systems in Mahout. Neither is quite right. > > The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary. It > doesn't do the frequency counting you want. > > The more complex one is in DictionaryVectorizer. Unfortunately, it is a > mass of static functions that depend on statically named files rather than > being a real API. > > There is a third choice as well > in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder. It does > on-line IDF weighting and can be used underneath a text encoder to get > on-line TF-IDF weighting of the sort you desire. You can preset counts > using the getDictionary accessor. > > A fourth choice is to simply use a static word encoder with hashed vectors > and do the IDF weighting as a vector element-wise multiplication. That way > you only need to keep around a vector of weights and no dictionary. That > should be much cheaper in memory. > > > On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[email protected]>wrote: > >> Hi, >> >> I would like to try LSI processing of results produced by seq2sparse. >> >> What's more, I need to be able to fold-in a bunch of new documents >> afterwards. >> >> Is there any support for fold-in indexing in Mahout? >> >> if not, is there a quick way for me to gain the understanding of >> seq2sparse >> output? >> In particular, if i wanted to add fold-in indexing, i need to be able to >> produce TF or TF-IDF of the new document on the fly using pre-existing >> dictionary and word counts. What's the api for this dictionary? >> >> Thank you. >> -Dmitriy >> > >
