Thank you, Ted. On Thu, Dec 30, 2010 at 1:05 PM, Ted Dunning <[email protected]> wrote:
> The fourth choice is what I would recommend in general unless you need very > easy reverse-engineering of your vectors. > > On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <[email protected]> > wrote: > > > > > There are two dictionary-like systems in Mahout. Neither is quite right. > > > > The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary. > It > > doesn't do the frequency counting you want. > > > > The more complex one is in DictionaryVectorizer. Unfortunately, it is a > > mass of static functions that depend on statically named files rather > than > > being a real API. > > > > There is a third choice as well > > in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder. It > does > > on-line IDF weighting and can be used underneath a text encoder to get > > on-line TF-IDF weighting of the sort you desire. You can preset counts > > using the getDictionary accessor. > > > > A fourth choice is to simply use a static word encoder with hashed > vectors > > and do the IDF weighting as a vector element-wise multiplication. That > way > > you only need to keep around a vector of weights and no dictionary. That > > should be much cheaper in memory. > > > > > > On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[email protected] > >wrote: > > > >> Hi, > >> > >> I would like to try LSI processing of results produced by seq2sparse. > >> > >> What's more, I need to be able to fold-in a bunch of new documents > >> afterwards. > >> > >> Is there any support for fold-in indexing in Mahout? > >> > >> if not, is there a quick way for me to gain the understanding of > >> seq2sparse > >> output? > >> In particular, if i wanted to add fold-in indexing, i need to be able to > >> produce TF or TF-IDF of the new document on the fly using pre-existing > >> dictionary and word counts. What's the api for this dictionary? > >> > >> Thank you. > >> -Dmitriy > >> > > > > >
