Also, if i have a bunch of new documents to fold-in, it looks like i'd need to run a matrix multiplication job between new document vectors and V, both matrices represented row-wise. So DistributedRowMatrix should help me, shouldn't it? do i need to transpose the first matrix first?
Thank you once again, your help is really invaluable. -Dmitriy On Thu, Dec 30, 2010 at 1:38 PM, Dmitriy Lyubimov <[email protected]> wrote: > Thank you, Ted. > > > On Thu, Dec 30, 2010 at 1:05 PM, Ted Dunning <[email protected]>wrote: > >> The fourth choice is what I would recommend in general unless you need >> very >> easy reverse-engineering of your vectors. >> >> On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <[email protected]> >> wrote: >> >> > >> > There are two dictionary-like systems in Mahout. Neither is quite >> right. >> > >> > The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary. >> It >> > doesn't do the frequency counting you want. >> > >> > The more complex one is in DictionaryVectorizer. Unfortunately, it is a >> > mass of static functions that depend on statically named files rather >> than >> > being a real API. >> > >> > There is a third choice as well >> > in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder. It >> does >> > on-line IDF weighting and can be used underneath a text encoder to get >> > on-line TF-IDF weighting of the sort you desire. You can preset counts >> > using the getDictionary accessor. >> > >> > A fourth choice is to simply use a static word encoder with hashed >> vectors >> > and do the IDF weighting as a vector element-wise multiplication. That >> way >> > you only need to keep around a vector of weights and no dictionary. >> That >> > should be much cheaper in memory. >> > >> > >> > On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[email protected] >> >wrote: >> > >> >> Hi, >> >> >> >> I would like to try LSI processing of results produced by seq2sparse. >> >> >> >> What's more, I need to be able to fold-in a bunch of new documents >> >> afterwards. >> >> >> >> Is there any support for fold-in indexing in Mahout? >> >> >> >> if not, is there a quick way for me to gain the understanding of >> >> seq2sparse >> >> output? >> >> In particular, if i wanted to add fold-in indexing, i need to be able >> to >> >> produce TF or TF-IDF of the new document on the fly using pre-existing >> >> dictionary and word counts. What's the api for this dictionary? >> >> >> >> Thank you. >> >> -Dmitriy >> >> >> > >> > >> > >
