Thank you, Ted.

On Thu, Dec 30, 2010 at 1:05 PM, Ted Dunning <[email protected]> wrote:

> The fourth choice is what I would recommend in general unless you need very
> easy reverse-engineering of your vectors.
>
> On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <[email protected]>
> wrote:
>
> >
> > There are two dictionary-like systems in Mahout.  Neither is quite right.
> >
> > The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary.
>  It
> > doesn't do the frequency counting you want.
> >
> > The more complex one is in DictionaryVectorizer.  Unfortunately, it is a
> > mass of static functions that depend on statically named files rather
> than
> > being a real API.
> >
> > There is a third choice as well
> > in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder.  It
> does
> > on-line IDF weighting and can be used underneath a text encoder to get
> > on-line TF-IDF weighting of the sort you desire.  You can preset counts
> > using the getDictionary accessor.
> >
> > A fourth choice is to simply use a static word encoder with hashed
> vectors
> > and do the IDF weighting as a vector element-wise multiplication.  That
> way
> > you only need to keep around a vector of weights and no dictionary.  That
> > should be much cheaper in memory.
> >
> >
> > On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[email protected]
> >wrote:
> >
> >> Hi,
> >>
> >> I would like to try LSI processing of results produced by seq2sparse.
> >>
> >> What's more, I need to be able to fold-in a bunch of new documents
> >> afterwards.
> >>
> >> Is there any support for fold-in indexing in Mahout?
> >>
> >> if not, is there a quick way for me to gain the understanding of
> >> seq2sparse
> >> output?
> >> In particular, if i wanted to add fold-in indexing, i need to be able to
> >> produce TF or TF-IDF of the new document on the fly using pre-existing
> >> dictionary and word counts. What's the api for this dictionary?
> >>
> >> Thank you.
> >> -Dmitriy
> >>
> >
> >
>

Reply via email to