Re: seq2sparse and lsi fold-in

Ted Dunning Thu, 30 Dec 2010 13:05:09 -0800

There are two dictionary-like systems in Mahout.  Neither is quite right.

The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary.  It
doesn't do the frequency counting you want.

The more complex one is in DictionaryVectorizer.  Unfortunately, it is a
mass of static functions that depend on statically named files rather than
being a real API.

There is a third choice as well
in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder.  It does
on-line IDF weighting and can be used underneath a text encoder to get
on-line TF-IDF weighting of the sort you desire.  You can preset counts
using the getDictionary accessor.

A fourth choice is to simply use a static word encoder with hashed vectors
and do the IDF weighting as a vector element-wise multiplication.  That way
you only need to keep around a vector of weights and no dictionary.  That
should be much cheaper in memory.

On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[email protected]>wrote:

> Hi,
>
> I would like to try LSI processing of results produced by seq2sparse.
>
> What's more, I need to be able to fold-in a bunch of new documents
> afterwards.
>
> Is there any support for fold-in indexing in Mahout?
>
> if not, is there a quick way for me to gain the understanding of seq2sparse
> output?
> In particular, if i wanted to add fold-in indexing, i need to be able to
> produce TF or TF-IDF of the new document on the fly using pre-existing
> dictionary and word counts. What's the api for this dictionary?
>
> Thank you.
> -Dmitriy
>

Re: seq2sparse and lsi fold-in

Reply via email to