Re: seq2sparse and lsi fold-in

Ted Dunning Thu, 30 Dec 2010 13:06:11 -0800

The fourth choice is what I would recommend in general unless you need very
easy reverse-engineering of your vectors.


On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <[email protected]> wrote:

>
> There are two dictionary-like systems in Mahout.  Neither is quite right.
>
> The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary.  It
> doesn't do the frequency counting you want.
>
> The more complex one is in DictionaryVectorizer.  Unfortunately, it is a
> mass of static functions that depend on statically named files rather than
> being a real API.
>
> There is a third choice as well
> in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder.  It does
> on-line IDF weighting and can be used underneath a text encoder to get
> on-line TF-IDF weighting of the sort you desire.  You can preset counts
> using the getDictionary accessor.
>
> A fourth choice is to simply use a static word encoder with hashed vectors
> and do the IDF weighting as a vector element-wise multiplication.  That way
> you only need to keep around a vector of weights and no dictionary.  That
> should be much cheaper in memory.
>
>
> On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[email protected]>wrote:
>
>> Hi,
>>
>> I would like to try LSI processing of results produced by seq2sparse.
>>
>> What's more, I need to be able to fold-in a bunch of new documents
>> afterwards.
>>
>> Is there any support for fold-in indexing in Mahout?
>>
>> if not, is there a quick way for me to gain the understanding of
>> seq2sparse
>> output?
>> In particular, if i wanted to add fold-in indexing, i need to be able to
>> produce TF or TF-IDF of the new document on the fly using pre-existing
>> dictionary and word counts. What's the api for this dictionary?
>>
>> Thank you.
>> -Dmitriy
>>
>
>

Re: seq2sparse and lsi fold-in

Reply via email to