Re: seq2sparse and lsi fold-in

Dmitriy Lyubimov Thu, 30 Dec 2010 15:58:19 -0800

Also, if i have a bunch of new documents to fold-in, it looks like i'd need
to run a matrix multiplication job between new document vectors and V, both
matrices represented row-wise. So DistributedRowMatrix should help me,
shouldn't it? do i need to transpose the first matrix first?


Thank you once again, your help is really invaluable.

-Dmitriy

On Thu, Dec 30, 2010 at 1:38 PM, Dmitriy Lyubimov <[email protected]> wrote:

> Thank you, Ted.
>
>
> On Thu, Dec 30, 2010 at 1:05 PM, Ted Dunning <[email protected]>wrote:
>
>> The fourth choice is what I would recommend in general unless you need
>> very
>> easy reverse-engineering of your vectors.
>>
>> On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <[email protected]>
>> wrote:
>>
>> >
>> > There are two dictionary-like systems in Mahout.  Neither is quite
>> right.
>> >
>> > The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary.
>>  It
>> > doesn't do the frequency counting you want.
>> >
>> > The more complex one is in DictionaryVectorizer.  Unfortunately, it is a
>> > mass of static functions that depend on statically named files rather
>> than
>> > being a real API.
>> >
>> > There is a third choice as well
>> > in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder.  It
>> does
>> > on-line IDF weighting and can be used underneath a text encoder to get
>> > on-line TF-IDF weighting of the sort you desire.  You can preset counts
>> > using the getDictionary accessor.
>> >
>> > A fourth choice is to simply use a static word encoder with hashed
>> vectors
>> > and do the IDF weighting as a vector element-wise multiplication.  That
>> way
>> > you only need to keep around a vector of weights and no dictionary.
>>  That
>> > should be much cheaper in memory.
>> >
>> >
>> > On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[email protected]
>> >wrote:
>> >
>> >> Hi,
>> >>
>> >> I would like to try LSI processing of results produced by seq2sparse.
>> >>
>> >> What's more, I need to be able to fold-in a bunch of new documents
>> >> afterwards.
>> >>
>> >> Is there any support for fold-in indexing in Mahout?
>> >>
>> >> if not, is there a quick way for me to gain the understanding of
>> >> seq2sparse
>> >> output?
>> >> In particular, if i wanted to add fold-in indexing, i need to be able
>> to
>> >> produce TF or TF-IDF of the new document on the fly using pre-existing
>> >> dictionary and word counts. What's the api for this dictionary?
>> >>
>> >> Thank you.
>> >> -Dmitriy
>> >>
>> >
>> >
>>
>
>

Re: seq2sparse and lsi fold-in

Reply via email to