The only way to build model incrementally is to do a 'fold in' of new
observations, that i know.

However, folding in (which is just a multiplication of a new vector
over the matrices as Ted explained somewhere else) is just a
projection into already trained space of factors, but not a repetition
of a training, which is what the actual SVD is.

This may work in more ways than just for the sake of incremental
compilation. E.g. what i kind of doing: I am working on a particular
industry domain, so i look for a large corpus of documents pertinent
to that domain. That's what i run LSA on.

Then i actually throw away the entire document matrix and build a 100%
fold-in document index based on the term and singular value matrix.

The reason for this is that often the actual corpus of documents that
you are working with and need proximity comparisons to is not that big
at all to provide true picture of the specific domain you want to fit
the stuff into. So generic workflow that i figured in my case  is to
take a really big corpus relevant to your business, fit term
dictionary to it using LSA, and then use that dictionary to fold-in
new documents, with assumption that that big corpus is more relevant
and much bigger to what you want to focus on, than those documents you
actually want to retrieve and compare.

Then you may have one fairly infrequent training and a fairly simple
fold-in procedures.

On Thu, Nov 17, 2011 at 1:47 PM, Grant Ingersoll <[email protected]> wrote:
> I've never implemented LSI.  Is there a way to incrementally build the model 
> (by simply indexing documents) or is it something that one only runs after 
> the fact once one has built up the much bigger matrix?  If it's the former, I 
> bet it wouldn't be that hard to just implement the appropriate new codecs and 
> similarity, assuming Lucene trunk.  If it's the latter, then Ted's comment 
> about pushing back into Lucene gets a bit hairier.  Still, I wonder if the 
> Codecs/Similarity could help here, too.
>
> What's a typical workflow look like for building all of this?
>
> On Nov 13, 2011, at 3:58 PM, Ted Dunning wrote:
>
>> Essentially not.
>>
>> And I would worry about how to push the LSI vectors back into lucene in a
>> coherent and usable way.
>>
>> On Sun, Nov 13, 2011 at 10:47 AM, Sebastian Schelter <[email protected]> wrote:
>>
>>> Is there some documentation/tutorial available on how to build a LSI
>>> pipeline with mahout and lucene?
>>>
>>> --sebastian
>>>
>
>
>

Reply via email to