Re: LDA from Lucene Indexes

Grant Ingersoll Thu, 05 May 2011 05:54:51 -0700

On May 4, 2011, at 2:31 PM, Jake Mannix wrote:

> On Wed, May 4, 2011 at 10:46 AM, Ted Dunning <[email protected]> wrote:
> 
>> Pipelining is good for abstraction and really bad for performance (in the
>> map-reduce world).
>> 
>> My thought is that we could have a multipurpose tool.  Input would be a
>> lucene index and the program would read term vectors or original text as
>> available.  Output would be either sequence file full of text or sequence
>> file full of vectors.
>> 
> 
> Ok, sure, then this is modifying the lucene.vectors code, not the
> seq2sparse code, right?


Easiest is to dump to text and then use seq2sparse which has all of the 
functionality for tokenizing, etc.   As Jake said, it's about 5 lines of code 
plus boilerplate.  I think I even have some lying around somewhere.

If we go the route suggested here by Ted, we likely
should refactor both lucene.vec and seq2sparse to have a shared piece for doing 
the analysis.  After all, it's entirely feasible that one would want to even 
postprocess what comes out of the term vector too (for instance, if it wasn't 
stemmed before or if you wanted more aggressive stopword removal)

-Grant

Re: LDA from Lucene Indexes

Reply via email to