On May 4, 2011, at 2:31 PM, Jake Mannix wrote: > On Wed, May 4, 2011 at 10:46 AM, Ted Dunning <[email protected]> wrote: > >> Pipelining is good for abstraction and really bad for performance (in the >> map-reduce world). >> >> My thought is that we could have a multipurpose tool. Input would be a >> lucene index and the program would read term vectors or original text as >> available. Output would be either sequence file full of text or sequence >> file full of vectors. >> > > Ok, sure, then this is modifying the lucene.vectors code, not the > seq2sparse code, right?
Easiest is to dump to text and then use seq2sparse which has all of the functionality for tokenizing, etc. As Jake said, it's about 5 lines of code plus boilerplate. I think I even have some lying around somewhere. If we go the route suggested here by Ted, we likely should refactor both lucene.vec and seq2sparse to have a shared piece for doing the analysis. After all, it's entirely feasible that one would want to even postprocess what comes out of the term vector too (for instance, if it wasn't stemmed before or if you wanted more aggressive stopword removal) -Grant
