Pipelining is good for abstraction and really bad for performance (in the map-reduce world).
My thought is that we could have a multipurpose tool. Input would be a lucene index and the program would read term vectors or original text as available. Output would be either sequence file full of text or sequence file full of vectors. This would allow pipelining if interesting, but would also allow the common case of generating vectors to proceed in one step. On Wed, May 4, 2011 at 10:41 AM, Jake Mannix <[email protected]> wrote: > On Wed, May 4, 2011 at 10:33 AM, Ted Dunning <[email protected]> > wrote: > > > It might be that the right thing is to just tweak the current seq2saprse > > process. > > > > Jake, > > > > is that what you were thinking? > > > > Well seq2sparse is really for grabbing sequence files, and lucene.vector > grabs > lucene indexes... I was just imagining another script that takes lucene > indexes > and produces text files (or sequence files of text), so you can just > pipeline it. > > I haven't thought about it too carefully, however. >
