Thanks, Jake! I also need certain files that are generated in the seq2sparse process (tf), so lucene.vector might not be the best choice. I'll take a look at dumping stored fields, then.
Thanks 2011/5/4 Jake Mannix <[email protected]> > On Wed, May 4, 2011 at 8:53 AM, Julian Limon <[email protected] > >wrote: > > > This sounds really interesting. Is there a way to dump certain fields > from > > a > > Lucene index to text files? > > > > If so, I could use Lucene to do the parsing, and then seqdirectory and > > seq2sparse to generate Mahout vectors out of these files. > > > > You need to either have the fields Store.YES, or TermVector.YES for this > to work. If you have the latter, then you don't need them in text files, > you > can use the usual lucene.vector script to produce mahout vectors. > > To dump stored fields, we don't currently have a script to do that, but it > should be another 5 lines of code to write one (ok, 25 lines, including > boilerplate, damn java). File a ticket, there are lots of people around > here > who could write that code. > > -jake > > > > Thanks, > > > > Julian > > > > 2011/5/3 Jake Mannix <[email protected]> > > > > > On Tue, May 3, 2011 at 6:17 PM, Grant Ingersoll <[email protected]> > > > wrote: > > > > > > > > > > > > Although technically, we could add the capability to take a > Store.YES > > > > field > > > > > and re-tokenize and > > > > > build vectors from this as well. > > > > > > > > True, or we could just dump stored fields out to text and use the > > > existing > > > > text converter > > > > > > > > > That would probably be the right way to do that, actually. > > > > > > -jake > > > > > >
