It might be that the right thing is to just tweak the current seq2saprse process.
Jake, is that what you were thinking? On Wed, May 4, 2011 at 10:22 AM, Julian Limon <[email protected]>wrote: > Thanks, Jake! > > I also need certain files that are generated in the seq2sparse process > (tf), > so lucene.vector might not be the best choice. I'll take a look at dumping > stored fields, then. > > Thanks > > 2011/5/4 Jake Mannix <[email protected]> > > > On Wed, May 4, 2011 at 8:53 AM, Julian Limon <[email protected] > > >wrote: > > > > > This sounds really interesting. Is there a way to dump certain fields > > from > > > a > > > Lucene index to text files? > > > > > > If so, I could use Lucene to do the parsing, and then seqdirectory and > > > seq2sparse to generate Mahout vectors out of these files. > > > > > > > You need to either have the fields Store.YES, or TermVector.YES for this > > to work. If you have the latter, then you don't need them in text files, > > you > > can use the usual lucene.vector script to produce mahout vectors. > > > > To dump stored fields, we don't currently have a script to do that, but > it > > should be another 5 lines of code to write one (ok, 25 lines, including > > boilerplate, damn java). File a ticket, there are lots of people around > > here > > who could write that code. > > > > -jake > > > > > > > Thanks, > > > > > > Julian > > > > > > 2011/5/3 Jake Mannix <[email protected]> > > > > > > > On Tue, May 3, 2011 at 6:17 PM, Grant Ingersoll <[email protected] > > > > > > wrote: > > > > > > > > > > > > > > > Although technically, we could add the capability to take a > > Store.YES > > > > > field > > > > > > and re-tokenize and > > > > > > build vectors from this as well. > > > > > > > > > > True, or we could just dump stored fields out to text and use the > > > > existing > > > > > text converter > > > > > > > > > > > > That would probably be the right way to do that, actually. > > > > > > > > -jake > > > > > > > > > >
