Re: LDA from Lucene Indexes

Jake Mannix Wed, 04 May 2011 10:09:36 -0700

On Wed, May 4, 2011 at 8:53 AM, Julian Limon <[email protected]>wrote:

> This sounds really interesting. Is there a way to dump certain fields from
> a
> Lucene index to text files?
>
> If so, I could use Lucene to do the parsing, and then seqdirectory and
> seq2sparse to generate Mahout vectors out of these files.
>

You need to either have the fields Store.YES, or TermVector.YES for this
to work.  If you have the latter, then you don't need them in text files,
you
can use the usual lucene.vector script to produce mahout vectors.

To dump stored fields, we don't currently have a script to do that, but it
should be another 5 lines of code to write one (ok, 25 lines, including
boilerplate, damn java).  File a ticket, there are lots of people around
here
who could write that code.

  -jake

> Thanks,
>
> Julian
>
> 2011/5/3 Jake Mannix <[email protected]>
>
> > On Tue, May 3, 2011 at 6:17 PM, Grant Ingersoll <[email protected]>
> > wrote:
> >
> > >
> > > > Although technically, we could add the capability to take a Store.YES
> > > field
> > > > and re-tokenize and
> > > > build vectors from this as well.
> > >
> > > True, or we could just dump stored fields out to text and use the
> > existing
> > > text converter
> >
> >
> > That would probably be the right way to do that, actually.
> >
> >  -jake
> >
>

Re: LDA from Lucene Indexes

Reply via email to