Should be doable, but likely slow. Relative to the other things you are likely
doing, probably not a big deal.
In fact, I've thought about adding such a piece of code, so if you are looking
to contrib, it would be welcome.
On Nov 4, 2011, at 1:55 PM, Robert Stewart wrote:
> I have a relatively large existing Lucene index which does not store vectors.
> Size is approx. 100 million documents (about 1.5 TB in size).
>
> I am thinking of using some lower level Lucene API code to extract vectors,
> by enumerating terms and term docs collections.
>
> Something like the following pseudocode logic:
>
> termid=0
> for each term in terms:
> termid++
> writeToFile("terms",term,termid)
> for each doc in termdocs(term):
> tf=getTF(doc,term)
> writeToFile("docs",docid, termid, tf)
>
>
> So I will end up with two files:
>
> terms - contains mapping from term text to term ID
> docs - contains mapping from docid to termid and TF
>
> I can then build vectors by sorting docs file by docid and then gathering
> terms for each doc into another vector file that I can use with mahout.
>
> Probably I should not use internal docid, but instead some unique identifier
> field.
>
> Also, I assume at some point this could be a map-reduce job in hadoop.
>
> I'm just asking for sanity check, or if there are any better ideas out there.
>
> Thanks
> Bob
--------------------------
Grant Ingersoll
http://www.lucidimagination.com