I have a relatively large existing Lucene index which does not store vectors.
Size is approx. 100 million documents (about 1.5 TB in size).
I am thinking of using some lower level Lucene API code to extract vectors, by
enumerating terms and term docs collections.
Something like the following pseudocode logic:
termid=0
for each term in terms:
termid++
writeToFile("terms",term,termid)
for each doc in termdocs(term):
tf=getTF(doc,term)
writeToFile("docs",docid, termid, tf)
So I will end up with two files:
terms - contains mapping from term text to term ID
docs - contains mapping from docid to termid and TF
I can then build vectors by sorting docs file by docid and then gathering terms
for each doc into another vector file that I can use with mahout.
Probably I should not use internal docid, but instead some unique identifier
field.
Also, I assume at some point this could be a map-reduce job in hadoop.
I'm just asking for sanity check, or if there are any better ideas out there.
Thanks
Bob