I have a relatively large existing Lucene index which does not store vectors.  
Size is approx. 100 million documents (about 1.5 TB in size).

I am thinking of using some lower level Lucene API code to extract vectors, by 
enumerating terms and term docs collections.

Something like the following pseudocode logic:

termid=0
for each term in terms:
        termid++
        writeToFile("terms",term,termid)
        for each doc in termdocs(term):
                tf=getTF(doc,term)
                writeToFile("docs",docid, termid, tf)


So I will end up with two files:

terms - contains mapping from term text to term ID
docs - contains mapping from docid to termid and TF

I can then build vectors by sorting docs file by docid and then gathering terms 
for each doc into another vector file that I can use with mahout.

Probably I should not use internal docid, but instead some unique identifier 
field.

Also, I assume at some point this could be a map-reduce job in hadoop.

I'm just asking for sanity check, or if there are any better ideas out there.

Thanks
Bob

Reply via email to