creating vectors from lucene index which does NOT store vectors

Robert Stewart Fri, 04 Nov 2011 10:56:12 -0700

I have a relatively large existing Lucene index which does not store vectors.  
Size is approx. 100 million documents (about 1.5 TB in size).


I am thinking of using some lower level Lucene API code to extract vectors, by 
enumerating terms and term docs collections.

Something like the following pseudocode logic:

termid=0
for each term in terms:
        termid++
        writeToFile("terms",term,termid)
        for each doc in termdocs(term):
                tf=getTF(doc,term)
                writeToFile("docs",docid, termid, tf)


So I will end up with two files:

terms - contains mapping from term text to term ID
docs - contains mapping from docid to termid and TF

I can then build vectors by sorting docs file by docid and then gathering terms 
for each doc into another vector file that I can use with mahout.

Probably I should not use internal docid, but instead some unique identifier 
field.

Also, I assume at some point this could be a map-reduce job in hadoop.

I'm just asking for sanity check, or if there are any better ideas out there.

Thanks
Bob

creating vectors from lucene index which does NOT store vectors

Reply via email to