Should be doable, but likely slow. Relative to the other things you are likely 
doing, probably not a big deal.

In fact, I've thought about adding such a piece of code, so if you are looking 
to contrib, it would be welcome.

On Nov 4, 2011, at 1:55 PM, Robert Stewart wrote:

> I have a relatively large existing Lucene index which does not store vectors. 
>  Size is approx. 100 million documents (about 1.5 TB in size).
> 
> I am thinking of using some lower level Lucene API code to extract vectors, 
> by enumerating terms and term docs collections.
> 
> Something like the following pseudocode logic:
> 
> termid=0
> for each term in terms:
>       termid++
>       writeToFile("terms",term,termid)
>       for each doc in termdocs(term):
>               tf=getTF(doc,term)
>               writeToFile("docs",docid, termid, tf)
> 
> 
> So I will end up with two files:
> 
> terms - contains mapping from term text to term ID
> docs - contains mapping from docid to termid and TF
> 
> I can then build vectors by sorting docs file by docid and then gathering 
> terms for each doc into another vector file that I can use with mahout.
> 
> Probably I should not use internal docid, but instead some unique identifier 
> field.
> 
> Also, I assume at some point this could be a map-reduce job in hadoop.
> 
> I'm just asking for sanity check, or if there are any better ideas out there.
> 
> Thanks
> Bob

--------------------------
Grant Ingersoll
http://www.lucidimagination.com





Reply via email to