Re: creating vectors from lucene index which does NOT store vectors

Robert Stewart Sat, 05 Nov 2011 08:21:00 -0700

Can you point me to the code in trunk which implements "lucene.vector" command?


Bob


On Nov 4, 2011, at 2:05 PM, Grant Ingersoll wrote:

> Should be doable, but likely slow. Relative to the other things you are 
> likely doing, probably not a big deal.
> 
> In fact, I've thought about adding such a piece of code, so if you are 
> looking to contrib, it would be welcome.
> 
> On Nov 4, 2011, at 1:55 PM, Robert Stewart wrote:
> 
>> I have a relatively large existing Lucene index which does not store 
>> vectors.  Size is approx. 100 million documents (about 1.5 TB in size).
>> 
>> I am thinking of using some lower level Lucene API code to extract vectors, 
>> by enumerating terms and term docs collections.
>> 
>> Something like the following pseudocode logic:
>> 
>> termid=0
>> for each term in terms:
>>      termid++
>>      writeToFile("terms",term,termid)
>>      for each doc in termdocs(term):
>>              tf=getTF(doc,term)
>>              writeToFile("docs",docid, termid, tf)
>> 
>> 
>> So I will end up with two files:
>> 
>> terms - contains mapping from term text to term ID
>> docs - contains mapping from docid to termid and TF
>> 
>> I can then build vectors by sorting docs file by docid and then gathering 
>> terms for each doc into another vector file that I can use with mahout.
>> 
>> Probably I should not use internal docid, but instead some unique identifier 
>> field.
>> 
>> Also, I assume at some point this could be a map-reduce job in hadoop.
>> 
>> I'm just asking for sanity check, or if there are any better ideas out there.
>> 
>> Thanks
>> Bob
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> 
> 
> 
>

Re: creating vectors from lucene index which does NOT store vectors

Reply via email to