Can you point me to the code in trunk which implements "lucene.vector" command?
Bob
On Nov 4, 2011, at 2:05 PM, Grant Ingersoll wrote:
> Should be doable, but likely slow. Relative to the other things you are
> likely doing, probably not a big deal.
>
> In fact, I've thought about adding such a piece of code, so if you are
> looking to contrib, it would be welcome.
>
> On Nov 4, 2011, at 1:55 PM, Robert Stewart wrote:
>
>> I have a relatively large existing Lucene index which does not store
>> vectors. Size is approx. 100 million documents (about 1.5 TB in size).
>>
>> I am thinking of using some lower level Lucene API code to extract vectors,
>> by enumerating terms and term docs collections.
>>
>> Something like the following pseudocode logic:
>>
>> termid=0
>> for each term in terms:
>> termid++
>> writeToFile("terms",term,termid)
>> for each doc in termdocs(term):
>> tf=getTF(doc,term)
>> writeToFile("docs",docid, termid, tf)
>>
>>
>> So I will end up with two files:
>>
>> terms - contains mapping from term text to term ID
>> docs - contains mapping from docid to termid and TF
>>
>> I can then build vectors by sorting docs file by docid and then gathering
>> terms for each doc into another vector file that I can use with mahout.
>>
>> Probably I should not use internal docid, but instead some unique identifier
>> field.
>>
>> Also, I assume at some point this could be a map-reduce job in hadoop.
>>
>> I'm just asking for sanity check, or if there are any better ideas out there.
>>
>> Thanks
>> Bob
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>
>
>
>