Here's the path to follow for finding these kinds of things:
MahoutDriver uses driver.classes.props to figure out the options available on
the command line.
driver.classes.props has lucene.vector as
org.apache.mahout.utils.vectors.lucene.Driver
In there, the dumpVectors method does most of the heavy lifting.
On Nov 5, 2011, at 11:20 AM, Robert Stewart wrote:
> Can you point me to the code in trunk which implements "lucene.vector"
> command?
>
> Bob
>
>
> On Nov 4, 2011, at 2:05 PM, Grant Ingersoll wrote:
>
>> Should be doable, but likely slow. Relative to the other things you are
>> likely doing, probably not a big deal.
>>
>> In fact, I've thought about adding such a piece of code, so if you are
>> looking to contrib, it would be welcome.
>>
>> On Nov 4, 2011, at 1:55 PM, Robert Stewart wrote:
>>
>>> I have a relatively large existing Lucene index which does not store
>>> vectors. Size is approx. 100 million documents (about 1.5 TB in size).
>>>
>>> I am thinking of using some lower level Lucene API code to extract vectors,
>>> by enumerating terms and term docs collections.
>>>
>>> Something like the following pseudocode logic:
>>>
>>> termid=0
>>> for each term in terms:
>>> termid++
>>> writeToFile("terms",term,termid)
>>> for each doc in termdocs(term):
>>> tf=getTF(doc,term)
>>> writeToFile("docs",docid, termid, tf)
>>>
>>>
>>> So I will end up with two files:
>>>
>>> terms - contains mapping from term text to term ID
>>> docs - contains mapping from docid to termid and TF
>>>
>>> I can then build vectors by sorting docs file by docid and then gathering
>>> terms for each doc into another vector file that I can use with mahout.
>>>
>>> Probably I should not use internal docid, but instead some unique
>>> identifier field.
>>>
>>> Also, I assume at some point this could be a map-reduce job in hadoop.
>>>
>>> I'm just asking for sanity check, or if there are any better ideas out
>>> there.
>>>
>>> Thanks
>>> Bob
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>>
>>
>>
>>
>
--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com