Thanks Ted,

One thing I don't get.  Why would "earlier term id's be much, much more common 
that later ones"?  AFAIK, terms are sorted lexicographically, so earlier ones 
are just AAA... instead of ZZZ... so I don't understand how that relates to 
frequency.  Probably I misunderstand what you mean by term ids?



On Nov 4, 2011, at 2:33 PM, Ted Dunning wrote:

> It looks like a fine solution.  It should be map-reducable as well if you
> can build good splits on term space.  That isn't quite as simple as it
> looks since you probably want each mapper to read a consecutive sequence of
> term id's and earlier term id's will be much, much more common than later
> ones.  It should be pretty easy to use term frequency to balance this out.
> 
> Another thought is that you should accumulate whatever information you get
> about documents and just burp out the contents of your accumulators
> occasionally.  In the map-reduce framework, this would be a combiner, but
> you can obviously do this in your single machine version as well.  The
> virtue of this is that you will decrease output size which should be a key
> determinant of run-time.
> 
> A final thought is that using the Mahout collections to get a compact int
> -> intlist map might help you keep more temporary accumulations in memory
> which will allow you to accumulate more data in memory before burping.
> That will decrease the output size which makes everything better.
> 
> 
> Fri, Nov 4, 2011 at 11:06 AM, Robert Stewart <[email protected]> wrote:
> 
>> Ok that was what I thought.  I'll give it a shot.
>> 
>> 
>> On Nov 4, 2011, at 2:05 PM, Grant Ingersoll wrote:
>> 
>>> Should be doable, but likely slow. Relative to the other things you are
>> likely doing, probably not a big deal.
>>> 
>>> In fact, I've thought about adding such a piece of code, so if you are
>> looking to contrib, it would be welcome.
>>> 
>>> On Nov 4, 2011, at 1:55 PM, Robert Stewart wrote:
>>> 
>>>> I have a relatively large existing Lucene index which does not store
>> vectors.  Size is approx. 100 million documents (about 1.5 TB in size).
>>>> 
>>>> I am thinking of using some lower level Lucene API code to extract
>> vectors, by enumerating terms and term docs collections.
>>>> 
>>>> Something like the following pseudocode logic:
>>>> 
>>>> termid=0
>>>> for each term in terms:
>>>>     termid++
>>>>     writeToFile("terms",term,termid)
>>>>     for each doc in termdocs(term):
>>>>             tf=getTF(doc,term)
>>>>             writeToFile("docs",docid, termid, tf)
>>>> 
>>>> 
>>>> So I will end up with two files:
>>>> 
>>>> terms - contains mapping from term text to term ID
>>>> docs - contains mapping from docid to termid and TF
>>>> 
>>>> I can then build vectors by sorting docs file by docid and then
>> gathering terms for each doc into another vector file that I can use with
>> mahout.
>>>> 
>>>> Probably I should not use internal docid, but instead some unique
>> identifier field.
>>>> 
>>>> Also, I assume at some point this could be a map-reduce job in hadoop.
>>>> 
>>>> I'm just asking for sanity check, or if there are any better ideas out
>> there.
>>>> 
>>>> Thanks
>>>> Bob
>>> 
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 

Reply via email to