Thanks Ted, One thing I don't get. Why would "earlier term id's be much, much more common that later ones"? AFAIK, terms are sorted lexicographically, so earlier ones are just AAA... instead of ZZZ... so I don't understand how that relates to frequency. Probably I misunderstand what you mean by term ids?
On Nov 4, 2011, at 2:33 PM, Ted Dunning wrote: > It looks like a fine solution. It should be map-reducable as well if you > can build good splits on term space. That isn't quite as simple as it > looks since you probably want each mapper to read a consecutive sequence of > term id's and earlier term id's will be much, much more common than later > ones. It should be pretty easy to use term frequency to balance this out. > > Another thought is that you should accumulate whatever information you get > about documents and just burp out the contents of your accumulators > occasionally. In the map-reduce framework, this would be a combiner, but > you can obviously do this in your single machine version as well. The > virtue of this is that you will decrease output size which should be a key > determinant of run-time. > > A final thought is that using the Mahout collections to get a compact int > -> intlist map might help you keep more temporary accumulations in memory > which will allow you to accumulate more data in memory before burping. > That will decrease the output size which makes everything better. > > > Fri, Nov 4, 2011 at 11:06 AM, Robert Stewart <[email protected]> wrote: > >> Ok that was what I thought. I'll give it a shot. >> >> >> On Nov 4, 2011, at 2:05 PM, Grant Ingersoll wrote: >> >>> Should be doable, but likely slow. Relative to the other things you are >> likely doing, probably not a big deal. >>> >>> In fact, I've thought about adding such a piece of code, so if you are >> looking to contrib, it would be welcome. >>> >>> On Nov 4, 2011, at 1:55 PM, Robert Stewart wrote: >>> >>>> I have a relatively large existing Lucene index which does not store >> vectors. Size is approx. 100 million documents (about 1.5 TB in size). >>>> >>>> I am thinking of using some lower level Lucene API code to extract >> vectors, by enumerating terms and term docs collections. >>>> >>>> Something like the following pseudocode logic: >>>> >>>> termid=0 >>>> for each term in terms: >>>> termid++ >>>> writeToFile("terms",term,termid) >>>> for each doc in termdocs(term): >>>> tf=getTF(doc,term) >>>> writeToFile("docs",docid, termid, tf) >>>> >>>> >>>> So I will end up with two files: >>>> >>>> terms - contains mapping from term text to term ID >>>> docs - contains mapping from docid to termid and TF >>>> >>>> I can then build vectors by sorting docs file by docid and then >> gathering terms for each doc into another vector file that I can use with >> mahout. >>>> >>>> Probably I should not use internal docid, but instead some unique >> identifier field. >>>> >>>> Also, I assume at some point this could be a map-reduce job in hadoop. >>>> >>>> I'm just asking for sanity check, or if there are any better ideas out >> there. >>>> >>>> Thanks >>>> Bob >>> >>> -------------------------- >>> Grant Ingersoll >>> http://www.lucidimagination.com >>> >>> >>> >>> >>> >> >>
