It looks like a fine solution.  It should be map-reducable as well if you
can build good splits on term space.  That isn't quite as simple as it
looks since you probably want each mapper to read a consecutive sequence of
term id's and earlier term id's will be much, much more common than later
ones.  It should be pretty easy to use term frequency to balance this out.

Another thought is that you should accumulate whatever information you get
about documents and just burp out the contents of your accumulators
occasionally.  In the map-reduce framework, this would be a combiner, but
you can obviously do this in your single machine version as well.  The
virtue of this is that you will decrease output size which should be a key
determinant of run-time.

A final thought is that using the Mahout collections to get a compact int
-> intlist map might help you keep more temporary accumulations in memory
which will allow you to accumulate more data in memory before burping.
 That will decrease the output size which makes everything better.


Fri, Nov 4, 2011 at 11:06 AM, Robert Stewart <[email protected]> wrote:

> Ok that was what I thought.  I'll give it a shot.
>
>
> On Nov 4, 2011, at 2:05 PM, Grant Ingersoll wrote:
>
> > Should be doable, but likely slow. Relative to the other things you are
> likely doing, probably not a big deal.
> >
> > In fact, I've thought about adding such a piece of code, so if you are
> looking to contrib, it would be welcome.
> >
> > On Nov 4, 2011, at 1:55 PM, Robert Stewart wrote:
> >
> >> I have a relatively large existing Lucene index which does not store
> vectors.  Size is approx. 100 million documents (about 1.5 TB in size).
> >>
> >> I am thinking of using some lower level Lucene API code to extract
> vectors, by enumerating terms and term docs collections.
> >>
> >> Something like the following pseudocode logic:
> >>
> >> termid=0
> >> for each term in terms:
> >>      termid++
> >>      writeToFile("terms",term,termid)
> >>      for each doc in termdocs(term):
> >>              tf=getTF(doc,term)
> >>              writeToFile("docs",docid, termid, tf)
> >>
> >>
> >> So I will end up with two files:
> >>
> >> terms - contains mapping from term text to term ID
> >> docs - contains mapping from docid to termid and TF
> >>
> >> I can then build vectors by sorting docs file by docid and then
> gathering terms for each doc into another vector file that I can use with
> mahout.
> >>
> >> Probably I should not use internal docid, but instead some unique
> identifier field.
> >>
> >> Also, I assume at some point this could be a map-reduce job in hadoop.
> >>
> >> I'm just asking for sanity check, or if there are any better ideas out
> there.
> >>
> >> Thanks
> >> Bob
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com
> >
> >
> >
> >
> >
>
>

Reply via email to