Hmmm... I think I may be out of date.  Or not.  Grant may be able to
resolve the question.

If term id's are assigned in order of appearance then the first id's
assigned will tend to be common terms.

But I think you are that the current Lucen index structure uses
lexicographic order for the terms table.  Each term links into the file
holding the term postings  (i.e. the term frequency or .frq file)

On Fri, Nov 4, 2011 at 11:49 AM, Robert Stewart <[email protected]>wrote:

> Thanks Ted,
>
> One thing I don't get.  Why would "earlier term id's be much, much more
> common that later ones"?  AFAIK, terms are sorted lexicographically, so
> earlier ones are just AAA... instead of ZZZ... so I don't understand how
> that relates to frequency.  Probably I misunderstand what you mean by term
> ids?
>
>
>
> On Nov 4, 2011, at 2:33 PM, Ted Dunning wrote:
>
> > It looks like a fine solution.  It should be map-reducable as well if you
> > can build good splits on term space.  That isn't quite as simple as it
> > looks since you probably want each mapper to read a consecutive sequence
> of
> > term id's and earlier term id's will be much, much more common than later
> > ones.  It should be pretty easy to use term frequency to balance this
> out.
> >
> > Another thought is that you should accumulate whatever information you
> get
> > about documents and just burp out the contents of your accumulators
> > occasionally.  In the map-reduce framework, this would be a combiner, but
> > you can obviously do this in your single machine version as well.  The
> > virtue of this is that you will decrease output size which should be a
> key
> > determinant of run-time.
> >
> > A final thought is that using the Mahout collections to get a compact int
> > -> intlist map might help you keep more temporary accumulations in memory
> > which will allow you to accumulate more data in memory before burping.
> > That will decrease the output size which makes everything better.
> >
> >
> > Fri, Nov 4, 2011 at 11:06 AM, Robert Stewart <[email protected]>
> wrote:
> >
> >> Ok that was what I thought.  I'll give it a shot.
> >>
> >>
> >> On Nov 4, 2011, at 2:05 PM, Grant Ingersoll wrote:
> >>
> >>> Should be doable, but likely slow. Relative to the other things you are
> >> likely doing, probably not a big deal.
> >>>
> >>> In fact, I've thought about adding such a piece of code, so if you are
> >> looking to contrib, it would be welcome.
> >>>
> >>> On Nov 4, 2011, at 1:55 PM, Robert Stewart wrote:
> >>>
> >>>> I have a relatively large existing Lucene index which does not store
> >> vectors.  Size is approx. 100 million documents (about 1.5 TB in size).
> >>>>
> >>>> I am thinking of using some lower level Lucene API code to extract
> >> vectors, by enumerating terms and term docs collections.
> >>>>
> >>>> Something like the following pseudocode logic:
> >>>>
> >>>> termid=0
> >>>> for each term in terms:
> >>>>     termid++
> >>>>     writeToFile("terms",term,termid)
> >>>>     for each doc in termdocs(term):
> >>>>             tf=getTF(doc,term)
> >>>>             writeToFile("docs",docid, termid, tf)
> >>>>
> >>>>
> >>>> So I will end up with two files:
> >>>>
> >>>> terms - contains mapping from term text to term ID
> >>>> docs - contains mapping from docid to termid and TF
> >>>>
> >>>> I can then build vectors by sorting docs file by docid and then
> >> gathering terms for each doc into another vector file that I can use
> with
> >> mahout.
> >>>>
> >>>> Probably I should not use internal docid, but instead some unique
> >> identifier field.
> >>>>
> >>>> Also, I assume at some point this could be a map-reduce job in hadoop.
> >>>>
> >>>> I'm just asking for sanity check, or if there are any better ideas out
> >> there.
> >>>>
> >>>> Thanks
> >>>> Bob
> >>>
> >>> --------------------------
> >>> Grant Ingersoll
> >>> http://www.lucidimagination.com
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
>
>

Reply via email to