Rather than attempt an answer to your questions directly, I'll mention how other projects have dealt with the very-common-word issue. Nutch, for example, has a list of high frequency terms and concatenates them with the successive word in order to form less-frequent aggregate terms. The original term is also indexed, but during querying in phrases, the common terms are again concatenated, thus making querying a lot faster.

I may not have explained it entirely accurately, but that's the gist. Have a look at Nutch's Analyzer for more details.

        Erik


On Nov 18, 2008, at 4:00 PM, Burton-West, Tom wrote:

Hello,

We are working with a very large index and with large documents (300+
page books.)  It appears that the bottleneck on our system is the disk
IO involved in reading position information from the prx file for
commonly occuring terms.

An example slow query is  "the new economics".

To process the above phrase query for the word "the", does the entire
part of the .prx file for the word "the" need to be read in to memory or
only the fragments of the entries for the word "the" that contain
specific doc ids?

In reading the lucene index file formats document
(http://lucene.apache.org/java/2_4_0/fileformats.html) its not clear
whether the .tis file stores a pointer into the .prx file for a term
(and therefore the entire list of doc_ids and positions for that term
needs to be read into memory), or if the .tis file stores a pointer to
the term **and doc id** in the prx file, in which case only the
positions for a given doc id would need to be read. Or if somehow the
.frq file has information on where to find the doc id in the .prx file.


The documentation for the .tis file says that it stores ProxDelta which
is based on the term (rather than the term/doc id).  On the other hand
the documentation for the .prx file states that Positions entries are
"ordered by increasing document number (the document number is implicit
from the .frq file)"


Tom

Reply via email to