Re: Processing of prx file for phrase queries: Whole position list for term read?

Erik Hatcher Tue, 18 Nov 2008 13:07:33 -0800

Rather than attempt an answer to your questions directly, I'll mentionhow other projects have dealt with the very-common-word issue. Nutch,for example, has a list of high frequency terms and concatenates themwith the successive word in order to form less-frequent aggregateterms. The original term is also indexed, but during querying inphrases, the common terms are again concatenated, thus making queryinga lot faster.

I may not have explained it entirely accurately, but that's the gist.Have a look at Nutch's Analyzer for more details.


        Erik


On Nov 18, 2008, at 4:00 PM, Burton-West, Tom wrote:

Hello,

We are working with a very large index and with large documents (300+
page books.)  It appears that the bottleneck on our system is the disk
IO involved in reading position information from the prx file for
commonly occuring terms.

An example slow query is  "the new economics".

To process the above phrase query for the word "the", does the entire

part of the .prx file for the word "the" need to be read in tomemory or

only the fragments of the entries for the word "the" that contain
specific doc ids?

In reading the lucene index file formats document
(http://lucene.apache.org/java/2_4_0/fileformats.html) its not clear
whether the .tis file stores a pointer into the .prx file for a term
(and therefore the entire list of doc_ids and positions for that term
needs to be read into memory), or if the .tis file stores a pointer to
the term **and doc id** in the prx file, in which case only the
positions for a given doc id would need to be read. Or if somehow the

.frq file has information on where to find the doc id in the .prxfile.

The documentation for the .tis file says that it stores ProxDeltawhich

is based on the term (rather than the term/doc id).  On the other hand
the documentation for the .prx file states that Positions entries are

"ordered by increasing document number (the document number isimplicit

from the .frq file)"


Tom

Re: Processing of prx file for phrase queries: Whole position list for term read?

Reply via email to