On 3-Jul-08, at 3:04 PM, Chris Harris wrote:

Now I gather that phrase queries are inherently slower than non-phrase
queries, but 1-3 orders of magnitude difference seems noteworthy.

This is on Solr r654965, which I don't think is *too* far behind the
trunk version. 1200Mb RAM allocated to Solr. 8M documents. Lots of
compressed, stored fields. Most docs are probably like 50Kb, but some
of them might be 10Mb, 100Mb. The index as a whole is 106GB.
maxFieldLength=10000. The index was recently optimized. (It has only
one segment right now.)

I'm thinking that even supposing I've indexed everything in a horrible
inefficient manner, and even supposing my machine is woefully
underpowered, that wouldn't really explain why the phrase queries
would be *that* much slower, would it? Any ideas?

It is simply due to caching effects. Probably the term count info is in the OS cache, but the positions aren't. You are seeing disk vs. non-disk access differences, which is what accounts for the multi- orders of magnitude difference.

The important variable here isn't total index size, but size of .prx (positions) versus .frq (term counts), as compared with the total _free/cached_ memory available on the system (not allocated to the JVM).

Indexing with
termPositions wouldn't help, would it? (Now I'm not using
termPositions or termVectors.) Or what if I used an alternative query
parser, so phrase queries could be implemented in terms of the
SpanNearQuery class rather than the PhraseQuery class?

No way to speed this up other than indexing less, buying more memory, or distributing across more machines.

-Mike

Reply via email to