On 3-Jul-08, at 3:04 PM, Chris Harris wrote:
Now I gather that phrase queries are inherently slower than non-phrase queries, but 1-3 orders of magnitude difference seems noteworthy. This is on Solr r654965, which I don't think is *too* far behind the trunk version. 1200Mb RAM allocated to Solr. 8M documents. Lots of compressed, stored fields. Most docs are probably like 50Kb, but some of them might be 10Mb, 100Mb. The index as a whole is 106GB. maxFieldLength=10000. The index was recently optimized. (It has only one segment right now.) I'm thinking that even supposing I've indexed everything in a horrible inefficient manner, and even supposing my machine is woefully underpowered, that wouldn't really explain why the phrase queries would be *that* much slower, would it? Any ideas?
It is simply due to caching effects. Probably the term count info is in the OS cache, but the positions aren't. You are seeing disk vs. non-disk access differences, which is what accounts for the multi- orders of magnitude difference.
The important variable here isn't total index size, but size of .prx (positions) versus .frq (term counts), as compared with the total _free/cached_ memory available on the system (not allocated to the JVM).
Indexing with termPositions wouldn't help, would it? (Now I'm not using termPositions or termVectors.) Or what if I used an alternative query parser, so phrase queries could be implemented in terms of the SpanNearQuery class rather than the PhraseQuery class?
No way to speed this up other than indexing less, buying more memory, or distributing across more machines.
-Mike