Just for fun, I ran my cpu profiler against Solr while running a little app to hammer Solr with phrase queries. As Mike and Yonik could have predicted, I found that Solr was spending basically all its time in lucene.index.MultiSegmentReader$MultiTermPositions code (called from lucene.search.PhrasePositions code), which is to say practically all it was doing was phrase position I/O.
Mike, your idea of indexing bigrams is also interesting. Do you know if any text search platforms do this behind the scenes as their default way of handling phrase queries? See, I have a copy of dtSearch, and when I use it to index the same data I've been discussing here, on the same machine (and end with roughly the same size index, measured in GB), it performs notably faster than Solr on phrase queries. I haven't done any serious tests, but it might be an order of magnitude difference or more. Now dtSearch may have a speed advantage over Solr because it's written in C and it's optimized for Windows in particular, but I'm wondering if the real explanation wouldn't be a difference in algorithms. (Hopefully the explanation is not me being stupid and thinking I've indexed the same thing in dtSearch when in fact I've configured them with some wildly different settings.) Is this plausible? On Thu, Jul 3, 2008 at 5:30 PM, Mike Klaas <[EMAIL PROTECTED]> wrote: > > On 3-Jul-08, at 5:13 PM, Chris Harris wrote: > >>> That's pretty much impossible (way too small). Double check those >>> numbers. >> >> I don't know where I got the above numbers. Sorry. Here are the real >> numbers: >> >> .tis file: 730MB >> .frq files: 10.1 GB >> .prx file: 43.2 GB >> >> Now keeping all *that* in RAM, that sounds like a challenge. > > It doesn't have to be *all* in RAM... the OS will figure out what parts are > needed. > > One alternative you might consider is using a flash hard drive. Another is > to index bigrams as terms, and do phrase queries using the conjunction of > the bigrams of a phrase. This should make phrase queries only a few times > slower than term queries, and probably inflate your .frq to "only" 25GB > (.prx could be ignored). > > Some other tricks, like stop word removal, also speed up phrase queries. > > -Mike >