Just for fun, I ran my cpu profiler against Solr while running a
little app to hammer Solr with phrase queries. As Mike and Yonik could
have predicted, I found that Solr was spending basically all its time
in lucene.index.MultiSegmentReader$MultiTermPositions code (called
from lucene.search.PhrasePositions code), which is to say practically
all it was doing was phrase position I/O.

Mike, your idea of indexing bigrams is also interesting. Do you know
if any text search platforms do this behind the scenes as their
default way of handling phrase queries? See, I have a copy of
dtSearch, and when I use it to index the same data I've been
discussing here, on the same machine (and end with roughly the same
size index, measured in GB), it performs notably faster than Solr on
phrase queries. I haven't done any serious tests, but it might be an
order of magnitude difference or more. Now dtSearch may have a speed
advantage over Solr because it's written in C and it's optimized for
Windows in particular, but I'm wondering if the real explanation
wouldn't be a difference in algorithms. (Hopefully the explanation is
not me being stupid and thinking I've indexed the same thing in
dtSearch when in fact I've configured them with some wildly different
settings.) Is this plausible?

On Thu, Jul 3, 2008 at 5:30 PM, Mike Klaas <[EMAIL PROTECTED]> wrote:
>
> On 3-Jul-08, at 5:13 PM, Chris Harris wrote:
>
>>> That's pretty much impossible (way too small).  Double check those
>>> numbers.
>>
>> I don't know where I got the above numbers. Sorry. Here are the real
>> numbers:
>>
>> .tis file: 730MB
>> .frq files: 10.1 GB
>> .prx file: 43.2 GB
>>
>> Now keeping all *that* in RAM, that sounds like a challenge.
>
> It doesn't have to be *all* in RAM... the OS will figure out what parts are
> needed.
>
> One alternative you might consider is using a flash hard drive.  Another is
> to index bigrams as terms, and do phrase queries using the conjunction of
> the bigrams of a phrase.  This should make phrase queries only a few times
> slower than term queries, and probably inflate your .frq to "only" 25GB
> (.prx could be ignored).
>
> Some other tricks, like stop word removal, also speed up phrase queries.
>
> -Mike
>

Reply via email to