I'm pretty sure Solr/lucene have no such "optimization" already, but
it's not clear to me that it would result in much of a performance
benefit, just because of the way lucene works, it's not obvious to me
that the second version of your query will be noticeably faster than the
first version.
Maybe in cases with many many clauses, rather than the few clauses in
your example. You'd definitely want to performance test it to verify
there are any gains, before embarking on writing the 'optimization' --
you can test it just by sending the different versions of your real
world queries to Solr and seeing what the response times are,
calculating the hypothetically 'optimized' version yourself by hand if
need be, right?
On 7/27/2011 5:05 PM, Scott Smith wrote:
We have a solr application which ends up creating queries with very complicated
filters (literally hundreds and sometimes thousands of terms-typically a large
number of terms OR'ed together where each of these terms might have a half a
dozen keywords ANDed/ORed together). In looking at the filters, I realized
that there are often a lot of common sub-filters.
A simple example of what I mean is:
("cat" AND "dog") OR ("cat" AND "horse")
This could clearly be simplified by saying:
"cat" AND ("dog" OR "horse")
It turns out that finding and combining common sub-filters isn't trivial for our
application. So, before I start a project to attempt some kind of
"optimization", my question is whether it's likely that I will see significant
decreases in query times to justify the development effort it takes to optimize the
filters. Certainly, if I thought I might get a 20%+ decrease in time, I would say it's
probably a good project. If it's just a few percentage points of improvement, then I'm
less excited about doing it.
Does Solr already go through some kind of optimization which effectively
combines common sub-filters and possibly duplicated terms? Does anyone have
any thoughts on this subject?
Thanks
Scott