On 8-Nov-07, at 8:59 AM, Chris Hostetter wrote:

Let's back up a second...

the theory is that while it's frequently handy to cache fq's independent of the main query (because they are probably used over and over) in some
cases it may be advantageous to use an FQ directly in the body of hte
main query to get better skipTo behavior. -- the fundemental issue is
orthogonal to wether or not a DocSet for the FQ is cached, the question
is how should that FQ be used when computing the final DocList.

So what if instead of letting the client say "this argument is an fq which should be used to generate a BitSet and cached as a filter, this argument is an fq.nocache which should be added to the main query" we instead make
SolrIndexSearcher smart enough to say "i've been asked to filter the
main query using some DocSets, the intersection of those DocSets is small
enough, that instead of filtering the query on it, i'm going to add a
query that only matches docs in it to the main query to improve skipTo
behavior." ... so now clients don't have to know, they just pass in a
bunch of fq params. we still cache a DocSet for each one, but when it
comes time to do the search, we get the skipTo benefit anytime the
intersection of all fqs is really small (wether the individual fqs are
small enough individually or not)

I agree that this would be awesome if it can be pulled off.

that should just be a simple change to getDocListNC right?

Let's think about this: To effectively do what you suggest, the query handling needs to

1. determine whether a given (set of) filter(s) would be effective in a skipTo context
2. embed the filter in the query as a scorer

I see difficulties with both, but perhaps they are not unsurmountable.

First, how to determine whether the filter-embedding would be effective? We have at our disposal the size of the filter- intersection, assuming they are cached. The most important criterion here is probably the relative size difference of the result set with the filter applied or not, which isn't really available. It can be estimated assuming the filter and query are independent, but this definitely isn't always true. If the filter isn't/shouldn't be cached? You have to compute it separately for this (avoiding that is part of the goal).

Second, embedding the filter itself. This is much more nettlesome within SolrIndexSearcher than within one of the request handlers. One problem is the use of BooleanScorer--I suppose we could detect that by walking the query tree looking for it. Another is the embedding location: if filters are embedded in SIS, then then only reasonable option is to wrap everything in another top-level BooleanScorer with the original and filter query as required clauses (perhaps the filter would be inserted as prohibited if the inverse bitset was sufficiently sparse). This means that the next()'s that happen to occur on the original query will pull in lots of extra scoring that might not be needed: bq's, bf's, pf's, whatever else is layered on the scoring (in my case, there are be 1-2 layers of multiplicative boosts as well). It is nice to insert the filters directly into the "matching" part of the query.

Actually, nevermind: ReqOptSumScorer does not pull ahead its optional scorers until .score() is called, so the effects should be largely the same.

ISTM then that the main challenge is in determining when the filter intersection should be embedded. Also, the ability to control filter caching is still difficult with this implementation, but perhaps that's less important.

Thanks for the feedback,
-Mike

Reply via email to