On 8-Nov-07, at 8:59 AM, Chris Hostetter wrote:
Let's back up a second...
the theory is that while it's frequently handy to cache fq's
independent
of the main query (because they are probably used over and over) in
some
cases it may be advantageous to use an FQ directly in the body of hte
main query to get better skipTo behavior. -- the fundemental issue is
orthogonal to wether or not a DocSet for the FQ is cached, the
question
is how should that FQ be used when computing the final DocList.
So what if instead of letting the client say "this argument is an
fq which
should be used to generate a BitSet and cached as a filter, this
argument
is an fq.nocache which should be added to the main query" we
instead make
SolrIndexSearcher smart enough to say "i've been asked to filter the
main query using some DocSets, the intersection of those DocSets is
small
enough, that instead of filtering the query on it, i'm going to add a
query that only matches docs in it to the main query to improve skipTo
behavior." ... so now clients don't have to know, they just pass in a
bunch of fq params. we still cache a DocSet for each one, but
when it
comes time to do the search, we get the skipTo benefit anytime the
intersection of all fqs is really small (wether the individual fqs are
small enough individually or not)
I agree that this would be awesome if it can be pulled off.
that should just be a simple change to getDocListNC right?
Let's think about this: To effectively do what you suggest, the query
handling needs to
1. determine whether a given (set of) filter(s) would be effective in
a skipTo context
2. embed the filter in the query as a scorer
I see difficulties with both, but perhaps they are not unsurmountable.
First, how to determine whether the filter-embedding would be
effective? We have at our disposal the size of the filter-
intersection, assuming they are cached. The most important criterion
here is probably the relative size difference of the result set with
the filter applied or not, which isn't really available. It can be
estimated assuming the filter and query are independent, but this
definitely isn't always true. If the filter isn't/shouldn't be
cached? You have to compute it separately for this (avoiding that
is part of the goal).
Second, embedding the filter itself. This is much more nettlesome
within SolrIndexSearcher than within one of the request handlers.
One problem is the use of BooleanScorer--I suppose we could detect
that by walking the query tree looking for it. Another is the
embedding location: if filters are embedded in SIS, then then only
reasonable option is to wrap everything in another top-level
BooleanScorer with the original and filter query as required clauses
(perhaps the filter would be inserted as prohibited if the inverse
bitset was sufficiently sparse). This means that the next()'s that
happen to occur on the original query will pull in lots of extra
scoring that might not be needed: bq's, bf's, pf's, whatever else is
layered on the scoring (in my case, there are be 1-2 layers of
multiplicative boosts as well). It is nice to insert the filters
directly into the "matching" part of the query.
Actually, nevermind: ReqOptSumScorer does not pull ahead its optional
scorers until .score() is called, so the effects should be largely
the same.
ISTM then that the main challenge is in determining when the filter
intersection should be embedded. Also, the ability to control filter
caching is still difficult with this implementation, but perhaps
that's less important.
Thanks for the feedback,
-Mike