On 2019/04/17 10:55 AM, Poor Yorick wrote:
I've used the following two test queries in a version of sqlite built against a
recent checkout of trunk, and also another recent version of sqlite.  a.ref is
indexed.  The subselect query is faster than the join query -- two orders of
magnitude faster on a larger dataset.  Is sqlite missing some easy optimisation
opportunity here?


select a.rowid
from a join b on a.rowid = b.rowid
where a.ref = $x


select a.rowid
from a,b
where a.ref = $x and a.rowid in (select rowid from b)


These queries have vastly different meanings and as such must have vastly different execution plans. I won't go into that because it's obvious, so I will just mention the assumptions.

For the query engine to optimize this according to your suggestion (i.e. for it to really do the second thing when you've asked for the first thing) it has to assume the following:

- 1. You do not need any of the fields from b in any part of the query (which IS the case here and can be deduced, but is an extremely small likelihood for JOINs in general queries), - 2. there are no special Collations on either of the columns (again, this can be deduced here), and - 3. it must assume that every b.rowid appears ONLY once in table b (I know this can be assumed for rowid's, but is something that cannot be easily known for any other field, where it could be true with or without a unique constraint).

And, these are only the obvious outsider-viewpoint assumptions, I have no insight in what other checks might be needed internally from the QP. This means that this optimization carries a cost deficit in the general case. i.e. Wasting CPU cycles "looking" for this optimization opportunity in the general case, considering the high unlikelihood of finding it viable, will slow down many more queries than it will help.

That said, this is the very kind of thing where we as engineering programmers or database queryers could optimize by asking our questions right, mind you - I am not advocating thinking for the Query Planner, but do state your question optimally. If you really want to know how many rabbits have been eaten by foxes, do not ask for a list of all rabbit names alphabetically and exclude every one that was NOT eaten by a fox, simply ask for the count of rabbits where eaten by fox = true.

In your above example you really wish to know all the a's which have an entry in b. The first query asks to join and list all b's found for every a (which works mathematically in this case by virtue of rowid uniqueness, but isn't the real question and forces a lot of "join" algorithm checking on the QP), the correct question is the second query: Show every a which can also be found in b. It releases the QP of a lot of responsibility and let's it follow a plan that is much faster.


Hope that makes sense :)

Ryan


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to