On Thu, Dec 19, 2013 at 12:34 PM, Clemens Ladisch <[email protected]>wrote:

> Dominique Devienne wrote:
> > In the two queries below, there's a 5x performance difference.
>
> > select max(id) from t100m limit 1;
> > select id from t100m order by id desc limit 1;
>
> > But logically, it seems to me that the limit 1 on the order by is
> logically
> > equivalent to a min or max depending on ascending or descending ordering,
> > and is "optimize-able" by using an appropriate transformation.
>
> SQLite's query optimizer does not implement this transformation, so the
> MAX can throw away any smaller records, while the ORDER BY does not know
> about the LIMIT and keeps all sorted records in a temporary table.
>

The selling point of SQL is to declaratively tell the engine what you need,
and let it choose the optimal implementation. So saying that ORDER BY
doesn't know about LIMIT as a matter of fact seems completely wrong to me.
The intent is clearly stated in the query, and the optimizer sees *both*
the ORDER BY and the LIMIT and thus has the opportunity to implement it in
a more efficient way that it does now. Obviously a query optimizer is a
very tricky beast to implement. I'm merely reporting this
"non-optimization", in part to make this optimization use-case known, in
the hope that it might be supported in the future, and in part in case I
misses another way to achieve the same results.

This query is efficient only if there is an index that allows the
> database to look up the largest value without touching any other
> records.
>

That's something else. There are no indices in my example queries, yet the
logically equivalent query is 5x faster.

With an index, in my contrived examples, both queries are instantaneous
(10m rows instead of 100m, otherwise I run out of memory). But typically
our order by columns are not indexed, and even worse some queries are
ordered using the result of a custom SQL function.

> This use case is not as bogus as it seems. We have tree UI components,
> > where the child nodes of a tree node is determined by a query, with some
> of
> > these "child queries" having order by clauses. The UI needs to know if
> the
> > parent node is expendable or not, and for that we currently add the
> limit 1
> > clause to the "child query", but the limit 1 is applied *after* the
> > ordering (or after the union all, see below), which forces getting all
> the
> > children rows and columns, a major performance issue.
>
> This sounds as if you're working on the wrong abstraction level.
> I would redesign the code so that tree nodes can return another query.
>

That's side-stepping the point I'm raising. Plus it opens the door for
inconsistencies between the two queries (i.e. false positives or worse
false negatives), and from experience it leads to more bugs. Using a single
query has drawbacks performance-wise, but at least the results are always
consistent. Our current mitigation is more along the lines of having our
code have more knowledge about the query structure by breaking it down into
clauses and such, to avoid query re-parsing.

Hmmm, I think I might have stumbled on what I need, namely the EXISTS
condition, which seems to exhibit the short-circuiting semantic I was
looking for!

sqlite> select count(*) from t100m;
100000000
CPU Time: user 0.171601 sys 0.000000
sqlite> select max(id) from t100m limit 1;
111111110
CPU Time: user 15.132097 sys 0.000000
sqlite> select id from t100m order by id desc limit 1;
111111110
CPU Time: user 78.343702 sys 0.000000
sqlite>
sqlite> select 1 where exists (select id from t100m order by id desc);
1
CPU Time: user 0.000000 sys 0.000000
sqlite> select 1 where exists (select id from t100m where id < 500*1000
order by id desc);
CPU Time: user 12.823282 sys 0.000000
sqlite> select 1 where exists (select id from t100m where id = 13 order by
id desc);
CPU Time: user 12.729682 sys 0.000000
sqlite> select 1 where exists (select id from t100m where id > 111111110
order by id desc);
CPU Time: user 13.525287 sys 0.000000


> Alternatively, if you can afford to confuse your users, allow all nodes
> to be expandable, and remove the [+] only when an actual attempt to
> expand returns no records.
>

That's been ruled out as unacceptable.


> > Also note that quite a few of our queries are "union" or "union all"
> > queries, joining rows from several different tables, so in this specific
> > use case, any query returning a row would be enough to stop executing the
> > other queries and without the need to union all the inner result sets.
>
> Why "inner"?  Are you using a subqueries?  In that case, read
> <http://www.sqlite.org/optoverview.html#flattening>.
>

Yes we are. The queries return "base models", i.e. primary "data" result
sets, which can be further ordered, filtered, and/or aggregated depending
on the state of the UI component. We try to avoid inner queries, but
sometimes we must use them.

Thanks for your input Clemens. It challenged me into finding my own
solution I think :)

Thanks, --DD
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to