On Thu, Dec 19, 2013 at 12:34 PM, Clemens Ladisch <[email protected]>wrote:
> Dominique Devienne wrote: > > In the two queries below, there's a 5x performance difference. > > > select max(id) from t100m limit 1; > > select id from t100m order by id desc limit 1; > > > But logically, it seems to me that the limit 1 on the order by is > logically > > equivalent to a min or max depending on ascending or descending ordering, > > and is "optimize-able" by using an appropriate transformation. > > SQLite's query optimizer does not implement this transformation, so the > MAX can throw away any smaller records, while the ORDER BY does not know > about the LIMIT and keeps all sorted records in a temporary table. > The selling point of SQL is to declaratively tell the engine what you need, and let it choose the optimal implementation. So saying that ORDER BY doesn't know about LIMIT as a matter of fact seems completely wrong to me. The intent is clearly stated in the query, and the optimizer sees *both* the ORDER BY and the LIMIT and thus has the opportunity to implement it in a more efficient way that it does now. Obviously a query optimizer is a very tricky beast to implement. I'm merely reporting this "non-optimization", in part to make this optimization use-case known, in the hope that it might be supported in the future, and in part in case I misses another way to achieve the same results. This query is efficient only if there is an index that allows the > database to look up the largest value without touching any other > records. > That's something else. There are no indices in my example queries, yet the logically equivalent query is 5x faster. With an index, in my contrived examples, both queries are instantaneous (10m rows instead of 100m, otherwise I run out of memory). But typically our order by columns are not indexed, and even worse some queries are ordered using the result of a custom SQL function. > This use case is not as bogus as it seems. We have tree UI components, > > where the child nodes of a tree node is determined by a query, with some > of > > these "child queries" having order by clauses. The UI needs to know if > the > > parent node is expendable or not, and for that we currently add the > limit 1 > > clause to the "child query", but the limit 1 is applied *after* the > > ordering (or after the union all, see below), which forces getting all > the > > children rows and columns, a major performance issue. > > This sounds as if you're working on the wrong abstraction level. > I would redesign the code so that tree nodes can return another query. > That's side-stepping the point I'm raising. Plus it opens the door for inconsistencies between the two queries (i.e. false positives or worse false negatives), and from experience it leads to more bugs. Using a single query has drawbacks performance-wise, but at least the results are always consistent. Our current mitigation is more along the lines of having our code have more knowledge about the query structure by breaking it down into clauses and such, to avoid query re-parsing. Hmmm, I think I might have stumbled on what I need, namely the EXISTS condition, which seems to exhibit the short-circuiting semantic I was looking for! sqlite> select count(*) from t100m; 100000000 CPU Time: user 0.171601 sys 0.000000 sqlite> select max(id) from t100m limit 1; 111111110 CPU Time: user 15.132097 sys 0.000000 sqlite> select id from t100m order by id desc limit 1; 111111110 CPU Time: user 78.343702 sys 0.000000 sqlite> sqlite> select 1 where exists (select id from t100m order by id desc); 1 CPU Time: user 0.000000 sys 0.000000 sqlite> select 1 where exists (select id from t100m where id < 500*1000 order by id desc); CPU Time: user 12.823282 sys 0.000000 sqlite> select 1 where exists (select id from t100m where id = 13 order by id desc); CPU Time: user 12.729682 sys 0.000000 sqlite> select 1 where exists (select id from t100m where id > 111111110 order by id desc); CPU Time: user 13.525287 sys 0.000000 > Alternatively, if you can afford to confuse your users, allow all nodes > to be expandable, and remove the [+] only when an actual attempt to > expand returns no records. > That's been ruled out as unacceptable. > > Also note that quite a few of our queries are "union" or "union all" > > queries, joining rows from several different tables, so in this specific > > use case, any query returning a row would be enough to stop executing the > > other queries and without the need to union all the inner result sets. > > Why "inner"? Are you using a subqueries? In that case, read > <http://www.sqlite.org/optoverview.html#flattening>. > Yes we are. The queries return "base models", i.e. primary "data" result sets, which can be further ordered, filtered, and/or aggregated depending on the state of the UI component. We try to avoid inner queries, but sometimes we must use them. Thanks for your input Clemens. It challenged me into finding my own solution I think :) Thanks, --DD _______________________________________________ sqlite-users mailing list [email protected] http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

