Re: [sqlite] range enclosing a number

Jay A. Kreibich Mon, 13 Jul 2009 21:35:45 -0700

On Mon, Jul 13, 2009 at 10:33:00PM -0400, Pavel Ivanov scratched on the wall:
> Jay, you're pretty much mistaken:
> 
> >  I'm pretty sure you don't want to do it this way.  What this does is
> >  gather every row that meets the WHERE condition and then runs a max()
> >  or min() aggregation function across all of those rows.  That means
> >  that even if the column "number" has an index on it, between these
> >  two statements you're going to end up scanning the whole table.
> 
> All database engines optimize queries which ask for min/max on indexed
> column with condition including only <, > or = on this very column.
> And SQLite is among these too:



  Not "all."  This type of optimization is actually a fairly new
  addition to SQLite (considering the product lifetime) and first
  appeared in 3.5.5, which was released in early 2008.


  And I'd still go with ORDER/LIMIT.  Here's why:


  For my tests I just used the default build under the current version
  of Mac OS X, which is a bit old (3.4).  Under that build, the
  ORDER/LIMIT is clearly faster, as this is before the optimization
  existed:

  (using the same test set you did)
  
  Full scan, 3.4: 
  -------------
  real  0m5.99s
  user  0m4.73s
  sys   0m0.84s

  Using ORDER/LIMIT, 3.4:
  -------------
  real  0m0.00s
  user  0m0.01s
  sys   0m0.00s

  Using min/max, 3.4:
  -------------
  real  0m5.97s
  user  0m2.94s
  sys   0m0.38s

  In this case, it is clear that min/max are NOT integrated into the
  optimizer, and requires half a table scan, just as I stated.

  I also have a build of the current 3.6.16 around, and in that case,
  the numbers are better:

  Using ORDER/LIMIT, 3.6.16
  -------------
  real  0m0.12s
  user  0m0.01s
  sys   0m0.03s

  Using min/max, 3.6.16
  -------------
  real  0m0.04s
  user  0m0.01s
  sys   0m0.03s

  This clearly shows that the optimization does exist, and that for
  this very basic case my assumptions were incorrect.

  With the current 3.6.16 build, using min/max seems a tad faster-- but
  only in "real" time.  In terms of user/sys times, the results shown
  here (and you're own numbers, which were 0.043/0.001/0.005 and
  0.005/0.001/0.001) were pretty typical (i.e. very very close).
  That might just be an I/O fluke.  We're getting small enough that
  to really say anything definite requires better profiling.  So
  there does appear to be a difference, but it is pretty small and
  unclear where it is coming from.

  However, I'd point out that using ORDER/LIMIT under 3.4 is the
  fastest of all.  This isn't just a quirk of one run, either.
  I ran these several times and the 3.4 ORDER/LIMIT was always fastest.
  We're still playing with number to small to really trust, but it
  seems that if the 3.6.16 ORDER/LIMIT was as fast as the one in 3.4,
  it would likely be the best choice of all.


  So you've sold me that the current version of SQLite clearly does
  have the min/max optimization and doesn't require a table scan.  It
  also appears to be slightly faster, but not by a big enough gap to
  clearly consider it a better choice on that alone.



  Personally, I'd still go with ORDER/LIMIT.  With the current version
  of SQLite the runtimes of both approaches are extremely similar,
  but the min/max approach depends on the query optimizer being able to
  take the min/max notation and basically turn it into an internal
  ORDER BY (look at the code).

  There are a lot of limits on when the optimizer can do this.  If the
  real-world query is a bit more complex than this example, or if (down
  the road) the query conditions get changed, or if the query gets more
  complex in just about anyway, the optimization is going to break with
  the min/max approach and you'll be stuck with a table scan--
  something that I think we can all agree is MUCH slower.  If that comes
  up due to a change in the SQL six months down the road, you're going
  to be spending a lot of time wondering why things just got so slow.
  The ORDER/LIMIT approach is much more resilient to changes, however,
  and should more or less behave the same no matter what you do to the
  rest of the query.

  The ORDER/LIMIT approach is also the clear winner if there is any
  possible chance at all you'll be running on older code.  "Older," in
  this case, is code that was released less than two years ago.  In
  many production environments, that's not all that long.

  There is also this...

> But of course your other point is true - if you want some other data
> from table along with min/max value, you need to make additional
> select in case of using min/max.

  ... which also negates any possible advantage of using the min/max
  method.  The ORDER/LIMIT approach lets you pull out all the data you
  might need or want in a single query.

  For the simple case on newer code, both approaches give similar
  returns.  But the min/max approach will default to a very expensive
  behavior if it is changed in just about any way, while the
  ORDER/LIMIT approach is valid for most variations on this example.


  I could also go off on a more idealistic tangent, about how I think
  the ORDER/LIMIT approach is more in the spirit of how SELECT is meant
  to be used... the min/max approach clearly depends on the query
  optimizer to do its work, while the ORDER/LIMIT works "by design."
  I'm never a fan of trusting the query optimizer.  It changes from time
  to time and sometimes things break.  The min/max approach also
  depends on things like an implied GROUP BY.  That might be standard
  SQL, but for someone less versed in SQL it looks like a query that
  shouldn't even work (and, in fact, doesn't once you add another
  column).  The ORDER/LIMIT approach is, arguable, less graceful, but
  it is also (IMHO) a lot easier to break down into logical blocks that
  a newbie can follow and understand both the design and intent-- even
  if the query is a bit strung out.  I would say that the min/max
  approach is more in the spirit of the Relational Model, but that's
  not what SQL is about and, at least where I work, the mention of
  "Relational Model" will get you nothing but blank looks from the
  majority of my co-workers.  It might be a more elegant approach,
  but six months down the road when some random team member needs to
  update it, that elegance is going to be totally lost on them. 
  I'd prefer the practical, more bullet proof solution that isn't
  dependent on the optimizer retaining the exact same behavor.
  And speaking of changes...  if the current version of SQLite can be
  made to run the ORDER/LIMIT as fast as the older ones, then
  ORDER/LIMIT could win in time as well!
  
  I can go on, but I'll stop there, since at this point we're well into
  opinions and I need to be doing other things.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Our opponent is an alien starship packed with atomic bombs.  We have
 a protractor."   "I'll go home and see if I can scrounge up a ruler
 and a piece of string."  --from Anathem by Neal Stephenson
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] range enclosing a number

Reply via email to