[sqlite] [BUG or FR] OpenEphemeral invariant optimization (was [sqlite] Best way to optimize this query?)

Tomash Brechko Thu, 03 May 2007 07:15:23 -0700

Hello,

Tito, instead of

  SELECT * FROM People where GUID in (select GUID from myGUID);

you better to have an index on People.GUID, and do

  SELECT People.* FROM myGUID CROSS JOIN People ON People.GUID = myGUID.GUID;

(assuming myGUID temporary table has much fewer records than People).

The rest of the letter explains the problem with IN in SQLite 3.3.17.

On Wed, May 02, 2007 at 13:11:46 -0500, P Kishor wrote:
> On 5/2/07, Tito Ciuro <[EMAIL PROTECTED]> wrote:
> >SELECT * FROM People where GUID in ("ABC", "RDT", "TUV");
> >
> >Is there a better way to include all these GUIDs on a single SQL
> >statement to speed things up?
> 
> Internally, I believe the IN clause gets converted to a set of OR
> matches as in GUID = 'ABC' OR GUID = 'XYZ' OR... (or is it the other
> way around?)

Unfortunately this can't be done so.  Indeed, for very small number of
alternatives this is what one should do by hand.  But imagine we have
500 values, and only last comparison returns true.  That's why (I
removed irrelevant instructions below):

sqlite> .explain
sqlite> explain SELECT * FROM People where GUID in ("ABC", "RDT", "TUV");
addr  opcode          p1          p2          p3
----  --------------  ----------  ----------  ---------------------------------
...
2     OpenRead        0           6
...
5     MemLoad         0           0
...
8     OpenEphemeral   1           0           keyinfo(1,BINARY)
9     SetNumColumns   1           1
10    String8         0           0           ABC
11    MakeRecord      1           0           b
12    IdxInsert       1           0
13    String8         0           0           RDT
14    MakeRecord      1           0           b
15    IdxInsert       1           0
16    String8         0           0           TUV
17    MakeRecord      1           0           b
18    IdxInsert       1           0
...
26    Found           1           28
...
30    Callback        1           0
31    Next            0           5
...

See, at instruction 8 a temporary table (or rather index) is created,
and values "ABC", "RDT", "TUV" are put into it, so that we can later
use logarithmic search instead of liner search as would be done with
OR chain.

The problem is that jump at instruction 31 (Next) brings us back to
instruction 5.  This means that the query plan is:

   for every row in People do
       create temp table
       populate temp table
       see if current value from People is in temp table
   next

This doesn't look right.  And the problem is not that we could compare
current record from People against IN values right away, instead of
putting them into the temporary table first (this solution would be
basically the OR chain described above).  The problem is that we
populate temporary table inside the loop, while its contents doesn't
depend on this loop.

The same problem holds for every solution using IN that were
suggested.  Namely,

  SELECT * FROM People where GUID in (select GUID from myGUID);

will too populate the temporary table on _every_ iteration over People
rows:

addr  opcode          p1          p2          p3
----  --------------  ----------  ----------  ---------------------------------
...
5     MemLoad         0           0
...
8     OpenEphemeral   2           0           keyinfo(1,BINARY)
...
11    OpenRead        1           2
...
14    Column          1           0
...
18    MakeRecord      1           0           b
19    IdxInsert       2           0
20    Next            1           14
...
29    Found           2           31
...
33    Callback        1           0
34    Next            0           5
...

11-20 is temporary table population, 5-34---loop over People.  Even
creation of index ON myGUID (GUID) won't help to avoid this temporary
table.

I'm not sure if this is a nasty bug, and previous versions worked
correctly (lazy to check), or a misfeature.

To sum it up:

1 If sub-query at nesting level N2 depends on an outer query at level
  N1 (N1 < Nany < N2), then temporal table population may be moved to
  level N1+1.  This means that for completely independent sub-query we
  may move temporary table population to the upper level.

2 For the query "... IN (SELECT col FROM table)" (note the absence of
  WHERE clause in sub-query) when there is an index

     ... INDEX ... ON table (col [, ...])

  no temporary table has to be created, we can do the lookups in this
  index.  Further, if there is a WHERE clause that uses only col, it
  may be moved to the outer query.

3 As described in 1, no every temporary table population may be moved
  to the upper level.  Hence a simple technical optimization is
  possible (which is orthogonal to the problem described in this
  letter): instead of closing a temporary table we may push its handle
  onto the stack, and, instead of re-creating temporary table on every
  OpenEphemeral in the loop (which is a costly operation: path checks,
  name generation, open() and unlink()), we may simply pop the handle,
  and do ftruncate() on it.  The stack should be freed (i.e. all
  cached handles closed) at the end of the statement (and also at
  UNION, which joins two separate statements; possibly at other places
  to prevent file handle starvation).

Dr. Hipp, what do you think?  I can't call any of that a bug, because,
as said in the docs, SQLite's primary goal is being robust and have a
manageable code base, speed comes as a consequence.  But, at least
point 3 doesn't seem to be hard to do, and for a certain queries it
will give a noticeable boost ("then submit a patch"---yeah, I know,
but I'm not that much into SQLite code :-/ (yet)).

-- 
   Tomash Brechko

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

[sqlite] [BUG or FR] OpenEphemeral invariant optimization (was [sqlite] Best way to optimize this query?)

Reply via email to