Hello,
Tito, instead of
SELECT * FROM People where GUID in (select GUID from myGUID);
you better to have an index on People.GUID, and do
SELECT People.* FROM myGUID CROSS JOIN People ON People.GUID = myGUID.GUID;
(assuming myGUID temporary table has much fewer records than People).
The rest of the letter explains the problem with IN in SQLite 3.3.17.
On Wed, May 02, 2007 at 13:11:46 -0500, P Kishor wrote:
> On 5/2/07, Tito Ciuro <[EMAIL PROTECTED]> wrote:
> >SELECT * FROM People where GUID in ("ABC", "RDT", "TUV");
> >
> >Is there a better way to include all these GUIDs on a single SQL
> >statement to speed things up?
>
> Internally, I believe the IN clause gets converted to a set of OR
> matches as in GUID = 'ABC' OR GUID = 'XYZ' OR... (or is it the other
> way around?)
Unfortunately this can't be done so. Indeed, for very small number of
alternatives this is what one should do by hand. But imagine we have
500 values, and only last comparison returns true. That's why (I
removed irrelevant instructions below):
sqlite> .explain
sqlite> explain SELECT * FROM People where GUID in ("ABC", "RDT", "TUV");
addr opcode p1 p2 p3
---- -------------- ---------- ---------- ---------------------------------
...
2 OpenRead 0 6
...
5 MemLoad 0 0
...
8 OpenEphemeral 1 0 keyinfo(1,BINARY)
9 SetNumColumns 1 1
10 String8 0 0 ABC
11 MakeRecord 1 0 b
12 IdxInsert 1 0
13 String8 0 0 RDT
14 MakeRecord 1 0 b
15 IdxInsert 1 0
16 String8 0 0 TUV
17 MakeRecord 1 0 b
18 IdxInsert 1 0
...
26 Found 1 28
...
30 Callback 1 0
31 Next 0 5
...
See, at instruction 8 a temporary table (or rather index) is created,
and values "ABC", "RDT", "TUV" are put into it, so that we can later
use logarithmic search instead of liner search as would be done with
OR chain.
The problem is that jump at instruction 31 (Next) brings us back to
instruction 5. This means that the query plan is:
for every row in People do
create temp table
populate temp table
see if current value from People is in temp table
next
This doesn't look right. And the problem is not that we could compare
current record from People against IN values right away, instead of
putting them into the temporary table first (this solution would be
basically the OR chain described above). The problem is that we
populate temporary table inside the loop, while its contents doesn't
depend on this loop.
The same problem holds for every solution using IN that were
suggested. Namely,
SELECT * FROM People where GUID in (select GUID from myGUID);
will too populate the temporary table on _every_ iteration over People
rows:
addr opcode p1 p2 p3
---- -------------- ---------- ---------- ---------------------------------
...
5 MemLoad 0 0
...
8 OpenEphemeral 2 0 keyinfo(1,BINARY)
...
11 OpenRead 1 2
...
14 Column 1 0
...
18 MakeRecord 1 0 b
19 IdxInsert 2 0
20 Next 1 14
...
29 Found 2 31
...
33 Callback 1 0
34 Next 0 5
...
11-20 is temporary table population, 5-34---loop over People. Even
creation of index ON myGUID (GUID) won't help to avoid this temporary
table.
I'm not sure if this is a nasty bug, and previous versions worked
correctly (lazy to check), or a misfeature.
To sum it up:
1 If sub-query at nesting level N2 depends on an outer query at level
N1 (N1 < Nany < N2), then temporal table population may be moved to
level N1+1. This means that for completely independent sub-query we
may move temporary table population to the upper level.
2 For the query "... IN (SELECT col FROM table)" (note the absence of
WHERE clause in sub-query) when there is an index
... INDEX ... ON table (col [, ...])
no temporary table has to be created, we can do the lookups in this
index. Further, if there is a WHERE clause that uses only col, it
may be moved to the outer query.
3 As described in 1, no every temporary table population may be moved
to the upper level. Hence a simple technical optimization is
possible (which is orthogonal to the problem described in this
letter): instead of closing a temporary table we may push its handle
onto the stack, and, instead of re-creating temporary table on every
OpenEphemeral in the loop (which is a costly operation: path checks,
name generation, open() and unlink()), we may simply pop the handle,
and do ftruncate() on it. The stack should be freed (i.e. all
cached handles closed) at the end of the statement (and also at
UNION, which joins two separate statements; possibly at other places
to prevent file handle starvation).
Dr. Hipp, what do you think? I can't call any of that a bug, because,
as said in the docs, SQLite's primary goal is being robust and have a
manageable code base, speed comes as a consequence. But, at least
point 3 doesn't seem to be hard to do, and for a certain queries it
will give a noticeable boost ("then submit a patch"---yeah, I know,
but I'm not that much into SQLite code :-/ (yet)).
--
Tomash Brechko
-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------