[sqlite] Diff two tables as fast as diff(1) (of sorted files)

Nico Williams Wed, 18 Nov 2015 18:21:38 -0600

On Wed, Nov 18, 2015 at 11:52:14PM +0000, Simon Slavin wrote:
> On 18 Nov 2015, at 8:11pm, Nico Williams <nico at cryptonector.com> wrote:
> > SELECT 'added., a.* FROM
> >  (SELECT a.* FROM a EXCEPT SELECT b.* FROM b) a
> > UNION ALL
> > SELECT 'deleted', d.* FROM
> >  (SELECT b.* FROM b EXCEPT SELECT a.* FROM a) d;
> > 
> > In other words, the SQL analog of a straightforward diff(1) of two
> > [sorted] files.  I think this is the canonical way to do this in SQL,
> > though there are variants.
> 
> I don't know why anyone would want to do that in SQL.  If you need to
> diff two tables you also need to know about UPDATE, not just INSERT
> and DELETE.


I implied that all the columns were as if part of the primary key.  See
the follow-up posts.

> > Now, diff(1) (i.e., the Unix command) can do this very efficiently
> > because it just moves a "cursor" in both files looking for different
> > records (lines).
> 
> Text files have the concept of order.  SQL tables do not.  You are
> imposing an arbitrary idea of order on table rows which doesn't really
> make sense for data.

There was no need to be patronizing.  There's no need for CREATE INDEX
in pure SQL.  Over here in real life we have them because they *can*
make queries go faster (at some price, of course).  And in SQLite3 the
indices happen to support ordered traversal, which is useful for certain
optimizations, in particular the one we're discussing.  If SQLite3 had
only supported hash tables for indexing then I would not have bothered
posting this thread at all, naturally.

Why make me say any of this?  Especially when the rest of the thread was
actually quite useful.  Are you traumatized by newcomers to SQL?

> > SQLite3, on the other hand, does four table scans and further scans two
> > sub-queries' result row-sets.
> 
> Instead of the above try the following.  Suppose that the primary key
> (i.e. a significant unique value like a company name) for the above
> tables is a column 'pk', and suppose further that if this is not
> literally the primary key that there's a UNIQUE INDEX on it ...

Yes, one would think.  But actually SQLite3 doesn't treat UNIQUE INDEX
as the same as PRIMARY KEY for some of these queries (see the rest of
this thread).

> SELECT 'added., a.* FROM
>  (SELECT a.pk FROM a EXCEPT SELECT b.pk FROM b) a
> UNION ALL
> SELECT 'deleted', d.pk FROM
>  (SELECT b.pk FROM b EXCEPT SELECT a.pk FROM a) d;

After correcting the syntax and adapting as

SELECT 'added', a.* FROM
 (SELECT a.a, a.b FROM main.toy a EXCEPT SELECT b.a, b.b FROM aux.toy b) a
UNION ALL
SELECT 'deleted', d.* FROM
 (SELECT b.a, b.b FROM aux.toy b EXCEPT SELECT a.a, a.b FROM main.toy a) d;

I get:

3|0|0|SCAN TABLE toy AS a USING COVERING INDEX sqlite_autoindex_toy_1
4|0|0|SCAN TABLE toy AS b USING COVERING INDEX sqlite_autoindex_toy_1
2|0|0|COMPOUND SUBQUERIES 3 AND 4 USING TEMP B-TREE (EXCEPT)
1|0|0|SCAN SUBQUERY 2 AS a
7|0|0|SCAN TABLE toy AS b USING COVERING INDEX sqlite_autoindex_toy_1
8|0|0|SCAN TABLE toy AS a USING COVERING INDEX sqlite_autoindex_toy_1
6|0|0|COMPOUND SUBQUERIES 7 AND 8 USING TEMP B-TREE (EXCEPT)
5|0|0|SCAN SUBQUERY 6 AS d
0|0|0|COMPOUND SUBQUERIES 1 AND 5 (UNION ALL)

Which... is not as good as the query produced by sqldiff (see the rest
of the thread).

Nico
--

[sqlite] Diff two tables as fast as diff(1) (of sorted files)

Reply via email to