[sqlite] Diff two tables as fast as diff(1) (of sorted files)

Nico Williams Wed, 18 Nov 2015 14:11:16 -0600

Consider two tables with the same columns and primary keys, and then
consider this query:


SELECT 'added., a.* FROM
  (SELECT a.* FROM a EXCEPT SELECT b.* FROM b) a
UNION ALL
SELECT 'deleted', d.* FROM
  (SELECT b.* FROM b EXCEPT SELECT a.* FROM a) d;

In other words, the SQL analog of a straightforward diff(1) of two
[sorted] files.  I think this is the canonical way to do this in SQL,
though there are variants.

Now, diff(1) (i.e., the Unix command) can do this very efficiently
because it just moves a "cursor" in both files looking for different
records (lines).

SQLite3, on the other hand, does four table scans and further scans two
sub-queries' result row-sets.  If only the optimizer could recognize
this pattern, then it could surely do as well as diff(1).  This sort of
query strikes me as a common enough kind of query that implementing such
an optimization might be worthwhile, though the SQLite3 team will have
better knowledge of that.  I'd certainly appreciate such an
optimization.  (No, Postgres 9.4.4 doesn't seem to optimize this
substantially better either.)

A difference UDF would be one way to handle this.  Table-valued
functions are half the battle, but one still needs table-valued
arguments to UDFs to make it clean looking:

  -- Outputs two-column rows where the first column denotes addition vs
  -- deletion and the second is a row-valued column with the result
  -- columns of the corresponding row in the old or new table source:
  SELECT diff((SELECT ...), (SELECT ...));

Using "eval" and JSON one could define a diff() UDF that takes two
strings, compile the two strings as SQL statements and execute them,
stepping over the results of each just like diff(1) would:

  -- Outputs JSON arrays of addition/deletion marker and column array or
  -- object.
  SELECT diff('SELECT ...', 'SELECT ...');

and/or

  SELECT diff('old_view_or_table', 'new_view_or_table');

Perhaps SQL could grow a DIFFERENCE operator for this, though there are
syntactic/semantic issues to deal with.  Maybe:

  SELECT d.diff, o.c0, o.c1, .., o.cN
  FROM old_source o
  DIFFERENCE (/*deletion*/ '<', '>' /*addition*/)) d
  SELECT d.diff, n.c0, n.c1, .., n.cN
  FROM new_source n;

One might get a comm(1) equivalent too:

  SELECT d.diff, o.c0, o.c1, .., o.cN
  FROM old_source o
  DIFFERENCE (/*deletion*/ '<', '>' /*addition*/), '=' /*same*/) d
  SELECT d.diff, n.c0, n.c1, .., n.cN
  FROM new_source n;

Nico
--

[sqlite] Diff two tables as fast as diff(1) (of sorted files)

Reply via email to