[sqlite] Diff two tables as fast as diff(1) (of sorted files)

Nico Williams Wed, 18 Nov 2015 16:50:22 -0600

On Wed, Nov 18, 2015 at 03:23:33PM -0500, Richard Hipp wrote:
> If it's so common, why are you the first to request it :-)  And, yeah,
> according to the WWPD principal, if Postgres doesn't do it, we
> probably won't be in a big hurry to do it either...


I've written this sort of query many times.  I imagined I couldn't have
been the first to do so.  Maybe I am, but I doubt it.  I am suprised to
be the first to request it.  I suspect most people who need this either
accept the performance they get, or write the diff in application code.

I've wanted better performance for this kind of query for two years,
and I'm only just now getting around to asking for it.

> Note that you can get efficient performance by tweaking the query
> slightly.

There's several ways of writing this, and I've not tried all of them.

> The "sqldiff.exe" command-line utility has an (undocumented and
> unsupported) "--debug 2" option that will show you the SQL that it
> uses to do a diff between two tables.  Maybe look at that and tweak it
> for your use?

Well hey!  I'm not the first to need to diff two table sources after
all!  :)

Here's what sqldiff's query's explain query plan looks like this both,
with and without indices:

2|0|1|SCAN TABLE toy AS B
2|1|0|SEARCH TABLE toy AS A USING INTEGER PRIMARY KEY (rowid=?)
3|0|0|SCAN TABLE toy AS A
3|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 4
4|0|0|SEARCH TABLE toy AS B USING INTEGER PRIMARY KEY (rowid=?)
1|0|0|COMPOUND SUBQUERIES 2 AND 3 (UNION ALL)
5|0|0|SCAN TABLE toy AS B
5|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 6
6|0|0|SEARCH TABLE toy AS A USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|COMPOUND SUBQUERIES 1 AND 5 (UNION ALL)

on a table created as:

CREATE TABLE toy(a,b);

So that's three table scans with a JOIN each onto the other.

That *is* faster than what I was getting, so I'll probably be using this
(I have to check what Postgres does too, since I need and use this kind
of query on Postgres as well).

It's not as fast as diff(1), but it may compare well to a

  $ diff <(sort ...) <(sort ...)

Of course, *with* sorted sources the sort is unnecessary and then
diff(1) gets faster again; with indices the diff query ought to get fast
too.

Now, if all the columns are part of the PRIMARY KEY and we use WITHOUT
ROWID

  CREATE TABLE TOY (a,b, PRIMARY KEY (a,b)) WITHOUT ROWID;

then the SQL emitted by sqldiff --debug 2 gets down to just two scans:

1|0|0|SCAN TABLE toy AS A USING COVERING INDEX sqlite_autoindex_toy_1
1|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE toy AS B USING COVERING INDEX sqlite_autoindex_toy_1 (a=? 
AND b=?)
3|0|0|SCAN TABLE toy AS B USING COVERING INDEX sqlite_autoindex_toy_1
3|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 4
4|0|0|SEARCH TABLE toy AS A USING COVERING INDEX sqlite_autoindex_toy_1 (a=? 
AND b=?)
0|0|0|COMPOUND SUBQUERIES 1 AND 3 (UNION ALL)

And that's starting to look pretty good!

In the tables I happen to care about most for this kind of query I can
make all the columns part of the primary key (if they aren't already),
so I may well be able to get diff(1)-like performance now -- I haven't
tested it yet, but I'll let you know.

The RDBMS could do better even with not all columns being in the primary
key: the RDBMS should be able to do as well as diff(1) when there's
suitable indices, and as good as diff(1) + sort(1) of inputs when there
aren't.

Thanks for the reference to sqldiff!

Nico
--

[sqlite] Diff two tables as fast as diff(1) (of sorted files)

Reply via email to