Do I correctly understand the intention of the UPDATE is that for
each my_id in meta_table, it will store the count of all the hashes that
are associated only with my_id and no other id's?

In that case, have you tried:

UPDATE
     meta_table
SET
     distinct_hashes = (
         SELECT
             COUNT(hash) AS num
         FROM
             temp.t_distinct_hashes_by_id d
         WHERE
             --No need for any NOT NULL, it'll fail equality comparison
             1 = (SELECT COUNT(my_id) FROM  temp.t_distinct_hashes_by_id sub
WHERE d.hash = sub.hash)
             AND
             meta_table.my_id = d.my_id
     )
;

Or you could try (although I doubt it'd be any faster):

distinct_hashes = (
         SELECT
             COUNT(*) AS num
          FROM (
             -- Technically not valid SQL,
             -- but SQLite allows you to select non aggregate rows not
include in the GROUP BY,
             -- and we're safe because of HAVING COUNT(*) = 1.
             SELECT my_id
             FROM
                temp.t_distinct_hashes_by_id
             GROUP BY Hash
             HAVING COUNT(*) = 1
          ) d
          WHERE d.my_id = meta_table.my_id
     );

If my logic is correct both of these should work, and both avoid a SCAN
TABLE in your innermost loop.

Also, is it possible my_id and hash could be guaranteed NOT NULL in your
temp table? I'm not sure if that would speed up the query (although it
couldn't hurt), but it would certainly make the problem easier to think
about.

Can you share your overall goal? It looks like you're doing after like
SELECT my_id, COUNT(*) FROM (SELECT my_id FROM <Dataset> GROUP BY hash
HAVING COUNT(DISTINCT (hash, my_id)) = 1) GROUP by my_id.
Which doesn't work because SQLite doesn't like COUNT(DISTINCT <row-value>).

On Sat, 23 Jun 2018 at 23:44, Jonathan Moules <jonathan-li...@lightpear.com>
wrote:

> Hi List,
> I'm trying to find all hashes that are unique to a specific id (my_id),
> and then use a UPDATE-Join to update another table with that number.
>
> After much tweaking, I've simplified the table down to a basic temp
> table (actually created using a CREATE AS SELECT ... GROUP BY my_id, hash):
>
> CREATE TEMP TABLE t_distinct_hashes_by_id (
>      my_id            INTEGER,
>      hash            TEXT
> );
>
> And indexes in both directions because I'm still trying to optimise (and
> see what SQLite wants):
>
> CREATE UNIQUE INDEX temp.idx__1 ON t_distinct_hashes_by_id (
>      hash,
>       my_id
> );
>
> CREATE UNIQUE INDEX temp.idx__2 ON t_distinct_hashes_by_id (
>          my_id,
>          hash
> );
>
> There are only 20 values for my_id, but several hundred thousand hashes.
>
> -----
>
> I can do a SELECT query which gets what I want and runs in about 0.5
> seconds:
>
> SELECT
>     *
>          FROM
>              temp.t_distinct_hashes_by_id d
>          WHERE
>              hash NOT IN (SELECT hash FROM temp.t_distinct_hashes_by_id
> sub where 1 != sub.my_id and hash not NULL)
>              AND
>              1 = d.my_id
>
> The EXPLAIN:
> 0    0    0    SEARCH TABLE t_distinct_hashes_by_id AS l USING COVERING
> INDEX idx__2 (my_id=?)
> 0    0    0    EXECUTE LIST SUBQUERY 1
> 1    0    0    SCAN TABLE t_distinct_hashes_by_id AS sub
>
> -----
> So in theory, I'd hope that an UPDATE version using this select should
> take around 20 * 0.5 sec = 10 seconds. But it's actually taking... well
> I don't know how long, at least 10 minutes before I gave up waiting.
> This is the UPDATE:
>
> UPDATE
>      meta_table
> SET
>      distinct_hashes = (
>          SELECT
>              COUNT(hash) AS num
>          FROM
>              temp.t_distinct_hashes_by_id d
>          WHERE
>              hash NOT IN (SELECT hash FROM temp.t_distinct_hashes_by_id
> sub where meta_table.my_id != sub.my_id and hash not NULL)
>              AND
>              -- This one has to be at the bottom for some reason.
>              meta_table.my_id = d.my_id
>      )
> ;
>
> The query plan for this UPDATE includes two CORRELATED Subqueries, which
> the docs say are reassessed on every run - that seems like the problem.
> I get that it'd need to do that 20 times (once per my_id), but the
> slowdown seems considerably longer than that needs to warrant.
>
> 0    0    0    SCAN TABLE meta_table
> 0    0    0    EXECUTE CORRELATED SCALAR SUBQUERY 0
> 0    0    0    SEARCH TABLE t_distinct_hashes_by_id AS l USING COVERING
> INDEX idx__2 (my_id=?)
> 0    0    0    EXECUTE CORRELATED LIST SUBQUERY 1
> 1    0    0    SCAN TABLE t_distinct_hashes_by_id AS sub
>
> Does anyone have any thoughts on how I can speed this up (SQLite 3.15.0
> and confirmed in 3.24.0 (which uses about 3 times the disk IO / sec for
> the same query/data))?
> Thanks,
> Jonathan
>
>
>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to