Re: [sqlite] SELECT becomes very slow when converted to UPDATE

2018-06-23 Thread Barry
Do I correctly understand the intention of the UPDATE is that for
each my_id in meta_table, it will store the count of all the hashes that
are associated only with my_id and no other id's?

In that case, have you tried:

UPDATE
 meta_table
SET
 distinct_hashes = (
 SELECT
 COUNT(hash) AS num
 FROM
 temp.t_distinct_hashes_by_id d
 WHERE
 --No need for any NOT NULL, it'll fail equality comparison
 1 = (SELECT COUNT(my_id) FROM  temp.t_distinct_hashes_by_id sub
WHERE d.hash = sub.hash)
 AND
 meta_table.my_id = d.my_id
 )
;

Or you could try (although I doubt it'd be any faster):

distinct_hashes = (
 SELECT
 COUNT(*) AS num
  FROM (
 -- Technically not valid SQL,
 -- but SQLite allows you to select non aggregate rows not
include in the GROUP BY,
 -- and we're safe because of HAVING COUNT(*) = 1.
 SELECT my_id
 FROM
temp.t_distinct_hashes_by_id
 GROUP BY Hash
 HAVING COUNT(*) = 1
  ) d
  WHERE d.my_id = meta_table.my_id
 );

If my logic is correct both of these should work, and both avoid a SCAN
TABLE in your innermost loop.

Also, is it possible my_id and hash could be guaranteed NOT NULL in your
temp table? I'm not sure if that would speed up the query (although it
couldn't hurt), but it would certainly make the problem easier to think
about.

Can you share your overall goal? It looks like you're doing after like
SELECT my_id, COUNT(*) FROM (SELECT my_id FROM  GROUP BY hash
HAVING COUNT(DISTINCT (hash, my_id)) = 1) GROUP by my_id.
Which doesn't work because SQLite doesn't like COUNT(DISTINCT ).

On Sat, 23 Jun 2018 at 23:44, Jonathan Moules 
wrote:

> Hi List,
> I'm trying to find all hashes that are unique to a specific id (my_id),
> and then use a UPDATE-Join to update another table with that number.
>
> After much tweaking, I've simplified the table down to a basic temp
> table (actually created using a CREATE AS SELECT ... GROUP BY my_id, hash):
>
> CREATE TEMP TABLE t_distinct_hashes_by_id (
>  my_idINTEGER,
>  hashTEXT
> );
>
> And indexes in both directions because I'm still trying to optimise (and
> see what SQLite wants):
>
> CREATE UNIQUE INDEX temp.idx__1 ON t_distinct_hashes_by_id (
>  hash,
>   my_id
> );
>
> CREATE UNIQUE INDEX temp.idx__2 ON t_distinct_hashes_by_id (
>  my_id,
>  hash
> );
>
> There are only 20 values for my_id, but several hundred thousand hashes.
>
> -
>
> I can do a SELECT query which gets what I want and runs in about 0.5
> seconds:
>
> SELECT
> *
>  FROM
>  temp.t_distinct_hashes_by_id d
>  WHERE
>  hash NOT IN (SELECT hash FROM temp.t_distinct_hashes_by_id
> sub where 1 != sub.my_id and hash not NULL)
>  AND
>  1 = d.my_id
>
> The EXPLAIN:
> 000SEARCH TABLE t_distinct_hashes_by_id AS l USING COVERING
> INDEX idx__2 (my_id=?)
> 000EXECUTE LIST SUBQUERY 1
> 100SCAN TABLE t_distinct_hashes_by_id AS sub
>
> -
> So in theory, I'd hope that an UPDATE version using this select should
> take around 20 * 0.5 sec = 10 seconds. But it's actually taking... well
> I don't know how long, at least 10 minutes before I gave up waiting.
> This is the UPDATE:
>
> UPDATE
>  meta_table
> SET
>  distinct_hashes = (
>  SELECT
>  COUNT(hash) AS num
>  FROM
>  temp.t_distinct_hashes_by_id d
>  WHERE
>  hash NOT IN (SELECT hash FROM temp.t_distinct_hashes_by_id
> sub where meta_table.my_id != sub.my_id and hash not NULL)
>  AND
>  -- This one has to be at the bottom for some reason.
>  meta_table.my_id = d.my_id
>  )
> ;
>
> The query plan for this UPDATE includes two CORRELATED Subqueries, which
> the docs say are reassessed on every run - that seems like the problem.
> I get that it'd need to do that 20 times (once per my_id), but the
> slowdown seems considerably longer than that needs to warrant.
>
> 000SCAN TABLE meta_table
> 000EXECUTE CORRELATED SCALAR SUBQUERY 0
> 000SEARCH TABLE t_distinct_hashes_by_id AS l USING COVERING
> INDEX idx__2 (my_id=?)
> 000EXECUTE CORRELATED LIST SUBQUERY 1
> 100SCAN TABLE t_distinct_hashes_by_id AS sub
>
> Does anyone have any thoughts on how I can speed this up (SQLite 3.15.0
> and confirmed in 3.24.0 (which uses about 3 times the disk IO / sec for
> the same query/data))?
> Thanks,
> Jonathan
>
>
>
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list

[sqlite] SELECT becomes very slow when converted to UPDATE

2018-06-23 Thread Jonathan Moules

Hi List,
I'm trying to find all hashes that are unique to a specific id (my_id), 
and then use a UPDATE-Join to update another table with that number.


After much tweaking, I've simplified the table down to a basic temp 
table (actually created using a CREATE AS SELECT ... GROUP BY my_id, hash):


CREATE TEMP TABLE t_distinct_hashes_by_id (
    my_id            INTEGER,
    hash            TEXT
);

And indexes in both directions because I'm still trying to optimise (and 
see what SQLite wants):


CREATE UNIQUE INDEX temp.idx__1 ON t_distinct_hashes_by_id (
    hash,
 my_id
);

CREATE UNIQUE INDEX temp.idx__2 ON t_distinct_hashes_by_id (
    my_id,
    hash
);

There are only 20 values for my_id, but several hundred thousand hashes.

-

I can do a SELECT query which gets what I want and runs in about 0.5 
seconds:


SELECT
   *
        FROM
            temp.t_distinct_hashes_by_id d
        WHERE
            hash NOT IN (SELECT hash FROM temp.t_distinct_hashes_by_id 
sub where 1 != sub.my_id and hash not NULL)

            AND
            1 = d.my_id

The EXPLAIN:
0    0    0    SEARCH TABLE t_distinct_hashes_by_id AS l USING COVERING 
INDEX idx__2 (my_id=?)

0    0    0    EXECUTE LIST SUBQUERY 1
1    0    0    SCAN TABLE t_distinct_hashes_by_id AS sub

-
So in theory, I'd hope that an UPDATE version using this select should 
take around 20 * 0.5 sec = 10 seconds. But it's actually taking... well 
I don't know how long, at least 10 minutes before I gave up waiting. 
This is the UPDATE:


UPDATE
    meta_table
SET
    distinct_hashes = (
        SELECT
            COUNT(hash) AS num
        FROM
            temp.t_distinct_hashes_by_id d
        WHERE
            hash NOT IN (SELECT hash FROM temp.t_distinct_hashes_by_id 
sub where meta_table.my_id != sub.my_id and hash not NULL)

            AND
            -- This one has to be at the bottom for some reason.
            meta_table.my_id = d.my_id
    )
;

The query plan for this UPDATE includes two CORRELATED Subqueries, which 
the docs say are reassessed on every run - that seems like the problem. 
I get that it'd need to do that 20 times (once per my_id), but the 
slowdown seems considerably longer than that needs to warrant.


0    0    0    SCAN TABLE meta_table
0    0    0    EXECUTE CORRELATED SCALAR SUBQUERY 0
0    0    0    SEARCH TABLE t_distinct_hashes_by_id AS l USING COVERING 
INDEX idx__2 (my_id=?)

0    0    0    EXECUTE CORRELATED LIST SUBQUERY 1
1    0    0    SCAN TABLE t_distinct_hashes_by_id AS sub

Does anyone have any thoughts on how I can speed this up (SQLite 3.15.0 
and confirmed in 3.24.0 (which uses about 3 times the disk IO / sec for 
the same query/data))?

Thanks,
Jonathan



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users