Re: [sqlite] Application optimization puzzle: reading jobs from a table while writing results back to same table

Keith Medcalf Thu, 06 Feb 2014 21:30:47 -0800

What is the purpose of urls.origin_id?

You need indexes on foreign keys.  They do not make themselves.


>-----Original Message-----
>From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-
>boun...@sqlite.org] On Behalf Of Zack Weinberg
>Sent: Thursday, 6 February, 2014 13:40
>To: sqlite-users@sqlite.org
>Subject: [sqlite] Application optimization puzzle: reading jobs from a
>table while writing results back to same table
>
>I have a database with this (partial) schema:
>
>  CREATE TABLE origins (
>    id         INTEGER PRIMARY KEY,
>    label      TEXT NOT NULL UNIQUE
>  );
>
>  CREATE TABLE url_strings (
>    id         INTEGER PRIMARY KEY,
>    url        TEXT NOT NULL UNIQUE
>  );
>
>  CREATE TABLE canon_statuses (
>    id         INTEGER PRIMARY KEY,
>    status     TEXT NOT NULL UNIQUE
>  );
>
>  CREATE TABLE urls (
>    origin     INTEGER NOT NULL REFERENCES origins(id),
>    origin_id  INTEGER NOT NULL,
>    url        INTEGER NOT NULL REFERENCES url_strings(id),
>    UNIQUE (origin, origin_id)
>  );
>
>  CREATE TABLE canon_urls (
>    url        INTEGER PRIMARY KEY REFERENCES url_strings(id),
>    canon      INTEGER REFERENCES url_strings(id),
>    status     INTEGER REFERENCES canon_statuses(id)
>  );
>
>The 'urls' and 'canon_urls' tables are separate because the same URL
>may appear many times (with different origin+origin_id pairs) in the
>former table, but only needs to appear once in the latter.  There are
>no explicit indexes, only the ones implied by the UNIQUEs and PRIMARY
>KEYs.  There are currently about 4,000,000 rows in 'url_strings',
>'urls', and 'canon_urls', and about 200 rows in 'canon_statuses'.
>
>There are several programs that make use of this database, but the one
>that's relevant right now is called "canonicalize", and what it does,
>in outline, is scan the 'urls' table for any new URLs that aren't
>already in the 'canon_urls' table, add them, then do some complicated
>processing on each entry in 'canon_urls' for which both 'canon' and
>'status' are currently NULL.  (The short version is it chases HTTP
>redirects until there aren't any more.)  Because the 'complicated
>processing' stage is slow and independent for each URL, it gets farmed
>out to a pool of worker threads, but (at present) there is only one
>thread that talks to the database.  In pseudocode + SQL, the database
>thread's logic is as follows:
>
>    INSERT INTO canon_urls
>      SELECT DISTINCT url, NULL, NULL FROM urls
>        WHERE url NOT IN (SELECT url FROM canon_urls)
>
>    work_queue = Queue()
>    result_queue = Queue()
>    for (id, url) in (SELECT u.url, v.url
>                        FROM canon_urls AS u
>                        LEFT JOIN url_strings AS v ON u.url = v.id
>                        WHERE u.canon IS NULL AND u.status IS NULL):
>      work_queue.put( (id, url) )
>
>    repeat n_workers times:
>      spawn_worker(work_queue, result_queue)
>
>    while (url_id, canon_url, status) = result_queue.get():
>
>      # 'canon_url' can be NULL at this point, 'status' can't.
>      if canon_url == NULL:
>        canon_id = NULL
>      else:
>        canon_id = (SELECT id FROM url_strings WHERE url =
>result.canon_url)
>        if canon_id == NULL:
>          INSERT INTO url_strings VALUES(NULL, result.canon_url)
>          canon_id = last_row_id
>
>      status_id = (SELECT id FROM canon_statuses WHERE status =
>result.status)
>      if not status_id:
>        INSERT INTO canon_statuses VALUES(NULL, result.status)
>        status_id = last_row_id
>
>       UPDATE canon_urls SET canon = canon_id, status = status_id
>         WHERE url = url_id
>
>This works correctly but is inefficient.  The work_queue - which, as
>you can see, is initialized to every yet-to-be-done row from
>'canon_urls' - occupies too much memory: when the job is started from
>scratch, approximately half of all available RAM on the rather small
>virtual machine where this needs to run. The obvious fix is to feed
>the workers directly from the SELECT query that currently loads up the
>work queue; then the set of jobs-to-do need not sit in RAM.  (There
>would still be a short work queue to keep all database accesses on one
>thread.)  But this would mean simultaneously reading and writing the
>same table.  If I hold a read cursor open on 'canon_urls', it is my
>understanding that the UPDATEs will fail; if I restart the read cursor
>after every UPDATE, I will be rescanning the table for each new job
>and the overall algorithm will become quadratic.  Can anyone suggest a
>better way?  Note that the overall job may be interrupted at any time,
>so writing completed rows to a scratch table and then copying them over
>at the end is impractical.
>
>Also, but much less important, the initial INSERT operation takes tens
>of seconds even when it has nothing to do; if there's a more efficient
>way to write that I would like to know about it.  (I also tried
>
>   INSERT INTO canon_urls
>     SELECT DISTINCT url, NULL, NULL FROM urls
>     EXCEPT SELECT url, NULL, NULL FROM canon_urls
>
>but that is even slower.)
>
>Thanks,
>zw
>_______________________________________________
>sqlite-users mailing list
>sqlite-users@sqlite.org
>http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Application optimization puzzle: reading jobs from a table while writing results back to same table

Reply via email to