What is the purpose of urls.origin_id? You need indexes on foreign keys. They do not make themselves.
>-----Original Message----- >From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users- >boun...@sqlite.org] On Behalf Of Zack Weinberg >Sent: Thursday, 6 February, 2014 13:40 >To: sqlite-users@sqlite.org >Subject: [sqlite] Application optimization puzzle: reading jobs from a >table while writing results back to same table > >I have a database with this (partial) schema: > > CREATE TABLE origins ( > id INTEGER PRIMARY KEY, > label TEXT NOT NULL UNIQUE > ); > > CREATE TABLE url_strings ( > id INTEGER PRIMARY KEY, > url TEXT NOT NULL UNIQUE > ); > > CREATE TABLE canon_statuses ( > id INTEGER PRIMARY KEY, > status TEXT NOT NULL UNIQUE > ); > > CREATE TABLE urls ( > origin INTEGER NOT NULL REFERENCES origins(id), > origin_id INTEGER NOT NULL, > url INTEGER NOT NULL REFERENCES url_strings(id), > UNIQUE (origin, origin_id) > ); > > CREATE TABLE canon_urls ( > url INTEGER PRIMARY KEY REFERENCES url_strings(id), > canon INTEGER REFERENCES url_strings(id), > status INTEGER REFERENCES canon_statuses(id) > ); > >The 'urls' and 'canon_urls' tables are separate because the same URL >may appear many times (with different origin+origin_id pairs) in the >former table, but only needs to appear once in the latter. There are >no explicit indexes, only the ones implied by the UNIQUEs and PRIMARY >KEYs. There are currently about 4,000,000 rows in 'url_strings', >'urls', and 'canon_urls', and about 200 rows in 'canon_statuses'. > >There are several programs that make use of this database, but the one >that's relevant right now is called "canonicalize", and what it does, >in outline, is scan the 'urls' table for any new URLs that aren't >already in the 'canon_urls' table, add them, then do some complicated >processing on each entry in 'canon_urls' for which both 'canon' and >'status' are currently NULL. (The short version is it chases HTTP >redirects until there aren't any more.) Because the 'complicated >processing' stage is slow and independent for each URL, it gets farmed >out to a pool of worker threads, but (at present) there is only one >thread that talks to the database. In pseudocode + SQL, the database >thread's logic is as follows: > > INSERT INTO canon_urls > SELECT DISTINCT url, NULL, NULL FROM urls > WHERE url NOT IN (SELECT url FROM canon_urls) > > work_queue = Queue() > result_queue = Queue() > for (id, url) in (SELECT u.url, v.url > FROM canon_urls AS u > LEFT JOIN url_strings AS v ON u.url = v.id > WHERE u.canon IS NULL AND u.status IS NULL): > work_queue.put( (id, url) ) > > repeat n_workers times: > spawn_worker(work_queue, result_queue) > > while (url_id, canon_url, status) = result_queue.get(): > > # 'canon_url' can be NULL at this point, 'status' can't. > if canon_url == NULL: > canon_id = NULL > else: > canon_id = (SELECT id FROM url_strings WHERE url = >result.canon_url) > if canon_id == NULL: > INSERT INTO url_strings VALUES(NULL, result.canon_url) > canon_id = last_row_id > > status_id = (SELECT id FROM canon_statuses WHERE status = >result.status) > if not status_id: > INSERT INTO canon_statuses VALUES(NULL, result.status) > status_id = last_row_id > > UPDATE canon_urls SET canon = canon_id, status = status_id > WHERE url = url_id > >This works correctly but is inefficient. The work_queue - which, as >you can see, is initialized to every yet-to-be-done row from >'canon_urls' - occupies too much memory: when the job is started from >scratch, approximately half of all available RAM on the rather small >virtual machine where this needs to run. The obvious fix is to feed >the workers directly from the SELECT query that currently loads up the >work queue; then the set of jobs-to-do need not sit in RAM. (There >would still be a short work queue to keep all database accesses on one >thread.) But this would mean simultaneously reading and writing the >same table. If I hold a read cursor open on 'canon_urls', it is my >understanding that the UPDATEs will fail; if I restart the read cursor >after every UPDATE, I will be rescanning the table for each new job >and the overall algorithm will become quadratic. Can anyone suggest a >better way? Note that the overall job may be interrupted at any time, >so writing completed rows to a scratch table and then copying them over >at the end is impractical. > >Also, but much less important, the initial INSERT operation takes tens >of seconds even when it has nothing to do; if there's a more efficient >way to write that I would like to know about it. (I also tried > > INSERT INTO canon_urls > SELECT DISTINCT url, NULL, NULL FROM urls > EXCEPT SELECT url, NULL, NULL FROM canon_urls > >but that is even slower.) > >Thanks, >zw >_______________________________________________ >sqlite-users mailing list >sqlite-users@sqlite.org >http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users