https://bugzilla.wikimedia.org/show_bug.cgi?id=57176
--- Comment #7 from Brad Jorsch <[email protected]> --- I thought I had posted an analysis of these API modules somewhere, but now I can't find it. As noted already, the query for ApiQueryExtLinksUsage looks something like this: SELECT page_id,page_namespace,page_title,el_to FROM `page`,`externallinks` WHERE (page_id=el_from) AND (el_index LIKE 'http://gov.nih.nlm.ncbi.%' ) LIMIT $OFFSET,501; ApiQueryExternalLinks has the same issue, BTW. Its queries look something like this: SELECT el_from, el_to FROM externallinks WHERE (el_from IN (1,2,3,4,5,6,7,...,5000)) AND (el_index LIKE 'http://gov.nih.nlm.ncbi.%') ORDER BY el_from LIMIT $OFFSET,501; Option B is to change the first to making queries like this for $NUM from 0 to 1023? SELECT page_id,page_namespace,page_title,el_to FROM `page`,`externallinks` WHERE (page_id=el_from) AND (el_index LIKE 'http://gov.nih.nlm.ncbi.%' ) AND el_bucket = $NUM LIMIT $OFFSET,501; It'd be doable, although we still have the offset in there so it's still ugly IMO. We'd also need another index to handle ApiQueryExternalLinks in a similar way: either (el_bucket, el_from, el_index(60)) or (el_from, el_bucket, el_index(60)) would work, I think. I note that the API doesn't really have any clue as to whether the like is likely to match many rows, so we'd either have to do some sort of pre-query to test it (e.g. "SELECT COUNT(*) FROM (SELECT 1 FROM externallinks WHERE el_index LIKE $FOO LIMIT 5001) AS tmp") or else always do the bucket method. Also, to avoid making clients do excessive numbers of queries to get back few rows from each, I'd prefer ApiQueryExtLinksUsage to do as many of the 1024 bucket queries it needs to fill up the requested limit. For example, if the 1000000 matching links were evenly spread across the buckets, ApiQueryExtLinksUsage called by someone with apihighlimits and eulimit=5000 would wind up making 6 queries: bucket 0 with limit 5001 getting about 977 rows, bucket 1 with limit 4024 getting another 977, bucket 2 with limit 3047, bucket 3 with limit 2070, bucket 4 with limit 1093, and bucket 5 with limit 116. And if the 1000000 matching links were all in bucket 999 (major hashing failure!), it'd do 1000 queries. I like Option A better, since it would let us get rid of that offset entirely and also to stop trusting the database to not arbitrarily change the row ordering when no ORDER BY is actually given. But couldn't we just use one long index field and do a range match if the query is shorter, much like Block::getRangeCond()? -- This is el_index truncated to 60 bytes el_index_60 VARBINARY(60) NOT NULL CREATE INDEX /*i*/el_index_60 ON /*_*/externallinks (el_index_60, el_id); CREATE INDEX /*i*/el_from_index_60 ON /*_*/externallinks (el_from, el_index_60, el_id); Then the queries would look something like this: -- Note '/' is '.' + 1 SELECT page_id,page_namespace,page_title,el_to,el_index_60,el_id FROM `page`,`externallinks` WHERE (page_id=el_from) AND (el_index_60 >= 'http://gov.nih.nlm.ncbi.' AND el_index_60 < 'http://gov.nih.nlm.ncbi/' AND el_index LIKE 'http://gov.nih.nlm.ncbi.%') AND (el_index_60 > 'http://gov.nih.nlm.ncbi.XXXX' OR (el_index_60 = 'http://gov.nih.nlm.ncbi.XXXX' AND el_id >= $ID)) ORDER BY el_index_60, el_id LIMIT 501; SELECT el_from,el_to,el_index_60,el_id FROM externallinks WHERE (el_from IN (1,2,3,4,5,6,7,...,5000)) AND (el_index_60 >= 'http://gov.nih.nlm.ncbi.' AND el_index_60 < 'http://gov.nih.nlm.ncbi/' AND el_index LIKE 'http://gov.nih.nlm.ncbi.%') AND (el_from > $FROM OR (el_from = $FROM AND (el_index_60 > 'http://gov.nih.nlm.ncbi.XXXX' OR (el_index_60 = 'http://gov.nih.nlm.ncbi.XXXX' AND el_id >= $ID)))) ORDER BY el_from, el_index_60, el_id LIMIT 501; -- In these, the query is over 60 characters so el_index_60 is constant SELECT page_id,page_namespace,page_title,el_to,el_id FROM `page`,`externallinks` WHERE (page_id=el_from) AND (el_index_60 = 'http://gov.nih.nlm.ncbi./foo/bar/xxxxxxxxxxxxxxxxxxxxxxxx/ba' AND el_index LIKE 'http://gov.nih.nlm.ncbi./foo/bar/xxxxxxxxxxxxxxxxxxxxxxxx/baz/%') AND (el_id >= $ID) ORDER BY el_id LIMIT 501; SELECT el_from,el_to,el_id FROM externallinks WHERE (el_from IN (1,2,3,4,5,6,7,...,5000)) AND (el_index_60 = 'http://gov.nih.nlm.ncbi./foo/bar/xxxxxxxxxxxxxxxxxxxxxxxx/ba' AND el_index LIKE 'http://gov.nih.nlm.ncbi./foo/bar/xxxxxxxxxxxxxxxxxxxxxxxx/baz/%') AND (el_from > $FROM OR (el_from = $FROM AND el_id >= $ID)) ORDER BY el_from, el_id LIMIT 501; -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
