https://bugzilla.wikimedia.org/show_bug.cgi?id=57176

--- Comment #7 from Brad Jorsch <bjor...@wikimedia.org> ---
I thought I had posted an analysis of these API modules somewhere, but now I
can't find it.

As noted already, the query for ApiQueryExtLinksUsage looks something like
this:

 SELECT page_id,page_namespace,page_title,el_to FROM `page`,`externallinks`
WHERE (page_id=el_from) AND (el_index LIKE 'http://gov.nih.nlm.ncbi.%' ) LIMIT
$OFFSET,501;

ApiQueryExternalLinks has the same issue, BTW. Its queries look something like
this:

 SELECT el_from, el_to FROM externallinks WHERE (el_from IN
(1,2,3,4,5,6,7,...,5000)) AND (el_index LIKE 'http://gov.nih.nlm.ncbi.%') ORDER
BY el_from LIMIT $OFFSET,501;

Option B is to change the first to making queries like this for $NUM from 0 to
1023?

 SELECT page_id,page_namespace,page_title,el_to FROM `page`,`externallinks`
WHERE (page_id=el_from) AND (el_index LIKE 'http://gov.nih.nlm.ncbi.%' ) AND
el_bucket = $NUM LIMIT $OFFSET,501;

It'd be doable, although we still have the offset in there so it's still ugly
IMO. We'd also need another index to handle ApiQueryExternalLinks in a similar
way: either (el_bucket, el_from, el_index(60)) or (el_from, el_bucket,
el_index(60)) would work, I think.

I note that the API doesn't really have any clue as to whether the like is
likely to match many rows, so we'd either have to do some sort of pre-query to
test it (e.g. "SELECT COUNT(*) FROM (SELECT 1 FROM externallinks WHERE el_index
LIKE $FOO LIMIT 5001) AS tmp") or else always do the bucket method.

Also, to avoid making clients do excessive numbers of queries to get back few
rows from each, I'd prefer ApiQueryExtLinksUsage to do as many of the 1024
bucket queries it needs to fill up the requested limit. For example, if the
1000000 matching links were evenly spread across the buckets,
ApiQueryExtLinksUsage called by someone with apihighlimits and eulimit=5000
would wind up making 6 queries: bucket 0 with limit 5001 getting about 977
rows, bucket 1 with limit 4024 getting another 977, bucket 2 with limit 3047,
bucket 3 with limit 2070, bucket 4 with limit 1093, and bucket 5 with limit
116. And if the 1000000 matching links were all in bucket 999 (major hashing
failure!), it'd do 1000 queries.


I like Option A better, since it would let us get rid of that offset entirely
and also to stop trusting the database to not arbitrarily change the row
ordering when no ORDER BY is actually given. But couldn't we just use one long
index field and do a range match if the query is shorter, much like
Block::getRangeCond()?

-- This is el_index truncated to 60 bytes
el_index_60 VARBINARY(60) NOT NULL

CREATE INDEX /*i*/el_index_60 ON /*_*/externallinks (el_index_60, el_id);
CREATE INDEX /*i*/el_from_index_60 ON /*_*/externallinks (el_from, el_index_60,
el_id);

Then the queries would look something like this:

 -- Note '/' is '.' + 1
 SELECT page_id,page_namespace,page_title,el_to,el_index_60,el_id FROM
`page`,`externallinks` WHERE (page_id=el_from) AND (el_index_60 >=
'http://gov.nih.nlm.ncbi.' AND el_index_60 < 'http://gov.nih.nlm.ncbi/' AND
el_index LIKE 'http://gov.nih.nlm.ncbi.%') AND (el_index_60 >
'http://gov.nih.nlm.ncbi.XXXX' OR (el_index_60 = 'http://gov.nih.nlm.ncbi.XXXX'
AND el_id >= $ID)) ORDER BY el_index_60, el_id LIMIT 501;

 SELECT el_from,el_to,el_index_60,el_id FROM externallinks WHERE (el_from IN
(1,2,3,4,5,6,7,...,5000)) AND (el_index_60 >= 'http://gov.nih.nlm.ncbi.' AND
el_index_60 < 'http://gov.nih.nlm.ncbi/' AND el_index LIKE
'http://gov.nih.nlm.ncbi.%') AND (el_from > $FROM OR (el_from = $FROM AND
(el_index_60 > 'http://gov.nih.nlm.ncbi.XXXX' OR (el_index_60 =
'http://gov.nih.nlm.ncbi.XXXX' AND el_id >= $ID)))) ORDER BY el_from,
el_index_60, el_id LIMIT 501;

 -- In these, the query is over 60 characters so el_index_60 is constant
 SELECT page_id,page_namespace,page_title,el_to,el_id FROM
`page`,`externallinks` WHERE (page_id=el_from) AND (el_index_60 =
'http://gov.nih.nlm.ncbi./foo/bar/xxxxxxxxxxxxxxxxxxxxxxxx/ba' AND el_index
LIKE 'http://gov.nih.nlm.ncbi./foo/bar/xxxxxxxxxxxxxxxxxxxxxxxx/baz/%') AND
(el_id >= $ID) ORDER BY el_id LIMIT 501;

 SELECT el_from,el_to,el_id FROM externallinks WHERE (el_from IN
(1,2,3,4,5,6,7,...,5000)) AND (el_index_60 =
'http://gov.nih.nlm.ncbi./foo/bar/xxxxxxxxxxxxxxxxxxxxxxxx/ba' AND el_index
LIKE 'http://gov.nih.nlm.ncbi./foo/bar/xxxxxxxxxxxxxxxxxxxxxxxx/baz/%') AND
(el_from > $FROM OR (el_from = $FROM AND el_id >= $ID)) ORDER BY el_from, el_id
LIMIT 501;

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to