https://bugzilla.wikimedia.org/show_bug.cgi?id=60618

--- Comment #7 from Sean Pringle <sprin...@wikimedia.org> ---
Aaron pinged me on IRC. Some observations made during that discussion follow.

The pain points are now mostly mid-sized wikis (eg, *wiktionary, metawiki) that
have:

* large numbers of page or template links (hundreds of millions)
* comparatively few pages (less than 10 million)
* data skewed toward one or two namespaces
* data skewed toward a small set of titles

enwiktionary> select tl_namespace, count(*) as links
   from templatelinks group by tl_namespace;
+--------------+-----------+
| tl_namespace | links     |
+--------------+-----------+
|            0 |      1696 |
|            1 |        56 |
|            2 |     14901 |
|            3 |       129 |
|            4 |      1462 |
|            5 |        12 |
|            8 |       103 |
|           10 |  23967908 |
|           11 |         1 |
|           14 |         5 |
|           15 |         4 |
|           90 |         3 |
|          100 |        60 |
|          101 |         2 |
|          104 |      1546 |
|          106 |         6 |
|          110 |         1 |
|          828 | 138128211 |
+--------------+-----------+

When hitting 10 or 828 and a title with millions of links MariaDB may fall back
on an index scan on `page`. These mid-sized wikis liked having STRAIGHT_JOIN
even if others didn't :-)

Could we tolerate the possibility of stale links and skip the JOIN on `page`
altogther? Might need a denormalised tl_from_namespace field. Then pull out the
page fields in a second batch WHERE page_id IN (...), or use a sub-query to do
the same thing:

SELECT 
   page_id, page_namespace, page_title, rd_from
FROM (
   SELECT tl_from, rd_from
   FROM `templatelinks`
   LEFT JOIN `redirect` 
      ON rd_from = tl_from
         AND rd_namespace = tl_namespace
         AND rd_title = tl_title
         AND (rd_interwiki = '' OR rd_interwiki IS NULL)
   WHERE tl_namespace = '828'
      AND tl_title = 'languages/data3/i'
   ORDER BY tl_from
   LIMIT 100
) tmp
JOIN page ON tl_from = page_id
ORDER BY page_id
LIMIT 51;

Increase the inner LIMIT if stale links are a problem; 500 or 1000 would be
fine.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to