--- Comment #10 from Sean Pringle <> ---
Pulled from db1021 with the old wb_% indexes after a few hours of activity:

| wikidatawiki | wb_terms   | tmp1                |  905372478 |
| wikidatawiki | wb_terms   | wb_terms_entity_id  | 1284769088 |
| wikidatawiki | wb_terms   | wb_terms_text       |     365238 |
| wikidatawiki | wb_terms   | wb_terms_language   |   85815191 |
| wikidatawiki | wb_terms   | wb_terms_search_key |      80851 |

(tmp1 is roughly eqivalent to the proposed terms_search composite index)

Turns out we need to keep some indexes with term_entity_id, term_text, and
term_search_key in left-most positions, plus avoid having massive indexes (22G
already, more for the new composites) except where "Index Condition Pushdown"
(ICP) is useful. Therefore I propose these indexes instead:

-- Some wb_terms queries not listed in the bug report use term_entity_id=N
-- which is good selectivity.
CREATE INDEX /*i*/terms_entity ON /*_*/wb_terms (term_entity_id);

-- When any wb_terms query includes a search on term_text greater than
-- four or five leading characters a simple index on term_text and
-- language is often better than the proposed composite indexes. Note
-- that it's fine to have term_language second as MariaDB still uses
-- the entire key length even with LIKE '...%' on term_text.
CREATE INDEX /*i*/terms_text ON /*_*/wb_terms (term_text, term_language);

-- Same idea as above for terms_search_key. Not so much traffic as
-- term_text but enough to be useful.
CREATE INDEX /*i*/terms_search_key ON /*_*/wb_terms (term_search_key,

-- The getMatchingIDs query is horrible when it has just one or two
-- leading characters on term_search_key. This index is slightly
-- different to the proposed term_search for better selectivity
-- while still allowing ICP for short string values.
CREATE INDEX /*i*/terms_search ON /*_*/wb_terms (term_language, term_entity_id,
term_type, term_search_key(16));

The above indexes give reasonable performance on all wb_terms queries I can
find and seems suitably generic enough for tables.sql.

However, for wikidatawiki in production it's still not fast enough for all
fringe cases due to the sheer dataset size. So, on db1026 I've partitioned
wb_terms based on hash(term_language). This:

1) Lets the optimizer use partition pruning like a free first-stage index on
term_language (because we use equality for term_language clauses). Reduces the
number of rows read for many of our queries except those already using ICP.

2) Allows terms_text and terms_search_key to drop the second term_language
field for slightly smaller footprints with comparable performance due to (1).
More stuff fits in the buffer pool for longer, basically.

3) Speeds up writes to wb_terms on the slave because insert/update/delete only
touches the smaller indexes in the relevant partition rather than the entire
btrees (this is only a slight benefit, but I figure every bit to help reduce

CREATE TABLE wb_terms (
  KEY terms_entity (term_entity_id),
  KEY terms_text (term_text),
  KEY terms_search_key (term_search_key),
  KEY terms_search

Note that I left term_language in terms_search despite the partitions because
getMatchingIDs query still benefits from ICP when term_search_key is used with
a poorly performing short prefix: LIKE 'a%'.

Partitions aren't suitable for tables.sql as we still say we support MySQL
5.0+, but I think it's worthwhile for this particular table on WMF production

You are receiving this mail because:
You are on the CC list for the bug.
Wikibugs-l mailing list

Reply via email to