Lucas_Werkmeister_WMDE added a comment.
It looks like `SELECT DISTINCT … WHERE … IN (…, …, …)` might not be the best
query for “which of these IDs are still used?” when the IDs may be used
millions of times. The “range” join type seems especially worrying, but even
when it’s not used:
[email protected](wikidatawiki)> EXPLAIN SELECT /*
Wikibase\Lib\Store\Sql\Terms\DatabaseItemTermStore::cleanTermsIfUnused */
DISTINCT wbit_term_in_lang_id AS value FROM wbt_item_terms WHERE
wbit_term_in_lang_id IN ('828227');
+------+-------------+----------------+------+----------------------------------------+----------------------------------------+---------+-------+---------+-------------------------------------------+
| id | select_type | table | type | possible_keys
| key | key_len | ref | rows
| Extra |
+------+-------------+----------------+------+----------------------------------------+----------------------------------------+---------+-------+---------+-------------------------------------------+
| 1 | SIMPLE | wbt_item_terms | ref |
wbt_item_terms_term_in_lang_id_item_id | wbt_item_terms_term_in_lang_id_item_id
| 4 | const | 1330386 | Using where; Using index; Using temporary |
+------+-------------+----------------+------+----------------------------------------+----------------------------------------+---------+-------+---------+-------------------------------------------+
1 row in set (0.00 sec)
Is it actually reading 1330386 rows, to get the same constant value from all
of them and then run it through `DISTINCT`? I sure hope not…
Since we’re really only interested in existence of any row with this (non-PK)
ID, I wondered if this can instead be written as some kind of `SELECT value,
EXISTS (SELECT wbit_term_in_lang_id …)`. This is possible when turning the list
of potential IDs into a temporary table (based on this SO answer
<https://stackoverflow.com/a/25806882/1420237>):
SELECT /*
Wikibase\Lib\Store\Sql\Terms\DatabaseItemTermStore::cleanTermsIfUnused */
`value`, EXISTS (SELECT * FROM `wbt_item_terms` WHERE `wbit_term_in_lang_id` =
`value`) FROM (SELECT '828227' AS `value` UNION SELECT '828232' AS `value`
UNION SELECT '828257' AS `value` UNION SELECT '828265' AS `value` UNION SELECT
'828269' AS `value` UNION SELECT '828272' AS `value` UNION SELECT '828276' AS
`value` UNION SELECT '828287' AS `value` UNION SELECT '828290' AS `value` UNION
SELECT '828293' AS `value` UNION SELECT '828297' AS `value` UNION SELECT
'828302' AS `value` UNION SELECT '828314' AS `value` UNION SELECT '828315' AS
`value` UNION SELECT '828319' AS `value` UNION SELECT '828333' AS `value` UNION
SELECT '828336' AS `value` UNION SELECT '828337' AS `value` UNION SELECT
'828342' AS `value` UNION SELECT '828350' AS `value` UNION SELECT '828358' AS
`value` UNION SELECT '828362' AS `value` UNION SELECT '828363' AS `value` UNION
SELECT '828368' AS `value` UNION SELECT '828374' AS `value` UNION SELECT
'828377' AS `value` UNION SELECT '828380' AS `value` UNION SELECT '828382' AS
`value` UNION SELECT '828386' AS `value` UNION SELECT '828387' AS `value` UNION
SELECT '828388' AS `value` UNION SELECT '828390' AS `value` UNION SELECT
'828401' AS `value` UNION SELECT '828404' AS `value` UNION SELECT '828405' AS
`value` UNION SELECT '828414' AS `value` UNION SELECT '828428' AS `value` UNION
SELECT '828433' AS `value` UNION SELECT '828435' AS `value` UNION SELECT
'828438' AS `value` UNION SELECT '828448' AS `value` UNION SELECT '828449' AS
`value` UNION SELECT '828450' AS `value` UNION SELECT '828451' AS `value` UNION
SELECT '870671' AS `value` UNION SELECT '870674' AS `value` UNION SELECT
'870679' AS `value`) AS `values`;
It’s ugly, but the `EXPLAIN` shows a “ref” join type instead of “range”,
doesn’t report millions of rows to be examined, and when you actually run the
query it finishes in 0.00 sec instead of 14.36 sec. (Though the explain
actually only reports 2 rows to be examined, which is definitely not enough, so
I won’t pretend to completely understand this.) Is there a less silly way to
arrive at a similar query plan that doesn’t scan millions of rows?
TASK DETAIL
https://phabricator.wikimedia.org/T234948
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Ladsgroup, Lucas_Werkmeister_WMDE
Cc: Lucas_Werkmeister_WMDE, Addshore, #dba, Liuxinyu970226, LarsWirzenius,
Lydia_Pintscher, alaa_wmde, Jdforrester-WMF, WMDE-leszek, Marostegui,
Ladsgroup, Aklapper, jcrespo, Hook696, Daryl-TTMG, RomaAmorRoma, 0010318400,
E.S.A-Sheild, Iflorez, darthmon_wmde, Meekrab2012, joker88john, DannyS712,
CucyNoiD, Nandana, NebulousIris, Gaboe420, Versusxo, Majesticalreaper22,
Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Af420, Darkminds3113, Bsandipan,
Lordiis, Pablo-WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6,
QZanden, LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper,
Jonas, Wikidata-bugs, aude, Mbch331, Jay8g, Krenair
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs