Lucas_Werkmeister_WMDE added a comment.

  It looks like `SELECT DISTINCT … WHERE … IN (…, …, …)` might not be the best 
query for “which of these IDs are still used?” when the IDs may be used 
millions of times. The “range” join type seems especially worrying, but even 
when it’s not used:
  
    [email protected](wikidatawiki)> EXPLAIN SELECT /* 
Wikibase\Lib\Store\Sql\Terms\DatabaseItemTermStore::cleanTermsIfUnused */ 
DISTINCT wbit_term_in_lang_id AS value FROM wbt_item_terms WHERE 
wbit_term_in_lang_id IN ('828227');
    
+------+-------------+----------------+------+----------------------------------------+----------------------------------------+---------+-------+---------+-------------------------------------------+
    | id   | select_type | table          | type | possible_keys                
          | key                                    | key_len | ref   | rows    
| Extra                                     |
    
+------+-------------+----------------+------+----------------------------------------+----------------------------------------+---------+-------+---------+-------------------------------------------+
    |    1 | SIMPLE      | wbt_item_terms | ref  | 
wbt_item_terms_term_in_lang_id_item_id | wbt_item_terms_term_in_lang_id_item_id 
| 4       | const | 1330386 | Using where; Using index; Using temporary |
    
+------+-------------+----------------+------+----------------------------------------+----------------------------------------+---------+-------+---------+-------------------------------------------+
    1 row in set (0.00 sec)
  
  Is it actually reading 1330386 rows, to get the same constant value from all 
of them and then run it through `DISTINCT`? I sure hope not…
  
  Since we’re really only interested in existence of any row with this (non-PK) 
ID, I wondered if this can instead be written as some kind of `SELECT value, 
EXISTS (SELECT wbit_term_in_lang_id …)`. This is possible when turning the list 
of potential IDs into a temporary table (based on this SO answer 
<https://stackoverflow.com/a/25806882/1420237>):
  
    SELECT /* 
Wikibase\Lib\Store\Sql\Terms\DatabaseItemTermStore::cleanTermsIfUnused */ 
`value`, EXISTS (SELECT * FROM `wbt_item_terms` WHERE `wbit_term_in_lang_id` = 
`value`) FROM (SELECT '828227' AS `value` UNION SELECT '828232' AS `value` 
UNION SELECT '828257' AS `value` UNION SELECT '828265' AS `value` UNION SELECT 
'828269' AS `value` UNION SELECT '828272' AS `value` UNION SELECT '828276' AS 
`value` UNION SELECT '828287' AS `value` UNION SELECT '828290' AS `value` UNION 
SELECT '828293' AS `value` UNION SELECT '828297' AS `value` UNION SELECT 
'828302' AS `value` UNION SELECT '828314' AS `value` UNION SELECT '828315' AS 
`value` UNION SELECT '828319' AS `value` UNION SELECT '828333' AS `value` UNION 
SELECT '828336' AS `value` UNION SELECT '828337' AS `value` UNION SELECT 
'828342' AS `value` UNION SELECT '828350' AS `value` UNION SELECT '828358' AS 
`value` UNION SELECT '828362' AS `value` UNION SELECT '828363' AS `value` UNION 
SELECT '828368' AS `value` UNION SELECT '828374' AS `value` UNION SELECT 
'828377' AS `value` UNION SELECT '828380' AS `value` UNION SELECT '828382' AS 
`value` UNION SELECT '828386' AS `value` UNION SELECT '828387' AS `value` UNION 
SELECT '828388' AS `value` UNION SELECT '828390' AS `value` UNION SELECT 
'828401' AS `value` UNION SELECT '828404' AS `value` UNION SELECT '828405' AS 
`value` UNION SELECT '828414' AS `value` UNION SELECT '828428' AS `value` UNION 
SELECT '828433' AS `value` UNION SELECT '828435' AS `value` UNION SELECT 
'828438' AS `value` UNION SELECT '828448' AS `value` UNION SELECT '828449' AS 
`value` UNION SELECT '828450' AS `value` UNION SELECT '828451' AS `value` UNION 
SELECT '870671' AS `value` UNION SELECT '870674' AS `value` UNION SELECT 
'870679' AS `value`) AS `values`;
  
  It’s ugly, but the `EXPLAIN` shows a “ref” join type instead of “range”, 
doesn’t report millions of rows to be examined, and when you actually run the 
query it finishes in 0.00 sec instead of 14.36 sec. (Though the explain 
actually only reports 2 rows to be examined, which is definitely not enough, so 
I won’t pretend to completely understand this.) Is there a less silly way to 
arrive at a similar query plan that doesn’t scan millions of rows?

TASK DETAIL
  https://phabricator.wikimedia.org/T234948

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ladsgroup, Lucas_Werkmeister_WMDE
Cc: Lucas_Werkmeister_WMDE, Addshore, #dba, Liuxinyu970226, LarsWirzenius, 
Lydia_Pintscher, alaa_wmde, Jdforrester-WMF, WMDE-leszek, Marostegui, 
Ladsgroup, Aklapper, jcrespo, Hook696, Daryl-TTMG, RomaAmorRoma, 0010318400, 
E.S.A-Sheild, Iflorez, darthmon_wmde, Meekrab2012, joker88john, DannyS712, 
CucyNoiD, Nandana, NebulousIris, Gaboe420, Versusxo, Majesticalreaper22, 
Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Af420, Darkminds3113, Bsandipan, 
Lordiis, Pablo-WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, 
QZanden, LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, 
Jonas, Wikidata-bugs, aude, Mbch331, Jay8g, Krenair
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to