| Catrope added a comment. |
Thanks for the detailed analysis @Bawolff and @jcrespo . My $0.02:
Given the overwhelming number of Wikidata entries in the RC table (95% or more on some wikis, I was very surprised by that), I think that's the main problem that we should tackle. @Bawolff's split query approach is also promising, but it's kind of orthogonal as it speeds up the whole query more or less independent of the Wikidata issue, and I'm a bit concerned that the code is kind of scary and possibly fragile.
There are four main ways that I can see us tackling the Wikidata flood problem:
- Reducing wasteful Wikidata entires (unused languages, unused properties)
- Segregating Wikidata RC entries (and possibly categorization too, since that's significant on some wikis) into a separate table
- Adding rc_type or rc_source (which one? these two fields seem largely redundant with each other) to the start of some indexes, so that queries that don't need to look at Wikidata rows can filter them out efficiently
- Disable Wikidata RC on large wikis until we have a more scalable implementation of the feature
#1 is something that should be done regardless because it would make users happy (people already complain about being spammed with updates to languages they don't speak), and it sounds like it's already being worked on. I don't know how hard that is or how long it would take, but AFAIK the Wikidata team started working on it a month ago, so it might take a while yet. Every reduction in wasteful entries would only affect new entries, so it would slowly improve query performance over days or weeks after deployment, as the older generation of wasteful entries falls off the time horizons of these queries. And a very large reduction would need to be achieved to address the issue properly. Right now, Wikidata is 95% of the RC table on commons; if the number of Wikidata rows were reduced by a factor of 10, it would still be 67%, and we'd need a 21x reduction just to get below 50%. So while this seems like a good long-term plan, I worry that it wouldn't make a dent for quite some time.
#2 would require creating a new table, and would require changing the code that generates RC entries (to write them to the right table) and all the code that queries the RC table (to query the right table, or when needed to query both tables with a UNION). The latter could be done in a similar way to @Bawolff's proof of concept for the split query, by rewriting the query right before it's executed. It'd require some trickery and could be a bit fragile, but we'd get an immediate performance boost for queries that exclude Wikidata (they'd need to scan 20x fewer rows on the worst-flooded wikis) and hopefully still decent performance for queries that don't.
#3 should in theory (AIUI) have the same performance benefits as #2, but without having to change any code: if an index starting with rc_type is used, the query should behave roughly as if every type of query is in its own table. I wonder if this would work in practice though, because I don't expect that MySQL would use an index to satisfy a != condition (even if that value is 95% of the table); but maybe rc_type IN (all, other, values) would work?
@jcrespo: How well do you think an index could approximate the effect of a separate table? Would a PARTITION like we use on the revision table work, perhaps?
As for #4, that should be pretty easy to do (IIRC Wikidata RC generation is disableable, and a background script could delete old rows starting with the most recent ones) and should alleviate the issue pretty quickly too, but it would also be disruptive to people who use the feature; on the other hand, query timeouts on the watchlist are also disruptive, to a different group of people. So I think we should only do this if there isn't another way that we can get the issue under control in a reasonable amount of time.
Cc: Ankry, Ladsgroup, Lsanabria, Josve05a, Bawolff, greg, Dominicbm, Vriullop, Jmabel, Fae, IKhitron, Johan, Herzi.Pinki, jmatazzoni, Trizek-WMF, Mattflaschen-WMF, KTC, Framawiki, zhuyifei1999, Marostegui, aaron, Andrei_Stroe, Turbojet, Rsocol, Strainu, Jwh, Pyb, Darwinius, Arbnos, Jdforrester-WMF, TheDJ, gerritbot, VladXe, Kf8, Liuxinyu970226, Jay8g, TerraCodes, Iniquity, jcrespo, Reedy, Catrope, Vicpeters, Demidenko, MaxBioHazard, Aklapper, Waytogoeducation, GoranSMilovanovic, Abiyoyo, QZanden, Vali.matei, Jack_who_built_the_house, Poyekhali, Volker_E, Izno, Wong128hk, Luke081515, Wikidata-bugs, Base, aude, GWicke, El_Grafo, Gryllida, putnik, Steinsplitter, Mbch331, Krenair
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
