[Wikidata-bugs] [Maniphest] T331356: Wikidata seems to still be utilizing insecure HTTP URIs
BBlack added a comment. In T331356#8718619 <https://phabricator.wikimedia.org/T331356#8718619>, @MisterSynergy wrote: > Some remarks: > > - We should consider these canonical HTTP URIs to be //names// in the first place, which are unique worldwide and issued by the Wikidata project as the "owner" [1] of the wikidata.org domain. The purpose of these //names// is to identify things. If they're only names, that's relatively-fine. However, there are user agents that end up following them as access URIs. If we could control every agent, we could require that they all upconvert to HTTPS for access, but we can't. > - Following linked data principles, it is no coincidence that these names happen to be valid URIs. These are meant to be used to look up information about the named entity. It is okay to redirect a canonical URI to another location, including of course to a secure HTTPS location. The problem with relying on redirects is that they're insecure.The initial request goes over the wire in the clear, as does the initial redirect response. They can both be hijacked, modified, censored, and surveilled, before the redirect to HTTPS ever happens. An advanced agent on the wire (like a national telecom) can even persistently hijack a whole session this way, by proxying the traffic into our servers as HTTPS. We support redirects as a "better than breakage/nothing" solution, but ideally UAs shouldn't ever utilize insecure HTTP to begin with. This is why all of our Canonical URIs (in the HTTP/HTML sense) begin with `https`, as evidenced in all the normal pageviews' `https://...` tags. > - To my understanding, HSTS can be used to secure all but the first request of a client (that supports HSTS). It can be, and we ever participate in HSTS Preload for all of our canonical domains as well, which protects even the first request to a domain from browsers which use the preload list. However, there are many clients, especially bots and scripted tools, which rely on HTTP libraries or CLI tools which do not, by default, honor HSTS or load the preload list. > - Canonical HTTP URIs are still widespread in many other linked data resources, since many projects have started issueing these before everything transitioned to HTTPS. Some projects have transitioned to canonical HTTPS URIs, however, with GND doing this in 2019 being a prominent example [3]. This would be the ideal end-outcome: that we're able to transition the URLs to be HTTPS everywhere. Barring that, we could also look at where and how they're being emitted. We may have HTML page outputs which are rendering these canonical URIs for access purposes, where it would make sense to convert them to HTTPS as part of the rendering process to cut down on the problem. TASK DETAIL https://phabricator.wikimedia.org/T331356 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: OlafJanssen, MisterSynergy, BCornwall, Bugreporter, Ennomeijers, Nikki, Volans, Aklapper, BBlack, Astuthiodit_1, KOfori, karapayneWMDE, joanna_borun, Invadibot, Devnull, maantietaja, Muchiri124, ItamarWMDE, Akuckartz, Legado_Shulgin, ReaperDawn, Nandana, Davinaclare77, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Hfbn0, QZanden, LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T330906: HTTP URIs do not resolve from NL and DE?
BBlack closed this task as "Resolved". BBlack added a comment. The redirects are neither //good// nor //bad//, they're instead both necessary (although that necessity is waning) and insecure. We thought we had standardized on all canonical URIs being of the secure variant ~8 years ago, and this oversight has flown under the radar since then, only to be exposed recently when we intentionally (for unrelated operational reasons) partially degraded our port 80 services. I've made a new ticket, since that seems better all around. Let's move the rest of this discussion there. TASK DETAIL https://phabricator.wikimedia.org/T330906 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ennomeijers, BBlack Cc: TheDJ, jbond, bking, akosiaris, Nikki, Vgutierrez, BBlack, Ennomeijers, Aklapper, Astuthiodit_1, KOfori, karapayneWMDE, joanna_borun, Invadibot, Devnull, maantietaja, Muchiri124, ItamarWMDE, Akuckartz, Legado_Shulgin, ReaperDawn, Nandana, Davinaclare77, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Hfbn0, QZanden, LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T331356: Wikidata seems to still be utilizing insecure HTTP URIs
BBlack created this task. BBlack triaged this task as "High" priority. BBlack added projects: Wikidata, Traffic. Restricted Application added a subscriber: Aklapper. Restricted Application added a project: wdwb-tech. TASK DESCRIPTION It has come to our attention via T330906 <https://phabricator.wikimedia.org/T330906> that some part of the Wikidata software/ecosystem is emitting insecure HTTP URIs that some UAs are consuming for insecure access. We need to find a way to secure these accesses. We also need to understand a little more about the nature of the use of these URIs as identifiers and what the challenges are in changing them at some level (either rewriting them just for output purposes, or changing them in a deeper way). TASK DETAIL https://phabricator.wikimedia.org/T331356 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Aklapper, BBlack, Astuthiodit_1, KOfori, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T330906: HTTP URIs do not resolve from NL and DE?
BBlack reopened this task as "Open". BBlack added a comment. In T330906#8661013 <https://phabricator.wikimedia.org/T330906#8661013>, @Ennomeijers wrote: > As I already mentioned earlier, the SPARQL endpoint and the RDF serialized data all use the HTTP version as the canonical identifier. This makes sense to me and is, as far as I know, in line with other linked data best practices. But there needs to be a machine readable way to access the data. > > Using a 301 to redirect to the HTTPS url is the correct approach and in fact this is already implemented and currently working again from my end. When I run the same command as mentioned in my first report I now do get a 301 reply. I hope this will keep working in this way until HTTP are no longer used within WD. I will close the issue for now. Please don't close this task unless we're replacing it with a more-focused one on the uncovered issues here. For the reasons stated earlier, relying on the 301 to "fix" this is not the correct approach. We can open a separate new task if you prefer, but either way we need to get this properly addressed (by having all live links in our control use the proper canonical URIs via `https://`). TASK DETAIL https://phabricator.wikimedia.org/T330906 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ennomeijers, BBlack Cc: bking, akosiaris, Nikki, Vgutierrez, BBlack, Ennomeijers, Aklapper, Astuthiodit_1, KOfori, karapayneWMDE, joanna_borun, Invadibot, Devnull, maantietaja, Muchiri124, ItamarWMDE, Akuckartz, Legado_Shulgin, ReaperDawn, Nandana, Davinaclare77, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Hfbn0, QZanden, LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T330906: HTTP URIs do not resolve from NL and DE?
BBlack added a comment. In T330906#8657917 <https://phabricator.wikimedia.org/T330906#8657917>, @Ennomeijers wrote: > Thanks for the replies! Advising to use HTTPS over HTTP makes sense. > > But not supporting redirection from HTTP to HTTPS will in my opinion introduce a fundamental problem for using Wikidata as a source for Linked Data. When querying Wikidata through the sparql endpoint the entities of the result set are all HTTP URIs. The RDF description of WD entities (accessed as described on https://www.wikidata.org/wiki/Wikidata:Data_access) contain many HTTP URIs for related entities and other resources. > > Using the HTTP as identifier for the entity is not problematic as long as the redirection from HTTP to HTTPS can deliver access to the data itself. In T330906#8660183 <https://phabricator.wikimedia.org/T330906#8660183>, @Ennomeijers wrote: > I think this touches upon a fundamental question of how to model WD information as Linked Data. As currently stated in https://www.wikidata.org/wiki/Wikidata:Data_access the //concept URI// of an entity is its **HTTP** version. We don't have plans to get rid of port 80 HTTP->HTTPS redirects anytime soon. However, we consider that traffic pretty low priority, and in this particular case we partially disabled it temporarily while dealing with an operational incident, which luckily led to us uncovering this issue! However, the canonical (i.e. "official", "should be used in all links") URIs for traffic/access to all Wikimedia project domains are HTTPS URIs, not HTTP ones. We shouldn't be publishing plain-HTTP URIs. The HTTP->HTTPS redirects are designed to help smooth over issues with legacy links we don't control out in the wild Internet, when accessed by UAs that don't respect HSTS. These redirects, by their nature, are **not secure**. When users access content through plain-HTTP URIs, even though we try serve a helpful 301 redirect to HTTPS, literally anyone on the Internet path between the user and the WMF can both see and modify both the request and the response in flight. The initial, insecure request via plain-HTTP can be censored, surveilled, and modified. This means individual resources can be blocked/censored/replaced by bad actors. The article names you're reading can be catalogued to build profiles on readers. The intended 301 redirect can be replaced with something completely different, such as a redirect to a different site, an alternative version of our content, or even an attack payload or banner ad injection. All of that aside, HTTP access is also going to perform worse, as you have to do the full TCP and HTTP transaction (multiple latency roundtrips) just to get the redirect response, then start over again with a fresh HTTPS transaction again on a fresh TCP connection (more redundant network roundtrips). Normal redirects that stay within one protocol and domainname can generally re-use the same connection, but not HTTP->HTTPS protocol upgrade redirects. For all of these reasons, for Traffic purposes, all canonical URIs for our projects are HTTPS, not HTTP. We hadn't been aware that anything wikidata -related was publishing canonical URIs that start with `http://`, and we're collectively going to need to find a way to stop doing that. > Accessing the data associated with the concept URI should be possible both for humans (through browsers) and for applications. Can you point me to examples for machine readable processing using the HTTP Strict Transport Security implementation or is this a browser only solution? HSTS is basically a legacy transition mechanism, much like the redirects, but both stronger and less-universal. It's defined in https://www.rfc-editor.org/rfc/rfc6797 . Its goal is to help paper over issues exactly like these - the first time you access `https://www.wikidata.org/`, you get an extra header that informs the UA that all future accesses to this whole domain should be upgraded to HTTPS without attempting plain HTTP at all. Further, there's a public "HSTS Preload" list at https://hstspreload.org/ that all modern browsers utilize, and which contains all of our domains. This avoids the problem of first access and HSTS caching, so that Preloading UAs transform even the first HTTP access to HTTPS before sending anything over the network. It's not specific to browsers; it's implemented as some generic headers that are intended to be honored by any UA, but obviously many less-user-focused UAs (various HTTP library implementations for scripts, the curl CLI tool, etc) do not necessarily implement it strongly. TASK DETAIL https://phabricator.wikimedia.org/T330906 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: akosiaris, Nikki, Vgutierrez, BBlack, Ennomeijers,
[Wikidata-bugs] [Maniphest] T284981: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change
BBlack added a comment. We chose S:BP for those queries on the assumption that, by its nature, it would be a cheap page to monitor. Is there a better option we should be using, or is this ticket more about fixing inefficiencies in it? TASK DETAIL https://phabricator.wikimedia.org/T284981 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: BBlack, ArielGlenn, Ladsgroup, Addshore, Aklapper, Marostegui, joanna_borun, the0001, Invadibot, Zabe, Selby, Devnull, AndreCstr, maantietaja, XeroS_SkalibuR, lmata, Muchiri124, Akuckartz, RhinosF1, Legado_Shulgin, DannyS712, ReaperDawn, Nandana, Mirahamira, Davinaclare77, Techguru.pc, Lahi, Gq86, Markhalsey, GoranSMilovanovic, Jayprakash12345, Hfbn0, QZanden, LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Wong128hk, Wikidata-bugs, aude, mark, faidon, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T266702: Move WDQS UI to microsites
BBlack added a comment. We can route different URI subspaces differently at the edge layer, based on URI regexes, as shown here for the split of the API namespace of the primary wiki sites: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/trafficserver/backend.yaml#263 TASK DETAIL https://phabricator.wikimedia.org/T266702 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: BBlack, Gehel, Dzahn, Addshore, Aklapper, lmata, CBogen, Akuckartz, Legado_Shulgin, Nandana, Namenlos314, Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, Mahir256, QZanden, EBjune, merbst, LawExplorer, Salgo60, Zppix, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Lydia_Pintscher, faidon, Mbch331, Rxy, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T237319: 502 errors on ATS/8.0.5
BBlack added a comment. I think you ran into a temporary blip in some unrelated DNS work (which is already dealt with), not this bug (502 errors can happen for real infra failure reasons, too!) TASK DETAIL https://phabricator.wikimedia.org/T237319 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ladsgroup, BBlack Cc: BBlack, MoritzMuehlenhoff, darthmon_wmde, elukey, Addshore, WMDE-leszek, Ladsgroup, CDanis, Joe, Vgutierrez, ema, Nikerabbit, DannyS712, Aklapper, Legado_Shulgin, Nandana, Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Jonas, Wong128hk, Wikidata-bugs, aude, Lydia_Pintscher, faidon, Mbch331, Rxy, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T232006: LDF service does not Vary responses by Accept, sending incorrect cached responses to clients
BBlack added a comment. We'll also need to normalize the incoming `Accept` headers up in the edge cache layer to avoid pointless vary explosions. Ideally the normalization should exactly match the application-layer logic that chooses the output content type. Do you have some pseudo-code (or real code link is fine too) description of how accept is parsed to select content-types? TASK DETAIL https://phabricator.wikimedia.org/T232006 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: BBlack, Lucas_Werkmeister_WMDE, Aklapper, Lexie_23Kr, Hook696, Daryl-TTMG, RomaAmorRoma, 0010318400, E.S.A-Sheild, darthmon_wmde, Meekrab2012, joker88john, Legado_Shulgin, DannyS712, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Techguru.pc, Lahi, Gq86, Af420, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, WSH1906, Lewizho99, Zppix, Maathavan, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T99531: [Task] move wikiba.se webhosting to wikimedia cluster
BBlack added a comment. As noted in T155359 <https://phabricator.wikimedia.org/T155359> - WMDE has moved the hosting of this to some other platform, including the DNS hosting (and we never had the whois entry). So this task can resolve as Decline I think (or whatever), but we should use it to track down various revert patches first before we close it up (revert the DNS repo stuff and whatever else we've got going on in various other repos supporting the wikiba.se site). TASK DETAIL https://phabricator.wikimedia.org/T99531 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Abraham, Franziska_Heine, CRoslof, MasinAlDujailiWMDE, WMDE-leszek, abian, BBlack, Lucas_Werkmeister_WMDE, Stashbot, gerritbot, Dzahn, Lydia_Pintscher, mark, PokestarFan, faidon, Ladsgroup, Ivanhercaz, Addshore, Jonas, JeroenDeDauw, hoo, JanZerebecki, Aklapper, Hook696, Daryl-TTMG, RomaAmorRoma, 0010318400, E.S.A-Sheild, darthmon_wmde, joker88john, Legado_Shulgin, DannyS712, CucyNoiD, Nandana, NebulousIris, thifranc, jijiki, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Techguru.pc, Lahi, Gq86, Af420, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, LawExplorer, WSH1906, Lewizho99, Zppix, Maathavan, _jensen, rosalieper, Wong128hk, Wikidata-bugs, aude, Jdforrester-WMF, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T99531: [Task] move wikiba.se webhosting to wikimedia cluster
BBlack added a comment. @WMDE-leszek Thanks for looking into it! I believe @CRoslof is who you want to coordinate with on our end, whose last statement on this topic back in January was: In T99531#4878798 <https://phabricator.wikimedia.org/T99531#4878798>, @CRoslof wrote: > Transferring the domain name from WMDE to the Foundation requires that WMDE complete an ownership change form. I emailed with @Abraham and the Foundation's domain name registrar about it a while back, but the paperwork was never completed. Let me know when WMDE is ready to move forward with the transfer. TASK DETAIL https://phabricator.wikimedia.org/T99531 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Dzahn, BBlack Cc: Abraham, Franziska_Heine, CRoslof, MasinAlDujailiWMDE, WMDE-leszek, abian, BBlack, Lucas_Werkmeister_WMDE, Stashbot, gerritbot, Dzahn, Lydia_Pintscher, mark, PokestarFan, faidon, Ladsgroup, Ivanhercaz, Addshore, Jonas, JeroenDeDauw, hoo, JanZerebecki, Aklapper, alaa_wmde, joker88john, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, LawExplorer, WSH1906, Lewizho99, Zppix, Maathavan, _jensen, rosalieper, Wong128hk, Wikidata-bugs, aude, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T99531: [Task] move wikiba.se webhosting to wikimedia cluster
BBlack added a comment. Re: `wikibase.org`, adding it as a non-canonical redirection to catch confusion from those that manually type URLs is fine, but we should make sure everyone is clear on which domainname is canonical for this project (I assume `https://wikiba.se/`) and make sure that's the only one that's published, promoted, and used for links we control and such. It's an important notion that one name is canonical! Re: the HSTS/HTTPS stuff in https://gerrit.wikimedia.org/r/c/operations/puppet/+/500711 : - It's policy for canonical domains we support here, so there's no real debate about whether we'll end up with full-value HSTS and the HTTP->HTTPS redirect. - But we don't need this separate patch at the apache level; we'll handle it in VCL with the same code that handles the other canonical project domains. Re: handing off registration - I really think we should stop touching this whole project until this gets resolved, which means stalling on the above HSTS/HTTPS work and on the switch of IPs. This is a policy issue as well, which we've tried to explain politely more than once in this thread, but if it's going to end up being a blocker there's no point expending further effort on this until they figure out what direction they want to go. I think the original statement way back from @Faidon was that it was a "very strong preference" that we get ownership transfer, but in fact we'd already made the declaration that it's a policy requirement about a week before that in https://wikitech.wikimedia.org/wiki/HTTPS#For_the_Foundation's_canonical_domainnames , and honestly I really don't want to wade into the mess of having an exception to those rules during all the future improvements we have coming at the DNS and HTTPS layers. Even just applying strong HSTS with ownership issues seems irresponsible of us at best, as the current hosting WMDE has it on lacks HTTPS entirely. TASK DETAIL https://phabricator.wikimedia.org/T99531 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Dzahn, BBlack Cc: Abraham, Franziska_Heine, CRoslof, MasinAlDujailiWMDE, WMDE-leszek, abian, BBlack, Lucas_Werkmeister_WMDE, Stashbot, gerritbot, Dzahn, Lydia_Pintscher, mark, PokestarFan, faidon, Ladsgroup, Ivanhercaz, Addshore, Jonas, JeroenDeDauw, hoo, JanZerebecki, Aklapper, alaa_wmde, joker88john, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, LawExplorer, WSH1906, Lewizho99, Zppix, Maathavan, _jensen, rosalieper, Wong128hk, Wikidata-bugs, aude, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater
BBlack added a comment. I think it would be better, from my perspective, to really understand the use-cases better (which I don't). Why do these remote clients need "realtime" (no staleness) fetches of Q items? What I hear is it sounds like all clients expect everything to be perfectly synchronous, but I don't understand why they need to be perfectly synchronous. In the case that lead to this ticket, it was a remote client at Orange issuing a very high rate of these uncacheable queries, which seems like a bulk data load/update process, not an "I just edited this thing and need to see my own edits reflected" sort of case. TASK DETAIL https://phabricator.wikimedia.org/T217897 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Smalyshev, BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater
BBlack added a comment. Looking at an internal version of the flavor=dump outputs of an entity, related observations: Test request from the inside: `curl -v 'https://www.wikidata.org/wiki/Special:EntityData/Q15223487.ttl?flavor=dump' --resolve www.wikidata.org:443:10.2.2.1` - There is LM data, for this QID it currently says: `last-modified: Fri, 08 Mar 2019 06:24:59 GMT` - This could be used with standard HTTP conditional requests for `If-Modified-Since`. This would still cause a ping through to the applayer, but would not transfer the body if no change. - Or alternatively, use the same data that's informing the LM/IMS conditional stuff to set metadata in the dump output as well, so that your queries can use this as a datestamp that's shared among more clients (this is basically the `use event date` idea from the summary), so that it doesn't even need an LM/IMS roundtrip and can be a true cache hit. - The CC header is: `cache-control: public, s-maxage=3600, max-age=3600` - 1H seems short in general. We prefer 1d+ for the actual CC times advertised by major cacheable production endpoints so that everything doesn't go stale too quickly during minor maintenance work on a cache or a site. Is there a reason (often it's set low because other issues around purging and this kind of update traffic not being well-engineered yet?). - However, assuming the 1H is staying for now, can't updaters just be ok with up to 1H of stale data and not cache bust at all? There's no such thing as async+realtime; there's always a staleness, it's just a question of how much is tolerable for the use-case. TASK DETAIL https://phabricator.wikimedia.org/T217897 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: BBlack, Aklapper, Gehel, alaa_wmde, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T99531: [Task] move wikiba.se webhosting to wikimedia cluster
BBlack added a comment. There are different layers of "handing off" DNS management which are being conflated, but to run through them in order: "Point the A record to the right place" - We don't support this, and can't realistically. We need control of the zone data directly on our nameservers for a variety of technical reasons (e.g. setting policy controls like CAA authorizations, future ESNI-related records, etc), and we don't use simple A-records, we use a dynamic system that hands out any of a number of addresses to the nearest of our global edge datacenters, and these things evolve over time on a technical level. If we're going to host something in our production infrastructure and manage it correctly, we have to move to at least the next level of handoff: Leaving the registration of the domain with WMDE and their registrar, but having the Nameservers pointed at WMF nameservers. This is what we've already done earlier in the ticket and where we're at now. Currently the domain is registered to WMDE (presumably, it's hidden in public view) via registrar "united-domains", and the nameserver values are set to the 3x WMF nameserver hosts (ns0.wikimedia.org, ns1.wikimedia.org, and ns2.wikimedia.org, at specific IP addresses for each). This allows the WMF nameservers and SRE staff to do all the basic technical things referenced above, and is the first logical step before: Switching the registration to WMF's registrar/ownership. This is more on a policy/standards/legal level, and maybe @CRoslof can give more details than me on that front about legal-related things. It would be odd in the general case to be canonicalizing a WMF domain without registrar control though, as it could be swapped out from under us at any time. However, even on a purely technical level it matters to us as well: we have future plans to deploy more authdns servers, change their IPs, and deploy anycasted authdns as well, all of which require WMF to have tight control over the registrar settings for all the domains we host resources for so that we can get through transition periods smoothly as those ns[012] hostnames and their IPs change. It's not scalable for those processes to involve contacting N third parties and having them all indirectly contact their registrars on our behalf, etc. TASK DETAILhttps://phabricator.wikimedia.org/T99531EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Dzahn, BBlackCc: Abraham, Franziska_Heine, CRoslof, MasinAlDujailiWMDE, WMDE-leszek, abian, BBlack, Lucas_Werkmeister_WMDE, Stashbot, gerritbot, Dzahn, Lydia_Pintscher, mark, PokestarFan, faidon, Ladsgroup, Ivanhercaz, Addshore, Jonas, JeroenDeDauw, hoo, JanZerebecki, Aklapper, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, LawExplorer, Zppix, _jensen, Wong128hk, Wikidata-bugs, aude, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T99531: [Task] move wikiba.se webhosting to wikimedia cluster
BBlack added a comment. There's still a couple of things that can be done serially at present, one of which is necessary for the cert issuance later: Switch the nameservers for wikiba.se to ns[012].wikimedia.org with your current registrar (United Domains). We have to have this to later issue the cert at all. The cert likely won't be issued until sometime in Jan/Feb. Switch the ownership/registration of wikiba.se to the Foundation and its registrar(s). This isn't required to issue the cert on a technical level, but as a matter of general policy we'll want this done at some point before we're really hosting wikiba.se, and there's nothing blocking it after (1) is done above. TASK DETAILhttps://phabricator.wikimedia.org/T99531EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Dzahn, BBlackCc: MasinAlDujailiWMDE, WMDE-leszek, abian, BBlack, Lucas_Werkmeister_WMDE, Stashbot, gerritbot, Dzahn, Lydia_Pintscher, mark, greg, PokestarFan, faidon, Ladsgroup, Ivanhercaz, Addshore, Jonas, JeroenDeDauw, hoo, JanZerebecki, Aklapper, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, LawExplorer, Lewizho99, Zppix, Maathavan, _jensen, D3r1ck01, Wong128hk, Wikidata-bugs, aude, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T99531: [Task] move wikiba.se webhosting to wikimedia cluster
BBlack added a comment. Thanks for the data and the patch! We'll dig into the DNS patch next week and get it merged in so we're serving wikiba.se from our DNS as-is (as in, pointing at your existing server IPs). Then we can do handoff of the domain ownership/registration without causing any interruptions. That gets us over the first handoff hurdle, at which point it's on #Traffic to get wikiba.se certs added to our cache clusters using DNS-based validation (again, server IPs still pointing at current server throughout). We're migrating our existing LE certs to a new solution this quarter ( T207050 ), and after that's done we'll look early in the next at defining these new certs' slightly more-complicated case. Once those are issued and deployed, you'll have some time (if needed!) to test and refine the data in our version of the site is hosting ( https://gerrit.wikimedia.org/r/plugins/gitiles/wikibase/wikiba.se/+/master ), and then we switch server IPs and we're done.TASK DETAILhttps://phabricator.wikimedia.org/T99531EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Dzahn, BBlackCc: MasinAlDujailiWMDE, WMDE-leszek, abian, BBlack, Lucas_Werkmeister_WMDE, Stashbot, gerritbot, Dzahn, Lydia_Pintscher, mark, greg, PokestarFan, faidon, Ladsgroup, Ivanhercaz, Addshore, Jonas, JeroenDeDauw, hoo, JanZerebecki, Aklapper, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, LawExplorer, Lewizho99, Zppix, Maathavan, D3r1ck01, Wong128hk, Wikidata-bugs, aude, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T206105: Optimize networking configuration for WDQS
BBlack added a comment. Yes, let's look at this today. I think we need better tg3 ethernet card support in interface::rps for one of our authdnses anyways, which you'll need here too.TASK DETAILhttps://phabricator.wikimedia.org/T206105EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Gehel, BBlackCc: gerritbot, BBlack, Aklapper, Gehel, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, Jonas, Xmlizer, Wong128hk, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T99531: [Task] move wikiba.se webhosting to wikimedia misc-cluster
BBlack added a comment. There are plans underway at this point to support multiple LE certs on our standard cache terminators via the work in T199711 due by EOQ (end of Sept), which would make this whole thing simpler and zero cert cost. I couldn't say for sure how fast we'll shake out all the bugs in such a system after initial deployment, but I'd hope quickly. In the interim, our best option aside from waiting would be to purchase a commercial DV wikiba.se cert and deploy it on the caches (which requires a little bit of testing, we haven't run multiple SNI certs there in a while now). Nobody's worked on this in a while on our end, mostly for lack of priority/time/focus. In either case, the first few steps are relatively-trivial and would be the same: Create a wikiba.se microsite in WMF infra (already done by @Dzahn I believe, sourcing from https://gerrit.wikimedia.org/r/plugins/gitiles/wikibase/wikiba.se/+/master ) Create a wikiba.se template in our authdns, matching the current data (including current non-WMF server IPs) - any complications here, e.g. MX service is currently to udag.de, we can mirror that setting for now I guess. Any other service hostnames besides wikiba.se and www.wikiba.se pointing at 89.31.143.100?). Move authdns control for wikiba.se over to the WMF nameservers (no-op for users, but allows DV on our end). [Issue commercial DV cert to caches to avoid waiting, and/or deploy automated LE DV cert to caches at a later date] TASK DETAILhttps://phabricator.wikimedia.org/T99531EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Dzahn, BBlackCc: abian, BBlack, Lucas_Werkmeister_WMDE, Liuxinyu970226, Stashbot, gerritbot, Dzahn, Lydia_Pintscher, mark, greg, PokestarFan, faidon, Ladsgroup, Ivanhercaz, Addshore, Jonas, JeroenDeDauw, thiemowmde, hoo, JanZerebecki, Aklapper, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, LawExplorer, Lewizho99, Zppix, Maathavan, Wong128hk, Wikidata-bugs, aude, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T199219: WDQS should use internal endpoint to communicate to Wikidata
BBlack added a comment. It's a complicated topic I think, on our end. There are ways to make it work today, but when I try to write down generic steps any internal service could take to talk to any other (esp MW or RB), it bogs down in complications that are probably less than ideal in various language/platform contexts. For this very particular case, the simplest way would be to do your language/platform/library's equivalent of: curl -H 'Host: www.wikidata.org' 'https://appservers-ro.discovery.wmnet/wiki/Special:EntityData/Q2408871.ttl?nocache=1530836328152=dump' That is, use the internal service endpoint hostname in the URI for TLS connection purposes, but then explicitly set the request Host header to www.wikidata.org for use at the HTTP level. Whether you need the appservers-rw or api-ro or restbase-async (...) for a particular URL path for other cases underneath www.wikidata.org is the deep complication hereTASK DETAILhttps://phabricator.wikimedia.org/T199219EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: BBlack, Aklapper, Smalyshev, Gehel, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T199146: "Blocked" response when trying to access constraintsrdf action from production host
BBlack added a comment. This raises some questions that are probably unrelated to the problem at hand, but might affect things indirectly: Why is an internal service (wdqs) querying a public endpoint? It should probably use private internal endpoints like appservers.svc or api.svc, but there may be arguments about desirability of [Varnish] caching. This is something we're grappling with in general in the longer-term (trying to understand and/or eliminate private internal service<->service traffic routing through the public edge unnecessarily). Why is it using webproxy to access it? It should be able to reach www.wikidata.org without any kind of proxy. TASK DETAILhttps://phabricator.wikimedia.org/T199146EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: Mahir256, Jonas, Aklapper, BBlack, Gehel, Smalyshev, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Zppix, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T99531: [Task] move wikiba.se webhosting to wikimedia misc-cluster
BBlack added a comment. It's a pain any direction we slice this, and I'm not fond of adding new canonical domains outside the known set for individual low-traffic projects. We didn't add new domains for a variety of other public-facing efforts (e.g. wdqs, ORES, maps, etc). We don't have clear standards about these things, and frankly some of the existing legacy canonical project domains currently bloating our unified certs probably wouldn't comply with any standard I'd want to put in place here going forward, either. Those other projects domains are "special" though, in that they can only reasonably be handled via commercial wildcards today due to the language-subdomain issues. We're not structured in such a way that adding new domain registrations to our termination is trivial, and I'm not sure it's ever going to be completely trivial. There is always going to be some overhead associated with it, and we don't in the general case want to build up a pile of canonical domains that are mostly low-traffic and/or defunct but kept around for historical compatibility. The three paths forward to support the unique wikiba.se domainname on our termination are: Add it to our unified certs that we just renewed. This costs some $$ ongoing per-year, and will require us to prove ownership of the domain first before we integrate it (we need it in our DNS control either way). It also bloats the unified certificate size sent with every session on all projects (e.g. every TLS handshake for enwiki), which makes it really unpalatable to add new things here that could just as easily have been wikimedia.org subdomains for smaller projects "for free". Add it as a separate commercial cert deployed alongside the unified. Same $$ ongoing. More maintenance burden on our end (e.g. accounting for it in OCSP Stapling and nginx server configs, etc). We've had multiple certificates deployed like this in the past (for wmfusercontent.org before it was integrated into the unified wildcards cert), but there have been several refactors during the era of one-cert-only, and so I'm not sure there isn't some debt to clean up before we successfully switch back to multiple, independent certs. Add it separately as above, but using LetsEncrypt. This avoids the $$ cost, but adds additional complexities to deal with initially, as our current puppetized deployment of LE certs isn't robust enough for our primary traffic terminators, only for smaller single-host/one-off sites. It lacks dual-cert support (as in ECDSA+RSA), it lacks OCSP Stapling integration, and most-importantly it doesn't know how to do renewal updates for a service with many global traffic termination points (i.e. we need to add support for centralizing the renewal process with updates out to all the global edge terminators, as well as support on all the terminators to forward challenge requests to the central renewer, etc). TASK DETAILhttps://phabricator.wikimedia.org/T99531EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Dzahn, BBlackCc: BBlack, Lucas_Werkmeister_WMDE, Liuxinyu970226, Stashbot, gerritbot, Dzahn, Lydia_Pintscher, mark, greg, PokestarFan, faidon, Ladsgroup, Ivanhercaz, Addshore, Jonas, JeroenDeDauw, thiemowmde, hoo, JanZerebecki, Aklapper, Lahi, Gq86, Baloch007, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Zppix, Maathavan, Wikidata-bugs, aude, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. No, we never made an incident rep on this one, and I don't think it would be fair at this time to implicate ORES as a cause. We can't really say that ORES was directly involved at all (or any of the other services investigated here). Because the cause was so unknown at the time, we stared at lots of recently-deployed things, and probably uncovered hints at minor issues in various services incidentally, but none of them may have been causative. All we know for sure is that switching Varnish's default behavior from streaming to store-and-forward of certain applayer responses (which was our normal mode over a year ago) broke things, probably because some services are violating assumptions we hold. Unfortunately proper investigation of this will stall for quite a while on our end, but we'll probably eventually come back with some analysis on that later and push for some fixups in various services so that we can move forward on that path again. The RB timeouts mentioned earlier seem a more-likely candidate for what we'll eventually uncover than ORES at this point.TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: daniel, Peachey88, ema, Gehel, Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Abo00tamr, Lahi, Gq86, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Changed Status] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack lowered the priority of this task from "High" to "Normal".BBlack changed the task status from "Open" to "Stalled".BBlack added a comment. The timeout changes above will offer some insulation, and as time passes we're not seeing evidence of this problem recurring with the do_stream=false patch reverted. Some related investigations on slow requests have turned up some pointers 120-240s timeouts on requests to the REST API at /api/rest_v1/transform/wikitext/to/html. which are eerily similar to the kinds of problems we saw a while back in T150247 . RB was dropping the connection from Varnish, and doing so in a way that Varnish would retry it indefinitely internally. We patched Varnish to mitigate that particular problem in the past, but something related may be surfacing here... We have a few steps to go here, but there's going to be considerable delays before we get to the end of all of this: We have a preliminary patch to Varnish to limit the total response transaction time on backend requests (if the backend is dribbling response bytes often enough to evade hitting the between_bytes_timeout) at https://gerrit.wikimedia.org/r/#/c/387236/ . However, the patch is built on Varnish v5, and cache_text currently runs Varnish v4. We weren't planning to do any more Varnish v4 releases before moving all the clusters to v5 unless an emergency arose, as it complicates our process and timelines considerably, and this isn't enough of an emergency to justify it. Therefore, this part is blocked on https://phabricator.wikimedia.org/T168529 . We want to log slow backend queries so that we have a better handle on these cases in general. There's ongoing work for this in https://gerrit.wikimedia.org/r/#/c/389515/ , https://gerrit.wikimedia.org/r/#/c/389516 , and more to come. One of those patches also has the v4/v5 issues above and blocks on upgrading cache_text to v5. With those measures in place, we should be able to definitively identify (and/or workaround) the problematic transactions and figure out what needs fixing at the application layer, at which point we can un-revert the do_stream=false and move forward with our other VCL plans around exp(-size/c) admission policies on cache_text frontends as part of T144187 (but none of this ties up doing the same on cache_upload). TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: daniel, Peachey88, ema, Gehel, Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. In T179156#3720392, @BBlack wrote: In T179156#3719995, @BBlack wrote: We have an obvious case of normal slow chunked uploads of large files to commons to look at for examples to observe, though. Rewinding a little: this is false, I was just getting confused by terminology. Commons "chunked" uploads through UploadWizard are not HTTP chunked transfer encoding, which is what I meant by "chunked uploads" in the rest of this conversation. In T179156#3720290, @daniel wrote: "pass" means stream, right? wouldn't that also grab a backend connection from the pool, and hog it if throughput is slow? I'm pretty sure non-piped client request bodies are not streamed to backends, looking at the code (even in the pass case), and we don't use pipe-mode in cache_text at all. There's still an open question about whether that allows for resource exhaustion on the frontend (plain memory consumption, or slowloris-like), but again it's not the problem we're looking at here. We've gathered those manually in specific cases so far. Aggregating them across the varnishes to somewhere central all the time will require some significant work I think. How about doing this on the app servers instead of varnish? We do track MediaWiki execution time, right? Would it be possible to also track overall execution time, from the moment php starts receiving data, before giving controle to mediawiki? That would be nice too I think. But at the end of the day, probably our Varnishes should assume that we won't necessarily have sane execution timeouts at all possible underlying applayer services (if nothing else because Bugs). So we probably still want to capture this at the Varnish level as well. Relatedly, I know hhvm has a max_execution_time parameter which we've set to 60s, so you'd *think* that would be a limit for the MediaWiki requests in question. But on the other hand, I know during the weekend I logged requests going into (through?) the MW API for flow-parsoid stuff that timed out at ~80s (when we had the super-short Varnish timeouts configured as emergency workaround, which helped somewhat). TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: daniel, Peachey88, ema, Gehel, Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. In T179156#3719995, @BBlack wrote: We have an obvious case of normal slow chunked uploads of large files to commons to look at for examples to observe, though. Rewinding a little: this is false, I was just getting confused by terminology. Commons "chunked" uploads through UploadWizard are not HTTP chunked transfer encoding, which is what I meant by "chunked uploads" in the rest of this conversation. In T179156#3720290, @daniel wrote: "pass" means stream, right? wouldn't that also grab a backend connection from the pool, and hog it if throughput is slow? I'm pretty sure non-piped client request bodies are not streamed to backends, looking at the code (even in the pass case), and we don't use pipe-mode in cache_text at all. There's still an open question about whether that allows for resource exhaustion on the frontend (plain memory consumption, or slowloris-like) is an open question, but again it's not the problem we're looking at here. We've gathered those manually in specific cases so far. Aggregating them across the varnishes to somewhere central all the time will require some significant work I think. How about doing this on the app servers instead of varnish? We do track MediaWiki execution time, right? Would it be possible to also track overall execution time, from the moment php starts receiving data, before giving controle to mediawiki? That would be nice too I think. But at the end of the day, probably our Varnishes should assume that we won't necessarily have sane execution timeouts at all possible underlying applayer services (if nothing else because Bugs). So we probably still want to capture this at the Varnish level as well. Relatedly, I know hhvm has a max_execution_time parameter which we've set to 60s, so you'd *think* that would be a limit for the MediaWiki requests in question. But on the other hand, I know during the weekend I logged requests going into (through?) the MW API for flow-parsoid stuff that timed out at ~80s (when we had the super-short Varnish timeouts configured as emergency workaround, which helped somewhat).TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: daniel, Peachey88, ema, Gehel, Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Lowered Priority] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack lowered the priority of this task from "Unbreak Now!" to "High".BBlack added a comment. Reducing this from UBN->High, because current best-working-theory is this problem is gone so long as we keep the VCL do_stream=false change reverted. Obviously, there's still some related investigations ongoing, and I'm going to write up an Incident_Report about the 503s later today as well.TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: daniel, Peachey88, ema, Gehel, Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. In T179156#3719928, @daniel wrote: In any case, this would consume front-edge client connections, but wouldn't trigger anything deeper into the stack That's assuming varnish always caches the entire request, and never "streams" to the backend, even for file uploads. When discussing this with @hoo he told me that this should be the case - but is it? That would make it easy to exhaust RAM on the varnish boxes, no? Maybe? I haven't really delved deeply into this angle yet, because it seems less-likely to be the cause of the current issues. We have an obvious case of normal slow chunked uploads of large files to commons to look at for examples to observe, though. Because they're POST they'd be handled as an immediate pass through the varnish layers, so I don't think this would cause what we're looking at now. GETs with request-bodies that were slowly-chunked-out might be different, I don't know yet. this is definitely on the receiving end of responses from the applayer. So a slow-request-log would help? Yes. We've gathered those manually in specific cases so far. Aggregating them across the varnishes to somewhere central all the time will require some significant work I think. Right now I'm more-worried about the fact that, since varnish doesn't log a transaction at all until the transaction is complete, without overall transaction timeouts on the backend connections there are cases that would slip through all possible logging (if they stayed open virtually-indefinitely and sent data often enough to evade the between_bytes_timeout check).TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: daniel, Peachey88, ema, Gehel, Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. Trickled-in POST on the client side would be something else. Varnish's timeout_idle, which is set to 5s on our frontends, acts as the limit for receiving all client request headers, but I'm not sure that it has such a limitation that applies to client-sent bodies. In any case, this would consume front-edge client connections, but wouldn't trigger anything deeper into the stack. We could/should double-check varnish's behavior there, but that's not what's causing this, this is definitely on the receiving end of responses from the applayer.TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: daniel, Peachey88, ema, Gehel, Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. In T179156#3718772, @ema wrote: There's a timeout limiting the total amount of time varnish is allowed to spend on a single request, send_timeout, defaulting to 10 minutes. Unfortunately there's no counter tracking when the timer kicks in, although a debug line is logged to VSL when that happens. We can identify requests causing the "unreasonable" behavior as follows: varnishlog -q 'Debug ~ "Hit total send timeout"' Yeah, this might help find the triggering clients. However, I don't know if the backend side of Varnish would actually abandon the backend request on send_timeout to the client In T179156#3718957, @Lucas_Werkmeister_WMDE wrote: Another thing that might be similar: for certain queries, the Wikidata Query Service can push out lots of results for up to sixty seconds (at which point the query is killed by timeout) or even longer (if the server returned results faster than they could be transferred – it seems the timeout only applies to the query itself). The simplest such query would be SELECT * WHERE { ?s ?p ?o. }; when I just tried it out (curl -d 'query=SELECT * WHERE { ?s ?p ?o. }' https://query.wikidata.org/sparql -o /dev/null), I received 1468M in 5 minutes (at which point I killed curl – I have no idea how much longer it would have continued to receive results). However, if I understand it correctly, WDQS’ proxy seems to be running in do_stream mode, since I’m receiving results immediately and continuously. That's probably not causing the problem on text-lb, since query.wikidata.org goes through cache_misc at present. But if there's no known actual-push traffic, the next-best hypothesis is behavior exactly like the above: something that's doing a legitimate request->response cycle, but trickling out the bytes of it over a very long period. This would wrap back around to why we were looking at some of these cases before I think: could other services on text-lb be making these kinds of queries to WDQS on behalf of the client and basically proxying the same behavior through?TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: Peachey88, ema, Gehel, Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. Now that I'm digging deeper, it seems there are one or more projects in progress built around Push-like things, in particular T113125 . I don't see any evidence that there's been live deploy of them yet, but maybe I'm missing something or other. If we have a live deploy of any kind of push-like functionality through the text cluster, it's a likely candidate for issues above in the short term. In the long term, discussions about push services need to loop in #traffic much earlier in the process. This kind of thing is a definite No through the current traffic termination architecture as it's configured today. I've even seen some tickets mention the possibility of push for anonymous users (!). The changes on our end to sanely accommodate various push technologies reliably at wiki-scale could potentially be very large and costly, and could involve carving out a separate parallel edge-facing architecture for this stuff, distinct from the edge architecture we use for simpler transactions. We don't have any kind of long-term planning at the #traffic level around supporting this in our annual plans and headcounts, either. It may seem like a small thing from some perspectives, but push notifications at wiki-scale is a huge sea-change on our end from simple HTTP transactions.TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: ema, Gehel, Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. Does Echo have any kind of push notification going on, even in light testing yet?TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: ema, Gehel, Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. A while after the above, @hoo started focusing on a different aspect of this we've been somewhat ignoring as more of a side-symptom: that there tend to be a lot of sockets in a strange state on the "target" varnish, to various MW nodes. They look strange on both sides, in that they spend significant time in the CLOSE_WAIT state on the varnish side and FIN_WAIT_2 on the MW side. This is a consistent state between the two nodes, but it's not usually one that non-buggy application code spends much time in. In this state, the MW side has sent FIN, Varnish has seen that and sent FIN+ACK, but Varnish has not yet decided to send its own FIN to finish the active closing process, and MW is still waiting on it. While staring at the relevant Varnish code to figure out why or how it would delay closing in this case, it seemed like it was possible in certain cases related to connections in the VBC_STOLEN state. Instead of closing immediately, in some such cases it defers killing the socket until some future eventloop event fires, which could explain the closing-delays under heavy load (and we know Varnish is backlogged in some senses when the problem is going on, because mailbox lag rises indefinitely). All of that aside, at some point while staring at related code I realized that do_stream behaviors can influence some related things as well, and we had a recent related VCL patch The patch in question was https://gerrit.wikimedia.org/r/#/c/386616/ , which was merged around 14:13 Oct 26, about 4.5 hours before the problems were first noticed (!). I manually reverted the patch on cp1067 (current target problem node) as a test, and all of the CLOSE_WAIT sockets disappeared shortly, never to return. I reverted the whole patch through gerrit shortly afterwards, so that's permanent now across the board. I think there's a strong chance this patch was the catalyst for start of the problems. At the very least, it was exacerbating the impact of the problems. If it turns out to be the problem, I think we still have more post-mortem investigation to do here, because the issues that raises are tricky. If it's just exacerbating, I think it's still useful to think about why it would, because that may help pin down the real problem. Operating on the assumption that it's the catalyst and diving a little deeper on that angle: The patch simply turned off do_stream behavior when the backend-most Varnish was talking to application layer services, when the applayer response did not contain a Content-Length header. Turning off do_stream makes Varnish act in a store-and-forward mode for the whole response, rather than forwarding bytes onwards to upstream clients as they arrive from the application. The benefit we were aiming for there was to have Varnish calculate the value of the missing Content-Length so that we can make more-informed cache-tuning decisions at higher layers. Minor performance tradeoffs aside, turning off do_stream shouldn't be harmful to any of our HTTP transactions under "reasonable" assumptions (more later on what "reasonable" is here). In fact, that was the default/only mode our caches operated in back when we were running Varnish 3, but streaming became the default for the text cluster when it switched to Varnish 4 just under a year ago. So this was "OK" a year ago, but clearly isn't ok for some requests today. That there was always a singular chash target within the text cluster for the problems also resonates here: there's probably only one special URI out there which breaks the "reasonable" assumption. Another oddity that we didn't delve into much before was that when we restarted the problematic varnish, it only depooled for a short period (<1 min), yet the problem would move *permanently* to its next chash target node and stay there even after the previous target node was repooled. This might indicate that the clients making these requests are doing so over very-long-lived connections, and even that the request->response cycle itself must be very-long-lived It moves via re-chashing when a backend is restarted, but doesn't move on repool because everything's still connected and transacting... My best hypothesis for the "unreasonable" behavior that would break under do_stream=false is that we have some URI which is abusing HTTP chunked responses to stream an indefinite response. Sort of like websockets, but using the normal HTTP protocol primitives. Client sends a request for "give me a live stream of some events or whatever", and the server periodically sends new HTTP response chunks to the client containing new bits of the event feed. Varnish has no way to distinguish this behavior from normal chunked HTTP (where the response chunks will eventually reach a natural end in a reasonable timeframe), and in the do_stream=false store-and-forward mode, Varnish would consume this chunk st
[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. Updates from the Varnish side of things today (since I've been bad about getting commits/logs tagged onto this ticket): 18:15 - I took over looking at today's outburst on the Varnish side The current target at the time was cp1053 (after elukey's earlier restart of cp1055 varnish-be above) 18:21 - I manually reduced the backend timeouts for api+appservers to from defaults of connect/firstbyte/betweenbytes of 5/180/60 to 3/20/10 cp1053 had already been under assault for quite a while though, and this didn't seem to help much. 18:39 - restarted varnish-be on cp1053 to clear it of issues and move to a new target 18:41 - identified cp1065 as the new target (more on identifying these below!) 18:42 - Merged->deployed https://gerrit.wikimedia.org/r/#/c/387024/ to apply the shorter-timeouts workaround above to all text caches at this point, cp1065 was showing various signs of the issue (rising connection counts + mailbox lag), but connection counts stabilized much lower than before, ~200-300 instead of rising towards ~3K, an apparent success of the timeout-reduction mitigation. 18:56 - Identified the first slow-running requests in cp1065 logs with the reduced timeouts: 18:56 < bblack> - BereqURL /w/api.php?action="" 18:56 < bblack> - BereqHeaderHost: www.wikidata.org 18:56 < bblack> - Timestamp Bereq: 1509216884.761646 0.42 0.42 18:56 < bblack> - Timestamp Beresp: 1509216965.538549 80.776945 80.776903 18:56 < bblack> - Timestamp Error: 1509216965.538554 80.776950 0.05 18:56 < bblack> - Timestamp Start: 1509216970.911803 0.00 0.00 after this, identified several other slow requests. All were for the same basic flow-parsoid-utils API + www.wikidata.org 19:39 - hoo's parsoid timeout reduction for Flow (above) hits 19:39 - restarted varnish-backend on cp1065 due to rising mailbox lag 19:41 - new target seems to be cp1067, briefly, but within a minute or two it recovers to normal state and stops exhibiting the symptoms much? Apparently the problem-causing traffic may have temporarily died off on its own. For future reference by another opsen who might be looking at this: one of the key metrics that identifies what we've been calling the "target cache" in eqiad, the one that will (eventually) have issues due to whatever bad traffic is currently mapped through it, is by looking at the connection counts to appservers.svc.eqiad.wmnet + api-appservers.svc.eqiad.wmnet on all the eqiad cache nodes. For this, I've been using: bblack@neodymium:~$ sudo cumin A:cp-text_eqiad 'netstat -an|egrep "10\.2\.2\.(1|22)"|awk "{print \$5}"|sort|uniq -c|sort -n' Which during the latter/worst part of cp1053's earlier target-period produced output like: = NODE GROUP = (1) cp1068.eqiad.wmnet - OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' - 1 10.2.2.18:8080 15 10.2.2.17:7231 79 10.2.2.1:80 111 10.2.2.22:80 = NODE GROUP = (1) cp1066.eqiad.wmnet - OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' - 1 10.2.2.18:8080 14 10.2.2.17:7231 92 10.2.2.1:80 111 10.2.2.22:80 = NODE GROUP =
[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. In T179156#3715432, @hoo wrote: I think I found the root cuase now, seems it's actually related to the WikibaseQualityConstraints extension: Isn't that the same extension referenced in the suspect commits mentioned above? 18:51 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/Wikidata/extensions/Constraints/includes/ConstraintCheck/DelegatingConstraintChecker.php: Fix sorting of NullResults (T179038) (duration: 01m 04s) 18:52 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/Wikidata/extensions/Constraints/tests/phpunit/DelegatingConstraintCheckerTest.php: Fix sorting of NullResults (T179038) (duration: 00m 49s) 19:12 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/WikibaseQualityConstraints/tests/phpunit/DelegatingConstraintCheckerTest.php: Fix sorting of NullResults (T179038) (duration: 00m 50s) 19:14 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/WikibaseQualityConstraints/includes/ConstraintCheck/DelegatingConstraintChecker.php: Fix sorting of NullResults (T179038) (duration: 00m 49s) Or is an unrelated problem in the same area?TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Zppix, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, Jay8g, fgiunchedi, mmodell___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. Unless anyone objects, I'd like to start with reverting our emergency varnish max_connections changes from https://gerrit.wikimedia.org/r/#/c/386756 . Since the end of the log above, connection counts have returned to normal, which is ~100, which is 1/10th the normal 1K limit that usually isn't a problem. If we leave the 10K limit in place, it will only serve to mask (for a time) any recurrence of the issue, making it only possible to detect it early by watching varnish socket counts on all the text cache machines.TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, Zppix, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, Jay8g, fgiunchedi, mmodell___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. My gut instinct remains what it was at the end of the log above. I think something in the revert of wikidatawiki to wmf.4 fixed this. And I think given the timing alignment of the Fix sorting of NullResults changes + the initial ORES->wikidata fatals makes those in particular a strong candidate. I would start with undo all of the other emergency changes first, leaving the wikidatawiki->wmf.4 bit for last.TASK DETAILhttps://phabricator.wikimedia.org/T179156EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, Zppix, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, Jay8g, fgiunchedi, mmodell___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26
BBlack added a comment. Copying this in from etherpad (this is less awful than 6 hours of raw IRC+SAL logs, but still pretty verbose): # cache servers work ongoing here, ethtool changes that require short depooled downtimes around short ethernet port outages: 17:49 bblack: ulsfo cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause 17:57 bblack@neodymium: conftool action : set/pooled=no; selector: name=cp4024.ulsfo.wmnet 17:59 bblack: codfw cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause 18:00 <+jouncebot> Amir1: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will   be rewarded with a sticker. 18:27 bblack: esams cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause 18:41 bblack: eqiad cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause # 5xx alerts start appearing. initial assumption is related to ethtool work above 18:44 <+icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] 18:46 <+icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] 18:47 <+icinga-wm> PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] 18:48 <+icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] # ...but once the MW exceptions hit, seems less-likely to be related to the ethtool work # notices hit IRC for these wikidata sorting changes: 18:51 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/Wikidata/extensions/Constraints/includes/ConstraintCheck/DelegatingConstraintChecker.php: Fix sorting of NullResults (T179038) (duration: 01m 04s) 18:52 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/Wikidata/extensions/Constraints/tests/phpunit/DelegatingConstraintCheckerTest.php: Fix sorting of NullResults (T179038) (duration: 00m 49s) 19:12 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/WikibaseQualityConstraints/tests/phpunit/DelegatingConstraintCheckerTest.php: Fix sorting of NullResults (T179038) (duration: 00m 50s) 19:14 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/WikibaseQualityConstraints/includes/ConstraintCheck/DelegatingConstraintChecker.php: Fix sorting of NullResults (T179038) (duration: 00m 49s) # Lots of discussion and digging ensues on all sides... # bblack figures out that while the logs implicate a single eqiad text backend cache, depooling said cache moves the problem to a different cache host (repeatedly), so it doesn't seem to be a faulty cache node. # One cache just happens to be the unlucky chash destination for more of the problematic traffic than the others at any given time. # The problematic traffic load/patterns consumes all of the 1K connection slots varnish allows to api.svc+appservers.svc, and then this causes many unrelated 503s for lack of available backend connection slots to service requests. # The Fatals logs seem to be related to ORES fetching from Wikidata # So, a timeout is increased there to cope with slow wikidata responses: 19:33 awight@tin: Started deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 19:33 awight@tin: Finished deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 (duration: 00m 10s) 19:34 awight@tin: Started deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 19:36 aaron@tin: Started restart [jobrunner/jobrunner@a20d043]: (no justification provided) 19:41 awight@tin: Finished deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 (duration: 07m 25s) # Still doesn't fix the problem, so the next attempted fix is to disable ores+wikidata entirely: 20:02 ladsgroup@tin: Synchronized wmf-config/InitialiseSettings.php: UBN! disbale ores for wikidata (T179107) (duration: 00m 50s) 20:00 ladsgroup@tin: Synchronized wmf-config/InitialiseSettings.php: UBN! disbale ores for wikidata (T179107) (duration: 00m 50s) # Things are still borked, try reverting some other recent Wikidata-related changes: 20:59 hoo@tin: Synchronized wmf-config/Wikibase.php: Revert "Add property for RDF mapping of external identifiers for Wikidata" (T178180) (duration: 00m 50s) 21:00 hoo: Fully revert all changes related to T178180 # Still borked. Tried reverting something else that looks dangerous in the logstash errors, but also wasn't the cause: 21:30 hoo@tin: Synchronized wmf-config/InitialiseSettings.php: Temporary disable remex html (T178632) (duration: 00m 50s) 21:32 hoo
[Wikidata-bugs] [Maniphest] [Commented On] T175588: Server overloaded .. can't save (only remove or cancel)
BBlack added a comment. Can you explain in more detail? Is the subject of this ticket was was shown as an error in your browser window? I doubt this is related to varnish and/or "mailbox lag".TASK DETAILhttps://phabricator.wikimedia.org/T175588EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: BBlack, Aklapper, Esc3300, GoranSMilovanovic, QZanden, Izno, Wikidata-bugs, aude, Mbch331___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T175588: Server overloaded .. can't save (only remove or cancel)
BBlack removed parent tasks: T174932: Recurrent 'mailbox lag' critical alerts and 500s, T175473: Multiple 503 Errors. TASK DETAILhttps://phabricator.wikimedia.org/T175588EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: Aklapper, Esc3300, GoranSMilovanovic, QZanden, Izno, Wikidata-bugs, aude, Mbch331___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T99531: [Task] move wikiba.se webhosting to wikimedia misc-cluster
BBlack added a project: Traffic. TASK DETAILhttps://phabricator.wikimedia.org/T99531EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: mark, greg, PokestarFan, faidon, Ladsgroup, Ivanhercaz, Addshore, Jonas, JeroenDeDauw, thiemowmde, hoo, JanZerebecki, Aklapper, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, Zppix, Izno, Wikidata-bugs, aude, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T153563: Consider switching to HTTPS for Wikidata query service links
BBlack removed a parent task: T104681: HTTPS Plans (tracking / high-level info). TASK DETAILhttps://phabricator.wikimedia.org/T153563EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: Kghbln, Dalba, Lydia_Pintscher, Jonas, Ricordisamoa, Lokal_Profil, DSGalaktos, MisterSynergy, Esc3300, Smalyshev, MZMcBride, Aklapper, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, Avner, Zppix, debt, Gehel, FloNight, Xmlizer, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Dinoguy1000, Manybubbles, faidon, Seb35, Mbch331, Jay8g, Krenair, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. Yeah that was the plan, for XKey to help here by consolidating that down to a single HTCP / PURGE per article touched. It's not useful for the mass-scale case (e.g. template/link references), as it doesn't scale well in that direction. But for the case like "1 article == 7 URLs for different formats/variants/derivatives" it should work great. The varnish module for it is deployed, but we haven't ever found/made the time to loop back to actually using it (defining standards for how to transmit it over the existing HTCP protocol or the new EventBus and pushing developers to make use of it). I think last we talked we were going to move cache-purge traffic over to EventBus before tackling this (with kafka consumers on the cache nodes pulling the purges), but I'm not sure what the relative timelines on all related projects look like anymore.TASK DETAILhttps://phabricator.wikimedia.org/T124418EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: Gilles, GWicke, ArielGlenn, Krinkle, Peter, EBernhardson, Smalyshev, gerritbot, Legoktm, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, aaron, faidon, Joe, ori, BBlack, Aklapper, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, Vali.matei, Zppix, Izno, Wikidata-bugs, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. We can get broader averages by dividing the values seen in the aggregate client status code graphs using eqiad's text cluster (the remote sites would expect fewer due to some of the bursts being more likely to be dropped by the network) This shows the past week's average at 33.6K/s avg PURGE rate for text@eqiad: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=6=1=eqiad_type=text_type=1_type=2_type=3_type=4_type=5 There are 7 active servers there the past week (cp1053 has been depooled for the past couple of weeks), so that puts us at ~4800/sec raw rate of HTCP purges. The numbers from before (100, 400) were htmlCacheUpdate numbers though, before they're multiplied by 4 (desktop/mobile, action="" for actual HTCP purging. So the comparable number now would be something like ~1200/sec. It's a little blurrier than that now, though, because in the meantime we've added RB purges as well (e.g. for the mobile content sections). I think these are 3x per article for mobile-sections, mobile-sections-lead, mobile-sections-remaining, and I'm not sure exactly how it hooks into the updating pipeline. I would suspect that, indirectly, all 3 of those are triggered for many of the same conditions as regular wiki purges, so we may be seeing a ~7x HTCP multiplier overall for title->URLs, which would divide the 4800/s down to 685s on the htmlCacheUpdate side as perhaps a more-comparable number to the earlier 100 and 400 numbers?TASK DETAILhttps://phabricator.wikimedia.org/T124418EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: GWicke, ArielGlenn, Krinkle, Peter, EBernhardson, Smalyshev, gerritbot, Legoktm, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, aaron, faidon, Joe, ori, BBlack, Aklapper, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, Vali.matei, Zppix, Izno, Wikidata-bugs, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. The lack of graph data from falling off the history is a sad commentary on how long this has remained unresolved :( Some salient points from earlier within this ticket, to recap: In T124418#1985526, @BBlack wrote: Continuing with some stuff I was saying in IRC the other day. At the "new normal", we're seeing something in the approximate ballpark of 400/s articles purged (which is then multiplied commonly for ?action="" and mobile and ends up more like ~1600/s actual HTCP packets), whereas the edit rate across all projects is something like 10/s. That 400/s number used to be somewhere south of 100 before December. In T124418#1986594, @BBlack wrote: Regardless, the average rate of HTCP these days is normally-flat-ish (a few scary spikes aside), and is mostly throttled by the jobqueue. The question still remains: what caused permanent, large bumps in the jobqueue htmlCacheUpdate insertion rate on ~Dec4, ~Dec11, and ~Jan20? Re; the outstanding patch that's been seeing some bumps ( https://gerrit.wikimedia.org/r/#/c/295027/ ) - The situation has evolved since that patch was first uploaded. Our current maximum TTLs are capped at a single day in all cache layers. However, they can still add up across layers if the race to refresh content plays out just right, with the worst theoretical edge case being 4 total days (fetching from ulsfo when eqiad is the primary). Those edge cases are also bounded by the actual Cache-Control (s-)max-age, but that's currently at two weeks still, I believe, so they don't really come into play. We should probably look at moving the mediawiki-config wgSquidMaxAge (and similar) down to something around 5-7 days, so that it's more reflective of the reality of the situation on the caches. We'll eventually get to a point where we've eliminated the corner-case refreshes and can definitely say that the whole of the cache infrastructure has a hard cap at one full day, but there's more work to do there in T124954 + T50835 (Surrogate-Control) first. I think even now, and especially once we reach that point later, we're starting to reach a point where purging Varnish for mass invalidations like refreshLinks and templating doesn't make sense. Those would be spooled out over a fairly long asynchronous period anyways. They can simply get updated as the now-short TTLs expire, reserving immediate HTCP invalidation for actual content edits of specific articles. Those kinds of ideas may need to be a separate discussion in another ticket?TASK DETAILhttps://phabricator.wikimedia.org/T124418EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: GWicke, ArielGlenn, Krinkle, Peter, EBernhardson, Smalyshev, gerritbot, Legoktm, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, aaron, faidon, Joe, ori, BBlack, Aklapper, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, Vali.matei, Zppix, Izno, Wikidata-bugs, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Reopened] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack reopened this task as "Open".BBlack added a comment. Not resolved, as the purge graphs can attest!TASK DETAILhttps://phabricator.wikimedia.org/T124418EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: aaron, BBlackCc: GWicke, ArielGlenn, Krinkle, Peter, EBernhardson, Smalyshev, gerritbot, Legoktm, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, aaron, faidon, Joe, ori, BBlack, Aklapper, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, Vali.matei, Zppix, Izno, Wikidata-bugs, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T142944: Performance and caching considerations for article placeholders accesses
BBlack added a comment. I clicked Submit too soon :) Continuing: We'd expect content to be at minimum a day, if not significantly longer. MW currently emits 2-week cache headers (with plans to eventually bring that down closer to a day, but those plans are still further off). Cache invalidation is a hard problem, but it's not something we can just ignore, either. Perhaps this should be tied into the broader X-Key effort to sweep these up when the underlying wikidata is updated?TASK DETAILhttps://phabricator.wikimedia.org/T142944EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: DaBPunkt, BBlack, daniel, Lydia_Pintscher, Joe, Lucie, Aklapper, hoo, Zppix, D3r1ck01, Izno, Wikidata-bugs, aude, jayvdb, Ricordisamoa, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T142944: Performance and caching considerations for article placeholders accesses
BBlack added a comment. Nothing was ever resolved here. 30 minutes seems like an arbitrary number with no formal basis or reasoning, and is way shorter than we'd like for anything article-like.TASK DETAILhttps://phabricator.wikimedia.org/T142944EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: DaBPunkt, BBlack, daniel, Lydia_Pintscher, Joe, Lucie, Aklapper, hoo, Zppix, D3r1ck01, Izno, Wikidata-bugs, aude, jayvdb, Ricordisamoa, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Closed] T132457: Move wdqs to an LVS service
BBlack closed this task as "Resolved".BBlack claimed this task. TASK DETAILhttps://phabricator.wikimedia.org/T132457EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: Stashbot, gerritbot, ema, Gehel, BBlack, Aklapper, mschwarzer, Avner, Lewizho99, Maathavan, debt, D3r1ck01, Jonas, FloNight, Xmlizer, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T132457: Move wdqs to an LVS service
BBlack added a parent task: T147844: Standardize varnish applayer backend definitions. TASK DETAILhttps://phabricator.wikimedia.org/T132457EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: Stashbot, gerritbot, ema, Gehel, BBlack, Aklapper, mschwarzer, Avner, Lewizho99, Maathavan, debt, D3r1ck01, Jonas, FloNight, Xmlizer, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, faidon, Mbch331, Jay8g___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T142944: Performance and caching considerations for article placeholders accesses
BBlack added a comment. I think I'm lacking a lot of context here about these special pages and placeholders. But my bottom line thoughts are currently along these lines: How do actual, real-world, anonymous users interact with these placeholders and special pages? What value is it providing the average reader, in what way? How does the scope of the new code and new invalidation problems (esp potential purge traffic) compare to that? Because I tend to think (with what little context I have) that this sounds like a ton of churn on our end for very little real value to the user. Maybe most of the value is to logged-in editors, who don't face invalidation problems in the first place? For the most part, we can categorize the invalidation model of page content into one of two bins: either it's purged on relevant update nearly-immediately (at most, a few seconds' delay for asynchronicity and such), or it's something that sometimes goes stale for some real amount of time, where we really have to think about what happens when users read a stale page, and we need an upper bound on staleness to consider that question properly. Once you're in the latter bin of stale things, there needs to be a rational way to quantify the fallout of a stale view. Is a stale page broken itself, or does it have broken links, or simply outdated content? I tend to think that, in the examples I've seen so far, either something requires immediate invalidation, or staleness isn't a real issue within a reasonable (e.g. hours, days) timeframe. 30 minutes seems arbitrary and probably not tied to a real-world constraint on how broken a stale view is. It sounds more like a compromise because we really want immediate purging but we know the purge volume will be unreasonable. TASK DETAILhttps://phabricator.wikimedia.org/T142944EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: BBlack, daniel, Lydia_Pintscher, Joe, Lucie, Aklapper, hoo, D3r1ck01, Izno, Wikidata-bugs, aude, jayvdb, Ricordisamoa, faidon, Mbch331, Jay8g___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T142944: Performance and caching considerations for article placeholders accesses
BBlack added a comment. 30 minutes isn't really reasonable, and neither is spamming more purge traffic. If there's a constant risk of the page content breaking without invalidation, how is even 30 minutes acceptable? Doesn't this mean that on average they'll be broken for 15 minutes after an affecting change?TASK DETAILhttps://phabricator.wikimedia.org/T142944EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: BBlack, daniel, Lydia_Pintscher, Joe, Lucie, Aklapper, hoo, D3r1ck01, Izno, Wikidata-bugs, aude, jayvdb, Ricordisamoa, faidon, Mbch331, Jay8g___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Changed Subscribers] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a subscriber: GWicke.BBlack added a comment. @aaron and @GWicke - both patches sound promising, thanks for digging into this topic!TASK DETAILhttps://phabricator.wikimedia.org/T124418EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: BBlackCc: GWicke, ArielGlenn, Krinkle, Peter, EBernhardson, Smalyshev, gerritbot, Legoktm, Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, aaron, faidon, Joe, ori, BBlack, Aklapper, Lewizho99, Maathavan, D3r1ck01, Izno, Wikidata-bugs, Mbch331, Jay8g___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read
BBlack added a comment. cache_maps cluster switched to the new varnish package today TASK DETAIL https://phabricator.wikimedia.org/T134989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: jeremyb, Ronarts12, Krenair, Dzahn, GWicke, Smalyshev, Heather, ZMcCune, ema, Stashbot, Luke081515, matmarex, TerraCodes, Urbanecm, KDDLB, hashar, Jonas, gerritbot, BBlack, Aklapper, Zppix, Lydia_Pintscher, Gehel, Avner, Lewizho99, Maathavan, debt, D3r1ck01, FloNight, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read
BBlack added a comment. Current State: - cp3007 and cp1045 are depooled from user traffic, icinga-downtimed for several days, and have puppet disabled. Please do not re-enable puppet on these! They also have confd shut down, and are running custom configs to continue debugging this issue under varnish4. - The rest of cache_misc is reverted to varnish3, which should temporarily resolve this issue for user traffic over the weekend and into next week while we continued isolated investigation using the two nodes above. - Please do **not** resolve this ticket - this is a bandaid, and we still have a lot of digging to do. - Please **do** report any similar user-facing failures from here forward, as there shouldn't be any while the cluster is reverted to varnish3. TASK DETAIL https://phabricator.wikimedia.org/T134989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Ronarts12, Krenair, Dzahn, GWicke, Smalyshev, Heather, Nirzar, ZMcCune, ema, Stashbot, Luke081515, matmarex, TerraCodes, Urbanecm, KDDLB, hashar, Jonas, gerritbot, BBlack, Aklapper, Zppix, Lydia_Pintscher, Gehel, Avner, Lewizho99, Maathavan, debt, D3r1ck01, FloNight, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Block] T133490: Wikidata Query Service REST endpoint returns truncated results
BBlack reopened blocking task T131501: Convert misc cluster to Varnish 4 as "Open". TASK DETAIL https://phabricator.wikimedia.org/T133490 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: MZMcBride, gerritbot, BBlack, Bovlb, Aklapper, Mushroom, Avner, Lewizho99, Maathavan, debt, Gehel, D3r1ck01, FloNight, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read
BBlack added a blocked task: T131501: Convert misc cluster to Varnish 4. TASK DETAIL https://phabricator.wikimedia.org/T134989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Ronarts12, Krenair, Dzahn, GWicke, Smalyshev, Heather, Nirzar, ZMcCune, ema, Stashbot, Luke081515, matmarex, TerraCodes, Urbanecm, KDDLB, hashar, Jonas, gerritbot, BBlack, Aklapper, Zppix, Lydia_Pintscher, Gehel, Avner, Lewizho99, Maathavan, debt, D3r1ck01, FloNight, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read
BBlack added a comment. I forgot one of our temporary hacks in the list above in https://phabricator.wikimedia.org/T134989#2290254: 4. https://gerrit.wikimedia.org/r/#/c/288656/ - we also enabled a critical small bit here in v4 vcl_hit. I reverted this for now during the varnish3 downgrade. Need to remember that once we find the right bug and start cleaning everything up for upgrade again... TASK DETAIL https://phabricator.wikimedia.org/T134989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Ronarts12, Krenair, Dzahn, GWicke, Smalyshev, Heather, Nirzar, ZMcCune, ema, Stashbot, Luke081515, matmarex, TerraCodes, Urbanecm, KDDLB, hashar, Jonas, gerritbot, BBlack, Aklapper, Zppix, Lydia_Pintscher, Gehel, Avner, Lewizho99, Maathavan, debt, D3r1ck01, FloNight, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read
BBlack added a comment. Has anyone been able to reproduce any of the problems in the tickets merged into here, since roughly the timestamp of the above message? TASK DETAIL https://phabricator.wikimedia.org/T134989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Heather, Nirzar, ZMcCune, ema, Stashbot, Luke081515, matmarex, TerraCodes, Urbanecm, KDDLB, hashar, Jonas, gerritbot, BBlack, Aklapper, Zppix, Lydia_Pintscher, Gehel, Avner, Lewizho99, Maathavan, debt, D3r1ck01, FloNight, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read
BBlack added a comment. So we're currently have several experiments in play trying to figure this out: 1. We've got 2x upstream bugfixes applied to our varnishd on cache_misc: https://github.com/varnishcache/varnish-cache/commit/d828a042b3fc2c2b4f1fea83021f0d5508649e50 + https://github.com/varnishcache/varnish-cache/commit/e142a199c53dd9331001cb29678602e726a35690 2. We've removed all of our Content-Length sensitive VCL that was on cache_misc temporarily (basically https://gerrit.wikimedia.org/r/#/c/288231/ , which at one point we partially put back, but then removed again) 3. We've switched from persistent to file storage on the misc backends, manually with puppet disabled. Puppetization to make that semi-permanent if need be: https://gerrit.wikimedia.org/r/288440 (untested) I don't think anyone has reproduced the problem since (3) went live everywhere. So we're in a new state and needing proof that things are still messed up (or not!). TASK DETAIL https://phabricator.wikimedia.org/T134989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Heather, Nirzar, ZMcCune, ema, Stashbot, Luke081515, matmarex, TerraCodes, Urbanecm, KDDLB, hashar, Jonas, gerritbot, BBlack, Aklapper, Zppix, Lydia_Pintscher, Gehel, Avner, Lewizho99, Maathavan, debt, D3r1ck01, FloNight, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read
BBlack added a comment. In the merged ticket above, it's browser access to status.wm.o, and the browser's getting a 304 Not Modified and complaining about it (due to missing character encoding supposedly, but it's entirely likely it's missing everything and that's just the first thing it notices). TASK DETAIL https://phabricator.wikimedia.org/T134989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Kghbln, ema, Stashbot, Luke081515, matmarex, TerraCodes, Urbanecm, KDDLB, hashar, Jonas, gerritbot, BBlack, Aklapper, Zppix, Lydia_Pintscher, Gehel, Avner, Lewizho99, Maathavan, debt, D3r1ck01, FloNight, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Merged] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read
BBlack added a subscriber: Kghbln. BBlack merged a task: T135121: stats.wikimedia.org down. TASK DETAIL https://phabricator.wikimedia.org/T134989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Kghbln, ema, Stashbot, Luke081515, matmarex, TerraCodes, Urbanecm, KDDLB, hashar, Jonas, gerritbot, BBlack, Aklapper, Zppix, Lydia_Pintscher, Gehel, Avner, Lewizho99, Maathavan, debt, D3r1ck01, FloNight, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read
BBlack added a comment. Status update: we've been debugging this off and on all day. It's some kind of bug fallout from cache_misc's upgrade to Varnish 4. It's a very complicated bug, and we don't really understand it yet. We've made some band-aid fixes to VCL for now which should keep the problem at bay while investigating further. TASK DETAIL https://phabricator.wikimedia.org/T134989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Gehel, BBlack Cc: Luke081515, matmarex, TerraCodes, Urbanecm, KDDLB, hashar, Jonas, gerritbot, BBlack, Aklapper, Zppix, Lydia_Pintscher, Gehel, Avner, Lewizho99, Maathavan, debt, D3r1ck01, FloNight, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read
BBlack added a comment. Assuming there was no transient issue (which became cached) on the wdqs end of things, then this was likely a transient thing from nginx experiments or the cache_misc varnish4 upgrade. I banned all wdqs objects from cache_misc and now your test URL works fine. Can you confirm? TASK DETAIL https://phabricator.wikimedia.org/T134989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: BBlack, Aklapper, Zppix, Lydia_Pintscher, Gehel, Avner, debt, D3r1ck01, FloNight, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Closed] T133866: Varnish seems to sometimes mangle uncompressed API results
BBlack closed this task as "Resolved". BBlack added a comment. My test cases on cache_text work now, should be resolved! TASK DETAIL https://phabricator.wikimedia.org/T133866 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: gerritbot, mahmoud, Slaporte, Zppix, Ricordisamoa, Trung.anh.dinh, MZMcBride, Anomie, Yurivict, TerraCodes, Orlodrim, BBlack, akosiaris, zhuyifei1999, elukey, ema, Aklapper, hoo, Lewizho99, Maathavan, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331, Jay8g, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Closed] T133490: Wikidata Query Service REST endpoint returns truncated results
BBlack closed this task as "Resolved". BBlack claimed this task. BBlack added a comment. This works now. There's a significant pause at the start of the transfer from the user's perspective if it's not a cache hit, because streaming is disabled as a workaround (so it has to completely load the data into each cache layer before starting the data stream to the user), but it does function correctly. The non-streamed pause behavior will go away with https://phabricator.wikimedia.org/T131501 . TASK DETAIL https://phabricator.wikimedia.org/T133490 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: MZMcBride, gerritbot, BBlack, Bovlb, Aklapper, Mushroom, Avner, Lewizho99, Maathavan, debt, Gehel, D3r1ck01, FloNight, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T133490: Wikidata Query Service REST endpoint returns truncated results
BBlack added a comment. We now have some understanding of the mechanism of this bug ( https://phabricator.wikimedia.org/T133866#2275985 ). It should go away in the imminent varnish 4 upgrade of the misc cluster in https://phabricator.wikimedia.org/T131501. TASK DETAIL https://phabricator.wikimedia.org/T133490 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: BBlack, Bovlb, Aklapper, Mushroom, Avner, debt, Gehel, D3r1ck01, FloNight, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T133490: Wikidata Query Service REST endpoint returns truncated results
BBlack added a blocking task: T131501: Convert misc cluster to Varnish 4. TASK DETAIL https://phabricator.wikimedia.org/T133490 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Bovlb, Aklapper, Mushroom, Avner, debt, Gehel, D3r1ck01, FloNight, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T133866: Varnish seems to sometimes mangle uncompressed API results
BBlack added a comment. So, as it turns out, this is a general varnishd bug in our specific varnishd build. For purposes of this bug, our varnishd code is essentially 3.0.7 plus a bunch of ancient forward-ported 'plus' patches related to streaming, and we're missing https://github.com/varnishcache/varnish-cache/commit/72981734a141a0a52172b85bae55f8877f69ff42 (do_gzip + do_stream content-length bug for HTTP/1.0 reqs, which is eerily similar to this issue, but not quite the same) because it doesn't apply cleanly/sanely to our codebase due to conflicts with the former. What I can reliably and predictably observe and control for now is: we have a response-length-specific response corruption bug, only when both of these conditions are met: 1. do_stream is in effect for this request (for text cluster, this means it's pass or initial miss+chfp(Created-Hit-For-Pass) traffic) 2. the response has to be gunzipped for the client (client does not advertise gzip support, but backend response is gzipped by the applayer, or gzipped by varnish due to do_gzip rules). In a lot of the test scenarios/requests myself and others were using previously, we weren't necessarily controlling for these variables well, which led to a lot of inconsistent results (notably, X-Wikimedia-Debug effectively turns non-pass traffic into pass-traffic when debugging, but the same might not be true if testing directly from varnish to mw1017 without X-Wikimedia-Debug). The do_gzip (and related gunzip) behaviors have been in place for a long time. What's new lately is the do_stream behaviors. These were added to the cache_text cluster in the past couple of months for the pass-traffic cases. cache_upload has had do_stream for certain requests for a very long time, but various constraints there conspire to make it accidentally-unlikely we'll observe this bug on cache_upload for legitimate traffic. cache_misc probably suffers from this as well, but the conditions under which it will or won't is trickier in this case, but almost surely this is related to https://phabricator.wikimedia.org/T133490 as well. So the basic game plan for this bug is: cache_text - revert the relatively-recent do_stream-enabling VCL patches. cache_misc - will resolve itself with varnish4 upgrade, which is imminent for this cluster cache_upload - keep ignoring what is probably a non-problem in practice there for now, will eventually get fixed up with varnish 4 upgrade. cache_maps - already varnish4, wouldn't have this issue. TASK DETAIL https://phabricator.wikimedia.org/T133866 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Ricordisamoa, Trung.anh.dinh, MZMcBride, Anomie, Yurivict, TerraCodes, Orlodrim, BBlack, akosiaris, zhuyifei1999, elukey, ema, Aklapper, hoo, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331, Jay8g, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Unblock] T133866: Varnish seems to sometimes mangle uncompressed API results
BBlack closed blocking task Restricted Task as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T133866 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Ricordisamoa, Trung.anh.dinh, MZMcBride, Anomie, Yurivict, TerraCodes, Orlodrim, BBlack, akosiaris, zhuyifei1999, elukey, ema, Aklapper, hoo, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331, Jay8g, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Triaged] T133866: Varnish seems to sometimes mangle uncompressed API results
BBlack triaged this task as "High" priority. TASK DETAIL https://phabricator.wikimedia.org/T133866 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Trung.anh.dinh, MZMcBride, Anomie, Yurivict, TerraCodes, Orlodrim, BBlack, akosiaris, zhuyifei1999, elukey, ema, Aklapper, hoo, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331, Jay8g, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T133866: Varnish seems to sometimes mangle uncompressed API results
BBlack added a blocking task: Restricted Task. TASK DETAIL https://phabricator.wikimedia.org/T133866 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Trung.anh.dinh, MZMcBride, Anomie, Yurivict, TerraCodes, Orlodrim, BBlack, akosiaris, zhuyifei1999, elukey, ema, Aklapper, hoo, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331, Jay8g, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T133866: Varnish seems to sometimes mangle uncompressed API results
BBlack added a comment. Thanks for merging in the probably-related tasks. I had somehow missed really noticing T123159 earlier... So probably digging into gunzip itself isn't a fruitful path. I'm going to open a separate blocker for this that's private, so we can keep merging public tickets into this... TASK DETAIL https://phabricator.wikimedia.org/T133866 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Trung.anh.dinh, MZMcBride, Anomie, Yurivict, TerraCodes, Orlodrim, BBlack, akosiaris, zhuyifei1999, elukey, ema, Aklapper, hoo, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331, Jay8g, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T133866: Varnish seems to sometimes mangle uncompressed API results
BBlack added a comment. Did some further testing on an isolated test machine, using our current varnish3 package. - Got 2833-byte test file from uncorrupted (--compressed) output on prod. This is the exact compressed content bytes emitted by MW/Apache for the broken test URL. - Configured a test backend (nginx) to serve static files, and to always set CE:gzip. - Placed the gzipped 2833 byte file in test directory, fetched over curl w/ --compressed, md5sum comes out right. - When fetched through our varnishd with a default config and do_gzip turned on, varnish does decompress this file for the curl client, and there is no corruption (still same md5sum). This rules out the possibility that this is some pure, data-sensitive varnish bug with gunzipping the content itself. However, the notable diff in this test from reality is that nginx serving a static pre-gzipped file is (a) not emitting it as TE:chunked and (b) even if it did, it probably wouldn't use the same chunk boundaries, nor is it likely to share any TE:chunked encoding bugs or varnish-bug-triggers... TASK DETAIL https://phabricator.wikimedia.org/T133866 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: elukey, ema, Aklapper, hoo, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331, Jay8g, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T133866: Varnish seems to sometimes mangle uncompressed API results
BBlack added a comment. Just jotting down the things I know so far from investigating this morning. I still don't have a good answer yet. Based on just the test URL, debugging it extensively at various layers: 1. The response size of that URL is in the ballpark of 32KB uncompressed, so this is not a large-objects issue. It also streams out of the backend reliably and quickly, there are no significant timing pauses. 2. Without a doubt, anytime the client doesn't use AE:gzip when talking to the public endpoint, the response is corrupt. 3. Slightly deeper, even when testing requests against a single layer of varnish internally, the response to a non-AE:gzip client request is corrupt. 4. It's definitely happening at the Varnish<->Apache/MediaWiki boundary (disagreement) or within a single varnishd process as it prepares the response. It's not a Varnish<->Varnish, Varnish<->Nginx, or Nginx<->client issue. 5. All of the gzip/chunked stuff looks basically correct in the headers at varnish/apache boundary and varnish/client boundary. We do send AE:gzip when expected, we do get CE:gzip when expected (only when asked for), the gzip-ness of MW/Apache's output always correctly follows its CE:gzip header (or lack thereof), etc. 6. Curl has no problem parsing the output of a direct fetch from Apache/MediaWiki, whether using `--compressed` to set AE:gzip or not, and the results hash the same (identical content). 7. Varnish emits the corrupt content for the non-AE:gzip client regardless of whether I tweak the test scenario to ensure that varnish is the one gzipping the content, or that Apache/Mediawiki are the ones gzipping the content. So it's not an error in gzip compression of the response by just one party or the other. The error happens when gunzipping the response for the non-AE:gzip client. 8. However, when I run through a similar set of fully-debugged test scenarios for https://en.wikipedia.org/wiki/Barack_Obama , which is ~1MB in uncompressed length, and similarly TE:chunked with backend gzip capabilities and do_gzip=true, and on the same cluster and VCL (and even same test machine), I don't get the corruption for a non-AE:gzip client, even though varnish is decompressing that on the fly as with the bad test URL. The obvious distinctions here between the Barack article and the failing test URL aren't much: api.svc vs appservers.svc shouldn't matter, right, they're both behind the same apache and hhvm configs. This leaves me guessing that there's something special about the specific output of the test URL that's causing this. There's almost certainly a varnish bug involved here, but the question is: is this a pure varnish gunzip bug that's sensitive to certain conditions which exist for the test URL but not the Barack one? Is the output of Apache/MW buggy in some way for the test URL such that it's tripping the bug (in which case it's still a varnish bug that it doesn't reject the buggy response and turn it into a 503 or similar), or is it non-buggy, but "special" in a way that trips a varnish bug? TASK DETAIL https://phabricator.wikimedia.org/T133866 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: elukey, ema, Aklapper, hoo, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T133866: Varnish seems to sometimes mangle uncompressed API results
BBlack added a comment. Do you know if some normal traffic is affected, such that we'd know a start date for a recent change in behavior? Or is it suspected that it was always this way? I've been digging through some debugging on this URL (which is an applayer chunked-response with no-cache headers), and it's definitely happening at the varnish<->MW boundary (as opposed to further up, at varnish<->varnish or nginx<->varnish), and only for non-AE:gzip requests. The length of the result is correct, but there's corruption in the trailing bytes. TASK DETAIL https://phabricator.wikimedia.org/T133866 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: ema, Aklapper, hoo, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. I really don't think it's specifically Wikidata-related either at this point. Wikidata might be a significant driver of update jobs in general, but the code changes driving the several large rate increases were probably generic to all update jobs. TASK DETAIL https://phabricator.wikimedia.org/T124418 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Smalyshev, gerritbot, Legoktm, Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, aaron, faidon, Joe, ori, BBlack, Aklapper, Lewizho99, Maathavan, D3r1ck01, Izno, Wikidata-bugs, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a blocked task: T133821: Content purges are unreliable. TASK DETAIL https://phabricator.wikimedia.org/T124418 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Smalyshev, gerritbot, Legoktm, Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, aaron, faidon, Joe, ori, BBlack, Aklapper, Lewizho99, Maathavan, D3r1ck01, Izno, Wikidata-bugs, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T102476: RFC: Requirements for change propagation
BBlack added a blocked task: T133821: Content purges are unreliable. TASK DETAIL https://phabricator.wikimedia.org/T102476 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: ArielGlenn, hoo, Addshore, RobLa-WMF, StudiesWorld, intracer, JanZerebecki, brion, Ltrlg, Anomie, Milimetric, mark, BBlack, aaron, daniel, Eevans, mobrovac, GWicke, D3r1ck01, Izno, Hardikj, Wikidata-bugs, aude, jayvdb, fbstj, Mbch331, Jay8g, bd808, Legoktm ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T133490: Wikidata Query Service REST endpoint returns truncated results
BBlack edited projects, added Traffic; removed Varnish. TASK DETAIL https://phabricator.wikimedia.org/T133490 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Aklapper, Mushroom, Avner, debt, TerraCodes, Gehel, D3r1ck01, FloNight, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. F3845100: Screen Shot 2016-04-07 at 7.47.28 PM.png <https://phabricator.wikimedia.org/F3845100> TASK DETAIL https://phabricator.wikimedia.org/T124418 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Smalyshev, gerritbot, Legoktm, Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, Denniss, aaron, faidon, Joe, ori, BBlack, Aklapper, Lewizho99, TerraCodes, D3r1ck01, Izno, Wikidata-bugs, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Edited] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack edited the task description. TASK DETAIL https://phabricator.wikimedia.org/T124418 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Smalyshev, gerritbot, Legoktm, Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, Denniss, aaron, faidon, Joe, ori, BBlack, Aklapper, Lewizho99, TerraCodes, D3r1ck01, Izno, Wikidata-bugs, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T121135: Banners fail to show up occassionally on Russian Wikivoyage
BBlack added a comment. In https://phabricator.wikimedia.org/T121135#1910435, @Atsirlin wrote: > @Legoktm: Frankly speaking, for a small project like Wikivoyage the cache brings no obvious benefits, but triggers many serious issues including the problem of page banners and ToC. The dirty trick of automatic cache purge worked perfectly fine in the last 3 weeks, and I believe that it could be used further even if it violates some general philosophy of the Mediawiki software. I am fine with having this cache purge feature reverted, but you have to propose another solution. At this point, having stable page banners and ToC is very important for us, while anything related to the cache is of minor relevance to the project. That JS hack, if I'm reading it correctly, effectively sends us a purge on every pageview? That's horrendous and abusive of our infrastructure, and the problem could grow if people start copying it to other wikis to work around other perceived problems. TASK DETAIL https://phabricator.wikimedia.org/T121135 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Jdlrobson, BBlack Cc: BBlack, faidon, Tgr, jcrespo, Krenair, Legoktm, LtPowers, mark, Wrh2, Sumit, Jdlrobson, Atsirlin, Aklapper, TerraCodes, D3r1ck01, Izno, Wikidata-bugs, aude, Lydia_Pintscher, Arlolra, Jackmcbarn, Mbch331, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T127014: Empty result on a tree query
BBlack added a blocking task: T128813: cache_misc's misc_fetch_large_objects has issues. TASK DETAIL https://phabricator.wikimedia.org/T127014 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Gehel, BBlack Cc: gerritbot, BBlack, Gehel, Nikki, Mbch331, Magnus, JanZerebecki, Smalyshev, Aklapper, StudiesWorld, Bugreporter, debt, TerraCodes, D3r1ck01, FloNight, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T127014: Empty result on a tree query
BBlack edited projects, added Traffic; removed Varnish. TASK DETAIL https://phabricator.wikimedia.org/T127014 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Gehel, BBlack Cc: gerritbot, BBlack, Gehel, Nikki, Mbch331, Magnus, JanZerebecki, Smalyshev, Aklapper, StudiesWorld, Bugreporter, debt, TerraCodes, D3r1ck01, FloNight, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Jay8g ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T127014: Empty result on a tree query
BBlack added a comment. I did some live experimentation with manual edits to the VCL. It is the `between_bytes_timeout`, but the situation is complex. The timeout that's failing is on the varnish frontend fetching from the varnish backend. These are fixed at 2s, but because this is all in do_stream mode, the stream delays come through directly. So, as I guessed earlier, this is all inter-related with https://phabricator.wikimedia.org/T128813. We should fix that issue first before deciding what to do about the between bytes timeout for varnish<->varnish (where it's not per-service...). TASK DETAIL https://phabricator.wikimedia.org/T127014 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Gehel, BBlack Cc: gerritbot, BBlack, Gehel, Nikki, Mbch331, Magnus, JanZerebecki, Smalyshev, Aklapper, StudiesWorld, Bugreporter, debt, TerraCodes, D3r1ck01, FloNight, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Jay8g, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T127014: Empty result on a tree query
BBlack added a comment. This is probably due to backend timeouts, I would guess? The default applayer settings being applied to wdqs include `between_bytes_timeout` at only 4s, whereas `first_byte_timeout` is 185s. So if wdqs delayed all output, it would have 3 minutes or so, but once it outputs the first byte, any delay over 4s will kill it. Although I'm surprised that doesn't result in a 503. There's probably also multi-layer interaction with https://phabricator.wikimedia.org/T128813 and the do_stream and such... TASK DETAIL https://phabricator.wikimedia.org/T127014 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Gehel, BBlack Cc: BBlack, Gehel, Nikki, Mbch331, Magnus, JanZerebecki, Smalyshev, Aklapper, StudiesWorld, Bugreporter, debt, D3r1ck01, FloNight, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T126730: [RFC] Caching for results of wikidata Sparql queries
BBlack added a comment. In https://phabricator.wikimedia.org/T126730#2034900, @Christopher wrote: > I may be wrong, but the headers that are returned from a request to the nginx > server wdqs1002 say that varnish 1.1 is already being used there. It's varnish 3.0.6 currently (4.x is coming down the road). > And, for whatever reason,** it misses**, because repeating the same query > gives the same response time. It misses because the response is sent with `Transfer-Encoding: chunked`. If it were sent un-chunked with a Content-Length, the varnish would have a chance at caching it. However, the next thing you'd run into is that the response doesn't contain any caching-relevant headers (e.g. `Expires`, `Cache-Control`, `Age`). Lacking these, varnish would cache it with our configured default_ttl, which on the misc cluster where `query.wikidata.org` is currently hosted, is only 120 seconds. > Even though Varnish cache **should work** to proxy nginx for optimizing > delivery of static query results, it lacks several important features of an > object broker. Namely, client control of object expiration (TTL) and > retrieval of "named query results" from persistent storage. > > A WDQS service use case may in fact be to compare results from several days > ago with current results. Thus, assuming the latest results state is what > the client wants my actually not be true. I think all of this is doable. Named query results is something we talked about in the previous discussion re `GET` length restrictions. `POST`ing (and/or server-side configuring, either way!) a complex query and saving it as a named query through a separate query-setup interface, then executing the query for results with a `GET` on just the query name. I don't think we really want client control of object expiration (at least, not "varnish cache object expiration"), but what we want is the ability to parameterize named queries based on time, right? e.g. a named query that gives a time-series graph might have parameters for start time and duration. You might initially post the complex SPARQL template and save it as `fooquery`, then later have a client get it as `/sparql?saved_query=fooquery=201601011234=1w`. Varnish would have the chance to cache those based on the query args as separate results, and you could limit the time resolution if you want to enhance cacheability. If it's for inclusion from a page that wants to graph that data and always show a "current" graph rather than hardcoded start/duration (and I could see use-cases for both in articles), you could support a start time of `now` with an optional resolution specifier that defaults to 1 day, like `=now/1d`. The response to such a query would set cache-control headers that allow caching at varnish up to 24H (based on `now/1d` resolution), which means everyone executing that query gets new results about once a day and they all shared a single cached result per day. The important thing here is there's no need for a client to have control over result object expiration if the query encodes everything that's relevant to expiration and the maximum cache lifetime is set small enough that other effects (e.g. data updates to existing historical data) are negligible in the big picture. > Possibly, the optimal solution would use the varnish-api-engine > (http://info.varnish-software.com/blog/introducing-varnish-api-engine) in > conjunction with a WDQS REST API (provided with a modified RESTBase?). Is > the varnish-api-engine being used anywhere in WMF? Also, delegating query > requests to an API could allow POSTs. Simply with Varnish cache, the POST > problem would remain unresolved. We're not using the Varnish API Engine, and I don't see us pursuing that anytime soon. Most of what it does can be done other ways, and more importantly it's commercial software. There seems to be some confusion as to whether `POST` is or isn't still an issue here... Also, a whole separate issue is that WDQS is currently mapped through our `cache_misc` cluster. That cluster is for small lightweight miscellaneous infrastructure. WDQS was probably always a poor match for that, but we put it there because at the time it was seen as being a lightweight / low-rate service that would mostly be used directly by humans to execute one-off complicated queries. The plans in this ticket sound nothing like that, and `cache_misc` probably isn't an appropriate home for a complex query services that's going to backend serious query load from wikis and the rest of the world... TASK DETAIL https://phabricator.wikimedia.org/T126730 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: BBlack, GWicke, Bene, Ricordisamoa, daniel, Lydia_Pintscher, Smalyshev, Jonas, Christopher, Yurik, hoo, Aklapper, aude, debt, Gehe
[Wikidata-bugs] [Maniphest] [Commented On] T125392: [Task] figure out the ratio of page views by logged-in vs. logged-out users
BBlack added a comment. In https://phabricator.wikimedia.org/T125392#1994242, @Milimetric wrote: > @BBlack - so you think cache_status is not even close to accurate? Do we > have other accurate measurements of it so we could compare to what extent > it's misleading? I'm happy to remove it from the data if it's really bad. On this topic, it is pretty misleading, and we do have other stats we look at more-manually to compare. We don't have a good singular, simple replacement for cache_status to include in analytics yet, though. What we do have (that we've looked at manually in some cases lately) is the `X-Cache` response header. That header has evolved a bit in how it's generated over the past couple of months so that it's less-misleading than it was before, but it still requires various regex operations to bin responses according to what exactly one is trying to measure. But for an example, this pseudo-code would be an accurate way to put all requests into 3 distinct non-overlapping bins based on X-Cache regex: if (X-Cache ~ / hit/) { print "This is a real cache object hit"; } else if (X-Cache ~ / int/) { print "This response was generated internally by varnish (e.g. 301 redirect for HTTPS, desktop->mobile redirect on UA detect, some kinds of error response, etc)"; } else { print "This is a cache miss or a cache pass (pass would be due to uncacheable content, which is more-often true for loggedin users than others, but exists in both cases in notable numbers)"; } However, I think X-Cache's raw data is still open to further modification. Ideally we'll build on top of this and start emitting some standard, simple header that can be one of N simple strings and reflects overall cache status bins (and hopefully with better detail as to miss-vs-pass and the nature of the pass to some degree). TASK DETAIL https://phabricator.wikimedia.org/T125392 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Addshore, BBlack Cc: JanZerebecki, Milimetric, BBlack, ori, gerritbot, hoo, daniel, Aklapper, Addshore, Lydia_Pintscher, Izno, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T126730: [RFC] Caching for results of wikidata Sparql queries
BBlack added a comment. IIRC, the problem we've beat our heads against in past SPARQL-related tickets is the fact that SPARQL clients are using `POST` method for readonly queries, due to argument length issues and whatnot. On the surface, that's a dealbreaker for caching them as `POST` isn't cacheable. The conflict here comes from a fundamental limitation of HTTP: the only idempotent/readonly methods have un-ignorable input data length restrictions. There are probably ways to design around that in a scratch design, but SPARQL software is already-written... TASK DETAIL https://phabricator.wikimedia.org/T126730 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: BBlack, GWicke, Bene, Ricordisamoa, daniel, Lydia_Pintscher, Smalyshev, Jonas, Christopher, Yurik, hoo, Aklapper, aude, debt, Gehel, Izno, Luke081515, jkroll, Wikidata-bugs, Jdouglas, Deskana, Manybubbles, Mbch331, Jay8g, Ltrlg ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. So, current thinking is that at least one of (maybe two of?) the bumps are from moving what used to be synchronous HTCP purge during requests to JobRunner jobs which should be doing the same thing. However, assuming it's that alone (or even just investigating that part in isolation), we're still left with "why did the resulting rate of HTCP purges go up by unexpected multiples just from moving them to the jobqueue?". TASK DETAIL https://phabricator.wikimedia.org/T124418 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Legoktm, Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, Denniss, aaron, faidon, Joe, ori, BBlack, Aklapper, Izno, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. Well then apparently the 10/s edits to all projects number I found before is complete bunk :) http://wikipulse.herokuapp.com/ has numbers for wikidata edits that approximately line up with yours, and then shows Wikipedias at about double that rate (which might be a reasonable interpretation of the distribution shown). TASK DETAIL https://phabricator.wikimedia.org/T124418 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Legoktm, Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, Denniss, aaron, faidon, Joe, ori, BBlack, Aklapper, Izno, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. heh so: https://phabricator.wikimedia.org/T113192 -> https://gerrit.wikimedia.org/r/#/c/258365/5 is probably the Jan 20 bump. TASK DETAIL https://phabricator.wikimedia.org/T124418 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Legoktm, Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, Denniss, aaron, faidon, Joe, ori, BBlack, Aklapper, Izno, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T125392: [Task] figure out the ratio of page views by logged-in vs. logged-out users
BBlack added a comment. FYI - "cache_status" is not an accurate reflection of anything. I'm not sure why we really even log it for analytics. The problem is that it only reflects some varnish state about the first of up to 3 layers of caching, and even then it does so poorly. TASK DETAIL https://phabricator.wikimedia.org/T125392 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Addshore, BBlack Cc: JanZerebecki, Milimetric, BBlack, ori, gerritbot, hoo, daniel, Aklapper, Addshore, Lydia_Pintscher, Izno, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. @ori - yeah that makes sense for the initial bump, and I think there may have even been a followup to do deferred purges, which may be one of the other multipliers, but I haven't found it yet (as in, insert an immediate job and also somehow insert one that fires a little later to cover race conditions). TASK DETAIL https://phabricator.wikimedia.org/T124418 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Legoktm, Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, Denniss, aaron, faidon, Joe, ori, BBlack, Aklapper, Izno, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. Another data point from the weekend: In one sample I took Saturday morning, when I sampled for 300s, the top site being purged was srwiki, and something like 98% of the purges flowing for srwiki were all Talk: pages (well, with Talk: as %-encoded something in Serbian). When I visited random examples of the purged Talk pages, the vast majority of the ones I checked were content-free (as in, nobody had talked about the given article at all yet, it was showing the generic initial blob). These had to be coming from a job obviously, the question is what kind of job wants to rip through (probably) every talk page on a wiki, blank ones included, for purging (or alternatively, it was ripping through the entire page list, and I just happened to catch it on a batch of Talk: ones)? @faidon suggested a template used in those pages, but then what's triggering the template? Surely not wikidata on a template for blank talk pages? TASK DETAIL https://phabricator.wikimedia.org/T124418 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Legoktm, Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, Denniss, aaron, faidon, Joe, ori, BBlack, Aklapper, Izno, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. Regardless, the average rate of HTCP these days is normally-flat-ish (a few scary spikes aside), and is mostly throttled by the jobqueue. The question still remains: what caused permanent, large bumps in the jobqueue htmlCacheUpdate insertion rate on ~Dec4, ~Dec11, and ~Jan20? TASK DETAIL https://phabricator.wikimedia.org/T124418 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Legoktm, Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, Denniss, aaron, faidon, Joe, ori, BBlack, Aklapper, Izno, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. @daniel - Sorry I should have linked this earlier, I made a paste at the time: https://phabricator.wikimedia.org/P2547 . Note that `/%D0%A0%D0%B0%D0%B7%D0%B3%D0%BE%D0%B2%D0%BE%D1%80:` is the Serbian srwiki version of `/Talk:`. TASK DETAIL https://phabricator.wikimedia.org/T124418 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Legoktm, Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, Denniss, aaron, faidon, Joe, ori, BBlack, Aklapper, Izno, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. Well, we have 3 different stages of rate-increase in the insert graph, so it could well be that we have 3 independent causes to look at here. And it's not necessarily true that any of them are buggy, but we need to understand what they're doing and why, because maybe something or other can be tweaked or tuned to be less wasteful. Fundamentally nothing really changed in the past month or two; it's not like we gained a 5x increase in human article editing rate... TASK DETAIL https://phabricator.wikimedia.org/T124418 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, Denniss, aaron, faidon, Joe, ori, BBlack, Aklapper, Izno, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
BBlack added a comment. Continuing with some stuff I was saying in IRC the other day. At the "new normal", we're seeing something in the approximate ballpark of 400/s articles purged (which is then multiplied commonly for ?action=history and mobile and ends up more like ~1600/s actual HTCP packets), whereas the edit rate across all projects is something like 10/s. That 400/s number used to somewhere south of 100 before December. TASK DETAIL https://phabricator.wikimedia.org/T124418 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Addshore, daniel, hoo, aude, Lydia_Pintscher, JanZerebecki, MZMcBride, Luke081515, Denniss, aaron, faidon, Joe, ori, BBlack, Aklapper, Izno, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs