Hey, Forgive my ignorance. I don't know much about infrastructure of WDQS and how it works. I just want to mention how application servers do it. In appservers, there are dedicated nodes both for apache and the replica database. So if a bot overdo things in Wikipedia (which happens quite a lot), users won't feel anything but the other bots take the hit. Routing based on UA seems hard though while it's easy in mediawiki (if you hit api.php, we assume it's a bot).
Did you consider this in a more long-term solution? Best On Tue, 23 Jul 2019 at 09:43, Stas Malyshev <smalys...@wikimedia.org> wrote: > Hello all! > > Here is (at last!) an update on what we are doing to protect the > stability of Wikidata Query Service. > > For 4 years we have been offering to Wikidata users the Query Service, a > powerful tool that allows anyone to query the content of Wikidata, > without any identification needed. This means that anyone can use the > service using a script and make heavy or very frequent requests. > However, this freedom has led to the service being overloaded by a too > big amount of queries, causing the issues or lag that you may have noticed. > > A reminder about the context: > > We have had a number of incidents where the public WDQS endpoint was > overloaded by bot traffic. We don't think that any of that activity was > intentionally malicious, but rather that the bot authors most probably > don't understand the cost of their queries and the impact they have on > our infrastructure. We've recently seen more distributed bots, coming > from multiple IPs from cloud providers. This kind of pattern makes it > harder and harder to filter or throttle an individual bot. The impact > has ranged from increased update lag to full service interruption. > > What we have been doing: > > While we would love to allow anyone to run any query they want at any > time, we're not able to sustain that load, and we need to be more > aggressive in how we throttle clients. We want to be fair to our users > and allow everyone to use the service productively. We also want the > service to be available to the casual user and provide up-to-date access > to the live Wikidata data. And while we would love to throttle only > abusive bots, to be able to do that we need to be able to identify them. > > We have two main means of identifying bots: > > 1) their user agent and IP address > 2) the pattern of their queries > > Identifying patterns in queries is done manually, by a person inspecting > the logs. It takes time and can only be done after the fact. We can only > start our identification process once the service is already overloaded. > This is not going to scale. > > IP addresses are starting to be problematic. We see bots running on > cloud providers and running their workloads on multiple instances, with > multiple IP addresses. > > We are left with user agents. But here, we have a problem again. To > block only abusive bots, we would need those bots to use a clearly > identifiable user agent, so that we can throttle or block them and > contact the author to work together on a solution. It is unlikely that > an intentionally abusive bot will voluntarily provide a way to be > blocked. So we need to be more aggressive about bots which are using a > generic user agent. We are not blocking those, but we are limiting the > number of requests coming from generic user agents. This is a large > bucket, with a lot of bots that are in this same category of "generic > user agent". Sadly, this is also the bucket that contains many small > bots that generate only a very reasonable load. And so we are also > impacting the bots that play fair. > > At the moment, if your bot is affected by our restrictions, configure a > custom user agent that identifies you; this should be sufficient to give > you enough bandwidth. If you are still running into issues, please > contact us; we'll find a solution together. > > What's coming next: > > First, it is unlikely that we will be able to remove the current > restrictions in the short term. We're sorry for that, but the > alternative - service being unresponsive or severely lagged for everyone > - is worse. > > We are exploring a number of alternatives. Adding authentication to the > service, and allowing higher quotas to bots that authenticate. Creating > an asynchronous queue, which could allow running more expensive queries, > but with longer deadlines. And we are in the process of hiring another > engineer to work on these ideas. > > Thanks for your patience! > > WDQS Team > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata > -- Amir Sarabadani (he/him) Software engineer Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin Tel. (030) 219 158 26-0 https://wikimedia.de Unsere Vision ist eine Welt, in der alle Menschen am Wissen der Menschheit teilhaben, es nutzen und mehren können. Helfen Sie uns dabei! https://spenden.wikimedia.de Wikimedia Deutschland — Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata