> > Adding authentication to the service, and allowing higher quotas to bots > that authenticate.
Awesome and expected. Creating an asynchronous queue, which could allow running more expensive > queries, but with longer deadlines. Even more awesome! Will this be approachable: My 2 hour query will actually finally return results into my 1gig csv.zip file? Thad https://www.linkedin.com/in/thadguidry/ On Tue, Jul 23, 2019 at 5:47 AM Amir Sarabadani < amir.sarabad...@wikimedia.de> wrote: > Hey, > Forgive my ignorance. I don't know much about infrastructure of WDQS and > how it works. I just want to mention how application servers do it. In > appservers, there are dedicated nodes both for apache and the replica > database. So if a bot overdo things in Wikipedia (which happens quite a > lot), users won't feel anything but the other bots take the hit. Routing > based on UA seems hard though while it's easy in mediawiki (if you hit > api.php, we assume it's a bot). > > Did you consider this in a more long-term solution? > Best > > On Tue, 23 Jul 2019 at 09:43, Stas Malyshev <smalys...@wikimedia.org> > wrote: > >> Hello all! >> >> Here is (at last!) an update on what we are doing to protect the >> stability of Wikidata Query Service. >> >> For 4 years we have been offering to Wikidata users the Query Service, a >> powerful tool that allows anyone to query the content of Wikidata, >> without any identification needed. This means that anyone can use the >> service using a script and make heavy or very frequent requests. >> However, this freedom has led to the service being overloaded by a too >> big amount of queries, causing the issues or lag that you may have >> noticed. >> >> A reminder about the context: >> >> We have had a number of incidents where the public WDQS endpoint was >> overloaded by bot traffic. We don't think that any of that activity was >> intentionally malicious, but rather that the bot authors most probably >> don't understand the cost of their queries and the impact they have on >> our infrastructure. We've recently seen more distributed bots, coming >> from multiple IPs from cloud providers. This kind of pattern makes it >> harder and harder to filter or throttle an individual bot. The impact >> has ranged from increased update lag to full service interruption. >> >> What we have been doing: >> >> While we would love to allow anyone to run any query they want at any >> time, we're not able to sustain that load, and we need to be more >> aggressive in how we throttle clients. We want to be fair to our users >> and allow everyone to use the service productively. We also want the >> service to be available to the casual user and provide up-to-date access >> to the live Wikidata data. And while we would love to throttle only >> abusive bots, to be able to do that we need to be able to identify them. >> >> We have two main means of identifying bots: >> >> 1) their user agent and IP address >> 2) the pattern of their queries >> >> Identifying patterns in queries is done manually, by a person inspecting >> the logs. It takes time and can only be done after the fact. We can only >> start our identification process once the service is already overloaded. >> This is not going to scale. >> >> IP addresses are starting to be problematic. We see bots running on >> cloud providers and running their workloads on multiple instances, with >> multiple IP addresses. >> >> We are left with user agents. But here, we have a problem again. To >> block only abusive bots, we would need those bots to use a clearly >> identifiable user agent, so that we can throttle or block them and >> contact the author to work together on a solution. It is unlikely that >> an intentionally abusive bot will voluntarily provide a way to be >> blocked. So we need to be more aggressive about bots which are using a >> generic user agent. We are not blocking those, but we are limiting the >> number of requests coming from generic user agents. This is a large >> bucket, with a lot of bots that are in this same category of "generic >> user agent". Sadly, this is also the bucket that contains many small >> bots that generate only a very reasonable load. And so we are also >> impacting the bots that play fair. >> >> At the moment, if your bot is affected by our restrictions, configure a >> custom user agent that identifies you; this should be sufficient to give >> you enough bandwidth. If you are still running into issues, please >> contact us; we'll find a solution together. >> >> What's coming next: >> >> First, it is unlikely that we will be able to remove the current >> restrictions in the short term. We're sorry for that, but the >> alternative - service being unresponsive or severely lagged for everyone >> - is worse. >> >> We are exploring a number of alternatives. Adding authentication to the >> service, and allowing higher quotas to bots that authenticate. Creating >> an asynchronous queue, which could allow running more expensive queries, >> but with longer deadlines. And we are in the process of hiring another >> engineer to work on these ideas. >> >> Thanks for your patience! >> >> WDQS Team >> >> _______________________________________________ >> Wikidata mailing list >> Wikidata@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> > > > -- > Amir Sarabadani (he/him) > Software engineer > > Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin > Tel. (030) 219 158 26-0 > https://wikimedia.de > > Unsere Vision ist eine Welt, in der alle Menschen am Wissen der Menschheit > teilhaben, es nutzen und mehren können. Helfen Sie uns dabei! > https://spenden.wikimedia.de > > Wikimedia Deutschland — Gesellschaft zur Förderung Freien Wissens e. V. > Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter > der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für > Körperschaften I Berlin, Steuernummer 27/029/42207. > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata >
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata