Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users

2019-07-24 Thread Lucas Werkmeister
On 23.07.19 20:23, Stas Malyshev wrote:
> Also, with live updates, long queries create other technical challenges
> (if query is running for 2 hours, the database has basically to keep the
> snapshot it runs on for 2 hours, which may make it much less efficient).

Wait, does that mean that the current query service actually gives each
query a consistent view of the database for up to 60 seconds? I always
assumed that WDQS didn’t give you the same transactionality guarantees
as e. g. MySQL, because how could that possibly work with such a high
query and update rate…

Cheers,
Lucas

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users

2019-07-24 Thread Andrew Gray
Hi Stas,

One thing that I've been wondering about is whether we could take a
little bit of load off via caching.

At the moment, if you run the same query again within a minute or two,
it uses the cached results. But after a few minutes, anyone who
follows the link triggers a new run.

If a query is embedded somewhere, or it does the rounds on Twitter or
in the newsletter, it might get a long stream of visitors spread out
enough to miss the cache window, meaning we need to recalculate it a
lot.

For a lot of queries, of course, this is a good thing - we want people
to have the newest data, especially for maintenance queries. But for a
lot of others, either the data isn't going to change in the next day
(eg maps of cities) or it's so high level that being a little old
won't affect much (eg high-level counts of groups of items where all
the results are in the tens of thousands anyway).

So the suggestion: would it be possible to have some kind of
comment/command (similar to #defaultView:Map) that keeps the results
cached for a day or two? This would make it an opt-in approach, and if
this is done as a comment then the user could remove it or tweak the
query to force an update. It certainly wouldn't solve the underlying
load issues - bots aren't likely to want longer cache times - but it
might help take a little bit of the load off.

It might also improve the user experience in some circumstances - if I
email someone a query which I can force to be cached, then I know when
they open it, they'll get something promptly rather than taking a long
time, and (if it's a complex query) I can know for sure it'll run
rather than timing out.

Andrew.


On Tue, 23 Jul 2019 at 08:43, Stas Malyshev  wrote:
>
> Hello all!
>
> Here is (at last!) an update on what we are doing to protect the
> stability of Wikidata Query Service.
>
> For 4 years we have been offering to Wikidata users the Query Service, a
> powerful tool that allows anyone to query the content of Wikidata,
> without any identification needed. This means that anyone can use the
> service using a script and make heavy or very frequent requests.
> However, this freedom has led to the service being overloaded by a too
> big amount of queries, causing the issues or lag that you may have noticed.
>
> A reminder about the context:
>
> We have had a number of incidents where the public WDQS endpoint was
> overloaded by bot traffic. We don't think that any of that activity was
> intentionally malicious, but rather that the bot authors most probably
> don't understand the cost of their queries and the impact they have on
> our infrastructure. We've recently seen more distributed bots, coming
> from multiple IPs from cloud providers. This kind of pattern makes it
> harder and harder to filter or throttle an individual bot. The impact
> has ranged from increased update lag to full service interruption.
>
> What we have been doing:
>
> While we would love to allow anyone to run any query they want at any
> time, we're not able to sustain that load, and we need to be more
> aggressive in how we throttle clients. We want to be fair to our users
> and allow everyone to use the service productively. We also want the
> service to be available to the casual user and provide up-to-date access
> to the live Wikidata data. And while we would love to throttle only
> abusive bots, to be able to do that we need to be able to identify them.
>
> We have two main means of identifying bots:
>
> 1) their user agent and IP address
> 2) the pattern of their queries
>
> Identifying patterns in queries is done manually, by a person inspecting
> the logs. It takes time and can only be done after the fact. We can only
> start our identification process once the service is already overloaded.
> This is not going to scale.
>
> IP addresses are starting to be problematic. We see bots running on
> cloud providers and running their workloads on multiple instances, with
> multiple IP addresses.
>
> We are left with user agents. But here, we have a problem again. To
> block only abusive bots, we would need those bots to use a clearly
> identifiable user agent, so that we can throttle or block them and
> contact the author to work together on a solution. It is unlikely that
> an intentionally abusive bot will voluntarily provide a way to be
> blocked. So we need to be more aggressive about bots which are using a
> generic user agent. We are not blocking those, but we are limiting the
> number of requests coming from generic user agents. This is a large
> bucket, with a lot of bots that are in this same category of "generic
> user agent". Sadly, this is also the bucket that contains many small
> bots that generate only a very reasonable load. And so we are also
> impacting the bots that play fair.
>
> At the moment, if your bot is affected by our restrictions, configure a
> custom user agent that identifies you; this should be sufficient to give
> you enough bandwidth. If you are still running into 

Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users

2019-07-23 Thread Thad Guidry
On Tue, Jul 23, 2019 at 1:23 PM Stas Malyshev 
wrote:

> Hi!
>
> > Will this be approachable:   My 2 hour query will actually finally
> > return results into my 1gig csv.zip file?
>
> Not sure about 2 hours, as again it'd be a service that would be open to
> a wide community, and time is the most limited resource of all - once
> 2-hour query is running, that means the resource to serve it is consumed
> for 2 hours and not available to anybody else. Even with batching, we
> only have 24 hours per day which we'd be able to run only 12 such
> queries (well, parallelism exists, but let's not complicate it too much
> for the sake of example) and then the 13th person would have to wait the
> whole day for their query to be even run. Without some limit you'd have
> to book it months in advance like a posh restaurant :) Of course, it's a
> consideration of resources available and demand for such queries, so
> we'd have to see what the precise limit is when we get there. Maybe
> there are no 13 people to run such queries and we'd be ok.
>
>
Was thinking the same thing.  I wouldn't form a 2 hour query, just saying.
In actuality, I'd spend the day or two to download the data dump.

> Also, with live updates, long queries create other technical challenges
> (if query is running for 2 hours, the database has basically to keep the
> snapshot it runs on for 2 hours, which may make it much less efficient).
> We could of course have non-live-updates database, but updating it then
> would be a bit tricky as loading full dump takes a week now and catching
> up for that week takes even more time (hello, Achilles, hello,
> Tortoise). We're working on improving those, but for now 2 hour queries
> may be poorly compatible with both resources we have and the model we
> have. Shorter queries though may definitely be possible - we'd need to
> find the boundary that is safe given the current resources.
>
>
Yeap, agreed.  Its a balancing act, even we do that in the enterprise,
where even extremely large companies still have budgets.  But the CEO and
his reports come first, yah? :)

Thanks for the explanations Stas to confirm my assumptions there.
Let's continue to focus on the 80% of common user queries and save the 20%
like in my special cases to point users to the data dumps and say "roll
your own kid, and have fun while doing it!"

Thad
https://www.linkedin.com/in/thadguidry/
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users

2019-07-23 Thread Stas Malyshev
Hi!

> Forgive my ignorance. I don't know much about infrastructure of WDQS and
> how it works. I just want to mention how application servers do it. In
> appservers, there are dedicated nodes both for apache and the replica
> database. So if a bot overdo things in Wikipedia (which happens quite a
> lot), users won't feel anything but the other bots take the hit. Routing
> based on UA seems hard though while it's easy in mediawiki (if you hit
> api.php, we assume it's a bot).

We have two clusters - public and internal, with the latter serving only
Wikimedia tasks thus isolated from outside traffic. However, we do not
have a practical way right now to separate bot and non-bot traffic, and
I don't think we now have resources for another cluster.

> Routing based on UA seems hard though while it's easy in mediawiki

I don't think our current LB setup can route based on user agent. There
could be a gateway that does that, but given that we don't have
resources for another cluster for now, it's not too useful to spend time
on developing something like that for now.

Even if we did separate browser and bot traffic, we'd still have the
problem on bot cluster - most bots are benign and low-traffic, and we
want to do our best to enable them to function smoothly. But for this to
work, we need ways to weed out outliners that consume too much
resources. In a way, the bucketing policy is a sort of version of what
you described - if you use proper identification, you are judged on your
traffic. If you use generic identification, you are bucketed with other
generic agents, and thus may be denied if that bucket is full. This is
not the best final solution, but experience so far shows it reduced the
incidence of problems. Further ideas on how to improve it of course are
welcome.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users

2019-07-23 Thread Thad Guidry
>
> Adding authentication to the service, and allowing higher quotas to bots
> that authenticate.


Awesome and expected.

Creating an asynchronous queue, which could allow running more expensive
> queries, but with longer deadlines.


Even more awesome!
Will this be approachable:   My 2 hour query will actually finally return
results into my 1gig csv.zip file?

Thad
https://www.linkedin.com/in/thadguidry/


On Tue, Jul 23, 2019 at 5:47 AM Amir Sarabadani <
amir.sarabad...@wikimedia.de> wrote:

> Hey,
> Forgive my ignorance. I don't know much about infrastructure of WDQS and
> how it works. I just want to mention how application servers do it. In
> appservers, there are dedicated nodes both for apache and the replica
> database. So if a bot overdo things in Wikipedia (which happens quite a
> lot), users won't feel anything but the other bots take the hit. Routing
> based on UA seems hard though while it's easy in mediawiki (if you hit
> api.php, we assume it's a bot).
>
> Did you consider this in a more long-term solution?
> Best
>
> On Tue, 23 Jul 2019 at 09:43, Stas Malyshev 
> wrote:
>
>> Hello all!
>>
>> Here is (at last!) an update on what we are doing to protect the
>> stability of Wikidata Query Service.
>>
>> For 4 years we have been offering to Wikidata users the Query Service, a
>> powerful tool that allows anyone to query the content of Wikidata,
>> without any identification needed. This means that anyone can use the
>> service using a script and make heavy or very frequent requests.
>> However, this freedom has led to the service being overloaded by a too
>> big amount of queries, causing the issues or lag that you may have
>> noticed.
>>
>> A reminder about the context:
>>
>> We have had a number of incidents where the public WDQS endpoint was
>> overloaded by bot traffic. We don't think that any of that activity was
>> intentionally malicious, but rather that the bot authors most probably
>> don't understand the cost of their queries and the impact they have on
>> our infrastructure. We've recently seen more distributed bots, coming
>> from multiple IPs from cloud providers. This kind of pattern makes it
>> harder and harder to filter or throttle an individual bot. The impact
>> has ranged from increased update lag to full service interruption.
>>
>> What we have been doing:
>>
>> While we would love to allow anyone to run any query they want at any
>> time, we're not able to sustain that load, and we need to be more
>> aggressive in how we throttle clients. We want to be fair to our users
>> and allow everyone to use the service productively. We also want the
>> service to be available to the casual user and provide up-to-date access
>> to the live Wikidata data. And while we would love to throttle only
>> abusive bots, to be able to do that we need to be able to identify them.
>>
>> We have two main means of identifying bots:
>>
>> 1) their user agent and IP address
>> 2) the pattern of their queries
>>
>> Identifying patterns in queries is done manually, by a person inspecting
>> the logs. It takes time and can only be done after the fact. We can only
>> start our identification process once the service is already overloaded.
>> This is not going to scale.
>>
>> IP addresses are starting to be problematic. We see bots running on
>> cloud providers and running their workloads on multiple instances, with
>> multiple IP addresses.
>>
>> We are left with user agents. But here, we have a problem again. To
>> block only abusive bots, we would need those bots to use a clearly
>> identifiable user agent, so that we can throttle or block them and
>> contact the author to work together on a solution. It is unlikely that
>> an intentionally abusive bot will voluntarily provide a way to be
>> blocked. So we need to be more aggressive about bots which are using a
>> generic user agent. We are not blocking those, but we are limiting the
>> number of requests coming from generic user agents. This is a large
>> bucket, with a lot of bots that are in this same category of "generic
>> user agent". Sadly, this is also the bucket that contains many small
>> bots that generate only a very reasonable load. And so we are also
>> impacting the bots that play fair.
>>
>> At the moment, if your bot is affected by our restrictions, configure a
>> custom user agent that identifies you; this should be sufficient to give
>> you enough bandwidth. If you are still running into issues, please
>> contact us; we'll find a solution together.
>>
>> What's coming next:
>>
>> First, it is unlikely that we will be able to remove the current
>> restrictions in the short term. We're sorry for that, but the
>> alternative - service being unresponsive or severely lagged for everyone
>> - is worse.
>>
>> We are exploring a number of alternatives. Adding authentication to the
>> service, and allowing higher quotas to bots that authenticate. Creating
>> an asynchronous queue, which could allow running more expensive queries,
>> but 

Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users

2019-07-23 Thread Amir Sarabadani
Hey,
Forgive my ignorance. I don't know much about infrastructure of WDQS and
how it works. I just want to mention how application servers do it. In
appservers, there are dedicated nodes both for apache and the replica
database. So if a bot overdo things in Wikipedia (which happens quite a
lot), users won't feel anything but the other bots take the hit. Routing
based on UA seems hard though while it's easy in mediawiki (if you hit
api.php, we assume it's a bot).

Did you consider this in a more long-term solution?
Best

On Tue, 23 Jul 2019 at 09:43, Stas Malyshev  wrote:

> Hello all!
>
> Here is (at last!) an update on what we are doing to protect the
> stability of Wikidata Query Service.
>
> For 4 years we have been offering to Wikidata users the Query Service, a
> powerful tool that allows anyone to query the content of Wikidata,
> without any identification needed. This means that anyone can use the
> service using a script and make heavy or very frequent requests.
> However, this freedom has led to the service being overloaded by a too
> big amount of queries, causing the issues or lag that you may have noticed.
>
> A reminder about the context:
>
> We have had a number of incidents where the public WDQS endpoint was
> overloaded by bot traffic. We don't think that any of that activity was
> intentionally malicious, but rather that the bot authors most probably
> don't understand the cost of their queries and the impact they have on
> our infrastructure. We've recently seen more distributed bots, coming
> from multiple IPs from cloud providers. This kind of pattern makes it
> harder and harder to filter or throttle an individual bot. The impact
> has ranged from increased update lag to full service interruption.
>
> What we have been doing:
>
> While we would love to allow anyone to run any query they want at any
> time, we're not able to sustain that load, and we need to be more
> aggressive in how we throttle clients. We want to be fair to our users
> and allow everyone to use the service productively. We also want the
> service to be available to the casual user and provide up-to-date access
> to the live Wikidata data. And while we would love to throttle only
> abusive bots, to be able to do that we need to be able to identify them.
>
> We have two main means of identifying bots:
>
> 1) their user agent and IP address
> 2) the pattern of their queries
>
> Identifying patterns in queries is done manually, by a person inspecting
> the logs. It takes time and can only be done after the fact. We can only
> start our identification process once the service is already overloaded.
> This is not going to scale.
>
> IP addresses are starting to be problematic. We see bots running on
> cloud providers and running their workloads on multiple instances, with
> multiple IP addresses.
>
> We are left with user agents. But here, we have a problem again. To
> block only abusive bots, we would need those bots to use a clearly
> identifiable user agent, so that we can throttle or block them and
> contact the author to work together on a solution. It is unlikely that
> an intentionally abusive bot will voluntarily provide a way to be
> blocked. So we need to be more aggressive about bots which are using a
> generic user agent. We are not blocking those, but we are limiting the
> number of requests coming from generic user agents. This is a large
> bucket, with a lot of bots that are in this same category of "generic
> user agent". Sadly, this is also the bucket that contains many small
> bots that generate only a very reasonable load. And so we are also
> impacting the bots that play fair.
>
> At the moment, if your bot is affected by our restrictions, configure a
> custom user agent that identifies you; this should be sufficient to give
> you enough bandwidth. If you are still running into issues, please
> contact us; we'll find a solution together.
>
> What's coming next:
>
> First, it is unlikely that we will be able to remove the current
> restrictions in the short term. We're sorry for that, but the
> alternative - service being unresponsive or severely lagged for everyone
> - is worse.
>
> We are exploring a number of alternatives. Adding authentication to the
> service, and allowing higher quotas to bots that authenticate. Creating
> an asynchronous queue, which could allow running more expensive queries,
> but with longer deadlines. And we are in the process of hiring another
> engineer to work on these ideas.
>
> Thanks for your patience!
>
> WDQS Team
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
Amir Sarabadani (he/him)
Software engineer

Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Tel. (030) 219 158 26-0
https://wikimedia.de

Unsere Vision ist eine Welt, in der alle Menschen am Wissen der Menschheit
teilhaben, es nutzen und mehren können. Helfen Sie uns 

[Wikidata] Wikidata Query Service User-Agent requirements for script users

2019-07-23 Thread Stas Malyshev
Hello all!

Here is (at last!) an update on what we are doing to protect the
stability of Wikidata Query Service.

For 4 years we have been offering to Wikidata users the Query Service, a
powerful tool that allows anyone to query the content of Wikidata,
without any identification needed. This means that anyone can use the
service using a script and make heavy or very frequent requests.
However, this freedom has led to the service being overloaded by a too
big amount of queries, causing the issues or lag that you may have noticed.

A reminder about the context:

We have had a number of incidents where the public WDQS endpoint was
overloaded by bot traffic. We don't think that any of that activity was
intentionally malicious, but rather that the bot authors most probably
don't understand the cost of their queries and the impact they have on
our infrastructure. We've recently seen more distributed bots, coming
from multiple IPs from cloud providers. This kind of pattern makes it
harder and harder to filter or throttle an individual bot. The impact
has ranged from increased update lag to full service interruption.

What we have been doing:

While we would love to allow anyone to run any query they want at any
time, we're not able to sustain that load, and we need to be more
aggressive in how we throttle clients. We want to be fair to our users
and allow everyone to use the service productively. We also want the
service to be available to the casual user and provide up-to-date access
to the live Wikidata data. And while we would love to throttle only
abusive bots, to be able to do that we need to be able to identify them.

We have two main means of identifying bots:

1) their user agent and IP address
2) the pattern of their queries

Identifying patterns in queries is done manually, by a person inspecting
the logs. It takes time and can only be done after the fact. We can only
start our identification process once the service is already overloaded.
This is not going to scale.

IP addresses are starting to be problematic. We see bots running on
cloud providers and running their workloads on multiple instances, with
multiple IP addresses.

We are left with user agents. But here, we have a problem again. To
block only abusive bots, we would need those bots to use a clearly
identifiable user agent, so that we can throttle or block them and
contact the author to work together on a solution. It is unlikely that
an intentionally abusive bot will voluntarily provide a way to be
blocked. So we need to be more aggressive about bots which are using a
generic user agent. We are not blocking those, but we are limiting the
number of requests coming from generic user agents. This is a large
bucket, with a lot of bots that are in this same category of "generic
user agent". Sadly, this is also the bucket that contains many small
bots that generate only a very reasonable load. And so we are also
impacting the bots that play fair.

At the moment, if your bot is affected by our restrictions, configure a
custom user agent that identifies you; this should be sufficient to give
you enough bandwidth. If you are still running into issues, please
contact us; we'll find a solution together.

What's coming next:

First, it is unlikely that we will be able to remove the current
restrictions in the short term. We're sorry for that, but the
alternative - service being unresponsive or severely lagged for everyone
- is worse.

We are exploring a number of alternatives. Adding authentication to the
service, and allowing higher quotas to bots that authenticate. Creating
an asynchronous queue, which could allow running more expensive queries,
but with longer deadlines. And we are in the process of hiring another
engineer to work on these ideas.

Thanks for your patience!

WDQS Team

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata