[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

2020-05-30 Thread Dvorapa
Dvorapa added a comment.


  But anyway, it would be great to make Retry-After work (and not just switch 
between null and 5) and adapt tools to use it as discussed many times before.

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Dvorapa
Cc: Xqt, Zbyszko, dcausse, Nikerabbit, Majavah, tstarling, Joe, Dvorapa, 
daniel, Krinkle, Aklapper, Jakob_WMDE, Lydia_Pintscher, WMDE-leszek, 
darthmon_wmde, Addshore, Ladsgroup, Demian, DannyS712, Nandana, kostajh, Lahi, 
Gq86, GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, elukey, _jensen, 
rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, SBisson, Perhelion, 
Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

2020-05-30 Thread Ladsgroup
Ladsgroup added a comment.


  In T252091#6171258 , 
@tstarling wrote:
  
  > I hope you don't mind if I contradict my previous comment a bit, since my 
thinking is still evolving on this.
  
  No worries at all. I'm also changing my mind quickly here.
  
  > One problem with using lag as the metric is that it doesn't go negative, so 
the integral will not be pulled down while the service is idle. We could 
subtract a target lag, say 1 minute, but that loses some of the supposed 
benefit of including an integral term. A better metric would be updater load, 
i.e. demand/capacity. When the load is more than 100%, the lag increases at a 
rate of 1 second per second, but there's no further information in there as to 
how heavily overloaded it is. When the load is less than 100%, lag decreases 
until it reaches zero. While it's decreasing, the slope tells you something 
about how underloaded it is, but once it hits zero, you lose that information.
  >
  > Load is average queue size, if you take the currently running batch as 
being part of the queue. WDQS currently does not monitor the queue size. I 
gather (after an hour or so of research, I'm new to all this) that with some 
effort, KafkaPoller could obtain an estimate of the queue size by subtracting 
the current partition offsets from KafkaConsumer.endOffsets() 
.
  >
  > Failing that, we can make a rough approximation from available data. We can 
get the average utilisation of the importer from the 
rdf-repository-import-time-cnt metric. You can see in Grafana 

 that the derivative of this metric hovers between 0 and 1 when WDQS is not 
lagged, and remains near 1 when WDQS is lagged. The metric I would propose is 
to add replication lag to this utilisation metric, appropriately scaled: 
//utilisation + K_lag * lag - 1// where K_lag is say 1/60s. This is a metric 
which is -1 at idle, 0 when busy with no lag, and 1 with 1 minute of lag. The 
control system would adjust the request rate to keep this metric (and its 
integral) at zero.
  >
  >> With PID, we need to define three constants K_p, K_i and K_d. If we had 
problem with finding the pool size, this is going to get three times more 
complicated (I didn't find a standard way to determine these coefficients, 
maybe I'm missing something obvious)
  >
  > One way to simplify it is with K_d=0, i.e. make it a PI controller. Having 
the derivative in there probably doesn't add much. Then it's only two times 
more complicated. Although I added K_lag so I suppose we are still at 3. The 
idea is that it shouldn't matter too much exactly what K_p and K_i are set to 
-- the system should be stable and have low lag with a wide range of parameter 
values. So you just pick some values and see if it works.
  >
  >> We currently don't have an infrastructure to hold the "maxlag" data over 
time so we can calculate its derivative and integral. Should we use redis? How 
it's going to look like? These are questions, I don't have answers for them. Do 
you have ideas for that?
  >
  > WDQS lag is currently obtained by having an ApiMaxLagInfo hook handler 
which queries Prometheus, caching the result. Prometheus has a query language 
which can perform derivatives ("rate") and integrals ("sum_over_time") on 
metrics. So it would be the same system as now, just with a different 
Prometheus query.
  
  I might be a little YAGNI here but I would love to have maxlag numbers be 
kept over time and we build PI controller using the maxlag value and not the 
lag of WDQS. Mostly because WDQS hopefully will be fixed and handled later but 
there will be some sort of edit rate bottleneck all the time (jobqueue, 
replication, you name it) but if you think we can work on WDQS for now, I'm 
okay. My thinking was to have a P controller for start based on the maxlag and 
build the infrastructure to keep the data over time (maybe Prometheus?, query 
statsd? We already store all maxlag there here 

 but it seems broken atm) and add it there. I think oscillating over 3s is much 
better than oscillating around 5s because over 5s, the system doesn't accept 
the edit and the user have to re-send it.
  
  > The wording in RFC 7231 suggests to me that it is acceptable to use 
Retry-After in a 2xx response. "Servers send the "Retry-After" header field to 
indicate how long the user agent ought to wait before making a follow-up 
request." That seems pretty close to what we're doing.
  
  ack. I think we should communicate this with the tool developers (and 
pywikibot folks) so they start taking the header all the time.

TASK DETAIL
  https://phabricator.wikim

[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

2020-05-28 Thread dcausse
dcausse added a comment.


  In T252091#6171258 , 
@tstarling wrote:
  
  > Load is average queue size, if you take the currently running batch as 
being part of the queue. WDQS currently does not monitor the queue size. I 
gather (after an hour or so of research, I'm new to all this) that with some 
effort, KafkaPoller could obtain an estimate of the queue size by subtracting 
the current partition offsets from KafkaConsumer.endOffsets() 
.
  
  This metric is available in graphana through `kafka_burrow_partition_lag`, 
problem is that for some reasons we stopped polling updates from Kafka and 
we're now consuming the RC change API. The reasons we disabled it are now fixed 
so I believe we could enable it again.
  
  In the ideal case the updater runs at full speed most of the time as the 
effect of the maxlag propagates fast enough that the system in place works for 
what it was designed: make sure users don't query and see too much out of date 
data and don't starve too much when the threshold is green again.
  One problem that the current maxlag strategy does not address properly is 
when a single server is lagged situations like this starts to happen:
  F31845471: lag_wdqs.png 
  The median across all pooled servers being used the effect of the maxlag no 
longer propagates fast enough, for high lagged servers they see the effect of 
the edit rate slowdown that happened 10 mins ago while others sees their queue 
being emptied while they could have handled more. All this being pretty much 
random (spikes across servers are at different times) it exacerbates even more 
the oscillation. Was it evaluated to take the max or the sum instead of the 
median?
  
  As said in a previous comment there always be bottleneck somewhere, I feel 
that having a single fixed limit makes it a bit difficult to handle the 
variance in the edit rate and could encourage us to always tune it to a lower 
value to resolve such lag issues without knowing when your system can handle 
more.
  A solution around Retry-After and a PID controller seems a bit more flexible 
to me, the main drawbacks is that it relies on well behaved clients (which is 
currently the case).
  
  As for addressing the issue with the updater itself, we believe we have room 
for optimizations by redesigning the way we perform updates. The current 
situation is clearly not ideal but it can keep-up the update rates when bots 
are slowed down which gives us I hope enough time to finish the work we started 
on this rewrite.

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Zbyszko, dcausse, Nikerabbit, Majavah, tstarling, Joe, Dvorapa, daniel, 
Krinkle, Aklapper, Jakob_WMDE, Lydia_Pintscher, WMDE-leszek, darthmon_wmde, 
Addshore, Ladsgroup, Demian, DannyS712, Nandana, kostajh, Lahi, Gq86, 
GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, elukey, _jensen, 
rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, SBisson, Perhelion, 
Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

2020-05-27 Thread tstarling
tstarling added a comment.


  I hope you don't mind if I contradict my previous comment a bit, since my 
thinking is still evolving on this.
  
  One problem with using lag as the metric is that it doesn't go negative, so 
the integral will not be pulled down while the service is idle. We could 
subtract a target lag, say 1 minute, but that loses some of the supposed 
benefit of including an integral term. A better metric would be updater load, 
i.e. demand/capacity. When the load is more than 100%, the lag increases at a 
rate of 1 second per second, but there's no further information in there as to 
how heavily overloaded it is. When the load is less than 100%, lag decreases 
until it reaches zero. While it's decreasing, the slope tells you something 
about how underloaded it is, but once it hits zero, you lose that information.
  
  Load is average queue size, if you take the currently running batch as being 
part of the queue. WDQS currently does not monitor the queue size. I gather 
(after an hour or so of research, I'm new to all this) that with some effort, 
KafkaPoller could obtain an estimate of the queue size by subtracting the 
current partition offsets from KafkaConsumer.endOffsets() 
.
  
  Failing that, we can make a rough approximation from available data. We can 
get the average utilisation of the importer from the 
rdf-repository-import-time-cnt metric. You can see in Grafana 

 that the derivative of this metric hovers between 0 and 1 when WDQS is not 
lagged, and remains near 1 when WDQS is lagged. The metric I would propose is 
to add replication lag to this utilisation metric, appropriately scaled: 
//utilisation + K_lag * lag - 1// where K_lag is say 1/60s. This is a metric 
which is -1 at idle, 0 when busy with no lag, and 1 with 1 minute of lag. The 
control system would adjust the request rate to keep this metric (and its 
integral) at zero.
  
  > With PID, we need to define three constants K_p, K_i and K_d. If we had 
problem with finding the pool size, this is going to get three times more 
complicated (I didn't find a standard way to determine these coefficients, 
maybe I'm missing something obvious)
  
  One way to simplify it is with K_d=0, i.e. make it a PI controller. Having 
the derivative in there probably doesn't add much. Then it's only two times 
more complicated. Although I added K_lag so I suppose we are still at 3. The 
idea is that it shouldn't matter too much exactly what K_p and K_i are set to 
-- the system should be stable and have low lag with a wide range of parameter 
values. So you just pick some values and see if it works.
  
  > We currently don't have an infrastructure to hold the "maxlag" data over 
time so we can calculate its derivative and integral. Should we use redis? How 
it's going to look like? These are questions, I don't have answers for them. Do 
you have ideas for that?
  
  WDQS lag is currently obtained by having an ApiMaxLagInfo hook handler which 
queries Prometheus, caching the result. Prometheus has a query language which 
can perform derivatives ("rate") and integrals ("sum_over_time") on metrics. So 
it would be the same system as now, just with a different Prometheus query.
  
  > I'm not sure "Retry-After" is a good header for 2xx responses. It's like 
"We accepted your edit but "retry" it after 2 seconds". I looked at RFC 7231 
and it doesn't explicitly say we can't use it in 2xx requests but I haven't 
seen anywhere use it in 2xx responses. We might be able to find another better 
header?
  
  The wording in RFC 7231 suggests to me that it is acceptable to use 
Retry-After in a 2xx response. "Servers send the "Retry-After" header field to 
indicate how long the user agent ought to wait before making a follow-up 
request." That seems pretty close to what we're doing.
  
  In summary, we query Prometheus for //utilisation + lag / 60 - 1//, both the 
most recent value and the sum over some longer time interval. The sum and the 
value are separately scaled, then they are added together, then the result is 
limited to some reasonable range like 0-600s. If it's >0, then we send it as a 
Retry-After header. Then we badger all bots into respecting the header.

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: tstarling
Cc: Majavah, tstarling, Joe, Dvorapa, daniel, Krinkle, Aklapper, Jakob_WMDE, 
Lydia_Pintscher, WMDE-leszek, darthmon_wmde, Addshore, Ladsgroup, Demian, 
DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, 
QZanden, LawExplorer, elukey, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Jonas, 
Izno, SBisson, Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, 
fbstj, santhosh,

[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

2020-05-26 Thread Ladsgroup
Ladsgroup added a comment.


  In T252091#6154167 , 
@tstarling wrote:
  
  > This proposal is effectively a dynamic rate limit except that instead of 
delivering an error message when it is exceeded, we will just hold the 
connection open, forcing the bot to wait. That's expensive in terms of server 
resources -- we'd rather have the client wait using only its own resources. A 
rate limit has a tunable parameter (the rate) which is not really knowable. 
Similarly, this proposal has a tunable parameter (the pool size) which is not 
really knowable. You have to tune the pool size down until the replag stops 
increasing, but then if the nature of the edits changes, or if the hardware 
changes, the optimal pool size will change.
  >
  > I suggested at T202107  that the 
best method for globally controlling replication lag would be with a PID 
controller . A PID controller 
suppresses oscillation by having a memory of recent changes in the metric. The 
P (proportional) term is essentially as proposed at T240442 
 -- just back off proportionally as 
the lag increases. The problem with this is that it will settle into an 
equilibrium lag somewhere in the middle of the range. The I (integral) term 
addresses this by maintaining a rolling average and adjusting the control value 
until the average meets the desired value. This allows it to maintain 
approximately the same edit rate but with a lower average replication lag. The 
D (derivative) term causes the control value to be reduced more aggressively if 
the metric is rising quickly.
  >
  > My proposal is to use a PID controller to set the Retry-After header. 
Clients would be strongly encouraged to respect that header. We could have say 
maxlag=auto to opt in to this system.
  
  I quite like the idea of using PID but there are three notes I want to 
mention:
  
  - With PID, we need to define three constants K_p, K_i and K_d. If we had 
problem with finding the pool size, this is going to get three times more 
complicated (I didn't find a standard way to determine these coefficients, 
maybe I'm missing something obvious)
  - We currently don't have an infrastructure to hold the "maxlag" data over 
time so we can calculate its derivative and integral.  Should we use redis? How 
it's going to look like? These are questions, I don't have answers for them. Do 
you have ideas for that?
  - I'm not sure "Retry-After" is a good header for 2xx responses. It's like 
"We accepted your edit but "retry" it after 2 seconds". I looked at RFC 7231 
and it doesn't explicitly say we can't use it in 2xx requests but I haven't 
seen anywhere use it in 2xx responses. We might be able to find another better 
header?

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ladsgroup
Cc: tstarling, Joe, Dvorapa, daniel, Krinkle, Aklapper, Jakob_WMDE, 
Lydia_Pintscher, WMDE-leszek, darthmon_wmde, Addshore, Ladsgroup, Demian, 
DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, 
QZanden, LawExplorer, elukey, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Jonas, 
Izno, SBisson, Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, 
fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

2020-05-20 Thread Ladsgroup
Ladsgroup added a comment.


  In T252091#6150993 , @Joe 
wrote:
  
  >> - "The above suggests that the current rate limit is too high," this is 
not correct, the problem is that there is no rate limit for bots at all. The 
group explicitly doesn't have a rate limit. Adding such ratelimit was tried and 
caused lots of issues (even with a pretty high number).
  >
  > What kind of issues, specifically?
  >
  > I find the idea that we can't impose an upper limit to edits per minute 
bizarre, in abstract, but there might be good reasons for that.
  
  It broke MassMessage T192690: Mass message broken on Wikidata after ratelimit 
workaround  and see the discussions 
in T184948: limit page creation and edit rate on Wikidata 

  
  In T252091#6151004 , @Joe 
wrote:
  
  > So, while I find the idea of using poolcounter to limit the editing 
**concurrency** (it's not rate-limiting, which is different) a good proposal, 
and in general something desirable to have (including the possibility we tune 
it down to zero if we're in a crisis for instance), I think the fundamental 
problem reported here is that WDQS can't ingest the updates fast enough.
  
  My opinion is that there always be a bottleneck in rate of digesting edits in 
some parts of infra, if we fix WDQS in the next couple of months, edits also 
scale up and we might hit similar issue in, for example, search index update. 
See T243701#6152282 
  
  > So the solution should be searched there; either we improve performance of 
WDQS in ingesting updates (and I see there are future plans for that) or we 
stop considering it when calculating maxLag. We should not limit the edits 
happening to wikidata just because a dependent system can't keep up the pace.
  
  In paper they are dependent but in reality they are not. When we didn't count 
WDQS lag into maxlag, sometimes the lag was as high as half a day (and 
growing). This actually caused issues because lots of tools and systems that 
edit wikidata use WDQS and they started doing basic GIGO because they were 
getting outdated data, they used that to add wrong data to wikidata and this 
feedback loop caused issues. Also, it's safe to assume WDQS is lagged maybe 
even half an hour but when it's lagged for half a day, it breaks lots of 
implicit assumptions in tool builders, similar if search index in Wikipedia 
starts to lag behind for a day.
  
  In T252091#6154167 , 
@tstarling wrote:
  
  > This proposal is effectively a dynamic rate limit except that instead of 
delivering an error message when it is exceeded, we will just hold the 
connection open, forcing the bot to wait. That's expensive in terms of server 
resources -- we'd rather have the client wait using only its own resources. A 
rate limit has a tunable parameter (the rate) which is not really knowable. 
Similarly, this proposal has a tunable parameter (the pool size) which is not 
really knowable. You have to tune the pool size down until the replag stops 
increasing, but then if the nature of the edits changes, or if the hardware 
changes, the optimal pool size will change.
  >
  > I suggested at T202107  that the 
best method for globally controlling replication lag would be with a PID 
controller . A PID controller 
suppresses oscillation by having a memory of recent changes in the metric. The 
P (proportional) term is essentially as proposed at T240442 
 -- just back off proportionally as 
the lag increases. The problem with this is that it will settle into an 
equilibrium lag somewhere in the middle of the range. The I (integral) term 
addresses this by maintaining a rolling average and adjusting the control value 
until the average meets the desired value. This allows it to maintain 
approximately the same edit rate but with a lower average replication lag. The 
D (derivative) term causes the control value to be reduced more aggressively if 
the metric is rising quickly.
  >
  > My proposal is to use a PID controller to set the Retry-After header. 
Clients would be strongly encouraged to respect that header. We could have say 
maxlag=auto to opt in to this system.
  
  That sounds like a good alternative that needs exploring, I haven't thought 
about it in depth but I promise to do and come back to you.

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ladsgroup
Cc: tstarling, Joe, Dvorapa, daniel, Krinkle, Aklapper, Jakob_WMDE, 
Lydia_Pintscher, WMDE-leszek, darthmon_wmde, Addshore, Ladsgroup, DannyS712, 
Nandana, kostajh, Lahi, Gq

[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

2020-05-20 Thread tstarling
tstarling added a comment.


  Really the client has to wait every time, so there needs to be a delay hint 
header like Retry-After with every response. So it's not exactly maxlag=auto.

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: tstarling
Cc: tstarling, Joe, Dvorapa, daniel, Krinkle, Aklapper, Jakob_WMDE, 
Lydia_Pintscher, WMDE-leszek, darthmon_wmde, Addshore, Ladsgroup, DannyS712, 
Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, 
LawExplorer, elukey, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, 
SBisson, Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, 
santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

2020-05-20 Thread Addshore
Addshore added a comment.


  The tuning when in crisis is probably a more accurate description of what we 
want to aim for, be that automatically or manually.
  
  The issue of wdqs updater should indeed be seen as a seperate issue, and that 
is being solved seperatly.
  
  Maxlag is currently the system being abused to allow for some sort of rate 
limit on the site as a whole. You could say we have been in a bit of a constant 
crisis over the last 6 months regarding expectations of the query service which 
is critical to many workflows and what the service was able to deliver.
  
  With that in mind though, why do we have maxlag at all? We have the same 
problem with pure maxlag, as demonstrated at the weekend when one of the S8 dB 
servers was overwhelmed with a lag of 9 for 12 hours.
  Another element of maxlag, the dispatch system, ended up coming in just about 
this at 15(I think) for the same period.
  But the effects of either of those systems reporting that value of maxlag is 
0 edits by automated systems for a 12 hour period.
  That isn't really desired and instead being able to control concurrency could 
be seen as an answer.
  
  We could look at the issue this weekend again as an individual problem to 
fix, as with the query service, but as aliuded to above, there will always be 
more crisis situations where this mechanism would help.
  
  I can also see this from the other side of the fence, if we're in a situation 
where wikidata was negatively impacting enwiki, I imagine a response to that 
would be set Wikidata to read-only for a period, or use maxlag to slow down 
editing. However that isn't really desirable and having a control mechanism, 
rather than an on or off would be great.

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: Joe, Dvorapa, daniel, Krinkle, Aklapper, Jakob_WMDE, Lydia_Pintscher, 
WMDE-leszek, darthmon_wmde, Addshore, Ladsgroup, DannyS712, Nandana, kostajh, 
Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, elukey, 
_jensen, rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, SBisson, Perhelion, 
Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

2020-05-19 Thread Joe
Joe added a comment.


  So, while I find the idea of using poolcounter to limit the editing 
**concurrency** (it's not rate-limiting, which is different) a good proposal, 
and in general something desirable to have (including the possibility we tune 
it down to zero if we're in a crisis for instance), I think the fundamental 
problem reported here is that WDQS can't ingest the updates fast enough.
  
  So the solution should be searched there; either we improve performance of 
WDQS in ingesting updates or we stop considering it when calculating maxLag. We 
should not limit the edits happening to wikidata just because a project with no 
engineering resources dedicated[1] can't keep up the pace.
  
  [1] This is my current understanding of the situation with WDQS, apologies in 
advance if that's not the case.

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Joe
Cc: Joe, Dvorapa, daniel, Krinkle, Aklapper, Jakob_WMDE, Lydia_Pintscher, 
WMDE-leszek, darthmon_wmde, Addshore, Ladsgroup, DannyS712, Nandana, kostajh, 
Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, elukey, 
_jensen, rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, SBisson, Perhelion, 
Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

2020-05-19 Thread Joe
Joe added a comment.


  
  
  > - "The above suggests that the current rate limit is too high," this is not 
correct, the problem is that there is no rate limit for bots at all. The group 
explicitly doesn't have a rate limit. Adding such ratelimit was tried and 
caused lots of issues (even with a pretty high number).
  
  What kind of issues, specifically?
  
  I find the idea that we can't impose an upper limit to edits per minute 
bizarre, in abstract, but there might be good reasons for that.

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Joe
Cc: Joe, Dvorapa, daniel, Krinkle, Aklapper, Jakob_WMDE, Lydia_Pintscher, 
WMDE-leszek, darthmon_wmde, Addshore, Ladsgroup, DannyS712, Nandana, kostajh, 
Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, elukey, 
_jensen, rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, SBisson, Perhelion, 
Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

2020-05-13 Thread Ladsgroup
Ladsgroup added a comment.


  In T252091#6134865 , 
@Krinkle wrote:
  
  > @Ladsgroup wrote:
  >
  >> […] The edit rate on this wiki has been going up to 1,000 edits per minute 
and has been testing our infrastructure scalability […] The edits have been 
mostly done by bots […] and bot operators want to edit in full speed when the 
infrastructure is quiet and forcing an arbitrary number […] limits bots in 
times that **the infrastructure can actually take more**.
  >
  > (Emphasis mine).
  >
  > @Ladsgroup wrote:
  >
  >> WDQS updater can't keep up and […] we need to keep in mind that there 
always will be a bottleneck.
  >
  > It sounds like there are no times where the infrastructure can just handle 
it all at the current rate. The above suggests that the current rate limit is 
too high, because we can't keep up with that rate even at normal/quiet times. 
Right?
  
  No. Let me clarify:
  
  - By "the infrastructure can actually take more" I mean the times that there 
are less edits happening for example midnight when human edits are low. or days 
that a bot is broken/has nothing to do and other bots can go faster
  - "The above suggests that the current rate limit is too high," this is not 
correct, the problem is that there is no rate limit for bots at all. The group 
explicitly doesn't have a rate limit. Adding such ratelimit was tried and 
caused lots of issues (even with a pretty high number). In other words, inside 
the mediawiki, for bots, we are at mercy of them and based on contracts and API 
etiquettes, we tell them the pressure on the server and they adjust their speed 
based on that and maxlag is a proxy of a metric on the pressure of the server. 
If any bot doesn't respect maxlag, they'll be blocked. but the problem is that 
maxlag is not a good enough metrics to bots.
  
  > If we lower the rate limit, would this pattern not go away?
  
  As I said before, there's no ratelimit for bots.
  
  > I suppose it could come back if bots use their burst capacity within a 
single minute, or when there are many different/new bots starting to do the 
same thing.
  
  Bursts of a lots of activity are fine, it makes all bots stop for system to 
recover, the problem right now is that the edits are too high virtually all the 
time.
  
  > In that case, the global protections of `maxlag` and kick in automatically 
to restore us. Is that not good enough? Would the global rate limit behave 
differently in practice?
  
  Yes it would be different, it would keep the flow under control all the time 
instead of the oscillation.
  
  > @Ladsgroup wrote:
  >
  >> […] This has been oscillating like this for months:
  >> F31805674: image.png 
  >
  > It isn't said explicitly, but it sounds the oscillating pattern is 
considered a problem. Is that right? What kinds of problems is it causing, and 
for whom/what? I can understand that regularly reaching a lag of 5s is not 
great, but it seems like an expected outcome if we set the bot maxlag to 5s. If 
we want the "situation normal" lag peaks to be lower, then we should set that 
maxlag parameter lower.
  
  Well, it is a big problem. Please read T243701: Wikidata maxlag repeatedly 
over 5s since Jan20, 2020 (primarily caused by the query service) 
 and I mentioned that this pattern 
even broke CI (travis) of pywikibot.

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ladsgroup
Cc: daniel, Krinkle, Aklapper, Jakob_WMDE, Lydia_Pintscher, WMDE-leszek, 
darthmon_wmde, Addshore, Ladsgroup, DannyS712, Nandana, kostajh, Lahi, Gq86, 
GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, elukey, _jensen, 
rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, SBisson, Perhelion, 
Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

2020-05-13 Thread daniel
daniel added a comment.


  For reference, Brad used PooLCounter to impose a limit on 
Special:Contributions recently, see 
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/551909

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel
Cc: daniel, Krinkle, Aklapper, Jakob_WMDE, Lydia_Pintscher, WMDE-leszek, 
darthmon_wmde, Addshore, Ladsgroup, DannyS712, Nandana, kostajh, Lahi, Gq86, 
GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, elukey, _jensen, 
rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, SBisson, Perhelion, 
Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T252091: RFC: Site-wide edit rate limiting with PoolCounter

2020-05-13 Thread Krinkle
Krinkle added a comment.


  @Ladsgroup wrote:
  
  > […] The edit rate on this wiki has been going up to 1,000 edits per minute 
and has been testing our infrastructure scalability […] The edits have been 
mostly done by bots […] and bot operators want to edit in full speed when the 
infrastructure is quiet and forcing an arbitrary number […] limits bots in 
times that **the infrastructure can actually take more**.
  
  (Emphasis mine).
  
  @Ladsgroup wrote:
  
  > WDQS updater can't keep up and […] we need to keep in mind that there 
always will be a bottleneck.
  
  It sounds like there are no times where the infrastructure can just handle it 
all at the current rate. The above suggests that the current rate limit is too 
high, because we can't keep up with that rate even at normal/quiet times. Right?
  
  If we lower the rate limit, would this pattern not go away? I suppose it 
could come back if bots use their burst capacity within a single minute, or 
when there are many different/new bots starting to do the same thing. In that 
case, the global protections of `maxlag` and kick in automatically to restore 
us. Is that not good enough? Would the global rate limit behave differently in 
practice?
  
  @Ladsgroup wrote:
  
  > […] This has been oscillating like this for months:
  > F31805674: image.png 
  
  It isn't said explicitly, but it sounds the oscillating pattern is considered 
a problem. Is that right? What kinds of problems is it causing, and for 
whom/what? I can understand that regularly reaching a lag of 5s is not great, 
but it seems like an expected outcome if we set the bot maxlag to 5s. If we 
want the "situation normal" lag peaks to be lower, then we should set that 
maxlag parameter lower.

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Krinkle
Cc: Krinkle, Aklapper, Jakob_WMDE, Lydia_Pintscher, WMDE-leszek, darthmon_wmde, 
Addshore, Ladsgroup, DannyS712, Nandana, kostajh, Lahi, Gq86, 
GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, elukey, _jensen, 
rosalieper, D3r1ck01, Scott_WUaS, Jonas, Izno, SBisson, Perhelion, 
Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs