Re: [tor-dev] Bandwidth scanner: request for feedback
On Mon, Nov 19, 2018 at 7:36 AM teor wrote: > > Hi, > > We have deployed sbws on one bandwidth authority (longclaw). > > Here's a request for additional feedback, and a progress update: > > > Request for Feedback: Relay Bandwidth Self-Tests > > Torflow and sbws use relays' self-reported observed bandwidths for > load balancing. But relays can have really low bandwidths because > they're new, or due to random path selection. > > In torflow, relays can get stuck in a low-bandwidth partition. sbws > doesn't have partitions. But in both systems, low bandwidths can > cause inaccurate or unstable load balancing. > > Since torflow and sbws need accurate self-reported relay bandwidths, > some component of the Tor network needs to send enough bandwidth > through every relay. > > Here are our current choices: > > Tor relays can do a regular bandwidth self-test, so that their > first descriptor has an accurate bandwidth (up to some minimum). But > the current self-test is too small, and buggy. > > sbws already sends bandwidth to all relays to measure them. sbws gets > accurate bandwidths for most relays within 2 weeks, but the fastest > relays can take a month to ramp up. (sbws starts measuring at the > median relay bandwidth, and can double every 5 days.) > > Should we improve relay bandwidth self-tests? (#22453) > Or should we rely on sbws to create the bandwidths it needs? > What about test networks? Hi! I don't think I have the answers here, but maybe I can think aloud in a useful way. From my point of view, either of these is a fine idea, if it works. We could decide based on a lot of factors, like: * Which one is easier to do? * Which creates the greater maintenance burden, moving forward? * Which is more robust if something breaks in the future? * Which consumes the most relay bandwidth? * Which requires SBWS to use the most bandwidth? Maybe if we had those figured out, we'd have a better time deciding. > Should we make bandwidths grow faster in sbws? > Or is a ramp-up period of 2-5 weeks fast enough? I think that's fast enough, though I'm not sure. How does it compare with the current average torflow ramp-up time? > (We won't modify and re-deploy torflow.) ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] Bandwidth scanner: request for feedback
Hi, We have deployed sbws on one bandwidth authority (longclaw). Here's a request for additional feedback, and a progress update: Request for Feedback: Relay Bandwidth Self-Tests Torflow and sbws use relays' self-reported observed bandwidths for load balancing. But relays can have really low bandwidths because they're new, or due to random path selection. In torflow, relays can get stuck in a low-bandwidth partition. sbws doesn't have partitions. But in both systems, low bandwidths can cause inaccurate or unstable load balancing. Since torflow and sbws need accurate self-reported relay bandwidths, some component of the Tor network needs to send enough bandwidth through every relay. Here are our current choices: Tor relays can do a regular bandwidth self-test, so that their first descriptor has an accurate bandwidth (up to some minimum). But the current self-test is too small, and buggy. sbws already sends bandwidth to all relays to measure them. sbws gets accurate bandwidths for most relays within 2 weeks, but the fastest relays can take a month to ramp up. (sbws starts measuring at the median relay bandwidth, and can double every 5 days.) Should we improve relay bandwidth self-tests? (#22453) Or should we rely on sbws to create the bandwidths it needs? What about test networks? Should we make bandwidths grow faster in sbws? Or is a ramp-up period of 2-5 weeks fast enough? (We won't modify and re-deploy torflow.) Progress Update > On 30 Aug 2018, at 07:11, Mike Perry wrote: > > teor: >> >> What happens when sbws doesn't match torflow? >> >> https://trac.torproject.org/projects/tor/ticket/27339 >> >> We suggest this rule: >> >> If an sbws deployment is within X% of an existing bandwidth >> authority, sbws is ok. (The total consensus weights of the >> existing bandwidth authorities are within 25% - 50% of each >> other, see #25459.) We have successfully used this rule to discover and fix some bugs in sbws. > I would like an additional criteria for when we finally replace torflow > with sbws. > > Ideally, I would like us to perform A/B experiments to ensure that our > performance metrics do not degrade in terms of average *or* quartile > range/performance variance. (Ie: alternate torflow results for a week vs > sbws for a week, and repeat for a few weeks). I realize this might be > complicated for dirauth operators, though. Can we make it easier > somehow, so that it is easy to switch which result files they are voting > with? We do not have the capacity to A/B test sbws and torflow. (As far as I understand, we don't have enough people, and we don't have enough servers.) > If we can't do this, at minimum, we should definitely watch the change > in our average and quartile variance performance metrics when we first > switch to sbws. We deployed sbws on 1/6 bandwidth authorities, and the performance of the network has been stable: https://metrics.torproject.org/torperf.html?start=2018-01-21=2018-11-19=all=public=50kb (The drop in performance at the start of the year was due to extra network load.) > Additionally, if we ever change how sbws behaves to be different than > torflow, I would like sbws to have a well-defined load balancing > equilibrium goal, and I would like us to not change this load balancing > equilibrium goal unless we perform A/B testing and compare the average > and variance of our performance metrics. > > I'll explain what I mean by "load balancing equilibrium goal" below, > when I try to explain the PID mechanism again. sbws has adopted Torflow's load-balancing equilibrium goal. Our priority is transitioning away from Torflow successfully. We've deferred changes to the load-balancing goal until a later sbws release. We may never make this change. >> How long should sbws keep relay bandwidths? >> >> https://trac.torproject.org/projects/tor/ticket/27338 >> >> Torflow uses the latest self-reported relay observed bandwidth >> and bandwidth rate. >> >> Torflow uses a complex feedback loop for measured bandwidths. >> We think sbws can use a simple average or exponentially >> decaying weighted average. > > As I said in > https://lists.torproject.org/pipermail/tor-dev/2017-December/012714.html, > this feedback loop is disabled. I know you don't believe that the > bandwidth auth spec is accurate, but I'm telling you it is. Improving bandwidth measurement has been one of the most difficult things I have done with Tor. You're right: I don't know if the Torflow spec is accurate, because I often struggle to find the information I need in the spec. That's not anyone's fault: it's a difficult and complex topic. But it does mean that I need your help to answer some questions about Torflow. > The point of the PID control stuff was to formalize the type of load > balancing equilibrium goal that the bandwidth auths are using, and to > experiment with convergence on a specific target load balancing > equilibrium point (where that target equilibrium point is "all
Re: [tor-dev] Bandwidth scanner: request for feedback
On 29 August 2018 at 16:11, Mike Perry wrote: > Ideally, I would like us to perform A/B experiments to ensure that our > performance metrics do not degrade in terms of average *or* quartile > range/performance variance. (Ie: alternate torflow results for a week vs > sbws for a week, and repeat for a few weeks). I realize this might be > complicated for dirauth operators, though. Can we make it easier > somehow, so that it is easy to switch which result files they are voting > with? Having both voting files means running both scanners at the same time. Depending on one's pipes, that might skew the results from the scanners. -tom ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] Bandwidth scanner: request for feedback
teor: > Hi, > > Juga and pastly have been working hard on sbws. > > Sbws' results are now similar to torflow's results: > https://trac.torproject.org/projects/tor/attachment/ticket/27135/20180826_081902.png Congratulations, Juga and pastly! > Now that sbws is close to torflow, we want some feedback on its > design. We’ll work on the design at the tor meeting in September. > > Please feel free to give feedback by email, or on the tickets: > > > What happens when sbws doesn't match torflow? > > https://trac.torproject.org/projects/tor/ticket/27339 > > We suggest this rule: > > If an sbws deployment is within X% of an existing bandwidth > authority, sbws is ok. (The total consensus weights of the > existing bandwidth authorities are within 25% - 50% of each > other, see #25459.) I would like an additional criteria for when we finally replace torflow with sbws. Ideally, I would like us to perform A/B experiments to ensure that our performance metrics do not degrade in terms of average *or* quartile range/performance variance. (Ie: alternate torflow results for a week vs sbws for a week, and repeat for a few weeks). I realize this might be complicated for dirauth operators, though. Can we make it easier somehow, so that it is easy to switch which result files they are voting with? If we can't do this, at minimum, we should definitely watch the change in our average and quartile variance performance metrics when we first switch to sbws. Additionally, if we ever change how sbws behaves to be different than torflow, I would like sbws to have a well-defined load balancing equilibrium goal, and I would like us to not change this load balancing equilibrium goal unless we perform A/B testing and compare the average and variance of our performance metrics. I'll explain what I mean by "load balancing equilibrium goal" below, when I try to explain the PID mechanism again. > How long should sbws keep relay bandwidths? > > https://trac.torproject.org/projects/tor/ticket/27338 > > Torflow uses the latest self-reported relay observed bandwidth > and bandwidth rate. > > Torflow uses a complex feedback loop for measured bandwidths. > We think sbws can use a simple average or exponentially > decaying weighted average. As I said in https://lists.torproject.org/pipermail/tor-dev/2017-December/012714.html, this feedback loop is disabled. I know you don't believe that the bandwidth auth spec is accurate, but I'm telling you it is. There's just a lot going on there because the bwauths have required a long history of experimentation to get to where they are now, just as sbws is now encountering with trying to make various measurement and scaling decisions. (As you A/B test ways to improve performance on the live network, you tend to accumulate a lot of options for different ways of doing things). The point of the PID control stuff was to formalize the type of load balancing equilibrium goal that the bandwidth auths are using, and to experiment with convergence on a specific target load balancing equilibrium point (where that target equilibrium point is "all relays have the same spare capacity for one additional client stream"). The problem was that when you only use this criteria, faster relays run out of CPU, memory, or sockets before this criteria was satisfied for them. Hence all of the circuit failure reason statistics in the code base (to try to back off on PID control if we hit a different limiting factor other than bandwidth). Unfortunately, Tor does not provide enough error code feedback to reliably determine if a relay is low on memory, sockets, or CPU. Funding ended for the bandwidths auths before we could implement proper overload error feedback in Tor, and we got funding for me to work on Tor Browser instead. With the parameters in the current consensus (currently bwauthpid=1, and no others), the PID control is operating as only "Proportional control": https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/README.spec.txt#n476 (The default values for K_i and K_d are 0, as per Section 3.6 of the spec). In section 3.1 of the spec, I have a proof that using "Proportional control" (ie PID control with no I or D) is equivalent to what we were doing in Section 2.2. This means that Section 2.2 does describe what we are doing now: https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/README.spec.txt#n390 https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/README.spec.txt#n298 I left the PID code itself enabled (but in "Proportional-only" mode) because it is cleaner, and it makes it formally clear that the bandwidth authorities are actually measuring the difference in the ability of relays to carry additional client traffic, and correcting for that difference by adjusting weights in proportion to that difference. I naively assumed that eventually Tor would get funding to implement better feedback for CPU, memory, and socket overload. That
[tor-dev] Bandwidth scanner: request for feedback
Hi, Juga and pastly have been working hard on sbws. Sbws' results are now similar to torflow's results: https://trac.torproject.org/projects/tor/attachment/ticket/27135/20180826_081902.png Now that sbws is close to torflow, we want some feedback on its design. We’ll work on the design at the tor meeting in September. Please feel free to give feedback by email, or on the tickets: What happens when sbws doesn't match torflow? https://trac.torproject.org/projects/tor/ticket/27339 We suggest this rule: If an sbws deployment is within X% of an existing bandwidth authority, sbws is ok. (The total consensus weights of the existing bandwidth authorities are within 25% - 50% of each other, see #25459.) How long should sbws keep relay bandwidths? https://trac.torproject.org/projects/tor/ticket/27338 Torflow uses the latest self-reported relay observed bandwidth and bandwidth rate. Torflow uses a complex feedback loop for measured bandwidths. We think sbws can use a simple average or exponentially decaying weighted average. How should we scale sbws consensus weights? https://trac.torproject.org/projects/tor/ticket/27340 If sbws' total consensus weight is different to torflow's total consensus weight, how should we scale sbws? (The weights might differ because the measurement method is different, or because scanners and servers are in different locations.) In the bandwidth file spec, we suggest linear scaling. How should we round sbws consensus weights? https://trac.torproject.org/projects/tor/ticket/27337 Torflow currently rounds to 3 significant figures (which is a maximum of 0.5%). But I suggest 2 significant figures for sbws (or max 5%), because: - tor has a daily usage cycle that varies by 10% - 20% - existing bandwidth authorities vary by 25% - 50% Proposal 276 contains a slightly more complicated rounding algorithm, which we may want to implement in sbws or in tor: https://gitweb.torproject.org/torspec.git/tree/proposals/276-lower-bw-granularity.txt Does sbws need a maximum consensus weight fraction? https://trac.torproject.org/projects/tor/ticket/27336 Torflow uses 5%, but I suggest 1%, because the largest relay right now is only 0.5%. T -- teor Please reply @torproject.org New subkeys 1 July 2018 PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B -- signature.asc Description: Message signed with OpenPGP ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev