#26022: Fix a flaw in the noise-removing code in our onion service statistics --------------------------------+------------------------------ Reporter: karsten | Owner: metrics-team Type: defect | Status: needs_review Priority: Medium | Milestone: Component: Metrics/Statistics | Version: Severity: Normal | Resolution: Keywords: | Actual Points: Parent ID: | Points: Reviewer: | Sponsor: --------------------------------+------------------------------
Comment (by karsten): Replying to [comment:15 amj703]: > > > Why not throw out the outliers, then add the remaining, then do the adjustment? > > > > The way we're determining whether a reported value was an outlier or not is by extrapolating all reported values to network totals and discarding the lowest 25% and highest 25% of ''extrapolated'' values. But extrapolating values requires us to make these adjustment first, or we'd extrapolate to the wrong network totals. > > It sounds to me like it doesn't make a difference if throw out the outliers before adjustment or after adjustment. The adjustment preserves the relative ordering of the values, and so the bottom (and top) 25% of data points will remain the same. You're right. Agreed, this would work. > So here's a suggestion: (1) extrapolate by dividing the raw measurement by the relay's weight; (2) remove the top and bottom 25% of extrapolated values; (3) add the remaining raw values; (4) adjust that result by rounding towards the closest bin edge, subtracting num_relay*bin_size/2, and replacing a negative value with 0; and then (5) extrapolate the result by dividing by the total weight of the included relays. I see how that approach would work and produce similar results to our current approach. In particular, I think it does the following things differently: - We're currently rounding each reported statistic towards the closest bin edge, whereas your suggested approach would only do that once in step (4). You approach is less likely to introduce errors and relies more on noise to cancel out itself. This seems good. - Your approach replaces a negative end value with 0 in step (4) which our current approach doesn't. We could easily adopt this part, if we wanted. So far, it hasn't happened that we got a negative end result, though. > > Here's another idea, though: what if we change the way how we're removing noise by ''only'' subtracting `bin_size / 2` to undo the binning step as good as we can and leave the Laplace noise alone. Basically, we'd only account for the fact that relays always round up to the next multiple of `bin_size`, but we wouldn't do anything about the positive or negative noise. > > This isn't really the best guess about the value at the relays before noise is added. While not performing the noise-removal step makes it easier to explain how the metrics numbers are produced (and to implement them), I think it makes it harder to understand what they mean. This is a judgement call that I can see both sides of, though! Wait, my suggestion was to subtract `bin_size / 2` but not round to the closest bin edge. I think it produces the same result as your suggestion above, except that it does not round the end result to the closest bin edge in your step (4). However, and here comes the catch: I'm running out of time for experimenting with different approaches. My goal here was to fix a bug, not to try improved estimation algorithms. I can see how your suggestion might improve our results, but it's a bigger code change than a single method. I'm afraid I'll have to leave that as future work. :( I'll wait for the simplified `removeNoise()` method run to finish and will post a new graph that compares flawed, fixed, and simplied methods. > > Wait, we're talking about two different things: > > > > 1. Relays internally round ''up'' to the next multiple of `bin_size`. > > 2. metrics-web contains that `removeNoise()` method that this ticket is all about. > > Ah, right, thanks for clarifying that. Great! Sorry for confusing you earlier. -- Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26022#comment:16> Tor Bug Tracker & Wiki <https://trac.torproject.org/> The Tor Project: anonymity online
_______________________________________________ tor-bugs mailing list tor-bugs@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs