#26022: Fix a flaw in the noise-removing code in our onion service statistics
Reporter: karsten | Owner: metrics-team
Type: defect | Status: needs_review
Priority: Medium | Milestone:
Component: Metrics/Statistics | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
Comment (by karsten):
Replying to [comment:15 amj703]:
> > > Why not throw out the outliers, then add the remaining, then do the
> > The way we're determining whether a reported value was an outlier or
not is by extrapolating all reported values to network totals and
discarding the lowest 25% and highest 25% of ''extrapolated'' values. But
extrapolating values requires us to make these adjustment first, or we'd
extrapolate to the wrong network totals.
> It sounds to me like it doesn't make a difference if throw out the
outliers before adjustment or after adjustment. The adjustment preserves
the relative ordering of the values, and so the bottom (and top) 25% of
data points will remain the same.
You're right. Agreed, this would work.
> So here's a suggestion: (1) extrapolate by dividing the raw measurement
by the relay's weight; (2) remove the top and bottom 25% of extrapolated
values; (3) add the remaining raw values; (4) adjust that result by
rounding towards the closest bin edge, subtracting num_relay*bin_size/2,
and replacing a negative value with 0; and then (5) extrapolate the result
by dividing by the total weight of the included relays.
I see how that approach would work and produce similar results to our
current approach. In particular, I think it does the following things
- We're currently rounding each reported statistic towards the closest
bin edge, whereas your suggested approach would only do that once in step
(4). You approach is less likely to introduce errors and relies more on
noise to cancel out itself. This seems good.
- Your approach replaces a negative end value with 0 in step (4) which
our current approach doesn't. We could easily adopt this part, if we
wanted. So far, it hasn't happened that we got a negative end result,
> > Here's another idea, though: what if we change the way how we're
removing noise by ''only'' subtracting `bin_size / 2` to undo the binning
step as good as we can and leave the Laplace noise alone. Basically, we'd
only account for the fact that relays always round up to the next multiple
of `bin_size`, but we wouldn't do anything about the positive or negative
> This isn't really the best guess about the value at the relays before
noise is added. While not performing the noise-removal step makes it
easier to explain how the metrics numbers are produced (and to implement
them), I think it makes it harder to understand what they mean. This is a
judgement call that I can see both sides of, though!
Wait, my suggestion was to subtract `bin_size / 2` but not round to the
closest bin edge. I think it produces the same result as your suggestion
above, except that it does not round the end result to the closest bin
edge in your step (4).
However, and here comes the catch: I'm running out of time for
experimenting with different approaches. My goal here was to fix a bug,
not to try improved estimation algorithms. I can see how your suggestion
might improve our results, but it's a bigger code change than a single
method. I'm afraid I'll have to leave that as future work. :(
I'll wait for the simplified `removeNoise()` method run to finish and will
post a new graph that compares flawed, fixed, and simplied methods.
> > Wait, we're talking about two different things:
> > 1. Relays internally round ''up'' to the next multiple of `bin_size`.
> > 2. metrics-web contains that `removeNoise()` method that this ticket
is all about.
> Ah, right, thanks for clarifying that.
Great! Sorry for confusing you earlier.
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26022#comment:16>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
tor-bugs mailing list