Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-25 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+-
 Reporter:  karsten |  Owner:  karsten
 Type:  defect  | Status:  closed
 Priority:  High|  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:  fixed
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+-
Changes (by karsten):

 * status:  merge_ready => closed
 * resolution:   => fixed


Comment:

 Replying to [comment:20 iwakeh]:
 > Replying to [comment:18 karsten]:
 > > Please review commits [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26022=d297e80ba990f0163018e5ce5b2d9d131d9a1fb3
 d297e80] and [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26022=4a518a29306e6c372b207fc172dc82b1956c28ec
 4a518a2] in [https://gitweb.torproject.org/karsten/metrics-
 web.git/log/?h=task-26022 my task-26022 branch].
 >
 > Both commits are fine, pass tests and checks.

 Great! Merged and deployed.

 > I think there are other places in Metrics' code where rounding is
 performed implicitly by relying on integer operations.  These should be
 looked for and also be replaced by explicit rounding.  Maybe, a new
 ticket?

 I don't think that we have a similar issue in another part of metrics-web.
 After all, this only affects negative numbers, and we typically work with
 positive numbers. I also think I would have spotted such a case when
 specifying how we're processing data in metrics-web a few weeks ago.

 Closing. Thanks again!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-24 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+-
 Reporter:  karsten |  Owner:  karsten
 Type:  defect  | Status:  merge_ready
 Priority:  High|  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+-
Changes (by iwakeh):

 * status:  needs_review => merge_ready


Comment:

 Replying to [comment:18 karsten]:
 > Please review commits [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26022=d297e80ba990f0163018e5ce5b2d9d131d9a1fb3
 d297e80] and [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26022=4a518a29306e6c372b207fc172dc82b1956c28ec
 4a518a2] in [https://gitweb.torproject.org/karsten/metrics-
 web.git/log/?h=task-26022 my task-26022 branch].

 Both commits are fine, pass tests and checks.
 I think there are other places in Metrics' code where rounding is
 performed implicitly by relying on integer operations.  These should be
 looked for and also be replaced by explicit rounding.  Maybe, a new
 ticket?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-19 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  karsten
 Type:  defect  | Status:  needs_review
 Priority:  High|  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by amj703):

 Just want to say that I don't disagree with anything in your most recent
 comment!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-18 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  karsten
 Type:  defect  | Status:  needs_review
 Priority:  High|  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--
Changes (by karsten):

 * priority:  Medium => High
 * status:  accepted => needs_review


Comment:

 Please review commits [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26022=d297e80ba990f0163018e5ce5b2d9d131d9a1fb3
 d297e80] and [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26022=4a518a29306e6c372b207fc172dc82b1956c28ec
 4a518a2] in [https://gitweb.torproject.org/karsten/metrics-
 web.git/log/?h=task-26022 my task-26022 branch].

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-18 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  karsten
 Type:  defect  | Status:  accepted
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--
Changes (by karsten):

 * owner:  metrics-team => karsten
 * status:  needs_review => accepted


Comment:

 Replying to [comment:16 karsten]:
 > I'll wait for the simplified `removeNoise()` method run to finish and
 will post a new graph that compares flawed, fixed, and simplied methods.

 Here's a comparison of simplified vs. fixed `removeNoise()` method:

 [[Image(hidserv-change-task-26022-3.png, 700px)]]

 I'll go ahead and prepare a patch that implements the fixed method.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-18 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--
Changes (by karsten):

 * Attachment "hidserv-change-task-26022-3.png" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-17 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by karsten):

 Replying to [comment:15 amj703]:
 > > > Why not throw out the outliers, then add the remaining, then do the
 adjustment?
 > >
 > > The way we're determining whether a reported value was an outlier or
 not is by extrapolating all reported values to network totals and
 discarding the lowest 25% and highest 25% of ''extrapolated'' values. But
 extrapolating values requires us to make these adjustment first, or we'd
 extrapolate to the wrong network totals.
 >
 > It sounds to me like it doesn't make a difference if throw out the
 outliers before adjustment or after adjustment. The adjustment preserves
 the relative ordering of the values, and so the bottom (and top) 25% of
 data points will remain the same.

 You're right. Agreed, this would work.

 > So here's a suggestion: (1) extrapolate by dividing the raw measurement
 by the relay's weight; (2) remove the top and bottom 25% of extrapolated
 values; (3) add the remaining raw values; (4) adjust that result by
 rounding towards the closest bin edge, subtracting num_relay*bin_size/2,
 and replacing a negative value with 0; and then (5) extrapolate the result
 by dividing by the total weight of the included relays.

 I see how that approach would work and produce similar results to our
 current approach. In particular, I think it does the following things
 differently:

  - We're currently rounding each reported statistic towards the closest
 bin edge, whereas your suggested approach would only do that once in step
 (4). You approach is less likely to introduce errors and relies more on
 noise to cancel out itself. This seems good.

  - Your approach replaces a negative end value with 0 in step (4) which
 our current approach doesn't. We could easily adopt this part, if we
 wanted. So far, it hasn't happened that we got a negative end result,
 though.

 > > Here's another idea, though: what if we change the way how we're
 removing noise by ''only'' subtracting `bin_size / 2` to undo the binning
 step as good as we can and leave the Laplace noise alone. Basically, we'd
 only account for the fact that relays always round up to the next multiple
 of `bin_size`, but we wouldn't do anything about the positive or negative
 noise.
 >
 > This isn't really the best guess about the value at the relays before
 noise is added. While not performing the noise-removal step makes it
 easier to explain how the metrics numbers are produced (and to implement
 them), I think it makes it harder to understand what they mean. This is a
 judgement call that I can see both sides of, though!

 Wait, my suggestion was to subtract `bin_size / 2` but not round to the
 closest bin edge. I think it produces the same result as your suggestion
 above, except that it does not round the end result to the closest bin
 edge in your step (4).

 However, and here comes the catch: I'm running out of time for
 experimenting with different approaches. My goal here was to fix a bug,
 not to try improved estimation algorithms. I can see how your suggestion
 might improve our results, but it's a bigger code change than a single
 method. I'm afraid I'll have to leave that as future work. :(

 I'll wait for the simplified `removeNoise()` method run to finish and will
 post a new graph that compares flawed, fixed, and simplied methods.

 > > Wait, we're talking about two different things:
 > >
 > >  1. Relays internally round ''up'' to the next multiple of `bin_size`.
 > >  2. metrics-web contains that `removeNoise()` method that this ticket
 is all about.
 >
 > Ah, right, thanks for clarifying that.

 Great! Sorry for confusing you earlier.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-17 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by amj703):

 > > Why not throw out the outliers, then add the remaining, then do the
 adjustment?
 >
 > The way we're determining whether a reported value was an outlier or not
 is by extrapolating all reported values to network totals and discarding
 the lowest 25% and highest 25% of ''extrapolated'' values. But
 extrapolating values requires us to make these adjustment first, or we'd
 extrapolate to the wrong network totals.

 It sounds to me like it doesn't make a difference if throw out the
 outliers before adjustment or after adjustment. The adjustment preserves
 the relative ordering of the values, and so the bottom (and top) 25% of
 data points will remain the same. So here's a suggestion: (1) extrapolate
 by dividing the raw measurement by the relay's weight; (2) remove the top
 and bottom 25% of extrapolated values; (3) add the remaining raw values;
 (4) adjust that result by rounding towards the closest bin edge,
 subtracting num_relay*bin_size/2, and replacing a negative value with 0;
 and then (5) extrapolate the result by dividing by the total weight of the
 included relays.

 > Here's another idea, though: what if we change the way how we're
 removing noise by ''only'' subtracting `bin_size / 2` to undo the binning
 step as good as we can and leave the Laplace noise alone. Basically, we'd
 only account for the fact that relays always round up to the next multiple
 of `bin_size`, but we wouldn't do anything about the positive or negative
 noise.

 This isn't really the best guess about the value at the relays before
 noise is added. While not performing the noise-removal step makes it
 easier to explain how the metrics numbers are produced (and to implement
 them), I think it makes it harder to understand what they mean. This is a
 judgement call that I can see both sides of, though!

 > Wait, we're talking about two different things:
 >
 >  1. Relays internally round ''up'' to the next multiple of `bin_size`.
 >  2. metrics-web contains that `removeNoise()` method that this ticket is
 all about.

 Ah, right, thanks for clarifying that.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-17 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by karsten):

 Here's an updated version of the graph from comment 5 that also goes back
 to late 2014 when we started gathering onion service statistics:

 [[Image(hidserv-change-full-task-26022.png​, 700px)]]

 Some thoughts:
  - Some of the numbers for 2015 produced by the fixed `removeNoise()`
 method are up to 15% smaller than those from the flawed method. During
 2015, very few relays were reporting onion service statistics, with the
 [https://metrics.torproject.org/hidserv-frac-
 reporting.html?start=2014-10-01=2018-05-17 fraction not going up
 before late 2015]. The reason for new values being smaller than old values
 is that we're not erroneously adding `bin_size` to negative reported
 statistics anymore.
  - There are very few cases in the first half of 2015 where new values are
 much larger than old values. I believe this is related to another bug in
 our code that made us terminate the module immediately if a consensus did
 not contain a `bandwidth-weights` line. I'm going to fix that, too, but
 it's unrelated to the flawed `removeNoise()` method.
  - The numbers starting in 2016 are almost the same in the new and old
 approach. That's what the previous graph in comment 5 showed, too.

 All in all, I'd say the fixed `removeNoise()` method works just fine.

 I'm starting another run now that uses the simplified `removeNoise()`
 method that only subtracts `bin_size` and does no
 rounding/truncating/flooring at all (as suggested earlier). That will take
 12+ hours.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-17 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--
Changes (by karsten):

 * Attachment "hidserv-change-full-task-26022.png" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-16 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by karsten):

 Replying to [comment:12 amj703]:
 > > We could sum up relay values first and then adjust the result.
 However, we'd lose the ability to discard outliers, which we're doing
 extensively with onion service statistics. After all, we're throwing out 2
 times 25% of reported values which we'd then include again.
 >
 > Why not throw out the outliers, then add the remaining, then do the
 adjustment?

 The way we're determining whether a reported value was an outlier or not
 is by extrapolating all reported values to network totals and discarding
 the lowest 25% and highest 25% of ''extrapolated'' values. But
 extrapolating values requires us to make these adjustment first, or we'd
 extrapolate to the wrong network totals.

 Here's another idea, though: what if we change the way how we're removing
 noise by ''only'' subtracting `bin_size / 2` to undo the binning step as
 good as we can and leave the Laplace noise alone. Basically, we'd only
 account for the fact that relays always round up to the next multiple of
 `bin_size`, but we wouldn't do anything about the positive or negative
 noise. Of course, we'd keep the remaining extrapolation step and outlier
 handling unchanged. Like this:

 {{{
   /** Removes noise from a reported stats value by subtracting half of the
 bin size. */
   private long removeNoise(long reportedNumber, long binSize) {
 return reportedNumber - binSize / 2;
   }
 }}}

 If this makes any sense, I could produce some numbers with this new, even
 simpler approach.

 > > Hang on. Relays always round ''up'' to the next multiple of
 `bin_size`. So, everything in `(-bin_size, 0]` will be reported as `0` and
 ''not'' as `-bin_size`.
 > >
 > > > I don’t think the “right side” rounding is happening with current
 use of the floor function, if it ever was. Maybe I’m wrong, but as I
 understand it Math.floorDiv((reportedNumber + binSize / 2) will round
 -0.75*binSize to -binSize.
 > >
 > > This part is correct. (The full "formula" is
 `Math.floorDiv((reportedNumber + binSize / 2), binSize) * binSize`.)
 >
 > These statements appear inconsistent. Is everything in (-bin_size, 0]
 rounded to 0, or is only [-bin_size/2,0] rounded to zero with [-bin_size,
 -bin_size/2) rounded to -bin_size? I think it's the latter, because
 Math.floorDiv((reportedNumber + binSize / 2), binSize) * binSize with
 reportedNumber=-0.75*binSize should evaluate to
 Math.floorDiv((-0.25*binSize), binSize) * binSize = -1 * binSize =
 -binSize. That appears consistent with how you've described Math.floorDiv
 and how the docs describe it at
 : "Returns
 the largest (closest to positive infinity) int value that is less than or
 equal to the algebraic quotient".

 Wait, we're talking about two different things:

  1. Relays internally round ''up'' to the next multiple of `bin_size`.
  2. metrics-web contains that `removeNoise()` method that this ticket is
 all about.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-16 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by amj703):

 > We could sum up relay values first and then adjust the result. However,
 we'd lose the ability to discard outliers, which we're doing extensively
 with onion service statistics. After all, we're throwing out 2 times 25%
 of reported values which we'd then include again.

 Why not throw out the outliers, then add the remaining, then do the
 adjustment?

 > Hang on. Relays always round ''up'' to the next multiple of `bin_size`.
 So, everything in `(-bin_size, 0]` will be reported as `0` and ''not'' as
 `-bin_size`.
 >
 > > I don’t think the “right side” rounding is happening with current use
 of the floor function, if it ever was. Maybe I’m wrong, but as I
 understand it Math.floorDiv((reportedNumber + binSize / 2) will round
 -0.75*binSize to -binSize.
 >
 > This part is correct. (The full "formula" is
 `Math.floorDiv((reportedNumber + binSize / 2), binSize) * binSize`.)

 These statements appear inconsistent. Is everything in (-bin_size, 0]
 rounded to 0, or is only [-bin_size/2,0] rounded to zero with [-bin_size,
 -bin_size/2) rounded to -bin_size? I think it's the latter, because
 Math.floorDiv((reportedNumber + binSize / 2), binSize) * binSize with
 reportedNumber=-0.75*binSize should evaluate to
 Math.floorDiv((-0.25*binSize), binSize) * binSize = -1 * binSize =
 -binSize. That appears consistent with how you've described Math.floorDiv
 and how the docs describe it at
 : "Returns
 the largest (closest to positive infinity) int value that is less than or
 equal to the algebraic quotient".

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-16 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by karsten):

 Replying to [comment:10 amj703]:
 > > Hmm, I believe that disallowing negative values by rounding them up to
 zero would bias the result unnecessarily. For example, imagine an extreme
 case where we picked a large `bin_size` value and an even larger `delta_f`
 value: most relays would observe values between 0 and `bin_size`, and they
 would add very large positive or negative noise values. In such a case we
 need both negative and positive values to come up with a result close to
 0.
 >
 > Sure, that makes sense as a reason to try and produce unbiased estimates
 for each relay’s value. It seems to me that there is another way to do
 this, which would involve adding the relay values first and then adjusting
 the result, given that the noise has already been summed and thus has zero
 bias (aka an expectation of zero). But the current method seems to be
 working just fine!

 We could sum up relay values first and then adjust the result. However,
 we'd lose the ability to discard outliers, which we're doing extensively
 with onion service statistics. After all, we're throwing out 2 times 25%
 of reported values which we'd then include again.

 > > Hmm, I'm not so sure about this one, either. Remember that we
 implemented the code at relays to report the ''right side'' of a bin, not
 the ''center''. Take another example where a relay observes exactly 0
 events over two days: on day 1 it adds a small positive noise value and
 reports `bin_size` as number, and on day 2 it adds a small negative noise
 value and reports 0 as number. If we subtract `bin_size / 2` from those
 reported values, we'll end up with 0 on average, which would be correct.
 But if we didn't, we'd end up with `bin_size / 2` as average, which seems
 wrong.
 >
 > As another example, consider that on day 1 a positive noise value in
 (bin_size/2, bin_size] is added, which results in bin_size being reported,
 and on day 2 a large-magnitude negative value in [-bin_size, -bin_size/2)
 is reported, which results in -bin_size being reported. Then subtracting
 bin_size/2 from those reported values ends up with -bin_size/2 as the
 average, which seems wrong.

 Hang on. Relays always round ''up'' to the next multiple of `bin_size`.
 So, everything in `(-bin_size, 0]` will be reported as `0` and ''not'' as
 `-bin_size`.

 > I don’t think the “right side” rounding is happening with current use of
 the floor function, if it ever was. Maybe I’m wrong, but as I understand
 it Math.floorDiv((reportedNumber + binSize / 2) will round -0.75*binSize
 to -binSize.

 This part is correct. (The full "formula" is
 `Math.floorDiv((reportedNumber + binSize / 2), binSize) * binSize`.)

 > > Does this make sense? If so, I think I'd prefer to leave the general
 formula to remove noise unchanged and only focus on fixing the
 implementation bug related to integer truncation.
 >
 > It definitely makes sense to just focus on the immediate issue
 discovered. I was just thinking through how this was supposed to work
 again and thought I would let you know how it appeared to me.

 Sure! It's good to revisit our design decisions from a few years back and
 to fix any other flaws coming up.

 So, if the truncate/floor bug remains the only flaw in our current code,
 and when the reprocessed numbers are finally done here, I'll merge and
 update the numbers.

 Thanks so much for sharing your thoughts here!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-16 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by amj703):

 > Hmm, I believe that disallowing negative values by rounding them up to
 zero would bias the result unnecessarily. For example, imagine an extreme
 case where we picked a large `bin_size` value and an even larger `delta_f`
 value: most relays would observe values between 0 and `bin_size`, and they
 would add very large positive or negative noise values. In such a case we
 need both negative and positive values to come up with a result close to
 0.

 Sure, that makes sense as a reason to try and produce unbiased estimates
 for each relay’s value. It seems to me that there is another way to do
 this, which would involve adding the relay values first and then adjusting
 the result, given that the noise has already been summed and thus has zero
 bias (aka an expectation of zero). But the current method seems to be
 working just fine!

 > Hmm, I'm not so sure about this one, either. Remember that we
 implemented the code at relays to report the ''right side'' of a bin, not
 the ''center''. Take another example where a relay observes exactly 0
 events over two days: on day 1 it adds a small positive noise value and
 reports `bin_size` as number, and on day 2 it adds a small negative noise
 value and reports 0 as number. If we subtract `bin_size / 2` from those
 reported values, we'll end up with 0 on average, which would be correct.
 But if we didn't, we'd end up with `bin_size / 2` as average, which seems
 wrong.

 As another example, consider that on day 1 a positive noise value in
 (bin_size/2, bin_size] is added, which results in bin_size being reported,
 and on day 2 a large-magnitude negative value in [-bin_size, -bin_size/2)
 is reported, which results in -bin_size being reported. Then subtracting
 bin_size/2 from those reported values ends up with -bin_size/2 as the
 average, which seems wrong. I don’t think the “right side” rounding is
 happening with current use of the floor function, if it ever was. Maybe
 I’m wrong, but as I understand it Math.floorDiv((reportedNumber + binSize
 / 2) will round -0.75*binSize to -binSize.

 > Does this make sense? If so, I think I'd prefer to leave the general
 formula to remove noise unchanged and only focus on fixing the
 implementation bug related to integer truncation.

 It definitely makes sense to just focus on the immediate issue discovered.
 I was just thinking through how this was supposed to work again and
 thought I would let you know how it appeared to me.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-16 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by karsten):

 Replying to [comment:7 amj703]:
 > Hey Karsten,

 Hey Aaron,

 > (My trac pseudonym is amj703 instead of ohmygodel. I've changed the cc).

 Oops, my bad!

 Thanks for your very quick response, after learning about this issue! :)

 > I agree you did find a bug in how the noisy numbers are adjusted. The
 change from integer division to floorDiv seems right to me.

 Great, I will fix that then.

 > One thing you might also consider doing to improve handling negative
 values is to disallow them (i.e. round up to zero). This could be done by
 the relay reporting its number. We probably discussed this option, and
 maybe the reason it wasn't chosen is that it slightly biases counts so
 their expected value isn't the true value (or at least the true rounded
 value).

 Hmm, I believe that disallowing negative values by rounding them up to
 zero would bias the result unnecessarily. For example, imagine an extreme
 case where we picked a large `bin_size` value and an even larger `delta_f`
 value: most relays would observe values between 0 and `bin_size`, and they
 would add very large positive or negative noise values. In such a case we
 need both negative and positive values to come up with a result close to
 0.

 > If that's the case, and negative values are allowed to prevent biasing,
 we should recognize that (1) values are already being biased because of
 the rounding (for which we minimize the worst-case bin by adding
 (-binSize/2) at the end), and (2) adding (-binSize/2) actually makes the
 bias worse for negative bin values.

 Hmm, I'm not so sure about this one, either. Remember that we implemented
 the code at relays to report the ''right side'' of a bin, not the
 ''center''. Take another example where a relay observes exactly 0 events
 over two days: on day 1 it adds a small positive noise value and reports
 `bin_size` as number, and on day 2 it adds a small negative noise value
 and reports 0 as number. If we subtract `bin_size / 2` from those reported
 values, we'll end up with 0 on average, which would be correct. But if we
 didn't, we'd end up with `bin_size / 2` as average, which seems wrong.

 Does this make sense? If so, I think I'd prefer to leave the general
 formula to remove noise unchanged and only focus on fixing the
 implementation bug related to integer truncation.

 Thanks again!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-15 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by amj703):

 Also, PrivCount shouldn't be affected by this because it doesn't use bins.
 The bins help make sure, over time, if the noise can be averaged out due
 to a stable (or otherwise derivable) count, the count is still protected
 to within the accuracy of the bins. The PrivCount implementation is
 designed for a research tool that isn't producing these measurement values
 consistently over time. Tor should consider adding this binning
 enhancement for its own implementation of PrivCount.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-15 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by amj703):

 Hey Karsten,

 (My trac pseudonym is amj703 instead of ohmygodel. I've changed the cc).

 I agree you did find a bug in how the noisy numbers are adjusted. The
 change from integer division to floorDiv seems right to me. One thing you
 might also consider doing to improve handling negative values is to
 disallow them (i.e. round up to zero). This could be done by the relay
 reporting its number. We probably discussed this option, and maybe the
 reason it wasn't chosen is that it slightly biases counts so their
 expected value isn't the true value (or at least the true rounded value).
 If that's the case, and negative values are allowed to prevent biasing, we
 should recognize that (1) values are already being biased because of the
 rounding (for which we minimize the worst-case bin by adding (-binSize/2)
 at the end), and (2) adding (-binSize/2) actually makes the bias worse for
 negative bin values.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-15 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--
Changes (by amj703):

 * cc: ohmygodel (removed)
 * cc: amj703 (added)


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-15 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by karsten):

 Replying to [comment:4 asn]:
 > Replying to [comment:3 karsten]:
 > > Replying to [comment:2 asn]:
 > > If the answers above make sense and you agree that this is a possible
 bug, I'll try to produce numbers using the fixed method.
 >
 > Yes, this does seem like a possible bug. Thanks for helping me
 understand. Would you like to try to produce numbers using the fixed
 method?

 Okay, I produced numbers using the fixed method for the time in 2018 so
 far, and the change is hardly visible. That's why I put the new numbers in
 relation to old numbers here:

 [[Image(hidserv-change-task-26022.png​, 700px)]]

 > As a final review step, we could also send a small email to Aaron
 Johnson, so that he can also verify the logic here.

 I did copy an ohmygodel person on this ticket, but I might have misspelled
 that pseudonym, or Trac email did not work for some reason. I'll send him
 a good old email to let him know.

 So, I'd say this is not a big thing. However, we should fix it anyway, and
 we should be sure not to make the same mistake again in PrivCount. After
 all, this is not Java-specific. It's related to how integers are
 truncated, which is likely the same in other programming languages.
 Another good reason to let Aaron know.

 If the plan to fix this small bug seems reasonable, I'll do that in the
 next few days. It's going to be that one-line change from the ticket
 summary.

 Thanks, asn!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-15 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--
Changes (by karsten):

 * Attachment "hidserv-change-task-26022.png" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-14 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by asn):

 Replying to [comment:3 karsten]:
 > Replying to [comment:2 asn]:
 >
 >
 > If the answers above make sense and you agree that this is a possible
 bug, I'll try to produce numbers using the fixed method.

 Yes, this does seem like a possible bug. Thanks for helping me understand.
 Would you like to try to produce numbers using the fixed method?

 As a final review step, we could also send a small email to Aaron Johnson,
 so that he can also verify the logic here.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-09 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by karsten):

 Replying to [comment:2 asn]:
 > Hey Karsten,

 Hi asn,

 > nice research! It's been a while since we looked at this stuff, so here
 are a
 > few questions as I try to understand what's going on:

 Thanks for looking into this! Those are good questions. Let me try to
 answer them:

 > a) I'm trying to understand the rounding step and it's significance. Do
 you remember what we meant by ''The idea is that it's most likely that
 noise was added to the closest right side of a bin than to the right side
 of another bin.''?

 The following is from my memory, not from looking at code:

 I ''think'' that, on relays, we're first binning the observed number and
 then adding noise. The noise distribution has a mean value of 0. So, even
 though we don't know how much noise was added, it's more likely that a
 small negative or small positive value was added.

 > b) Why do we think that the result of that division should be floored
 towards infinity? Is it because the graph looks weird?

 Floored towards ''negative'' infinity. The idea is that we're always
 subtracting something, rather than subtracting in case of positive numbers
 and adding in case of negative numbers. We're basically handling reported
 values spanning two bins exactly the same, just because they happen to be
 close to zero. This seems wrong.

 (To be honest, this is the first time that I'm even thinking about integer
 truncation going into two different "directions" depending on whether the
 input value is positive or negative. I could imagine that none of us had
 thought about this before.)

 > c) How did you produce that graph? I agree it doesn't look good.

 By generating sample values and feeding them into the code that removes
 noise. That's the "flawed" graph. And then another time into the fixed
 code which produces the "fixed" graph.

 > d) You say ''Right now, we're handling all negative reported values
 wrong by adding 1 binSize to them which we shouldn't do.''. Where do we do
 that, and why?

 Take the "fixed" graph and imagine you add 1024 to every value < -512.
 Then you have the "flawed" graph. I didn't mean to say that we're doing
 that for a reason. I meant that this is what we're doing right now,
 unknowingly, and that I believe that it's wrong.

 > Thanks! :)

 Thanks for digging into this!

 If the answers above make sense and you agree that this is a possible bug,
 I'll try to produce numbers using the fixed method.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-09 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--

Comment (by asn):

 Hey Karsten,

 nice research! It's been a while since we looked at this stuff, so here
 are a
 few questions as I try to understand what's going on:

 a) I'm trying to understand the rounding step and it's significance. Do
 you
remember what we meant by ''The idea is that it's most likely that
 noise was
added to the closest right side of a bin than to the right side of
 another
bin.''?

 b) Why do we think that the result of that division should be floored
 towards
infinity? Is it because the graph looks weird?

 c) How did you produce that graph? I agree it doesn't look good.

 d) You say ''Right now, we're handling all negative reported values wrong
 by
adding 1 binSize to them which we shouldn't do.''. Where do we do that,
 and
why?

 Thanks! :)

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-04 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--
Changes (by karsten):

 * status:  new => needs_review


Comment:

 [[Image(noise.png, 700px)]]

 Setting to needs_review mostly to discuss the suggestion. If we decide
 that this is a good idea, I'll create a proper branch and also think about
 fixing our existing statistics.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-04 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  new
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:
+--
Changes (by karsten):

 * Attachment "noise.png" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

[tor-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

2018-05-04 Thread Tor Bug Tracker & Wiki
#26022: Fix a flaw in the noise-removing code in our onion service statistics
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  defect  | Status:  new
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  |   Keywords:
Actual Points:  |  Parent ID:
   Points:  |   Reviewer:
  Sponsor:  |
+--
 I think I found a flaw in the aggregation part of our onion service
 statistics.

 For context, I'm quoting a paragraph from our technical report where we
 explain how we're [https://research.torproject.org/techreports
 /extrapolating-hidserv-stats-2015-01-31.pdf extrapolating network totals
 from hidden-service statistics]:

 ''"When processing hidden-service statistics, we need to handle the fact
 that they have been obfuscated by relays. As first step, we're attempting
 to remove the additive Laplace-distributed noise by rounding up or down to
 the nearest multiple of `bin_size`. The idea is that it's most likely that
 noise was added to the closest right side of a bin than to the right side
 of another bin. In step two, we're subtracting half of `bin_size`, because
 the relay added between 0 and `bin_size − 1` to the originally observed
 value."''

 In our Java code, these two steps turned into:

 {{{
   /** Removes noise from a reported stats value by rounding to the nearest
* right side of a bin and subtracting half of the bin size. */
   private long removeNoise(long reportedNumber, long binSize) {
 long roundedToNearestRightSideOfTheBin =
 ((reportedNumber + binSize / 2) / binSize) * binSize;
 long subtractedHalfOfBinSize =
 roundedToNearestRightSideOfTheBin - binSize / 2;
 return subtractedHalfOfBinSize;
   }
 }}}

 I think that this code has a flaw: in `(reportedNumber + binSize / 2) /
 binSize`, we use [https://en.wikipedia.org/wiki/Truncation integer
 truncation]. However, as described in that Wikipedia article, ''"for
 negative numbers truncation does not round in the same direction as the
 floor function: truncation always rounds toward zero, the floor function
 rounds towards negative infinity."''

 I'm going to attach a graph (once this ticket exists) that visualizes the
 effect for reported values around zero. In short: the step function has
 one really long step, and that seems flawed or at least inconsistent.

 My suggestion would be to change that Java code and use `Math.floorDiv()`
 rather than integer truncation. Something like this:

 {{{
 diff --git
 a/src/main/java/org/torproject/metrics/stats/hidserv/Parser.java
 b/src/main/java/org/torproject/metrics/stats/hidserv/Parser.java
 index f2abc789..888c8959 100644
 --- a/src/main/java/org/torproject/metrics/stats/hidserv/Parser.java
 +++ b/src/main/java/org/torproject/metrics/stats/hidserv/Parser.java
 @@ -245,7 +245,7 @@ public class Parser {
 * right side of a bin and subtracting half of the bin size. */
private long removeNoise(long reportedNumber, long binSize) {
  long roundedToNearestRightSideOfTheBin =
 -((reportedNumber + binSize / 2) / binSize) * binSize;
 +Math.floorDiv((reportedNumber + binSize / 2), binSize) * binSize;
  long subtractedHalfOfBinSize =
  roundedToNearestRightSideOfTheBin - binSize / 2;
  return subtractedHalfOfBinSize;
 }}}

 See the [https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html
 #floorDiv-long-long- JavaDocs for that method] for details. Quoting:
 ''"Normal integer division operates under the round to zero rounding mode
 (truncation). This operation instead acts under the round toward negative
 infinity (floor) rounding mode. The floor rounding mode gives different
 results than truncation when the exact result is negative."''

 I'm yet unclear whether this might affect overall results. Right now,
 we're handling all negative reported values wrong by adding 1 `binSize` to
 them which we shouldn't do. Of course, negative reported values should be
 the exception, not the rule, but there's a reason why we're keeping them
 in the process, just like we're keeping values that we think are too large
 because of noise. We'll have to see.

 But regardless, I think we need to fix this. Opinions?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs