Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-11-05 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  karsten
 Type:  enhancement | Status:  closed
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:  fixed
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  irl |Sponsor:  Sponsor13
+---
Changes (by karsten):

 * status:  merge_ready => closed
 * resolution:   => fixed


Comment:

 Replying to [comment:26 karsten]:
 > Replying to [comment:25 karsten]:
 > > Replying to [comment:24 irl]:
 > > > [...] but, we should create a new ticket for generalising the
 computePercentiles function with generics.
 > >
 > > I'll create a slightly different ticket before closing this one: we
 should share more code between modules, and a generic
 `computePercentiles()` would be one piece of code to consider here. I
 wouldn't want to merge a patch that starts with this part of the code. I'd
 want us to approach this from a top-down perspective where we generalize
 similar functionality and use it in all modules to reduce the overall
 amount of code and likelihood of bugs.
 >
 > Still not closing until I get around to doing this.

 Alright, #28342 now exists with many more ideas on this topic. Closing.
 Thanks!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-10-27 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+-
 Reporter:  karsten |  Owner:  karsten
 Type:  enhancement | Status:  merge_ready
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  irl |Sponsor:  Sponsor13
+-

Comment (by karsten):

 Replying to [comment:25 karsten]:
 > Replying to [comment:24 irl]:
 > > I think this is ready to go, [...]
 >
 > Thanks for the review! Pushed to master, deploying now, step by step.
 Leaving this ticket open until everything's deployed.

 Everything's deployed now.

 > > [...] but, we should create a new ticket for generalising the
 computePercentiles function with generics.
 >
 > I'll create a slightly different ticket before closing this one: we
 should share more code between modules, and a generic
 `computePercentiles()` would be one piece of code to consider here. I
 wouldn't want to merge a patch that starts with this part of the code. I'd
 want us to approach this from a top-down perspective where we generalize
 similar functionality and use it in all modules to reduce the overall
 amount of code and likelihood of bugs.

 Still not closing until I get around to doing this.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-10-25 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+-
 Reporter:  karsten |  Owner:  karsten
 Type:  enhancement | Status:  merge_ready
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  irl |Sponsor:  Sponsor13
+-

Comment (by karsten):

 Replying to [comment:24 irl]:
 > I think this is ready to go, [...]

 Thanks for the review! Pushed to master, deploying now, step by step.
 Leaving this ticket open until everything's deployed.

 > [...] but, we should create a new ticket for generalising the
 computePercentiles function with generics.

 I'll create a slightly different ticket before closing this one: we should
 share more code between modules, and a generic `computePercentiles()`
 would be one piece of code to consider here. I wouldn't want to merge a
 patch that starts with this part of the code. I'd want us to approach this
 from a top-down perspective where we generalize similar functionality and
 use it in all modules to reduce the overall amount of code and likelihood
 of bugs.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-10-24 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+-
 Reporter:  karsten |  Owner:  karsten
 Type:  enhancement | Status:  merge_ready
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  irl |Sponsor:  Sponsor13
+-
Changes (by irl):

 * status:  needs_review => merge_ready


Comment:

 I think this is ready to go, but, we should create a new ticket for
 generalising the computePercentiles function with generics.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-10-11 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+--
 Reporter:  karsten |  Owner:  karsten
 Type:  enhancement | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+--

Comment (by karsten):

 Fine to wait and prioritize other things first.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-10-11 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+--
 Reporter:  karsten |  Owner:  karsten
 Type:  enhancement | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+--

Comment (by irl):

 As this is a larger review, I'd like to look at this after the current
 sponsor19 tasks, unless we think this is an immediate priority. If anyone
 else would like to do an initial review for now, then please do.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-10-10 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  karsten
 Type:  enhancement | Status:  accepted
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+---
Changes (by karsten):

 * status:  needs_revision => accepted


Comment:

 Alright, I finally worked on a revision of my earlier patch series. I'd
 really like to get these fixes in, so that we're using the same definition
 of percentile everywhere in our code.

 So, I went through all of iwakeh's remarks and suggestions and implemented
 most of them.

 The suggestion that I did ''not'' implement is generalize and share the
 percentile computation code, for several reasons:
  - While I think that sharing code between modules will be useful, I'd
 rather want to approach that top-down rather than bottom-up from this
 rather small method.
  - These two methods are similar, but different. It's certainly possible
 to generalize them, but this has so far delayed this patch for too long.
  - We're talking about 15 to 25 lines of code here, and the hard parts
 happen in the Apache Commons Math library. Things might be different if we
 implemented our own percentile logic, but that's not the case here.

 Instead I went for separating the percentile computation code in both
 modules into their own methods and added unit tests.

 irl, please review [https://gitweb.torproject.org/user/karsten/metrics-
 web.git/log/?h=task-26035-2 my task-26035-2 branch], which contains the
 rebased patches discussed above plus three new commits with unit test
 fixes and implementations of iwakeh's remarks and suggestions.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-10-10 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+--
 Reporter:  karsten |  Owner:  karsten
 Type:  enhancement | Status:  needs_review
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+--
Changes (by karsten):

 * status:  accepted => needs_review


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-29 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+
 Reporter:  karsten |  Owner:  karsten
 Type:  enhancement | Status:  needs_revision
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+
Changes (by karsten):

 * priority:  High => Medium


Comment:

 It seems unlikely that I'll get to this before the end of the month. De-
 prioritizing in order to focus on the tickets that we should really do in
 the remaining three days of May. This one can still happen in the first
 days of June.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-25 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+
 Reporter:  karsten |  Owner:  karsten
 Type:  enhancement | Status:  needs_revision
 Priority:  High|  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+

Comment (by karsten):

 Thanks for the review and remarks/suggestions! I'll revise the Java
 patches over the weekend or early next week.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-24 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+
 Reporter:  karsten |  Owner:  karsten
 Type:  enhancement | Status:  needs_revision
 Priority:  High|  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+
Changes (by iwakeh):

 * status:  needs_review => needs_revision


Comment:

 Replying to [comment:15 karsten]:
 > Replying to [comment:14 iwakeh]:
 > > Taking you up on your offer from comment:13, so I can concentrate on
 reviews and tickets of CollecTor.
 >
 > Alright, happy to implement this change.

 Thanks!

 >
 > Please review [https://gitweb.torproject.org/karsten/metrics-
 web.git/log/?h=task-26035 my task-26035 branch] with three commits:
 >
 >  - [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26035&id=4f92894a1ee5315b9e4a17b38f3cdb229612f0f1
 4f92894] changes how we're computing median and inter-quartile range in
 the censorship detector code, which is still written in Python. I tested
 the change by running on our user number estimates. I found that it
 changes 159 of 2447 days in our data (6.5%) and leaves the remaining days
 entirely unchanged. This also makes sense: with a slightly different
 median and inter-quartile range we either include a value or exclude it as
 outlier. I'd say we cannot conclude that one of the implementations is
 correct and the other is not. The new implementation will simply be more
 consistent throughout our code base.

 This looks fine and will make it easier to transfer this into Java later.

 >
 >  - [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26035&id=2685c78f13cbf9402d5ba0b4380df03f246e86e5
 2685c78] makes the same change to our advertised bandwidth statistics.
 Obviously, this changes results a bit, because we're now interpolating
 between actually reported advertised bandwidths rather than returning a
 value that was actually reported by one of the relays. Still, for the sake
 of consistency throughout our code base, we should switch.
 >
 >  - [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26035&id=f9c24cab1006bf5999c662e9d06767c59c71a3e6
 f9c24ca] makes the third change in this series, this time to the
 connbidirect module. The change is quite significant in years 2011 and
 2012 where we had just a handful of relays reporting these statistics.
 Then it does make a difference whether we're interpolating or not. Same
 argument in favor of doing it now.

 The advbwdist module has a new static method, which should be made more
 visible and thus facilitate re-use as well as testing.
 Of course, for re-use it needs to be made more generic and maybe also
 placed in a different class (maybe `**.stats.Utiliy`).

 Remarks & suggestions in no particular order:
 * the sorting step in advbwdist changes an input parameter, which is bad
 practice.
 * commons-math Percentile class doesn't require the input data to be
 sorted. (The javadoc comment only talks about sorting in order to explain
 what will happen for edge cases.)
 * Maybe rather use `doubleValue()` instead of casting a Number sub-type to
 a primitive.
 * Casting of percentile results could be performed by the caller, which
 could guarantee that there are only values of for example type short
 entered (see connbiderect).  Or, provide special utility methods that re-
 use code internally.
 * connbidirect uses similar code as advbwdist for almost the same
 computations. The input fraction list also get changed by the unnecessary
 sorting step (this might not matter in that case, but still)
 * The Java re-implementation of the python detector will also benefit from
 a percentile function.
 * The percentile input parameter `int[] percentiles` could be changed to
 `int ... percentiles`.

 Encapsulation and testability of this type of functionality that is needed
 throughout the code is essential and will also make documentation now and
 in future much easier.
 The functionality should especially be tested b/c of the large impact such
 changes have, i.e., re-computation of everything.  This should be revised.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-19 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+--
 Reporter:  karsten |  Owner:  karsten
 Type:  enhancement | Status:  needs_review
 Priority:  High|  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+--
Changes (by karsten):

 * status:  assigned => needs_review


Comment:

 Replying to [comment:14 iwakeh]:
 > Taking you up on your offer from comment:13, so I can concentrate on
 reviews and tickets of CollecTor.

 Alright, happy to implement this change.

 Please review [https://gitweb.torproject.org/karsten/metrics-
 web.git/log/?h=task-26035 my task-26035 branch] with three commits:

  - [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26035&id=4f92894a1ee5315b9e4a17b38f3cdb229612f0f1
 4f92894] changes how we're computing median and inter-quartile range in
 the censorship detector code, which is still written in Python. I tested
 the change by running on our user number estimates. I found that it
 changes 159 of 2447 days in our data (6.5%) and leaves the remaining days
 entirely unchanged. This also makes sense: with a slightly different
 median and inter-quartile range we either include a value or exclude it as
 outlier. I'd say we cannot conclude that one of the implementations is
 correct and the other is not. The new implementation will simply be more
 consistent throughout our code base.

  - [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26035&id=2685c78f13cbf9402d5ba0b4380df03f246e86e5
 2685c78] makes the same change to our advertised bandwidth statistics.
 Obviously, this changes results a bit, because we're now interpolating
 between actually reported advertised bandwidths rather than returning a
 value that was actually reported by one of the relays. Still, for the sake
 of consistency throughout our code base, we should switch.

  - [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26035&id=f9c24cab1006bf5999c662e9d06767c59c71a3e6
 f9c24ca] makes the third change in this series, this time to the
 connbidirect module. The change is quite significant in years 2011 and
 2012 where we had just a handful of relays reporting these statistics.
 Then it does make a difference whether we're interpolating or not. Same
 argument in favor of doing it now.

 I'm currently re-processing the descriptor archive for updated advbwdist
 statistics (second commit above). Re-doing the clients and connbidirect
 statistics using the updated code is much simpler. I hope to be ready to
 deploy the change in the next few days. Ideally, we'd be done with the
 review process by then, too. Thanks in advance!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-18 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  karsten
 Type:  enhancement | Status:  assigned
 Priority:  High|  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+---
Changes (by iwakeh):

 * owner:  iwakeh => karsten
 * status:  accepted => assigned


Comment:

 Taking you up on your offer from comment:13, so I can concentrate on
 reviews and tickets of CollecTor.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-16 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  iwakeh
 Type:  enhancement | Status:  accepted
 Priority:  High|  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+---

Comment (by karsten):

 Sounds like we agree on the approach. (Except that we'll want to take
 another look at e) and f) when the others are done.) Looking forward to
 reviewing code! If, for any reason, you prefer to work on something else
 instead, I'm happy to help here. Thanks!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-16 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  iwakeh
 Type:  enhancement | Status:  accepted
 Priority:  High|  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+---

Comment (by iwakeh):

 Replying to [comment:5 karsten]:

 Leaving out all commons-math related, b/c we agreed to use it.

 > Thanks, very useful! Let me first try to answer the open questions:
 >
 >  - What's up with a) and c) using slightly different percentile
 implementations? The reason is that we're including the 0th (minimum) and
 100th percentile (maximum) in a) which we're not in c). It's totally
 possible that what we're using right now for a) is a terrible hack. Maybe
 we should instead use the formula for c) in a) and handle percentile 0 or
 100 as a special case. Whatever the other implementations do.

 Well, c) would fail on `1.0`, but that wouldn't occur b/c only quartiles
 are computed. This ought to be fixed and both implementations will be the
 same except for the edge cases.

 >
 >  - What's up with e) and f) not being quartiles? What we're doing there
 is that we're computing the ''weighted'' quartiles. And again, it might be
 that it's a hack that we should rewrite. The goal should be to implement a
 weighted trimmed mean. The technical report probably has a better
 definition. What we cannot do, though, is use the exact same percentile
 definition as we're using for the other places.

 Well, I wouldn't call .25 times a value (fraction sum in the code) a
 quartile, and the code calculates the weighed mean of all intervals
 contained in `[sumFraction * 0.25, sumFraction * 0.75]`.  So, nothing to
 be done here.

 >
 > ...
 > I'm slightly leaning towards R-7 here.

 I don't feel strongly about this.

 >
 > ...
 > Except for Java where we'd have to implement something ourselves, which
 would also have to handle special cases 0 and 100.

 Yes the minimum and maximum need to be coded.

 >
 > ...
 > P.S.: Did I write something about trucks? I meant insect legs! Unless
 those have a spare leg mounted somewhere, too, in which case I'll think
 even harder about a good example. ;)

 Well, for insects the leg number is fixed to six, unless they loose a leg
 and live on later.  Might be best to stick to the values at hand ;-)

 So, I implement the changes decided in this and the previous comments.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-16 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  iwakeh
 Type:  enhancement | Status:  accepted
 Priority:  High|  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+---
Changes (by karsten):

 * priority:  Medium => High


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-16 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  iwakeh
 Type:  enhancement | Status:  accepted
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+---

Comment (by karsten):

 Heh, okay. I'm setting priority to high, just in case it helps with
 prioritizing tickets. Thanks!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-16 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  iwakeh
 Type:  enhancement | Status:  accepted
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+---

Comment (by iwakeh):

 Replying to [comment:8 karsten]:
 > Sounds good!
 >
 > (Did you also see my longer comment above the one that you replied to?)

 Yes; still optimizing the reply length :-)

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-16 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  iwakeh
 Type:  enhancement | Status:  accepted
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+---

Comment (by karsten):

 Sounds good!

 (Did you also see my longer comment above the one that you replied to?)

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-16 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  iwakeh
 Type:  enhancement | Status:  accepted
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+---

Comment (by iwakeh):

 Replying to [comment:6 karsten]:
 > Replying to [comment:5 karsten]:
 > > By the way, do you feel strongly about avoiding Apache Commons Math?
 We'd only have to add it to metrics-web, and it would save us half a day
 of writing code and testing it. After all, we also rely on libraries for
 things like base64 encoding, which is not rocket science to implement
 ourselves. We wouldn't have to add it to the metrics-web .war file!
 >
 > Related aspect: When we rewrite the censorship detector (#21588), we'll
 have to compute mean and standard deviation of approximately 50 sample
 values. We'll probably want to use a library for that, too.


 Sure, commons-math might also be useful for future additions of new
 metrics and estimates.
 So, let's settle on adding the dependency.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-15 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  iwakeh
 Type:  enhancement | Status:  accepted
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+---

Comment (by karsten):

 Replying to [comment:5 karsten]:
 > By the way, do you feel strongly about avoiding Apache Commons Math?
 We'd only have to add it to metrics-web, and it would save us half a day
 of writing code and testing it. After all, we also rely on libraries for
 things like base64 encoding, which is not rocket science to implement
 ourselves. We wouldn't have to add it to the metrics-web .war file!

 Related aspect: When we rewrite the censorship detector (#21588), we'll
 have to compute mean and standard deviation of approximately 50 sample
 values. We'll probably want to use a library for that, too.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-12 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  iwakeh
 Type:  enhancement | Status:  accepted
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+---

Comment (by karsten):

 Thanks, very useful! Let me first try to answer the open questions:

  - What's up with a) and c) using slightly different percentile
 implementations? The reason is that we're including the 0th (minimum) and
 100th percentile (maximum) in a) which we're not in c). It's totally
 possible that what we're using right now for a) is a terrible hack. Maybe
 we should instead use the formula for c) in a) and handle percentile 0 or
 100 as a special case. Whatever the other implementations do.

  - What's up with e) and f) not being quartiles? What we're doing there is
 that we're computing the ''weighted'' quartiles. And again, it might be
 that it's a hack that we should rewrite. The goal should be to implement a
 weighted trimmed mean. The technical report probably has a better
 definition. What we cannot do, though, is use the exact same percentile
 definition as we're using for the other places.

  - I think you left out the Python code that is our current censorship
 detector. Which is fine, as I see how we could change that code to match
 what we're doing elsewhere.

 So, I guess the decision we need to make is whether we want to use R-1 or
 R-7 everywhere, right?

 I'm slightly leaning towards R-7 here.

 One reason is that, if we used R-1, we couldn't use R's default `median()`
 anymore, because that interpolates. I found a non-interpolating median
 implementation in Python, called
 [https://docs.python.org/3/library/statistics.html#statistics.median_low
 median_low] (or median_high). And I think the Tor daemon uses a low median
 for some things related to directory authority voting. But I believe it's
 not the standard.

 So, if we use R-7, we should have good tool support.

 Except for Java where we'd have to implement something ourselves, which
 would also have to handle special cases 0 and 100.

 By the way, do you feel strongly about avoiding Apache Commons Math? We'd
 only have to add it to metrics-web, and it would save us half a day of
 writing code and testing it. After all, we also rely on libraries for
 things like base64 encoding, which is not rocket science to implement
 ourselves. We wouldn't have to add it to the metrics-web .war file!

 P.S.: Did I write something about trucks? I meant insect legs! Unless
 those have a spare leg mounted somewhere, too, in which case I'll think
 even harder about a good example. ;)

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-11 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  iwakeh
 Type:  enhancement | Status:  accepted
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+---

Comment (by iwakeh):

 This turned out to be longer than intended:

 a) ​Advertised bandwidth distribution:
  * With the notation in comment:1 the Java code uses essentially `result =
 val(floor((N-1)*percentile))` [https://gitweb.torproject.org/metrics-
 
web.git/tree/src/main/java/org/torproject/metrics/stats/advbwdist/Main.java#n124
 about here].
  * [https://gitweb.torproject.org/metrics-
 web.git/tree/src/main/R/advbwdist/aggregate.R#n13 R code] takes the median
 of the Java calculated percentiles.
 b) ​Advertised bandwidth of n-th fastest relays uses the resulting data
 from a).
 c) ​Fraction of connections used uni-/bidirectionally:
 [https://gitweb.torproject.org/metrics-
 
web.git/tree/src/main/java/org/torproject/metrics/stats/connbidirect/Main.java#n468
 Uses Java] and calculates `result = val(floor(N*percentile))` for the
 three percentiles .25, .5, and .75.
 d) ​Time to download files over Tor: uses percentile_cont
 e) ​Unique .onion addresses (version 2 only):
 [https://gitweb.torproject.org/metrics-
 
web.git/tree/src/main/java/org/torproject/metrics/stats/hidserv/Aggregator.java#n165
 The code] doesn't seem to calculate quartiles, rather checks that the
 interval is contained in the 25% to 75% interval of the fraction sum.
 Hmm, what am I missing here?
 f) Onion-service traffic (versions 2 and 3): same as e).

 In total, the Java calculations a) and c) use a discrete version of median
 calculation and differ in 'slight index shifting'.
 The calculations in e) and f) are not really 'quartiles'.
 The remaining calculations use R's median and postgresql percentile_cont,
 where R's standard median is calculated (in pseude code) as
 {{{
 #!C
 first = floor(percentile * N);
 second = ceil(percentile * N);
 /* if first==second the value of first is used. */
 if (first==second) {
   result = val(first);
 } else { /* if first and second differ, take the average. */
   result = val(first) + (0.5 * (val(second) - val(first)));
 }
 }}}

 So R's standard median is the same as 50%-percentile of type R-2 and also
 coincides with 50%-percentile of type R-7.

 There is a variety there.  The discrete types are easier to compute (when
 trying to reproduce the results for example).  Introducing the
 interpolation (or continuous) type in Java would mean to complicate the
 current code a little, but could be done w/o commons-math.
 Of course, the two calculations in a) and c) should be the same, but
 that's only a minor change and not related to the choice of percentile
 calculation.

 === I: R-1
 If we decide to use the discrete R-1 throughout.  If so, we'd need to
 * replace percentile_cont by percentile_dics, and
 * replace R's median function throughout by the 50%-percentile type R-1
 provided by utility function named `metricsmedian`.

 === II: R-7
 If we decide to use of the proportionate interpolation method R-7
 throughout, there are these work packages:

 * implement a simple utility interpolation function for Java (similar to
 postgresql's approach)
 * and make use of it in a) and c).
 * replace R's median function throughout by the 50%-percentile type R-7
 provided by  utility function named `metricsmedian`.


 In both cases the calculations e) and f) stay as they are, but need more
 documentation.


 
 PS: (Trucks often have a spare tire mounted somewhere, and in some
 countries they use three wheeled trucks ;-)

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-09 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+---
 Reporter:  karsten |  Owner:  iwakeh
 Type:  enhancement | Status:  accepted
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+---
Changes (by iwakeh):

 * owner:  metrics-team => iwakeh
 * status:  new => accepted


Comment:

 Accepting.

 Next steps:
 * find decisions and answers for the few open questions.
 * identify work packages and create sub-tickets.
 * implement.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-09 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  enhancement | Status:  new
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+--

Comment (by karsten):

 Thanks for this thoughtful response! I have just a few more thoughts on
 this:

 I'm not entirely sure how that pseudocode handles `percentile` values of
 0.0 and 1.0. Is `val()` 0-indexed or 1-indexed? In either case, we'll end
 up outside of values with one of the two percentiles. (PostgreSQL's
 `PERCENTILE_CONT` does support percentiles 0.0 and 1.0.)

 I also looked more closely at types `R-1` and `R-2` and figured that
 PostgreSQL's `PERCENTILE_DISC` is `R-1`, because it does not produce any
 averages. So, PostgreSQL implements `R-1` and `R-7`.

 (By the way, when I refer to them as `R-x`, that's mainly to simplify our
 discussion here. I'm happy to specify them with their formulas and only
 mention that they are what R defines as type x in their `quantile()`
 function.)

 Regarding R's `median()`, function, that produces the same result as `R-2`
 and `R-7`, right?

 I wonder if, for the sake of simplicity, we should avoid using
 `PERCENTILE_DISC` (which we're not using yet, AFAIK) and only use
 `PERCENTILE_CONT` and R's `median()`. That is, use `R-7` everywhere.

 I do agree that interpolation between two integers representing user
 numbers doesn't make as much sense. But we can always truncate or round
 results, if we believe that integers are less confusing.

 (I could imagine that if we were to compute percentiles of truly discrete
 variables like the number of tires mounted on trucks, we wouldn't want to
 return 7, but only actual sample values. I don't think that we need to
 worry about that here.)

 Regarding Apache Commons Math, we're not using that yet, and I don't feel
 strongly about adding it as dependency or implementing this quite simple
 function ourselves, say, in metrics-lib. Worth adding tests, I guess.

 Regarding Python, I'm amending my statement above a little bit. It's true
 that we're going to replace our last remaining Python code. Still, if we
 want to make our numbers reproducible, we'll have to accept that many of
 our users will want to reproduce them using Python. We should at least
 take a brief look how this would work.

 So, your possible steps make sense. Is this something you'd like to work
 on?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

2018-05-08 Thread Tor Bug Tracker & Wiki
#26035: Streamline sample quantile types used in the various modules
+--
 Reporter:  karsten |  Owner:  metrics-team
 Type:  enhancement | Status:  new
 Priority:  Medium  |  Milestone:
Component:  Metrics/Statistics  |Version:
 Severity:  Normal  | Resolution:
 Keywords:  |  Actual Points:
Parent ID:  | Points:
 Reviewer:  |Sponsor:  Sponsor13
+--

Comment (by iwakeh):

 Long description, true.

 Let's look at the postgresql implementation (using source of 9.6.8, but
 assuming it is not subject to frequent change,
 ./src/backend/utils/adt/orderedsetaggs.c).

 The computation of percentile_* as C-ish pseudo-code (which could easily
 be used for implementation in Java or python):
 {{{
 #!C
 /*
All values are sorted and indexed according to the order.
first and second are indices; val(k) is the value at index k;
percentile is the wanted percentile.  N is the count of values.
 */

 first = floor(percentile * N);
 second = ceil(percentile * N);
 /* if first==second the value of first is used. */
 if (first==second) {
   result = val(first);
 } else { /* if first and second differ the following interpolation is
 used. */
   proportion = (percentile * N) - first;
   /* the value is chosen between the values of first and second */
   result = val(first) + (proportion * (val(second) - val(first)));
 }

 /*---*/
 /* For comparison percentile_disc:  */
 result = val( ceil(N*percentile));
 }}}

 For values, where fractions make sense, e.g. seconds for onionperf
 results, the interpolation or continuous method could be used and for all
 else, e.g. user numbers etc., the simpler calculation (refered to as
 discontinuous) could be used.

 The classification R-x is not really a DIN and I wouldn't rely on an
 implementation without checking the source.
 Thus, using our own implementation in Java might be less trouble and
 smaller dependency counts/space used.  Python is not used anymore soon,
 and R is only used for the median and either could document the median
 calculation in R or implement the 0.5-percentile.

 Possible steps:
 * keep using postgresql percentile_cont
 * implement the equivalent Java functionality and replace commons-math, if
 this is the only functionality why it is included.
 * document percentile calculation once for Java together with postgresql
 * adapt the R median calculation, if necessary

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs