Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-29 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_revision
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by karsten):

 Cherry-picked and pushed that 1-line commit.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-28 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_revision
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by karsten):

 Replying to [comment:26 iwakeh]:
 > Please also merge [https://gitweb.torproject.org/user/iwakeh/metrics-
 lib.git/commit/?h=task-22983-2=e8e783d3a9ae4e220e70fbd14240d0979672ac2a
 this commit] to current master, to avoid [https://metrics.torproject.org
 /metrics-lib/index.html?org/torproject/descriptor/internal/package-
 summary.html javadoc for package 'internal'] on the metrics.tp.o web site.

 Huh, good catch. Will do tomorrow.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-24 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_revision
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by iwakeh):

 Please also merge [https://gitweb.torproject.org/user/iwakeh/metrics-
 lib.git/commit/?h=task-22983-2=e8e783d3a9ae4e220e70fbd14240d0979672ac2a
 this commit] to current master, to avoid [https://metrics.torproject.org
 /metrics-lib/index.html?org/torproject/descriptor/internal/package-
 summary.html javadoc for package 'internal'] on the metrics.tp.o web site.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-15 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_revision
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by iwakeh):

 Replying to [comment:22 karsten]:
 > Alright, I looked at the remaining commits (in
 [https://gitweb.torproject.org/user/karsten/metrics-
 lib.git/log/?h=task-22983-3 ​my task-22983-3 branch]). I'll start with the
 major issues/questions in no specific order:
 >
 >  1. Why do we compress previously uncompressed log files? I see the
 point of saving memory, but we'd be doing that by sacrificing CPU time.
 And if we later change to leaving file contents on disk and only storing
 offsets and lengths into files, it would be wasteful to store a compressed
 copy of the file in memory just in case the application might need it
 later. Ideally, we'd just store a reference to the `byte[]` in whatever
 compression we're given, including uncompressed.
 >
 >  2. I already brought this up in
 [https://trac.torproject.org/projects/tor/ticket/22983#comment:12 comment
 12] above but didn't see a response: "Shouldn't `getRawDescriptorBytes()`
 return the uncompressed bytes and a separate method `getCompressedBytes()`
 return the compressed bytes? Thinking of being consistent with other
 descriptors where `getRawDescriptorBytes()` returns uncompressed bytes,
 too. Not sure about this one." Related to this, should
 `getRawDescriptorLength()` return the length of the ''uncompressed'' byte
 array? (This possibly requires uncompressing the file on first invocation
 and storing the length in an attribute.) What do you think about changing
 this for the sake of library consistency?


 Here and in 1. I noticed that I had some implicit assumptions about log
 descriptors that led to the chosen implementation.  Once #23243 is
 answered these concerns can be addressed in a better way.


 >
 >  3. Why do sanitized log lines contain a trailing `-`, as in: `... 403
 294 \"-\" \"-\" -\n`? I know that the
 [https://gitweb.torproject.org/webstats.git/tree/src/sanitize.py Python
 script] also added that trailing dash, so I'm asking if you think there's
 a reason to keep that. The [https://httpd.apache.org/docs/2.4/logs.html
 Combined Log Format] does not specify one. If you think it can go away we
 should quickly check with Sebastian and then take it out.

 Also a discussion for the spec ticket #23243.  But in general I don't see
 a need for the trailing dash, I only reproduced the log-lines from the
 python implementation.

 >
 >  4. From the tests it seems like `POST` requests are kept, too. However,
 we should only keep `GET` and `HEAD` requests, just like the
 [https://gitweb.torproject.org/webstats.git/tree/src/sanitize.py Python
 script]. Likewise, `400` and `404` requests should be discarded. Maybe
 check for other deviations from the script yourself. And in the next
 review round we should compare the two sanitizers (Python and Java) using
 some real logs. Or do you still have logs to run some tests yourself?

 I did run such tests and some of the test files are taken from the real
 vs. python-cleaned logs (the real ones without pi info).  In #23234 we
 should craft the input and target formats; once that is done change
 implementation and tests accordingly.

 >
 >  5. `LogDescriptorImpl` should not sort logs by default as part part of
 the sanitizing step. That's a specific sanitizing technique for web server
 logs. It might be that a future log format only requires removing certain
 fields but not re-ordering log lines. Maybe there should be a second
 method `cleanLines(List)` in `InternalLogDescriptor.Sanitizer`,
 and the existing method should be renamed to `cleanLine(String)`. The
 default sanitizer should keep the order unchanged and simply return the
 list it gets.

 True, the re-ordering is web-server-access-log specific and should be
 moved.

 >
 > I also found a few minor issues where it might be easiest if I fix them
 myself. I'll do that in the next review round as one or more suggested
 commits, and if you agree with those changes, I'll squash them, and maybe
 we can include them in the coding conventions afterwards.

 Ok, if these are 'orthogonal' to the above topics, but please give #23234
 a higher priority.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing 

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-15 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_revision
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by iwakeh):

 Replying to [comment:23 iwakeh]:
 > Hmm, when looking through your comment:22 I notice many
 questions/concerns that are due to not having an explicit webserver-
 access-log specification.  As we need that anyway I'll open a ticket for
 writing such spec and add all the open assumptions about 'real' and
 sanitized web-server-access-logs there.  I think, from such an explicit
 spec it will be easier to find solution for the implementational/design
 details.
 >

 See ticket #23243 for more.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-15 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_revision
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by iwakeh):

 Hmm, when looking through your comment:22 I notice many questions/concerns
 that are due to not having an explicit webserver-access-log specification.
 As we need that anyway I'll open a ticket for writing such spec and add
 all the open assumptions about 'real' and sanitized web-server-access-logs
 there.  I think, from such an explicit spec it will be easier to find
 solution for the implementational/design details.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-15 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_revision
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---
Changes (by karsten):

 * status:  needs_review => needs_revision


Comment:

 Alright, I looked at the remaining commits (in
 [https://gitweb.torproject.org/user/karsten/metrics-
 lib.git/log/?h=task-22983-3 ​my task-22983-3 branch]). I'll start with the
 major issues/questions in no specific order:

  1. Why do we compress previously uncompressed log files? I see the point
 of saving memory, but we'd be doing that by sacrificing CPU time. And if
 we later change to leaving file contents on disk and only storing offsets
 and lengths into files, it would be wasteful to store a compressed copy of
 the file in memory just in case the application might need it later.
 Ideally, we'd just store a reference to the `byte[]` in whatever
 compression we're given, including uncompressed.

  2. I already brought this up in
 [https://trac.torproject.org/projects/tor/ticket/22983#comment:12 comment
 12] above but didn't see a response: "Shouldn't `getRawDescriptorBytes()`
 return the uncompressed bytes and a separate method `getCompressedBytes()`
 return the compressed bytes? Thinking of being consistent with other
 descriptors where `getRawDescriptorBytes()` returns uncompressed bytes,
 too. Not sure about this one." Related to this, should
 `getRawDescriptorLength()` return the length of the ''uncompressed'' byte
 array? (This possibly requires uncompressing the file on first invocation
 and storing the length in an attribute.) What do you think about changing
 this for the sake of library consistency?

  3. Why do sanitized log lines contain a trailing `-`, as in: `... 403 294
 \"-\" \"-\" -\n`? I know that the
 [https://gitweb.torproject.org/webstats.git/tree/src/sanitize.py Python
 script] also added that trailing dash, so I'm asking if you think there's
 a reason to keep that. The [https://httpd.apache.org/docs/2.4/logs.html
 Combined Log Format] does not specify one. If you think it can go away we
 should quickly check with Sebastian and then take it out.

  4. From the tests it seems like `POST` requests are kept, too. However,
 we should only keep `GET` and `HEAD` requests, just like the
 [https://gitweb.torproject.org/webstats.git/tree/src/sanitize.py Python
 script]. Likewise, `400` and `404` requests should be discarded. Maybe
 check for other deviations from the script yourself. And in the next
 review round we should compare the two sanitizers (Python and Java) using
 some real logs. Or do you still have logs to run some tests yourself?

  5. `LogDescriptorImpl` should not sort logs by default as part part of
 the sanitizing step. That's a specific sanitizing technique for web server
 logs. It might be that a future log format only requires removing certain
 fields but not re-ordering log lines. Maybe there should be a second
 method `cleanLines(List)` in `InternalLogDescriptor.Sanitizer`,
 and the existing method should be renamed to `cleanLine(String)`. The
 default sanitizer should keep the order unchanged and simply return the
 list it gets.

 I also found a few minor issues where it might be easiest if I fix them
 myself. I'll do that in the next review round as one or more suggested
 commits, and if you agree with those changes, I'll squash them, and maybe
 we can include them in the coding conventions afterwards.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-15 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by karsten):

 Replying to [comment:20 iwakeh]:
 > Replying to [comment:19 karsten]:
 > > That's quite a few commits there, more than I can handle in a single
 chunk. But let me start going through them and put reviews here as I
 finish them.
 >
 > Yes, thanks for starting!
 >
 > >
 > > e0c5774 and e224680 look good. Already merged to master.
 > >
 > > Some suggestions for 77b143d:

 Fixes look good, squashed and merged to master.

 > > d687f44 looks good.

 Merged to master.

 > > I didn't finish my review of 76ae1e7, but here's some early feedback:

 Fixes so far look good, though I might have more thoughts on the
 compression part. I assumed that you wouldn't work on this branch without
 new input, so I went ahead and squashed the fixup commits and pushed
 everything to [https://gitweb.torproject.org/user/karsten/metrics-
 lib.git/log/?h=task-22983-3 my task-22983-3 branch] which I'll refer to
 from now on. Stay tuned! :)

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-14 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by iwakeh):

 Replying to [comment:19 karsten]:
 > That's quite a few commits there, more than I can handle in a single
 chunk. But let me start going through them and put reviews here as I
 finish them.

 Yes, thanks for starting!

 >
 > e0c5774 and e224680 look good. Already merged to master.
 >
 > Some suggestions for 77b143d:
 >  - JSON is not exactly a compression type, it's a file content type.
 Maybe we can remove it now and fall back to the same PLAIN type that we
 fall back to for uncompressed logs?

 I didn't want to impose additional changes for generating the index.json*
 files.  I think it is also useful to state explicitly that files ending in
 '.json' are not compressed.
 Maybe, we can revisit the question later when applying more changes, e.g.,
 also adding enums for 'tar' and 'tar.gz' etc.?

 >  - The (duplicate) documentation in `FileType` doesn't help that much.
 I'd say let's either document all types or none of them. If I had to
 choose, I'd say none.

 Agreed, I removed the comments.

 >  - In `FileType#findType`, I think it's bad to catch `RuntimeException`,
 because that covers all kinds of programming errors made by application
 developers. For example, if they give us `ext = null` as parameter, we'll
 tell them it's a `PLAIN` file, but really they shouldn't give us that
 parameter. Better catch `IllegalArgumentException` only and let them catch
 their `NullPointerException` if they think that giving us `null` is a good
 idea. Related to this, let's not say in the documentation that we're not
 throwing any exceptions. (Of course, if the plan is to handle `null`
 exactly like this, let's put `IllegalArgumentException |
 NullPointerException` in the `catch` clause to indicate that we put it
 there intentionally.)

 True, the catch can be more restrictive; a fixup-commit contains the
 `IllegalArgumentException | NullPointerException` variation.

 >  - If you make changes to this commit, please make them as fixup/squash
 commit on top of the branch, so that I can continue reviewing subsequent
 commits.
 >
 > d687f44 looks good.
 >
 > I didn't finish my review of 76ae1e7, but here's some early feedback:
 >  - There's a file `IndexNode.java.orig` which should go away (in a later
 fixup commit).

 Removed.

 >  - The documentation of `WebServerAccessLogImpl` says that "the file is
 not read." Yet, it says a few sentences later that "it will be compressed
 to the default compression type" in case it's not compressed yet. That's a
 contradiction. And should we really do this? Or should we provide a way
 for the (internal) user to compress the file using the compression type of
 their choice?

 Good point!  I added a fixup commit, which provides a constructor arg for
 the default compression, but that will only be used, if the supplied log
 file doesn't have any compression extension.  All other cases, like
 changing compression type should be handled outside of metrics-lib using
 FileType enums directly.

 There is no contradiction, b/c the given bytes will be compressed, but not
 the file.  The File constructor argument is here to implement `Descriptor`
 interface's method `getDescriptorFile` (and also b/c we need filename and
 immediate parent path for meta-data).
 I think, it doesn't hurt to move the discussion and resulting changes to
 'the big picture' of ticket #22695.

 >
 > I'll resume the review as soon as I get to it.

 All changes are provided as fixup commits.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-14 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by karsten):

 That's quite a few commits there, more than I can handle in a single
 chunk. But let me start going through them and put reviews here as I
 finish them.

 e0c5774 and e224680 look good. Already merged to master.

 Some suggestions for 77b143d:
  - JSON is not exactly a compression type, it's a file content type. Maybe
 we can remove it now and fall back to the same PLAIN type that we fall
 back to for uncompressed logs?
  - The (duplicate) documentation in `FileType` doesn't help that much. I'd
 say let's either document all types or none of them. If I had to choose,
 I'd say none.
  - In `FileType#findType`, I think it's bad to catch `RuntimeException`,
 because that covers all kinds of programming errors made by application
 developers. For example, if they give us `ext = null` as parameter, we'll
 tell them it's a `PLAIN` file, but really they shouldn't give us that
 parameter. Better catch `IllegalArgumentException` only and let them catch
 their `NullPointerException` if they think that giving us `null` is a good
 idea. Related to this, let's not say in the documentation that we're not
 throwing any exceptions. (Of course, if the plan is to handle `null`
 exactly like this, let's put `IllegalArgumentException |
 NullPointerException` in the `catch` clause to indicate that we put it
 there intentionally.)
  - If you make changes to this commit, please make them as fixup/squash
 commit on top of the branch, so that I can continue reviewing subsequent
 commits.

 d687f44 looks good.

 I didn't finish my review of 76ae1e7, but here's some early feedback:
  - There's a file `IndexNode.java.orig` which should go away (in a later
 fixup commit).
  - The documentation of `WebServerAccessLogImpl` says that "the file is
 not read." Yet, it says a few sentences later that "it will be compressed
 to the default compression type" in case it's not compressed yet. That's a
 contradiction. And should we really do this? Or should we provide a way
 for the (internal) user to compress the file using the compression type of
 their choice?

 I'll resume the review as soon as I get to it.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-11 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---
Changes (by iwakeh):

 * status:  needs_revision => needs_review


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-10 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_revision
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by iwakeh):

 Replying to [comment:16 karsten]:
 > I think we agree on most points here.
 >
 > Just one suggestion for the `internal` package: we could label it as
 "beta" with the intention to making it a public API at a later stage,
 while reserving to break the API even between minor releases until we
 remove the "beta" label. This would give us some freedom while designing
 this new API to make mistakes and revise ideas.

 I implemented the changes along the lines we discussed in the above
 comments.
 There are plenty of tests for the new code (which already helped uncover
 subtle bugs) and javadoc for both the public and the internal parts, which
 should also address some of the questions from your first review comments.

 Please review seven commits [https://gitweb.torproject.org/user/iwakeh
 /metrics-lib.git/log/?h=task-22983-2 on this branch].  Three of these
 commits ([https://gitweb.torproject.org/user/iwakeh/metrics-
 lib.git/commit/?h=task-22983-2=e0c5774529fb2597e80690b5cc625e54a2eeb245
 one], [https://gitweb.torproject.org/user/iwakeh/metrics-
 lib.git/commit/?h=task-22983-2=e2246804e366920c9c334d774575217be0c644f8
 two], [https://gitweb.torproject.org/user/iwakeh/metrics-
 lib.git/commit/?h=task-22983-2=d687f440ea570bf24fd01bdbf53365686f68a1aa
 three]) are tiny maintenance commits and the other four implement this
 task.

 As a result of this patch the CollecTor patch for webstats (#22428) will
 be very small and quite some code reduction will be possible in other
 CollecTor modules.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-01 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_revision
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by karsten):

 I think we agree on most points here.

 Just one suggestion for the `internal` package: we could label it as
 "beta" with the intention to making it a public API at a later stage,
 while reserving to break the API even between minor releases until we
 remove the "beta" label. This would give us some freedom while designing
 this new API to make mistakes and revise ideas.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-08-01 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_revision
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by iwakeh):

 Replying to [comment:14 karsten]:
 > I'm game. Your suggestion doesn't at all sound crazy. The only reason I
 pushed back was that such a change requires a lot of consideration and
 discussion and shouldn't be included together with a new feature but on
 its own. A few thoughts:

 Fine :-)

 >  - How about we include a second package with a second set of interfaces
 for CollecTor and other hypothetical applications providing and/or
 sanitizing descriptors? Ideally, applications that simply consume
 descriptors wouldn't have to worry about that second set. And
 implementation classes could implement interfaces from both sets.

 Makes sense; I would suggest `org.torproject.descriptor.internal`.

 >  - I agree that the discussion of JSON handling parts in CollecTor and
 metrics-lib was related. Another aspect there is metrics-web that will
 soon need to parse CollecTor's index.json, too, in order to implement
 #22836.

 Yep.

 >  - I'm not opposed to including the sanitizing code in metrics-lib if
 it's hidden from the descriptor-consuming interfaces. CollecTor will still
 contain all the logic for timing downloads. But as soon as we put the
 sanitized bridge descriptor specification into metrics-web, CollecTor
 won't be the only place anymore that needs changing when we change
 sanitizing code anyway. So, it doesn't really matter whether the
 sanitizing code is in CollecTor or metrics-lib. I could imagine that we
 can remove some duplicate code by putting everything directly related to
 messing with descriptors in a single place.

 Agreed.  This would also allow tests that could catch problems when
 sanitation changes etc.

 >  - We should focus on one change at a time. That is, we could either
 start with adding web logs without any sanitizing support and duplicate
 some code in CollecTor. Or we could start with this discussion of
 extending scope of metrics-lib and put the web logs extension on hold. I'm
 fine with either approach.

 Well, as for the next steps I would go ahead and add a section to metrics-
 lib README.md about the `internal` API package, which will condense and
 finalize the discussion (new ticket?).
 Then, I'd also like to start immediately putting the definitions/design
 there into code for webstats.  The code for the webstats module is already
 there and just needs to be 'distributed' across products.
 The public-API addition to metrics-lib is defined (i.e., the above with
 the suggestions from your comment and w/o any sanitation related).  The
 additional internal-API is internal and only used in CollecTor's new
 webstats module, which leaves a lot freedom to adapt things, if necessary.

 > There, I didn't fully think this through, but I didn't want to delay
 this any further in case you want to give this more thoughts. I'll keep
 thinking about this a little, as time permits, but please let me know how
 you want to continue, and I'll try to help move it forward in that
 direction. Thanks!

 Thanks, too!
 I think, this is a good step forward.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-31 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_revision
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by karsten):

 I'm game. Your suggestion doesn't at all sound crazy. The only reason I
 pushed back was that such a change requires a lot of consideration and
 discussion and shouldn't be included together with a new feature but on
 its own. A few thoughts:
  - How about we include a second package with a second set of interfaces
 for CollecTor and other hypothetical applications providing and/or
 sanitizing descriptors? Ideally, applications that simply consume
 descriptors wouldn't have to worry about that second set. And
 implementation classes could implement interfaces from both sets.
  - I agree that the discussion of JSON handling parts in CollecTor and
 metrics-lib was related. Another aspect there is metrics-web that will
 soon need to parse CollecTor's index.json, too, in order to implement
 #22836.
  - I'm not opposed to including the sanitizing code in metrics-lib if it's
 hidden from the descriptor-consuming interfaces. CollecTor will still
 contain all the logic for timing downloads. But as soon as we put the
 sanitized bridge descriptor specification into metrics-web, CollecTor
 won't be the only place anymore that needs changing when we change
 sanitizing code anyway. So, it doesn't really matter whether the
 sanitizing code is in CollecTor or metrics-lib. I could imagine that we
 can remove some duplicate code by putting everything directly related to
 messing with descriptors in a single place.
  - We should focus on one change at a time. That is, we could either start
 with adding web logs without any sanitizing support and duplicate some
 code in CollecTor. Or we could start with this discussion of extending
 scope of metrics-lib and put the web logs extension on hold. I'm fine with
 either approach.
 There, I didn't fully think this through, but I didn't want to delay this
 any further in case you want to give this more thoughts. I'll keep
 thinking about this a little, as time permits, but please let me know how
 you want to continue, and I'll try to help move it forward in that
 direction. Thanks!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-30 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_revision
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by iwakeh):

 Thanks for review!  I'll get to the details later, but first:

 A similar discussion came up during the work on JSON handling parts of
 CollecTor and metrics-lib.  Thinking about this for a while, I'd say we
 should step back a little and tackle the main discussion to find a
 solution for having a clean API without any (cross-product) code
 duplication:

 Goals:
 1) keep metric-lib API a pure client API
 2) avoid all code duplication throughout Metrics' products

 CollecTor is an unusual metrics-lib client, because CollecTor alters
 descriptors and even creates the final descriptors (e.g., adding
 annotations, sanitizing privacy problematic data).  Thus, there is some
 functional overlap (extracting bytes, (de)compression, JSON processing)
 with metrics-lib.

 Adding another Metrics product for these cross-concerns seems too much
 overhead in terms of dealing with the additional dependency in Metrics'
 products, committing to maintain another product and another public API.
 Extending metrics-lib's API (as with the above sanitize method) breaks 1.
 On the other hand, rewriting parsing code (like for bridge-descs-
 sanitation) or de/compression functionality violates 2.

 Suggestion:
 (hope the following doesn't sound crazy)
 What about granting CollecTor the special role (which it has anyway) by
 providing an 'internal' API in metrics-lib?
 This would mainly consist in adding interfaces in sub-packages of
 o.tp.descriptor that allow access to implementational functionality, which
 are explicitly **not** part of the public (as in published javadoc-API)
 API.

 The 'internal' interfaces serve as the special contract between CollecTor
 and metrics-lib, i.e., CollecTor code should not access metrics-lib's
 implementation, but only the 'internal' and public API to avoid code
 duplication.  (For example, I could envision an interface
 o.tp.d.log.InternalLogDescriptor that enables setting the descriptor's
 byte array etc.  And, in case you worry, I would not add a 'sanitize()'
 method to such an interface.)

 This approach seems to have the benefits a new cross-concerns Metrics
 product could offer without the hassle of another code-base and what all
 is attached to it.
 I think, it also would finally pave the way for trimming down and
 streamlining all CollecTor modules.

 If that makes some sense, I can provide branches for the web-log related
 code for both metrics-lib and CollecTor.  So, we can see how the approach
 works in code.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-28 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_revision
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---
Changes (by karsten):

 * status:  needs_review => needs_revision


Comment:

 Hmm, no, as of today I'm not convinced that this is a good idea. It might
 be and I'm not seeing it yet, or how it would work for other sanitized
 descriptors. But I'd rather merge something simple in the next couple of
 days and not rush this somewhat major design change. I agree that there's
 no actual sanitizing code in metrics-lib. But except for this new
 interface, there's also no other interface in metrics-lib where one can
 plug in sanitizing code. Avoiding duplicate code is good, but keeping
 interfaces small and intuitive is good, too. Let's open a ticket for that
 and do it in a separate step when we have more time to think about it.

 Here's some more feedback on the `LogDescriptor` interface:
  - The documentation of `TYPE` says that the name should include a string.
 But who should make sure that this is the case? The application developer?
 Can you rephrase that to say what is expected here?
  - Some of the method descriptions are written in 3rd person ("Returns
 ..."), some in 2nd person ("Return ..."). I believe that 3rd person is
 preferred, though we're not doing that consistently in the rest of
 metrics-lib. But we should at least try to do it consistently in one
 interface.
  - "yyymmdd" -> "mmdd" in `getLogDate()`
  - I'm unclear what to expect as return value from `getLogType()`. What
 types exist? Do I get a class name, or what?
  - Maybe rename `getLogMillis()` to `getLogDateMillis()` to indicate that
 we're just returning the milliseconds of the date at 00:00:00 UTC, not the
 milliseconds of whatever time of the day the log was produced.
  - Please avoid abbreviations like "msec" and instead write "milliseconds
 since the epoch", for consistency.
  - The JavaDoc for `getRawDescriptorBytes()` should not copy the known
 compression types, but refer to `getCompressionType()` for the list. Right
 now, there's already an inconsistency between the list regarding `bz2`.
  - Are uncompressed logs supported? The documentation for
 `getRawDescriptorBytes()` doesn't indicate that, nor does
 `getCompressionType()` say what it would return in that case.
  - Shouldn't `getRawDescriptorBytes()` return the uncompressed bytes and a
 separate method `getCompressedBytes()` return the compressed bytes?
 Thinking of being consistent with other descriptors where
 `getRawDescriptorBytes()` returns uncompressed bytes, too. Not sure about
 this one.
  - "added in future" -> "added in the future"
  - We briefly discussed above to include the first, say, 100 unrecognized
 lines in `getUnrecognizedLines()`, but the documentation says it doesn't
 reply any. Why? Because it's not implemented yet?
  - As explained above, let's take out the `Sanitizer` subinterface and
 related methods.

 I'd like to wait for your response and a revised branch before doing
 another review of the remaining code. I'm not around most of Saturday but
 can take a look after that. Or on Monday, if you'd rather enjoy your
 weekend. :) Thanks!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-28 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by iwakeh):

 Replying to [comment:10 karsten]:
 > I need more time for this review. But here's a first question:
 >
 > Should we really put the sanitizing code into metrics-lib rather than
 CollecTor? That's an important design decision and a change to what we
 have been doing in the past. Where would this code be used other than in
 CollecTor? So far, metrics-lib has primarily been the client-side library
 for applications using CollecTor data. But this change would turn it into
 a library that both the CollecTor server and its clients depend on. I'm
 yet undecided whether this is a good idea or not. In any case, we should
 discuss this first.

 The sanitizing code **is not** part of metrics-lib.  Thus, we agree here.
 In the proposed patch metrics-lib enables adding sanitizing code from the
 'outside' using method
 {{{
 public void setSanitizer(LogDescriptor.Sanitizer sani);
 }}}
 and `Sanitizer` is just a functional interface (i.e., having one method)
 that can be fulfilled by a lambda expression once we go to Java8, but
 that's an aside.

 The given `Sanitizer` is applied when `sanitize()` is called.  The
 resulting lines are sorted by metrics-lib.

 A choice I made is to have a default identity sanitizer in case none was
 set instead of raising an exception.

 With this approach metrics-lib is sanitizer-code-agnostic, but provides
 all else (compression, de-compression, etc.), which avoids duplicating
 code and enables us to implement performance and space saving code 'under
 the hood' once it is needed.  Hope this explains my reasoning.

 CollecTor depends on metrics-lib already as it uses `Descriptor`s of all
 kinds as well as parser and reader from metrics-lib.

 (I noticed that I missed adding a comment to
 `LogDescriptor.setSantizer()`.  I'll add a fixup commit, but that
 shouldn't hinder review.)

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-28 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by karsten):

 I need more time for this review. But here's a first question:

 Should we really put the sanitizing code into metrics-lib rather than
 CollecTor? That's an important design decision and a change to what we
 have been doing in the past. Where would this code be used other than in
 CollecTor? So far, metrics-lib has primarily been the client-side library
 for applications using CollecTor data. But this change would turn it into
 a library that both the CollecTor server and its clients depend on. I'm
 yet undecided whether this is a good idea or not. In any case, we should
 discuss this first.

 I'll continue my review later today and post my findings here.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-27 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by iwakeh):

 The new ticket for LogLine implementation is #23046.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-27 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---
Changes (by iwakeh):

 * milestone:   => metrics-lib 2.1.0


Comment:

 Prerequisite for CollecTor 1.3.0

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-26 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by iwakeh):

 * status:  new => needs_review


Comment:

 Please review [https://gitweb.torproject.org/user/iwakeh/metrics-
 lib.git/log/?h=task-22983 this branch].

 The [https://gitweb.torproject.org/user/iwakeh/metrics-
 lib.git/commit/?h=task-22983=20a9f82d06adbf960f1da8ff9853e50c5c1c5e25
 first commit] adds the new interfaces and their implementations.

 LogDescriptor contains all methods that will be hopefully applicable for
 all log-types possible.
 WebServerAccessLog is the specialization for access-logs.

 LogDescriptor also offers a sub-interface:
 {{{
/**
* Providing a single function for removing sensitive data from a
* given Apache Access Log log line.
*/
   public interface Sanitizer {

 /** Returns a cleaned log line, i.e., without possibly privacy
  * sensitive values. */
 public String clean(String line);
   }
 }}}

 and a method `sanitize()`.  The latter applies the cleaning procedure to
 all log lines and sorts the resulting lines.  The default sanitizer
 returns the line w/o any changes.  This setup keeps all descriptor
 parsing, compression, un-compression in metrics-lib; CollecTor is not
 forced to re-implement parsing functionality and only needs to provide the
 log cleaning procedure.  (A similar approach could be thought up for
 bridge-sanitation, too.)

 The [https://gitweb.torproject.org/user/iwakeh/metrics-
 lib.git/commit/?h=task-22983=d4ece5649573f315a8c63f43e490c3594f35affd
 second commit] makes `DescriptorParser` aware of the new types and avoids
 implementation javadoc comment generation for the new package.

 All of the code is covered by tests which are added in
 [https://gitweb.torproject.org/user/iwakeh/metrics-
 lib.git/commit/?h=task-22983=e07bca5e9429b2b93bb2cd3c0ef6911ad42ec32e
 this commit].  Total coverage even improved by one percent :-)

 The addition of another sub-interface `LogDescriptor.LogLine` (and the
 extensions to WebServerAccessLogLine) will be part of a new ticket, which
 will also provide unrecognized lines for access-logs.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-21 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Replying to [comment:5 iwakeh]:
 > Replying to [comment:4 karsten]:
 >
 > Thanks for the valuable input!
 >
 > > Regarding the name, let's try to find something more descriptive. How
 about `WebServerLog` or even `ApacheHttpServerAccessLog`? Otherwise
 there's the risk of confusion with descriptor types added in the future,
 like a log file written by BridgeDB containing client requests for bridge
 addresses.
 >
 > I see an interface hierarchy here:
 > LogDescriptor as parent for all logs (then we drop 'Descriptor' from the
 names) and have the first extending interface WebServerAccessLog.  Later
 we can add others *Log interfaces like BridgeDbClientLog etc.
 >
 > So for now, I focus on the access-log integration and keep future
 extensions in mind for the design.

 Makes sense.

 > > Regarding the suggested interface, I think there's a short term and a
 long term part here.
 > >
 > > In the long term I think that it would be at least twice as useful if
 we read the log contents and added methods to read these parsed contents.
 It's true that this causes some development hassle. But that's why we do
 it once in the library rather than rely on possibly more than one
 application to get it right. And we can still include the raw descriptor
 bytes by storing the compressed bytes and inflate them upon request.
 >
 > Yes, partially I have this in CollecTor anyway for sanitizing the logs.
 I'll add generally useful functionality to the metrics-lib code.
 > Should we have a new package for the implementations like
 `org.torproject.descriptor.logs`?  The log processing and content differs
 from usual descriptors quite a bit.

 As long as we keep all types that are relevant for applications in
 `org.torproject.descriptor`, I don't mind adding new subpackages.

 > > Some comments on the interface:
 > >  - Let's include a subtype `Request` or similar for each line
 contained in the log file, and let's include a method `getRequests()` that
 returns `Iterable`.
 >
 > There could be a parent interface LogLine that is extended by an
 appropriate interface for each log type, like a Request interface for
 access-logs.
 > I think about it and definitly keep the design open for the addition,
 but would put it on lesser priority right now.

 Long term sounds fine.

 > >  - Due to the fact that we cannot include a `@type` annotation with a
 version number, `Request` should ideally include getters for all fields
 contained in Apache's Combined Log Format.
 > >  - Ideally, `getLogDate()` would return the date in milliseconds since
 the epoch to be conformant to the rest of metrics-lib, in which case it
 would probably be called `getLogMillis()`.
 >
 > Fine, but we only have the date no time here.  Thus, msec signals a
 precision we don't offer.
 > I don't feel strongly about that.

 Me neither, I just think that it's easier to handle timestamps from
 different data sources if they all use the same format.

 > >  - I'm unclear what `getCompressionType()` returns. I think I'd expect
 a `String` that is either `"gz"` or `"gz"`, but not a `byte[]`. Was that
 intended?
 >
 > Correct, this should read `String getCompressionType()`, just a typo.
 Actually, it might turn into an enum.
 >
 > >  - If we read and parse logs, we'll have to change
 `getUnrecognizedLines()` to return any unrecognized lines.
 >
 > Yes, maybe with an upper limit in case a log got mangled?

 Good idea. First 100?

 > > In the short term I can see how we might want to put the `Request`
 part on hold and only return metadata and uncompressed raw descriptor
 contents in this new descriptor type.
 >
 > Fine, as replied above.
 >
 > Do you have a rough estimate of the future log file sizes metrics-lib
 will have to deal with?

 No idea. I think some of the Apache logs are pretty large in uncompressed
 form. But other descriptors have grown a lot over time, too, like votes.
 And when we recently pondered appending all votes collected in a singly
 CollecTor sync run, our original expectation of the size turned out to be
 pretty useless. So, 20 times the size?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-21 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by iwakeh):

 Replying to [comment:4 karsten]:

 Thanks for the valuable input!

 > Regarding the name, let's try to find something more descriptive. How
 about `WebServerLog` or even `ApacheHttpServerAccessLog`? Otherwise
 there's the risk of confusion with descriptor types added in the future,
 like a log file written by BridgeDB containing client requests for bridge
 addresses.

 I see an interface hierarchy here:
 LogDescriptor as parent for all logs (then we drop 'Descriptor' from the
 names) and have the first extending interface WebServerAccessLog.  Later
 we can add others *Log interfaces like BridgeDbClientLog etc.

 So for now, I focus on the access-log integration and keep future
 extensions in mind for the design.

 >
 > Regarding the suggested interface, I think there's a short term and a
 long term part here.
 >
 > In the long term I think that it would be at least twice as useful if we
 read the log contents and added methods to read these parsed contents.
 It's true that this causes some development hassle. But that's why we do
 it once in the library rather than rely on possibly more than one
 application to get it right. And we can still include the raw descriptor
 bytes by storing the compressed bytes and inflate them upon request.

 Yes, partially I have this anyway for sanitizing the logs.  I'll add
 generally useful functionality to the metrics-lib code.
 Should we have a new package for the implementations like
 `org.torproject.logs`?  The log processing and content differs from usual
 descriptors quite a bit.

 >
 > Some comments on the interface:
 >  - Let's include a subtype `Request` or similar for each line contained
 in the log file, and let's include a method `getRequests()` that returns
 `Iterable`.

 There could be a parent interface LogLine that is extended by an
 appropriate interface for each log type, like a Request interface for
 access-logs.
 I think about it and definitly keep the design open for the addition, but
 would put it on lesser priority right now.

 >  - Due to the fact that we cannot include a `@type` annotation with a
 version number, `Request` should ideally include getters for all fields
 contained in Apache's Combined Log Format.
 >  - Ideally, `getLogDate()` would return the date in milliseconds since
 the epoch to be conformant to the rest of metrics-lib, in which case it
 would probably be called `getLogMillis()`.

 Fine, but we only have the date no time here.  Thus, msec signals a
 precision we don't offer.
 I don't feel strongly about that.

 >  - I'm unclear what `getCompressionType()` returns. I think I'd expect a
 `String` that is either `"gz"` or `"gz"`, but not a `byte[]`. Was that
 intended?

 Correct, this should read `String getCompressionType()`, just a typo.
 Actually, it might turn into an enum.

 >  - If we read and parse logs, we'll have to change
 `getUnrecognizedLines()` to return any unrecognized lines.

 Yes, maybe with an upper limit in case a log got mangled?

 >
 > In the short term I can see how we might want to put the `Request` part
 on hold and only return metadata and uncompressed raw descriptor contents
 in this new descriptor type.

 Fine, as replied above.

 Do you have a rough estimate of the future log file sizes metrics-lib will
 have to deal with?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-20 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Regarding the name, let's try to find something more descriptive. How
 about `WebServerLog` or even `ApacheHttpServerAccessLog`? Otherwise
 there's the risk of confusion with descriptor types added in the future,
 like a log file written by BridgeDB containing client requests for bridge
 addresses.

 Regarding the suggested interface, I think there's a short term and a long
 term part here.

 In the long term I think that it would be at least twice as useful if we
 read the log contents and added methods to read these parsed contents.
 It's true that this causes some development hassle. But that's why we do
 it once in the library rather than rely on possibly more than one
 application to get it right. And we can still include the raw descriptor
 bytes by storing the compressed bytes and inflate them upon request.

 Some comments on the interface:
  - Let's include a subtype `Request` or similar for each line contained in
 the log file, and let's include a method `getRequests()` that returns
 `Iterable`.
  - Due to the fact that we cannot include a `@type` annotation with a
 version number, `Request` should ideally include getters for all fields
 contained in Apache's Combined Log Format.
  - Ideally, `getLogDate()` would return the date in milliseconds since the
 epoch to be conformant to the rest of metrics-lib, in which case it would
 probably be called `getLogMillis()`.
  - I'm unclear what `getCompressionType()` returns. I think I'd expect a
 `String` that is either `"gz"` or `"gz"`, but not a `byte[]`. Was that
 intended?
  - If we read and parse logs, we'll have to change
 `getUnrecognizedLines()` to return any unrecognized lines.

 In the short term I can see how we might want to put the `Request` part on
 hold and only return metadata and uncompressed raw descriptor contents in
 this new descriptor type.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-20 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by iwakeh):

 To visualize comment:2, please see the following interface:

 {{{
 #!java
 /* Copyright 2017 The Tor Project
  * See LICENSE for licensing information */

 package org.torproject.descriptor;

 import java.util.List;

 /**
  * Interface for log-file descriptors.
  *
  * @since 2.1.0
  */
 public interface LogDescriptor extends Descriptor {

   /**
* Returns the hostname of the machine on which the logs were gathered.
*
* @since 2.1.0
*/
   public String getMeasuringHost();

   /**
* Returns the name of the host served and for which the logs were
 gathered.
*
* @since 2.1.0
*/
   public String getServedHost();

   /**
* Returns the date of the log in yyymmdd format.
*
* @since 2.1.0
*/
   public String getLogDate();

   /**
* Return the raw descriptor bytes of the compressed sanitized logs.
*
* The compression type is returned by {@link #getCompressionType()}
* and is one of 'gz' or 'xz'.  Additional types might be added in
* future.
*
* @since 2.1.0
*/
   @Override
   public byte[] getRawDescriptorBytes();

   /**
* Returns the compression type as one of 'gz' or 'xz'.
*
* Additional types might be added in future.
*
* @since 2.1.0
*/
   public byte[] getCompressionType();

   /**
* Will always return an empty List as logs don't have any descriptors.
*
* @since 2.1.0
*/
   @Override
   public List getAnnotations();

   /**
* Will always return an empty List as logs don't have any descriptors.
*
* @since 2.1.0
*/
   @Override
   public List getUnrecognizedLines();
 }

 }}}

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-20 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by iwakeh):

 Well, metrics-lib doesn't need to 'read' the contents of the logs.  Thus,
 the 'raw bytes' could be from the compressed content for now.  Using the
 compressed content might be the most efficient in terms of runtime memory
 and development hassle.

 A list of interface methods without the methods from `Descriptor`:

 * getMeasuringHost (taken from path, i.e. the immediate parent folder
 name)
 * getServedHost (first part up to a dash of the filename)
 * getLogDate  (filename's last part before extension)
 * getCompressionType (the compression type of the raw bytes, taken from
 the file extension)

 Maybe also for convenience
 * getUncompressedData (with a comment warning about the possibly huge
 deflate size)

 What is missing?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-20 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Random thought: these logs can be really huge and also quite repetitive.
 Maybe we can try to save a bit of memory by trying out a few things when
 parsing them. Ideas:
  1. Store indexes into the raw descriptor string rather than copies of
 strings.
  2. Alternatively, only store fields that have changed compared to the
 previous line. If a field has not changed, store a reference to the first
 line that contains the same field contents. This makes use of the fact
 that lines are sorted.
  3. Maybe don't even store the raw descriptor string and instead throw an
 exception when it gets called. Only do this if 2 requires a lot less
 memory than 1 above.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

[tor-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

2017-07-19 Thread Tor Bug Tracker & Wiki
#22983: add a descriptor interface and implementation for web-logs
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/metrics-lib  |Version:
 Severity:  Normal   |   Keywords:
Actual Points:   |  Parent ID:
   Points:   |   Reviewer:
  Sponsor:   |
-+--
 The webstats-log-files are the only files available on CollecTor (in
 future) that are not yet covered by metrics-lib.

 Should there be a 'LogDescriptor' interface and implementation in metrics-
 lib? Are there any reasons why not?

 The name `LogDescriptor` is just a working name; better naming suggestions
 welcome.
 The interface should extend `Descriptor` and have additional methods for
 retrieving the measuring host and the served host, a method for retrieving
 the date of the log.
 What else?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs