Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-10 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Remaining next steps are:
  - Code the decisions (iwakeh)
  - Try out the code on actual logs (iwakeh; karsten can make more logs
 available)
  - Send draft to tor-dev@ and ask for feedback (karsten)

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-10 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Please review [https://gitweb.torproject.org/karsten/metrics-
 web.git/log/?h=task-23243 my task-23243 branch] and
 [https://trac.torproject.org/projects/tor/attachment/ticket/23243/web-
 server-logs.pdf this PDF print-out] of the compiled web page.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-10 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by karsten):

 * Attachment "web-server-logs.pdf" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-10 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 (incorporating the last feedback above -- just wanted to say that there
 are no open questions anymore)

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-10 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Alright, I'll turn the last draft we have into XML then.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-07 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by iwakeh):

 Replying to [comment:22 karsten]:
 > ...
 > > > > Similarly, only lines for 400 and 404 response codes are
 discarded.  Any bogus response code will be kept.
 > > >
 > > > Right, but I think 400 and 404 were the most problematic ones when
 we came up with the original plan for sanitizing these logs. Do you have
 any others in mind that we'd have to sanitize? If not, I'd say let's stick
 with 400 and 404 for now.
 > >
 > > No, the other way around.  As said above invalid (=bogus) response
 codes will be kept.  A better way to phrase this: we don't check, if a
 response code is a valid one and only sift out the valid 400 & 404.
 >
 > I'm not sure where this is going. Do you think we should check if status
 code is one that is currently defined by the HTTP protocol and only accept
 the ones that are, except for 400 and 404?

 That's the question.  Intuitively, I would not check for validity.  After
 all the server sets the status code and that should not really be privacy
 jeopardizing, supposing that the faulty codes simply happen by error not
 by hack.

 >
 > > > So, should I move forward with turning draft number five, numbered
 4, into XML?
 > >
 > > Yes :-)
 >
 > Will do as soon as the two open questions are resolved.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-07 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by iwakeh):

 Replying to [comment:21 karsten]:
 > Replying to [comment:19 iwakeh]:
 > > Replying to [comment:17 karsten]:
 > > > Replying to [comment:14 iwakeh]:
 > > > > Another addition:
 > > > >
 > > > > Even though Tor's Apache webservers are configured to only provide
 three ip addresses (e.g. `0.0.0.{0,1,2}`) all lines with different ips are
 accepted and sanitized to ip `0.0.0.0`.
 > > > >
 > > > > Or, should such lines be discarded?
 > > >
 > > > Right now addresses are kept as long as they start with `0.0.0.`,
 which seems plausible to me. The spec draft should also say that.
 > >
 > > Agreed.  My question was a different one: what about log lines that
 contain other ips (e.g. in case Apache suddenly logs more 11.22.33.44).
 Currently these would be replaced by 0.0.0.0 and the sanitized lines kept.
 >
 > Ah, hmm. I think that both the current script and the specification
 draft say that we ''drop'' any lines not starting with `0.0.0.`, but if a
 line matches that, we keep the `0.0.0.x` address unchanged.
 >
 > But I wonder if we should change that to "keep any address starting with
 `0.0.0.` and replace everything else with `0.0.0.0`". That way we could
 easily sanitize logs from web servers using different log formats that are
 compliant with Apache's Common Log Format. What do you think?

 Yes, that's what I would choose.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-07 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Replying to [comment:20 iwakeh]:
 > Replying to [comment:18 karsten]:
 > > Replying to [comment:15 iwakeh]:
 > > > Similarly, only lines for 400 and 404 response codes are discarded.
 Any bogus response code will be kept.
 > >
 > > Right, but I think 400 and 404 were the most problematic ones when we
 came up with the original plan for sanitizing these logs. Do you have any
 others in mind that we'd have to sanitize? If not, I'd say let's stick
 with 400 and 404 for now.
 >
 > No, the other way around.  As said above invalid (=bogus) response codes
 will be kept.  A better way to phrase this: we don't check, if a response
 code is a valid one and only sift out the valid 400 & 404.

 I'm not sure where this is going. Do you think we should check if status
 code is one that is currently defined by the HTTP protocol and only accept
 the ones that are, except for 400 and 404?

 > > So, should I move forward with turning draft number five, numbered 4,
 into XML?
 >
 > Yes :-)

 Will do as soon as the two open questions are resolved.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-07 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Replying to [comment:19 iwakeh]:
 > Replying to [comment:17 karsten]:
 > > Replying to [comment:14 iwakeh]:
 > > > Another addition:
 > > >
 > > > Even though Tor's Apache webservers are configured to only provide
 three ip addresses (e.g. `0.0.0.{0,1,2}`) all lines with different ips are
 accepted and sanitized to ip `0.0.0.0`.
 > > >
 > > > Or, should such lines be discarded?
 > >
 > > Right now addresses are kept as long as they start with `0.0.0.`,
 which seems plausible to me. The spec draft should also say that.
 >
 > Agreed.  My question was a different one: what about log lines that
 contain other ips (e.g. in case Apache suddenly logs more 11.22.33.44).
 Currently these would be replaced by 0.0.0.0 and the sanitized lines kept.

 Ah, hmm. I think that both the current script and the specification draft
 say that we ''drop'' any lines not starting with `0.0.0.`, but if a line
 matches that, we keep the `0.0.0.x` address unchanged.

 But I wonder if we should change that to "keep any address starting with
 `0.0.0.` and replace everything else with `0.0.0.0`". That way we could
 easily sanitize logs from web servers using different log formats that are
 compliant with Apache's Common Log Format. What do you think?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-07 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by iwakeh):

 Replying to [comment:18 karsten]:
 > Replying to [comment:15 iwakeh]:
 > > Similarly, only lines for 400 and 404 response codes are discarded.
 Any bogus response code will be kept.
 >
 > Right, but I think 400 and 404 were the most problematic ones when we
 came up with the original plan for sanitizing these logs. Do you have any
 others in mind that we'd have to sanitize? If not, I'd say let's stick
 with 400 and 404 for now.

 No, the other way around.  As said above invalid (=bogus) response codes
 will be kept.  A better way to phrase this: we don't check, if a response
 code is a valid one and only sift out the valid 400 & 404.

 >
 > So, should I move forward with turning draft number five, numbered 4,
 into XML?

 Yes :-)

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-07 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by iwakeh):

 Replying to [comment:17 karsten]:
 > Replying to [comment:14 iwakeh]:
 > > Another addition:
 > >
 > > Even though Tor's Apache webservers are configured to only provide
 three ip addresses (e.g. `0.0.0.{0,1,2}`) all lines with different ips are
 accepted and sanitized to ip `0.0.0.0`.
 > >
 > > Or, should such lines be discarded?
 >
 > Right now addresses are kept as long as they start with `0.0.0.`, which
 seems plausible to me. The spec draft should also say that.

 Agreed.  My question was a different one: what about log lines that
 contain other ips (e.g. in case Apache suddenly logs more 11.22.33.44).
 Currently these would be replaced by 0.0.0.0 and the sanitized lines kept.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-07 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Replying to [comment:15 iwakeh]:
 > Similarly, only lines for 400 and 404 response codes are discarded.  Any
 bogus response code will be kept.

 Right, but I think 400 and 404 were the most problematic ones when we came
 up with the original plan for sanitizing these logs. Do you have any
 others in mind that we'd have to sanitize? If not, I'd say let's stick
 with 400 and 404 for now.

 So, should I move forward with turning draft number five, numbered 4, into
 XML?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-07 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Replying to [comment:14 iwakeh]:
 > Another addition:
 >
 > Even though Tor's Apache webservers are configured to only provide three
 ip addresses (e.g. `0.0.0.{0,1,2}`) all lines with different ips are
 accepted and sanitized to ip `0.0.0.0`.
 >
 > Or, should such lines be discarded?

 Right now addresses are kept as long as they start with `0.0.0.`, which
 seems plausible to me. The spec draft should also say that.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-07 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Replying to [comment:13 iwakeh]:
 > The actual date (or system date) is only of concern for publishing the
 logs.  All other dates refer to the date the (original) log is finalized.
 I introduced the term 'reference date' for this.
 > The diff: [...]
 >
 > And, I don't see the necessity for stating that the files won't be
 changed in future.  This doesn't seem part of a spec here.  Anyway, we
 might want to re-sanitize these files, if suddenly there is a privacy
 issue with fields that seem benign now (as with bridge descriptors, for
 example).

 Agreed, those changes look good to me.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-07 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by iwakeh):

 Similarly, only lines for 400 and 404 response codes are discarded.  Any
 bogus response code will be kept.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-06 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by iwakeh):

 Another addition:

 Even though Tor's Apache webservers are configured to only provide three
 ip addresses (e.g. `0.0.0.{0,1,2}`) all lines with different ips are
 accepted and sanitized to ip `0.0.0.0`.

 Or, should such lines be discarded?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-05 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by iwakeh):

 * Attachment "web-spec-4.zip" added.

 draft five numbered 4 (trac keeps rejecting unzipped uploads as spam :-/

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-05 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by iwakeh):

 The actual date (or system date) is only of concern for publishing the
 logs.  All other dates refer to the date the (original) log is finalized.
 I introduced the term 'reference date' for this.
 The diff:
 {{{
 --- webstats-spec.3.txt
 +++ webstats-spec.4.txt
 @@ -33,7 +33,7 @@

  Tor's webservers are configured to rotate logs at least once per day,
 which does not necessarily happen at 00:00:00 UTC. As a result, log files
 may contain requests from up to two UTC days and several log files may
 contain requests that have been started on the same UTC day.

 -All access log files written by Tor's webservers follow the naming
 convention .torproject.org-access.log-MMDD.
 +All access log files written by Tor's webservers follow the naming
 convention .torproject.org-access.log-MMDD, where 'MMDD'
 is the date of the rotation and finalization of the log file.  This date
 will be referred to as 'reference date' in the following sections.

  # Sanitizing steps

 @@ -41,16 +41,16 @@

  ## Discarding non-matching files

 -As first safeguard against publishing log files that are too sensitive,
 we discard all files not matching the naming convention for access logs.
 This is to prevent, for example, error logs from slipping through.
 +As first safeguard against publishing log files that are too sensitive,
 we discard all files not matching the naming convention for access logs.
 This is to prevent, for example, error logs from slipping through.  In
 addition, the log file's name is supposed to contain the reference date,
 which is used to determine the validity of log lines.  If the log file's
 name doesn't end in a date string of the format 'MMDD' the entire file
 is discarded.

  ## Discarding non-matching lines

 -Log files are expected to contain exactly 1 request per line. We process
 these files line by line and discard any lines not matching the following
 criteria:
 +Log files are expected to contain exactly 1 request per line.  We process
 these files line by line and discard any lines not matching the following
 criteria:

   - Lines begin with Apache's Common Log Format ("%h %l %u %t \"%r\" %>s
 %b") or a compatible format like one of Tor's privacy formats. It is
 acceptable if lines start with a format that is compatible to the Common
 Log Format and continue with additional fields. Those additional fields
 will later be discarded, but the line will not be discarded because of
 them.
   - The request IP address starts with "0.0.0.", followed by any number
 between 0 and 255.
 - - The time the request was received does not lie in the future.
 - - The date the request was received, after converting the request time
 to UTC, does not lie more than 1 day in the past. (Bulk imports of
 archived logs are exempt from this requirement.)
 + - The time the request was received does not lie in the future of the
 reference date.
 + - The date the request was received, after converting the request time
 to UTC, does not lie more than 1 day in the past of the reference date.
   - The request protocol is HTTP.
   - The request method is either GET or HEAD.
   - The final status of the request is neither 400 ("Bad Request") nor 404
 ("Not Found").
 @@ -83,7 +83,7 @@

  //MM/--access.log-
 MMDD[.xz]

 -Due to the fact that the date when a log file was rotated and the start
 date of contained requests may not always overlap, we need to delay
 publishing sanitized log files until the start date of requests in UTC
 plus 2 days. After this delay, all log files containing requests from that
 date are assumed to be processed. Sanitized log files are published and
 not further modified in the future. (Again, bulk imports of archived logs
 are exempt from this.)
 +Due to the fact that the date when a log file was rotated and the start
 date of contained requests may not always overlap, we need to delay
 publishing sanitized log files until the start date of requests in UTC
 plus 2 days. After this delay, all log files containing requests from that
 date are assumed to be processed.

  As last and certainly not least important sanitizing step, all rewritten
 log lines are sorted alphabetically, so that request order cannot be
 inferred from sanitized log files.
 }}}

 And, I don't see the necessity for stating that the files won'

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-05 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Okay, I tried to specify that, but please
 [https://trac.torproject.org/projects/tor/attachment/ticket/23243
 /webstats-spec.3.txt review carefully]. The part that made this a bit more
 complex was that there are actually 2 places where we need to look at
 dates/times: 1) when deciding about discarding lines that are too old or
 too new and 2) when deciding when to publish a sanitized file and never
 ever touch it again. Maybe I overcomplicated this, so if you see a way to
 simplify what I wrote, please say so!

 Here's the diff, if that helps reviewing:

 {{{
 diff --git a/webstats-spec.txt b/webstats-spec.txt
 index 7e46449..48c0287 100644
 --- a/webstats-spec.txt
 +++ b/webstats-spec.txt
 @@ -3,7 +3,6 @@ Tor webserver logs

  Next steps:
   - Replace webserver with web server which seems to be Less Bad English
 (karsten).
 - - Find out what exact delay we'll need for publishing sanitized logs
 (iwakeh?)
   - Turn this document into XML (karsten)
   - Code the decisions (iwakeh)
   - Try out the code on actual logs (iwakeh; karsten can make more logs
 available)
 @@ -30,6 +29,8 @@ LogFormat "0.0.0.2 - %u %{[%d/%b/%Y:00:00:00 %z]}t
 \"%r\" %>s %b \"%{Referer}i\"

  The main difference to Apache's Common Log Format is that request IP
 addresses are removed and the field is instead used to encode whether the
 request came in via http:// (0.0.0.0), via https:// (0.0.0.1), or via the
 site's onion service (0.0.0.2).

 +Tor's webservers are configured to use UTC as timezone, which is also
 highly recommended when rewriting request times to "00:00:00" in order for
 the subsequent sanitizing steps to work correctly. Alternatively, if the
 system timezone is not set to UTC, webservers should keep request times
 unchanged and let them be handled by the subsequent sanitizing steps.
 +
  Tor's webservers are configured to rotate logs at least once per day,
 which does not necessarily happen at 00:00:00 UTC. As a result, log files
 may contain requests from up to two UTC days and several log files may
 contain requests that have been started on the same UTC day.

  All access log files written by Tor's webservers follow the naming
 convention .torproject.org-access.log-MMDD.
 @@ -48,6 +49,8 @@ Log files are expected to contain exactly 1 request per
 line. We process these f

   - Lines begin with Apache's Common Log Format ("%h %l %u %t \"%r\" %>s
 %b") or a compatible format like one of Tor's privacy formats. It is
 acceptable if lines start with a format that is compatible to the Common
 Log Format and continue with additional fields. Those additional fields
 will later be discarded, but the line will not be discarded because of
 them.
   - The request IP address starts with "0.0.0.", followed by any number
 between 0 and 255.
 + - The time the request was received does not lie in the future.
 + - The date the request was received, after converting the request time
 to UTC, does not lie more than 1 day in the past. (Bulk imports of
 archived logs are exempt from this requirement.)
   - The request protocol is HTTP.
   - The request method is either GET or HEAD.
   - The final status of the request is neither 400 ("Bad Request") nor 404
 ("Not Found").
 @@ -80,9 +83,7 @@ Sanitized log files may additionally be sorted into
 directories by virtual host

  //MM/--access.log-
 MMDD[.xz]

 -Due to the fact that the date when a log file was rotated and the start
 date of contained requests may not always overlap, we need to delay
 publishing sanitized log files until all log files containing requests
 from that date are guaranteed to be processed. After this delay, the
 sanitized log files are published and not further modified.
 -
 -XXX What's the delay? End of UTC day + 24 hours? Check current script!
 +Due to the fact that the date when a log file was rotated and the start
 date of contained requests may not always overlap, we need to delay
 publishing sanitized log files until the start date of requests in UTC
 plus 2 days. After this delay, all log files containing requests from that
 date are assumed to be processed. Sanitized log files are published and
 not further modified in the future. (Again, bulk imports of archived logs
 are exempt from this.)

  As last and certainly not least important sanitizing step, all rewritten
 

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-05 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by karsten):

 * Attachment "webstats-spec.3.txt" added.

 Fourth draft

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-05 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by iwakeh):

 For example, in a log file with date 20170903 (=: x) lines are processed
 according to their dates:
 * line a with date 20170903 (x) is written to the sanitized log file with
 date 20170903
 * line b with date 20170902 (x - 1) is written to the sanitized log file
 with date 20170902
 * line c with date 20170904 (x + y, where y > 0) is dropped and the date
 is logged on debug level
 * line d with date 20170901 (x - 1 - y, where y > 0) is dropped and the
 date is logged on debug level

 And, system date 20170903 (x) is the earliest publishing date for a
 sanitized log file with date 20170901 (x - 2).

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-05 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Do you mind giving an example with exact timestamps?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-05 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by iwakeh):

 Answer to open question
 'How does the script handle dates differing from the log-files date?' (cf.
 [https://gitweb.torproject.org/webstats.git/tree/src/sanitize.py#n56
 python implementation])
 Currently, log lines that have a date after the log-file's date or more
 than a day before the log-file's date are dropped and the dates are logged
 by the script.  Lines from a day earlier are written to the earlier log
 file.

 Thus, sanitized logs can be published once they are two days old.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-05 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 [https://trac.torproject.org/projects/tor/attachment/ticket/23243
 /webstats-spec.2.txt Here's another draft] based on today's discussion.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-05 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by karsten):

 * Attachment "webstats-spec.2.txt" added.

 Third draft based on karsten's first draft and a discussion with iwakeh

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-01 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 So, I started reviewing your draft and made a few tweaks here and there.
 But then I moved around things to better reflect the order of sanitizing
 steps and to better reason about why we're doing these steps. In the end I
 decided to start over and write a
 [https://trac.torproject.org/projects/tor/attachment/ticket/23243
 /webstats-spec.txt new draft] mostly based on my memory, plus a few other
 sources except for your draft. The idea was to start from scratch. It's
 not supposed to replace your draft entirely, because your draft describes
 some parts better than mine. Maybe we can somehow combine our drafts.
 Though we'll first have to discuss a few open questions (marked with XXX
 in my draft). Maybe something for a pad meeting early next week?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-09-01 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by karsten):

 * Attachment "webstats-spec.txt" added.

 karsten's first draft

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-08-23 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by iwakeh):

 * status:  needs_information => needs_review


Comment:

 The status change somehow didn't 'arrive' at trac yesterday.  Here it is.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-08-22 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_information
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---
Changes (by iwakeh):

 * Attachment "weblog-spec.html.xz" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-08-22 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_information
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by iwakeh):

 Please review a first draft in [https://gitweb.torproject.org/user/iwakeh
 /metrics-web.git/log/?h=task-23243 this branch].
 I also attach a simple html version here.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-08-22 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_information
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by karsten):

 Replying to [comment:3 iwakeh]:
 > Some questions before I begin to distill a small description:
 >
 > It seems the sanitized clean log lines should adhere to the Common Log
 Format:
 > {{{
 > LogFormat "%h %l %u %t \"%r\" %>s %b"
 > }}}
 >
 > Thus, the sanitized log format will be changed from currently:
 > `0.0.0.0 - - [10/Mar/2017:00:00:00 +] "GET / HTTP/1.0" 200 3018 "-"
 "-" -`
 > to
 > `0.0.0.0 - - [10/Mar/2017:00:00:00 +] "GET / HTTP/1.0" 200 3018`.
 > Can we agree on this?

 Yes.

 > When parsing all lines with a beginning that doesn't fit the Common Log
 Format are considered unparseable.  Lines with a beginning that matches
 CLF are considered parseable and all trailing content is ignored.

 Yes.

 > POST requests are to be dropped?

 Yes.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-08-22 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_information
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by iwakeh):

 Some questions before I begin to distill a small description:

 It seems the sanitized clean log lines should adhere to the Common Log
 Format:
 {{{
 LogFormat "%h %l %u %t \"%r\" %>s %b"
 }}}

 Thus, the sanitized log format will be changed from currently:
 `0.0.0.0 - - [10/Mar/2017:00:00:00 +] "GET / HTTP/1.0" 200 3018 "-"
 "-" -`
 to
 `0.0.0.0 - - [10/Mar/2017:00:00:00 +] "GET / HTTP/1.0" 200 3018`.
 Can we agree on this?
 When parsing all lines with a beginning that doesn't fit the Common Log
 Format are considered unparseable.  Lines with a beginning that matches
 CLF are considered parseable and all trailing content is ignored.


 POST requests are to be dropped?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-08-15 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_information
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---

Comment (by karsten):

 Replying to [ticket:23243 iwakeh]:
 > This document should answer the following questions:

 Good idea to start such a document! I'll start filling information below.

 > * What will the raw input data look like?
 >  - compressed logs

 Very likely, though compression shouldn't be a strict requirement.

 >  - varying dates in log-lines despite the file being tagged with a
 single date

 Yes, to a certain degree. We'll have to ask the admins for details, but I
 believe that the date in the file name is put in when rotating logs and
 that the date per line is when the host started processing a request. Now,
 it's possible that some requests are received before midnight and
 completed after midnight. And depending on when the log is rotated it's
 possible that some requests are started on the day before the log was
 rotated and finished after rotating the log.

 >  - are there only GET log-lines of 200 responses to be expected?

 No, there might be other methods and other response codes.

 >  - size could be huge (in future)

 Yes.

 >  - exact input format (if possible to define)

 Good question. We should ideally support Apache's Combined Log Format,
 even though we'd currently only receive Tor's privacy* log formats:

 {{{
 LogFormat "0.0.0.0 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b
 \"%{Referer}i\" \"-\" %{Age}o" privacy
 LogFormat "0.0.0.1 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b
 \"%{Referer}i\" \"-\" %{Age}o" privacyssl
 LogFormat "0.0.0.2 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b
 \"%{Referer}i\" \"-\" %{Age}o" privacyhs
 }}}

 And there's already the first contradiction: The `%{Age}o` part is not
 contained in the Combined Log Format:

 {{{
 LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""
 combined
 }}}

 Maybe we require lines to start with the Common Log Format and ignore any
 further fields? Needs discussion.

 >  - meta-data is provided in paths and filenames

 Yep.

 >  - ...
 > * What will sanitized stored (on disk) logs look like?
 >  - cleaned log-lines, define exact format, give examples (as this might
 deviate from the current python sanitation)
 >  - meta-data is provided in paths and filenames
 >  - should files be reassembled, i.e., only log lines of a given date in
 a descriptor for that log date?

 Yes! That's important! Otherwise we'll leak information of lines contained
 for a given date before/after rotating logs. That's a much shorter time
 frame than 24 hours then. We'll have to do this.

 >  - should storage (on disk) be in compressed files (opposed to storing
 other descriptors uncompressed)?

 Yes. Configurable by the application, but yes.

 >  - Should such log be stored (on disk) in reasonably sized chunks (once
 a GB size is reached)?

 No, compression should already reduce the size enough so that we'll never
 run into such sizes. Never!

 >  - ...
 >
 > Please add more.

 Looks like a good start! Will add more as more comes to mind.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-08-15 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+---
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  needs_information
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+---
Changes (by iwakeh):

 * status:  new => needs_information


Comment:

 Added to metrics-web as the spec will likely also face the public there.
 This is the basis for metrics-lib ticket #22983.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

[tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

2017-08-15 Thread Tor Bug Tracker & Wiki
#23243: write a spec for web-server-access log descriptors
-+--
 Reporter:  iwakeh   |  Owner:  metrics-team
 Type:  enhancement  | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Metrics website  |Version:
 Severity:  Normal   |   Keywords:
Actual Points:   |  Parent ID:
   Points:   |   Reviewer:
  Sponsor:   |
-+--
 This document should answer the following questions:

 * What will the raw input data look like?
  - compressed logs
  - varying dates in log-lines despite the file being tagged with a single
 date
  - are there only GET log-lines of 200 responses to be expected?
  - size could be huge (in future)
  - exact input format (if possible to define)
  - meta-data is provided in paths and filenames
  - ...
 * What will sanitized stored (on disk) logs look like?
  - cleaned log-lines, define exact format, give examples (as this might
 deviate from the current python sanitation)
  - meta-data is provided in paths and filenames
  - should files be reassembled, i.e., only log lines of a given date in a
 descriptor for that log date?
  - should storage (on disk) be in compressed files (opposed to storing
 other descriptors uncompressed)?
  - Should such log be stored (on disk) in reasonably sized chunks (once a
 GB size is reached)?
  - ...

 Please add more.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs