#23243: Write a specification for Tor web server logs -----------------------------+-------------------------------- Reporter: iwakeh | Owner: metrics-team Type: enhancement | Status: needs_revision Priority: Medium | Milestone: Component: Metrics/Website | Version: Severity: Normal | Resolution: Keywords: metrics-2017 | Actual Points: Parent ID: | Points: Reviewer: | Sponsor: -----------------------------+-------------------------------- Changes (by iwakeh):
* status: merge_ready => needs_revision Comment: The spec might need to be extended: The implementation of the CollecTor webstats module triggered more questions about the way original logs are supplied. One piece of information that is so far only supplied indirectly is the cue for when a log is finished. In detail: * Functionality for bulk imports of log files is necessary. Thus, the implementation cannot rely on the system date anymore to decide when a log day is complete. (distinguishing between reference date as defined in the spec and the 'log for a day' which means all log lines for a given date are available). * Implicit assumption: input log files can be empty or not contain any valid lines as long as there naming pattern matches the rules. * The current spec allows only for one input log per reference date (per virtual plus physical host). * Log lines for a particular log day could be spread over two successive log files (as defined in the current spec). * Implicit cue: all log lines are available for a certain reference date when the log for the reference date and its successor are available. This also means a log for a day without an immediate successor is not complete, i.e. won't be processed. The cue in form of the successor could be given as an empty successor log file. This cue has to be supplied from outside and cannot be determined from the implementation. Related is another question from #22428 comment:36 > Here's another, related question: what happens if a web server rotates logs more often than once per day? At least that's something that we write in the specification. I'm not sure how this would work with file names, so maybe we in fact require that logs are rotated exactly once per day, and we just didn't write that in the specification yet. However, it seems rather restrictive to prescribe exact log rotation intervals in order to sanitize logs subsequently. Maybe we should be less restrictive here. It doesn't really matter, if the log lines for a certain day are spread over two or more input files. Currently, only one input file per reference date is possible (the first wins). More input files could be supplied by extending the input log name pattern with a dash followed by an integer, i.e., `scrubbed.torproject.org- access.log-20171006-77.gz`. In such a case it should be required that * counting starts with one (arbitrary). * there are no gaps, i.e., if there is a file with 3, there have to be files with 2 and 1 for the same virtual, physical host, and date combination. Again, a cue is needed for when the log day is complete. As above this could be the input file for the immediate successor by reference date with number 1. And, this cue could be an empty file. Remarks: The way the cue is given is arbitrary, but the current implementation suggestion already works with the method described above. The naming pattern is just an arbitrary suggestion. So improvements are welcome. -- Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:45> Tor Bug Tracker & Wiki <https://trac.torproject.org/> The Tor Project: anonymity online
_______________________________________________ tor-bugs mailing list tor-bugs@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs