#23243: Write a specification for Tor web server logs -----------------------------+-------------------------------- Reporter: iwakeh | Owner: metrics-team Type: enhancement | Status: needs_revision Priority: Medium | Milestone: Component: Metrics/Website | Version: Severity: Normal | Resolution: Keywords: metrics-2017 | Actual Points: Parent ID: | Points: Reviewer: | Sponsor: -----------------------------+--------------------------------
Comment (by karsten): A few thoughts: - I don't have a good answer to the question of putting virtual and physical host name into a file name. Underscores might work. Trying to solve the bigger issue first. - We have little influence on the naming of input files. We can define reasonable requirements that are ideally met by existing input files, and we can reject future input files not matching these requirements. - I don't imply that we should alter sanitized log files after publication. We shouldn't. That would be pretty bad. - The specification is a moving target, because it was ambiguous and the implementation would have been fragile. It's a valid use case that a physical host does not see a single request for a given virtual host for days, and the implementation (and specification) did not cover that case. But let's look at the idea to process all input files and produce output files for all dates except the first and last UTC days. And let's ignore performance considerations for now. Why would it not solve the "when is a log ready for publication" question? Can you give an example? Not sure if this is what you have in mind, but I think we ''cannot'' handle the case of logs files "in the middle" being missing in one run and being present in a subsequent run. For example, if we receive input files with requests from the following dates: - 2017-11-01 and 2017-11-02 - 2017-11-02 and 2017-11-03 - (gap) - 2017-11-04 and 2017-11-05 - 2017-11-05 and 2017-11-06 We would produce output files for: - (skip 2017-11-01, because first UTC date) - 2017-11-02 - 2017-11-03 - 2017-11-04 - 2017-11-05 - (skip 2017-11-06, because last UTC date) Now, if we later find another file filling the gap with the following contained request dates: - 2017-11-03 and 2017-11-04 We ''couldn't'' update the output files for 2017-11-03 and 2017-11-04 anymore! We would simply leave them unchanged, containing just the requests we processed earlier. But is this a bug we should be able to handle? It seems like a bug in the log-copying script combined with bad timing. During normal operation and in the bulk-import case this should not happen. Note that if you think that cutting off the first and last days is not enough, we could easily change that to cutting off the first and last two days. Or the first and the last two. Or first and last three. Whatever we think works best. -- Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:52> Tor Bug Tracker & Wiki <https://trac.torproject.org/> The Tor Project: anonymity online
_______________________________________________ tor-bugs mailing list tor-bugs@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs