#23243: Write a specification for Tor web server logs -----------------------------+-------------------------------- Reporter: iwakeh | Owner: metrics-team Type: enhancement | Status: needs_revision Priority: Medium | Milestone: Component: Metrics/Website | Version: Severity: Normal | Resolution: Keywords: metrics-2017 | Actual Points: Parent ID: | Points: Reviewer: | Sponsor: -----------------------------+--------------------------------
Comment (by iwakeh): Replying to [comment:46 karsten]: > I'm not sure if we can resolve these questions by hard thinking. Well, we need to work on thoughtful decision making. There're not that many questions above except yours: > ... what happens if a web server rotates logs more often than once per day? At least that's something that we write in the specification. I'm not sure how this would work with file names, so maybe we in fact require that logs are rotated exactly once per day, and we just didn't write that in the specification yet. However, it seems rather restrictive to prescribe exact log rotation intervals in order to sanitize logs subsequently. Maybe we should be less restrictive here. The current webstat code and the spec require a log per day. So, if someone decides to change the log rotation to more than that, the spec and code will have to be adapted. Thus, it seems this question refers to a hypothetical change (afaik). In comment:45 I point out that this is a small issue for implementation based on the reasoning that rotated logs usually add a number or a time or both to the log file name. Either way is easily adapted. > - Would it help to know the log and log rotation configuration used on the various Tor web servers? Unless you have reason to think that current logging procedures are going to change or even changed already from the one log per day schema, this is not necessary. > - Would it help to have access to the current host that sanitizes web server logs? I think there are no questions regarding the current process. > - Does the existing code for sanitizing web server logs contain any more hints on the input data? We put all the information from the current code into the spec and the current implementation suggestion. The old code also uses a 'cue' (as mentioned comment:45): `sanitize.py` returns the sanitized log file name for the day before the processed log file, which is the cue for the calling shell script that this file is now complete and can be published. Both the old and suggested new version of webstats need an outside cue and without this cue an input log day would not be published. Now focussing on the new implementation of the webstats module for CollecTor there are several ways of preventing log file loss: 1. Make sure by outside means that there is no day without a log (e.g. by providing an empty file for that day using 'touch'). This would work without additional implementation for CollecTor and this works for bulk imports as well as daily processing. As a result there will be a sanitized log for each day offered by CollecTor, some might be empty. 2. For bulk processing a property could signal CollecTor to use all logs without insisting on an uninterrupted chain. This still requires outside measures for making sure no log lines are lost and might result in days without any logs, unless CollecTor creates empty ones. 3. Think out a mechanism that enables more automated processing of an interrupted chain of logs. This seems error prone an will result in many edge cases. I think 1. is the easiest in terms of operation, i.e., providing input logs, and implementation (it's there already). In addition, the uninterupted chain of (possibly empty) sanitized logs is also easy to verify and understand. An empty file could result from no log line being valid or no log being available for that day. So, in order to get forward one of the above methods needs to be chosen (or a new one made up). The other question about smaller log rotation intervals is only relevant, if that is put into practice. If so, it should be a straightforward task to adapt the code. Hope this makes some sense. Is there anything else missing here? -- Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:47> Tor Bug Tracker & Wiki <https://trac.torproject.org/> The Tor Project: anonymity online
_______________________________________________ tor-bugs mailing list tor-bugs@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs