#20228: Append all votes with same valid-after time to a single file in `recent/` -------------------------------+--------------------- Reporter: karsten | Owner: Type: enhancement | Status: new Priority: High | Milestone: Component: Metrics/CollecTor | Version: Severity: Normal | Resolution: Keywords: | Actual Points: Parent ID: | Points: Reviewer: | Sponsor: -------------------------------+--------------------- Changes (by karsten):
* priority: Medium => High Comment: I'd like us to move forward here, ideally with descriptors grouped by download time and both of us being fully convinced that it's the best way forward. :) So, let me give you some background on where the `recent/` folder comes from. A few years back, there was just the `archive/` folder with tarballs that were updated every few days. All services like Tor Metrics, ExoneraTor, and Onionoo were running on the same host as CollecTor and using CollecTor's directory structure for importing new descriptors. This was very convenient for running these services, but of course very fragile and very impossible for others to run similar services. That's when I turned CollecTor into its own service. The new CollecTor service had a local directory called `rsync/`, the predecessor of `recent/`, which had just the newest files that other services would download via `rsync` rather than http. The idea was to provide the latest 72 hours of descriptors, so that services can miss updates for up to 3 days (a weekend) without having to fall back to importing tarballs from the `archive/` directory. This fixed the problem of running all services on one machine, but it didn't allow others to run services. We quickly learned that rsyncing thousands or even hundreds of thousands of files did not scale, so we appended many small descriptors into one file per CollecTor update run. At some point we made that `rsync/` directory available via http as `recent/` and taught Onionoo et al. to download descriptors from there instead of relying on a local `rsync` command to magically fetch them. This is when other services could first enter the game. It's also when users started browsing the `recent/` directory to have an easy way to download descriptors---but that was mostly coincidence and a nice side effect. Now we're considering changing the directory structure to make it even more efficient for services to keep up to date. Merging votes into single files reduces the `index.json*` size while keeping the service exactly as useful for other services. Something that we'll make a bit more difficult is accessibility for humans, because they cannot locate a vote as easily anymore. Also consider a feature request that people ask for every so often: provide a search for raw descriptors. This is something that folks like directory authority operators or others who debug the network would find really useful. And these folks might be sad that votes are appended to single files and stored by download time rather than valid-after time. But it's again coincidence that votes are easily locatable by valid-after time. On the other hand, if a user searches for something different, like a relay fingerprint or IP address, they'll likely have to download the latest few votes and search locally. So, we might even go one step further and store ''all'' descriptors in the `recent/` folder by download time. That would include consensuses of which there are usually only per CollecTor update run. The upside would be that it'd become more obvious that all files contain the download time, not the published or valid-after time. All in all, I'd like to consider the `recent/` folder as an update channel for services rather than something that humans browse. I'm not going to stop them from doing that, but I'm very hesitant to make the original use case of that directory less useful by supporting this new use case. And we would do that by forcing services to download multiple files containing many descriptors they already know. Somebody should go and write a descriptor database that takes CollecTor's `recent/` folder as input and provides a search interface that returns raw descriptors. I hope this makes sense. Please let me know if it doesn't! And thanks for reading this wall of text. ;) -- Ticket URL: <https://trac.torproject.org/projects/tor/ticket/20228#comment:5> Tor Bug Tracker & Wiki <https://trac.torproject.org/> The Tor Project: anonymity online _______________________________________________ tor-bugs mailing list tor-bugs@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs