#19791: Use CollecTor's index.json for download; adapt current download to use new date format ---------------------------------+----------------------------------- Reporter: karsten | Owner: iwakeh Type: defect | Status: merge_ready Priority: High | Milestone: metrics-lib 1.4.0 Component: Metrics/metrics-lib | Version: Severity: Normal | Resolution: Keywords: | Actual Points: Parent ID: | Points: Reviewer: | Sponsor: ---------------------------------+----------------------------------- Changes (by karsten):
* status: needs_review => merge_ready Comment: Replying to [comment:11 iwakeh]: > Replying to [comment:10 karsten]: > [...] > > > Regarding the new subpackage `org.torproject.descriptor.index`, I'm yet unclear whether that's supposed to be "visible" to API users or not, that is, if they're supposed to import and/or use those classes directly. If it is, that's going to grow the overall interface quite a bit, requiring us to keep the public parts stable in subsequent releases. For example, I'm unclear whether CollecTor is supposed to instantiate its own `*Node` classes when writing its `index.json*` files. Ideally, metrics- lib would hide away those classes by providing some kind of `DescriptorIndexGenerator` interface that takes a local directory as parameter and writes one or more `index.json[*]` files, but I'm not sure what your plan is there. If the plan is to hide that package from users, how about we rename it to `org.torproject.descriptor.impl.index` to make that clearer? We could even create similar implementation subpackages in the future to clean up the `impl` package a bit, but that's not something that users would have to care about. > > I'd like to add the package as '''alpha'''. Warning that it can change still in unexpected ways. Thus, it can be used in CollecTor and from the findings there we can decide to remove it from the public interface, or change it w/o being backward compatible. > I provide a commit with the package-info and more javadoc below. > > The *Node classes are of use on their own for clients that want to know about a CollecTor's instance files. They are there anyway and maintained, so others can just use them as well. Okay, I'm fine postponing this decision, and adding that alpha notice should be clear enough. However, I'm yet unconvinced by the reasons you're giving. Ideally, clients shouldn't have to deal with the details of an index.json file. We should give them interfaces that are powerful enough to do all the things they'd want to do. Also, my past experience is that maintaining code can be very different depending on whether it's exposed in an API or not. I'd rather want to provide fewer details in interfaces than what's already there. But, no need to find a conclusion for this today, we can still think about this for 1.5.0. > > Can we avoid changing the parameter usage of `DescriptorIndexCollector#collectDescriptors()` by overloading that method in the interface? How about we add a new method `collectDescriptor(String collecTorIndexUrl, String collecTorIndexPath, String[] remoteDirectories, ...)` and use an implementation-specific default for `collecTorIndexPath` in the current method, such as `"/index.html"` for `DescriptorCollectorImpl` and `"/index.json.xz"` for `DescriptorIndexCollector`? If we retain backward compatibility here, I'd say we could switch to `DescriptorIndexCollector` in 1.5.0 when this class has seen some more testing. Otherwise I'd suggest waiting for 2.0.0. > > I wouldn't want to change the interface for that. Both are URL strings and whoever uses the non-default implementation must know what their doing, i.e. at least read javadoc of the class they explicitly request to be used. At the point DescriptorIndexCollector becomes default we might not want to keep maintaining the old implementation and simply remove it. Then the additional method wouldn't make sense any more. My point is that we could do the switch to the new implementation in a version 1.5.0 if it remains backward-compatible, whereas we'd have to call that version 2.0.0 otherwise. So, we can merge this code as is for 1.4.0, but we should aim for a backward-compatible version if we want to make it the new default in 1.5.0. Here's an even better suggestion: we accept both, base URLs (like `"https://collector.torproject.org"`) and full URLs to index files (like `"https://collector.torproject.org/index.html"` or `"https://collector.torproject.org/index.json.xz"`), in the current `collectDescriptors()` method, and if no file is specified, we'll use an implementation-specific default (`"/index.html"` in `DescriptorCollectorImpl` and `"/index.json.xz"` in `DescriptorIndexCollector`). That doesn't break current applications that just provide a base URL, and it enables future applications to select their favorite compression type. Likewise, no need to make a decision today, but let's consider this for 1.5.0. > [...] > > > Is `FileNode#lastModifiedMillis()` called often? If so, it may be better to use `ParseHelper#getDateFormat()` to obtain a thread-safe `DateFormat` instance for the given format and store it in a map. That avoids creating single-use instances of that class over and over. We'd have to make that method public for this, but that shouldn't be an issue. However, if it's not called very often, we can skip this. > > > > I'm aware of `ParseHelper#getDateFormat()`, but I didn't want to introduce dependencies to the `impl` package. > Actually, `ParseHelper` should be split into a utility class that is part of the interface and the implementation specific part (should be part of the redesign ticket #19640). Once, this class exists the `getDateFormat` method should be used; until then it wont break I think. Ah, fair enough. Happy to discuss this more on that other ticket. > [...] > > > Can we pick a smaller `test.json` and compressed equivalents in `src/test/resources/`? For example, we could take the `index.json` from the main CollecTor instance and throw out all files that haven't been modified in the past, say, four weeks. Or did you include a large file on purpose to check something that cannot be tested with a smaller file? > > > > I wanted to use the 'real' file as it is provided by the current main CollecTor without any edits, because I only have these small toy test files otherwise. There can be a timeout to the test method and we'd notice if that takes too long? Okay to leave these files in. No need to add a timeout to the test, as we'll notice slow tests pretty easily, and then we can always take them out. I was also more concerned about the file size than the test execution time, but I admit that file size doesn't really matter here. Never mind. > > I'd say once we have resolved these questions, it's good to go into 1.4.0, even today. Thanks for working on this! > > Fine, thanks for all the things you spotted! > > Please review the [https://gitweb.torproject.org/user/iwakeh/metrics- lib.git/?h=task-19791 new commits] based on your branch above. All changes look good. I'll squash your latest commits and mine into 32f628f, run some tests, and release 1.4.0. Let's just not forget the issues I'm raising above for 1.5.0. -- Ticket URL: <https://trac.torproject.org/projects/tor/ticket/19791#comment:12> Tor Bug Tracker & Wiki <https://trac.torproject.org/> The Tor Project: anonymity online _______________________________________________ tor-bugs mailing list tor-bugs@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs