Hi Kostas, On 11/06/14 04:48, Kostas Jakeliunas wrote: > Hi all! > > On Mon, Jun 9, 2014 at 10:22 AM, Karsten Loesing <[email protected]> > wrote: >> On 09/06/14 01:26, Damian Johnson wrote: >>> Oh, and another quick thought - you once mentioned that a descriptor >>> search service would make ExoneraTor obsolete, and in looking it over >>> I agree. The search functionality ExoneraTor provides is trivial. The >>> only reason it requires such a huge database is because it's storing a >>> copy of every descriptor ever made. >>> >>> I suspect the actual right solution isn't to rewrite ExoneraTor at >>> all, but rather develop a new service that can be queried for this >>> descriptor data. That would make for a *much* more worthwhile project. >>> >>> ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :) >> >> I agree, that was the idea behind Kostas' GSoC project last year. And I >> still think it's a good idea. It's just not trivial to get right. > > Indeed, not trivial at all! > > I'll use this space to mention the running metrics archive backend > modulo ExoneraTor stuff / what could be sorta-relevant here. > > fwiw, the onionoo-like backend is still running at an obscure address:port: > http://ts.mkj.lt:5555/
Would you want to put the summary you wrote here to that link? And would you want me to add a sentence or two about your service together with a link to the CollecTor page? https://collector.torproject.org/#references What would I write? > TL;DR "what can I do with that" is: look at: > > https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md > > In particular, regarding ExoneraTor-like queries (incl. arbitrary > subnet / part-of-ip lookups): > > https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md#exonerator-type-relay-participation-lookup > > Not sure if it's worth discussing all the weaknesses of this archive > backend in this thread, but the short relevant version is that the > ExoneraTor-like functionality does mostly work, but I would need to > look into it so see how reliable the results are ("is this relay ip > address field really the one we should be using?", etc.) > > But what's nice is that it is possible to do arbitrary queries on all > consensuses since ~2008, with no date specified (if you don't want > to.) (Which is to say, "it's possible", not necessarily "this is the > right way to do the solution for the problems in this thread") > > So e.g. this is the ip address where moria runs, and we want to see > what relays have ever run on it: > > http://ts.mkj.lt:5555/details?search=128.31.0.34 > > Take the fingerprint of the one that is currently running (moria1), > and look up its last 500 statuses (in a kind of condensed/summary > form): > http://ts.mkj.lt:5555/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE31&condensed=true > > "from", "to" date ranges can be specified as e.g. 2009, 2009-02, > 2009-02-10, 2009-02-10 02:00:00. limit/offset/parameters/etc. > specified here: > https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md > > (Descriptors/digests aren't currently included (I think they used to), > but they can be, etc.) > > The point is probably mostly about "this is some evidence that it can be > done." > ("But there are nuances, things are imperfect, time is needed, etc.") > > The question really is regarding the actual scope of this rewrite, I suppose. > > I'd probably agree with Karsten that just doing a port of the > ExoneraTor functionality as it currently is on > exonerator.torproject.org would be the safe bet. See how that goes, > venture into more exotic lands later on maybe, etc. (That doesn't mean > that I wouldn't be excited to put the current backend to good use, > and/or use the knowledge I gained to help you folks in some way!) > >> >> Regarding your comment about storing a copy of every descriptor ever >> made, I believe that users trust ExoneraTor's results more if they see >> the actual descriptors that lead to results. Of course, I'm saying that >> without knowing what ExoneraTor users actually want. But let's not drop >> descriptor copies from the database easily. >> >> And, heh, when you say that the search functionality ExoneraTor provides >> is trivial, a little part of me is dying. It's the part that spent a >> few weeks on getting the search functionality fast enough for >> production. That was not at all trivial. The oraddress24, oraddress48, >> and exitaddress24 fields as well as the indexes are the result of me >> running lots and lots of sample queries and wondering about Postgres' >> EXPLAIN ANALYZE results. Just saying that it's not going to be trivial >> to generalize the search functionality towards other fields than IP >> addresses and dates. > > Hear hear, I can only imagine! These things and exonerator stuff is > not easy to be done in a way that would provide **consistently** > good/great performance. > > I spent some days of the last summer also looking at EXPLAIN ANALYZE > results (it was a great feeling to start to understand what they mean > and how I can make them better), but eventually things start making > sense. (And when they do, I also get that same feeling that NoSQL > stuff doesn't magically solve things.) > >> >> If others want to follow, here's the SQL code I'm talking about: >> >> https://gitweb.torproject.org/exonerator.git/blob/HEAD:/db/exonerator.sql >> >> So, I'm happy to talk about writing a searchable descriptor archive. It >> could _start_ with ExoneraTor's functionality (minus the target address >> and port thing discussed in that other email), and then we could >> consider adding more searches. > > fwiw, imho this sounds like a sane plan to me. (Of course it could > also be possible to work on the onionoo-like archive backend (or fork > it, or smash it into parts and steal some of them, etc., but I can see > why this might yield unclear deliverables, etc.) (So a short document > of "what is wanted" would help, yeah.) > >> >> Pretty sure that Kostas is reading this (in fact, I just cc'ed him), so >> let me make one remark about optimizing Postgres defaults: I wrote quite >> a few database queries in the past, and some of them perform horribly >> (relay search) whereas others perform really well (ExoneraTor). I >> believe that the majority of performance gains can be achieved by >> designing good tables, indexes, and queries. Only as a last resort we >> should consider optimizing the Postgres defaults. > > Ha, at this point I probably have a sort of "premature optimizer" > label in your mind, Karsten. :) (And I kinda deserved it by at one > point focusing on very-low-level postgres caching mechanisms last > summer, etc etc.) > > I've actually come to really appreciate good schema and query > design[1] and the wonders that they do. That being said, I'd actually > be curious to know how large the indexes of relay-search and current > exonerator are.[2] I (still) bet increasing postgres' shared_buffers > and effective_cache_size (totally normal practice!) might help! (Oh, > is this one of those vim-vs-emacs things? If it is, sorry.) I just deleted most of the database contents behind the relay-search service a few days ago. But I might even have agreed there that some PostgreSQL tweaking would have helped. It was a bad database design, mostly because it was built for a different purpose (data aggregation for metrics website), so it's a bad example. But let me give you some numbers on current ExoneraTor (manually deleted part of the output which we don't care about here): exonerator=> \dt+ Name | Size ---------------+-------- consensus | 16 GB descriptor | 31 GB exitlistentry | 558 MB statusentry | 50 GB (4 rows) exonerator=> \di+ Name | Table | Size -----------------------------------------+---------------+--------- consensus_pkey | consensus | 1280 kB descriptor_pkey | descriptor | 1930 MB exitlistentry_exitaddress24_scanneddate | exitlistentry | 82 MB exitlistentry_exitaddress_scanneddate | exitlistentry | 82 MB exitlistentry_pkey | exitlistentry | 173 MB statusentry_oraddress24_validafterdate | statusentry | 5470 MB statusentry_oraddress48_validafterdate | statusentry | 4629 MB statusentry_oraddress_validafterdate | statusentry | 5509 MB statusentry_pkey | statusentry | 10 GB (9 rows) Happy to run some EXPLAIN ANALYZE queries for you if you tell me what to run. If we're going to optimize the ExoneraTor database, should we move this discussion to a ticket? All the best, Karsten > But the point is that (to invoke a cliche) there is no free lunch, and > (2) postgresql can really do wonders and scale well when used right. > >> >> You realize that a searchable descriptor archives focuses much more on >> database optimization than the ExoneraTor rewrite from Java to Python >> (which would leave the database untouched)? >> > > "leaving database untouched" probably implies (very) significantly > less work, so it would be a nice/clear starting point. (caveat, i may > be lacking context, etc.) > > > [1]: also, fun things like "sometimes indexes won't be used because a > sequential read will be faster, because if parts of indexes to be used > are in various parts across the disk (not all of them are in memory), > random seek + read a bit into memory + repeat is slower than 'just > read a lot of continuous data into memory'", etc etc.) > > [2]: if you're feeling adventuruous, you can run this on each of > postgres databases, to see how large the indexes (among all other > things) are, and which parts of them are loaded into memory > https://github.com/wfn/torsearch/blob/master/misc/buffercache.sql > > -- > > Kostas. > > 0x0e5dce45 @ pgp.mit.edu > _______________________________________________ tor-dev mailing list [email protected] https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
