On 24/06/14 18:53, Kostas Jakeliunas wrote: > Hi Karsten, Hi Kostas,
> On Tue, Jun 17, 2014 at 10:13 AM, Karsten Loesing > <[email protected]> wrote: >> Hi Kostas, >> >> On 11/06/14 04:48, Kostas Jakeliunas wrote: >>> Hi all! >>> >>> On Mon, Jun 9, 2014 at 10:22 AM, Karsten Loesing <[email protected]> >>> wrote: >>>> On 09/06/14 01:26, Damian Johnson wrote: >>>>> Oh, and another quick thought - you once mentioned that a descriptor >>>>> search service would make ExoneraTor obsolete, and in looking it over >>>>> I agree. The search functionality ExoneraTor provides is trivial. The >>>>> only reason it requires such a huge database is because it's storing a >>>>> copy of every descriptor ever made. >>>>> >>>>> I suspect the actual right solution isn't to rewrite ExoneraTor at >>>>> all, but rather develop a new service that can be queried for this >>>>> descriptor data. That would make for a *much* more worthwhile project. >>>>> >>>>> ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :) >>>> >>>> I agree, that was the idea behind Kostas' GSoC project last year. And I >>>> still think it's a good idea. It's just not trivial to get right. >>> >>> Indeed, not trivial at all! >>> >>> I'll use this space to mention the running metrics archive backend >>> modulo ExoneraTor stuff / what could be sorta-relevant here. >>> >>> fwiw, the onionoo-like backend is still running at an obscure address:port: >>> http://ts.mkj.lt:5555/ >> >> Would you want to put the summary you wrote here to that link? > > Moved the whole setup to work on port 80 (via uWSGI, with nginx as the > reverse proxy) ("ts.mkj.lt:5555/some/request" now transparently > perma-redirects to "ts.mkj.lt/some/request"), and put a simple very > short summary on the index: > > http://ts.mkj.lt/ > (have you heard of this new edgy font, "Times New Roman"?) > Let me know if something is too confusing or reads funny, etc. I can > elaborate more in the beginning or after the examples, too. Looks good to me! >> And would you want me to add a sentence or two about your service >> together with a link to the CollecTor page? >> >> https://collector.torproject.org/#references > > Ok! > >> What would I write? > > something like? -- > > The Searchable Metrics Archive backend allows users to search and > explore relay metrics data (consensuses and descriptors), present and > past. It covers the years 2008-now and provides an Onionoo-like API. > > does that make sense? It does! Tweaked a tiny bit and put online. >>> TL;DR "what can I do with that" is: look at: >>> >>> https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md >>> >>> In particular, regarding ExoneraTor-like queries (incl. arbitrary >>> subnet / part-of-ip lookups): >>> >>> https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md#exonerator-type-relay-participation-lookup >>> >>> Not sure if it's worth discussing all the weaknesses of this archive >>> backend in this thread, but the short relevant version is that the >>> ExoneraTor-like functionality does mostly work, but I would need to >>> look into it so see how reliable the results are ("is this relay ip >>> address field really the one we should be using?", etc.) >>> >>> But what's nice is that it is possible to do arbitrary queries on all >>> consensuses since ~2008, with no date specified (if you don't want >>> to.) (Which is to say, "it's possible", not necessarily "this is the >>> right way to do the solution for the problems in this thread") >>> >>> So e.g. this is the ip address where moria runs, and we want to see >>> what relays have ever run on it: >>> >>> http://ts.mkj.lt:5555/details?search=128.31.0.34 >>> >>> Take the fingerprint of the one that is currently running (moria1), >>> and look up its last 500 statuses (in a kind of condensed/summary >>> form): >>> http://ts.mkj.lt:5555/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE31&condensed=true >>> >>> "from", "to" date ranges can be specified as e.g. 2009, 2009-02, >>> 2009-02-10, 2009-02-10 02:00:00. limit/offset/parameters/etc. >>> specified here: >>> https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md >>> >>> (Descriptors/digests aren't currently included (I think they used to), >>> but they can be, etc.) >>> >>> The point is probably mostly about "this is some evidence that it can be >>> done." >>> ("But there are nuances, things are imperfect, time is needed, etc.") >>> >>> The question really is regarding the actual scope of this rewrite, I >>> suppose. >>> >>> I'd probably agree with Karsten that just doing a port of the >>> ExoneraTor functionality as it currently is on >>> exonerator.torproject.org would be the safe bet. See how that goes, >>> venture into more exotic lands later on maybe, etc. (That doesn't mean >>> that I wouldn't be excited to put the current backend to good use, >>> and/or use the knowledge I gained to help you folks in some way!) >>> >>>> >>>> Regarding your comment about storing a copy of every descriptor ever >>>> made, I believe that users trust ExoneraTor's results more if they see >>>> the actual descriptors that lead to results. Of course, I'm saying that >>>> without knowing what ExoneraTor users actually want. But let's not drop >>>> descriptor copies from the database easily. >>>> >>>> And, heh, when you say that the search functionality ExoneraTor provides >>>> is trivial, a little part of me is dying. It's the part that spent a >>>> few weeks on getting the search functionality fast enough for >>>> production. That was not at all trivial. The oraddress24, oraddress48, >>>> and exitaddress24 fields as well as the indexes are the result of me >>>> running lots and lots of sample queries and wondering about Postgres' >>>> EXPLAIN ANALYZE results. Just saying that it's not going to be trivial >>>> to generalize the search functionality towards other fields than IP >>>> addresses and dates. >>> >>> Hear hear, I can only imagine! These things and exonerator stuff is >>> not easy to be done in a way that would provide **consistently** >>> good/great performance. >>> >>> I spent some days of the last summer also looking at EXPLAIN ANALYZE >>> results (it was a great feeling to start to understand what they mean >>> and how I can make them better), but eventually things start making >>> sense. (And when they do, I also get that same feeling that NoSQL >>> stuff doesn't magically solve things.) >>> >>>> >>>> If others want to follow, here's the SQL code I'm talking about: >>>> >>>> https://gitweb.torproject.org/exonerator.git/blob/HEAD:/db/exonerator.sql >>>> >>>> So, I'm happy to talk about writing a searchable descriptor archive. It >>>> could _start_ with ExoneraTor's functionality (minus the target address >>>> and port thing discussed in that other email), and then we could >>>> consider adding more searches. >>> >>> fwiw, imho this sounds like a sane plan to me. (Of course it could >>> also be possible to work on the onionoo-like archive backend (or fork >>> it, or smash it into parts and steal some of them, etc., but I can see >>> why this might yield unclear deliverables, etc.) (So a short document >>> of "what is wanted" would help, yeah.) >>> >>>> >>>> Pretty sure that Kostas is reading this (in fact, I just cc'ed him), so >>>> let me make one remark about optimizing Postgres defaults: I wrote quite >>>> a few database queries in the past, and some of them perform horribly >>>> (relay search) whereas others perform really well (ExoneraTor). I >>>> believe that the majority of performance gains can be achieved by >>>> designing good tables, indexes, and queries. Only as a last resort we >>>> should consider optimizing the Postgres defaults. >>> >>> Ha, at this point I probably have a sort of "premature optimizer" >>> label in your mind, Karsten. :) (And I kinda deserved it by at one >>> point focusing on very-low-level postgres caching mechanisms last >>> summer, etc etc.) >>> >>> I've actually come to really appreciate good schema and query >>> design[1] and the wonders that they do. That being said, I'd actually >>> be curious to know how large the indexes of relay-search and current >>> exonerator are.[2] I (still) bet increasing postgres' shared_buffers >>> and effective_cache_size (totally normal practice!) might help! (Oh, >>> is this one of those vim-vs-emacs things? If it is, sorry.) >> >> I just deleted most of the database contents behind the relay-search >> service a few days ago. But I might even have agreed there that some >> PostgreSQL tweaking would have helped. It was a bad database design, >> mostly because it was built for a different purpose (data aggregation >> for metrics website), so it's a bad example. >> >> But let me give you some numbers on current ExoneraTor (manually deleted >> part of the output which we don't care about here): >> >> exonerator=> \dt+ >> Name | Size >> ---------------+-------- >> consensus | 16 GB >> descriptor | 31 GB >> exitlistentry | 558 MB >> statusentry | 50 GB >> (4 rows) >> >> exonerator=> \di+ >> Name | Table | Size >> -----------------------------------------+---------------+--------- >> consensus_pkey | consensus | 1280 kB >> descriptor_pkey | descriptor | 1930 MB >> exitlistentry_exitaddress24_scanneddate | exitlistentry | 82 MB >> exitlistentry_exitaddress_scanneddate | exitlistentry | 82 MB >> exitlistentry_pkey | exitlistentry | 173 MB >> statusentry_oraddress24_validafterdate | statusentry | 5470 MB >> statusentry_oraddress48_validafterdate | statusentry | 4629 MB >> statusentry_oraddress_validafterdate | statusentry | 5509 MB >> statusentry_pkey | statusentry | 10 GB >> (9 rows) > > Looks nice! :) thanks! (just for fun, the largest index on my side is > one "statusentry_substr_validafter_idx", which is an index on two > columns (a (SUBSTR() of) relay nickname and the consensus valid after > (DESC)), and it's currently at 7004 MB.) > Anyway, "these sizes make sense" is all I can think of right now! Good to hear. >> Happy to run some EXPLAIN ANALYZE queries for you if you tell me what to >> run. > > okay, maybe I'll think of something some time, and if I do, I can > either open a ticket, or create a new email thread, unless this is > kind-of-ok for this thread. > > (Regarding "what part of $something is in memory", I remember the > "disk read" (or was it "buffer read") words in EXPLAIN ANALYZE being > useful. Also, sometimes postgres really mis-assumes on how much it'll > have to read, and how much it ends up reading (it's all there in the > results iirc, but you probably know all that.) In which case a VACUUM > should help (of course), etc.) Feel free to start a new thread or create a ticket for this. To be honest, I didn't run EXPLAIN ANALYZE on this database for quite a while. I just assume everything works fine. >> If we're going to optimize the ExoneraTor database, should we move this >> discussion to a ticket? > > Derailment with technicalities is always a looming danger I guess, but > at this point I'm not even sure what you and Damian (and possibly > others) are planning to do with the current ExoneraTor. I assume > current ExoneraTor performance is good as it currently stands, so this > part of the thread/thoughtspace can be closed for the time being as > far as I can see. (And I could open a ticket if I think of something > interesting to do regarding diagnosing/optimizing the ExoneraTor > database.) > > I suppose there's still no consensus whether a python-exonerator > should aim to replicate current ExoneraTor's functionality (and, say, > use the current database), or whether it should do more(tm). (Happy to > participate in some form of discussion at the dev meeting, if my input > can be useful!) Damian won't be in Paris, AFAIK. But sure, happy to discuss more next week. All the best, Karsten _______________________________________________ tor-dev mailing list [email protected] https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
