Re: [tor-dev] Relay Database: Existing Schemas?
On Thu, Apr 16, 2015 at 4:53 PM, Karsten Loesing <kars...@torproject.org> wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 15/04/15 21:18, nusenu wrote: >> Hi, >> >> I'm planing to store relay data in a database for analysis. I >> assume others have done so as well, so before going ahead and >> designing a db schema I'd like to make sure I didn't miss >> pre-existing db schemas one could build on. >> >> Data to be stored: - (most) descriptor fields - everything that >> onionoo provides in a details record (geoip, asn, rdns, tordnsel, >> cw, ...) - historic records >> >> I didn't find something matching so far, so I'll go ahead, but if >> you know of other existing relay db schemas I'd like to hear about >> it. >> >> thanks, nusenu >> >> >> >> >> "GSoC2013: Searchable Tor descriptor archive" (Kostas Jakeliunas) >> https://www.google-melange.com/gsoc/project/details/google/gsoc2013/wfn/ >> >> > 5866452879933440 >> https://lists.torproject.org/pipermail/tor-dev/2013-May/004923.html >> >> > https://lists.torproject.org/pipermail/tor-dev/2013-September/005357.htm >> l https://github.com/wfn/torsearch (btw, someone knows the license >> of this?) > > Cc'ing Kostas for this question. Hi nusenu, I've been going through old mail, and on 2015-04-16 you asked about about a license (see above). Just added a LICENSE file - can't hurt (standard BSD 3-clause). If you're still by any chance collating (ha) and/or want to talk about schema design for descriptors (I personally would not lose hope for RDBMSes for large datasets - not until one gets into *actually* big data - say, terabytes at least, or more - but of course it gets nuanced real fast). -- Kostas. 0x0e5dce45 @ pgp.mit.edu > >>> This is true: the summary/details documents (just like in Onionoo >>> proper) deal with the *last* known info about relays. >> >> >> ernie >> https://gitweb.torproject.org/metrics-db.git/plain/doc/manual.pdf >> (didn't find db/tordir.sql mentioned in the pdf) > > That file lives here now: > > https://gitweb.torproject.org/metrics-web.git/tree/modules/legacy/db/tordir.sql > > A better schema might be the following one though. It's smaller, but > it's better documented: > > https://gitweb.torproject.org/exonerator.git/tree/db/exonerator.sql > >> "Instructions for setting up relay descriptor database" >> https://lists.torproject.org/pipermail/tor-dev/2010-March/001783.html > > That's >> > five years old. I'd say ignore that one. > >> "Set up descriptor database for other researchers" >> https://trac.torproject.org/projects/tor/ticket/1643 > > Also five years old. Better ignore. > > Hope that helps. > > All the best, > Karsten > -BEGIN PGP SIGNATURE- > Version: GnuPG v1 > Comment: GPGTools - http://gpgtools.org > > iQEcBAEBAgAGBQJVL9rcAAoJEJD5dJfVqbCrFZgIAIEv/Yi4sNoa8clYVAxuk0Sh > FFbRDT0kLs19t/DgTwUtB6jD4Lh0akMc806AaIFgfCdL+QwcG0llBfZnSsrbszoH > Xoi226PRx9lPITrA7KYds4PUZfqIqg3ECpNsKNa4PLB7SlQdNfJQ1wDngcwu2CrF > Hk+zHbu0gfSkfZRBqxt5aJLTFXR0aBYybF4d6sPJ4OW5Al2U8r9DYysXc0xALvwq > bvEDFctV1wkDgA3mP3guRrXImXYT1AQPFFlz0TR1eBruuSJBiPKIv7Fs/ocns4aR > OhxIEaKBaAO+HkvyxDcZ1ukXldR13s3MUPD0XvvZ8xQRCBZpNMygqTMi6pIjTN4= > =a0Nb > -END PGP SIGNATURE- ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
[tor-dev] [GSoC] BridgeDB Twitter Distributor report
Progress/activities since last time: * incorporating BridgeRequest's together with an initial bridge request API over JSON (it's easier to do both as they are tightly related.) The bridge request api is based on isis' initial fix/12029-dist-api_r1; * bogus server-side bridge provider that implements the json api: just something that gives fake bridges based on the request (which is handled/contained in BridgeRequest.) (will have server side code real soon now (hoped to have it by now.)) * my churn_rewrite could probably make use of bridgedb's current approach to pickled storage. (It's also worth switching to twisted.spread.jelly for (mostly) security) * experimenting with sending images over twitter DMs. Twitter API does not support images in DMs, but the web as well as various mobile apps support attaching images to DMs (images end up in twitter CDN. (served over ton.twitter.com), which is good.) Some progress here: the web client DM send requests (where image files can be attached) are contained; the bot should be able to send images in DMs soon, emulating a normal web user agent (but using the two twitter APIs for all other activities and DMs.) * once bridgerequest's + request api (client-side + my mock server-side thing) are done, the bot will have approached a not-far-from-functional state Apologies for the late report. -- Kostas. 0x0e5dce45 @ pgp.mit.edu ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
[tor-dev] [GSoC] BridgeDB Twitter Distributor report
Hi all, preferring existing code over shiny code and being mad late, I * (re)wrote a simple but working churn control mechanism[1], which uses * a general persistable storage system: * in particular, the bot now has a central storage controller which takes care of storage handlers which, in turn, may be of different varieties. Each variety knows how to handle its own kind of storage containers (simple objects with data as attributes). Some of them may be persistable, others necessarily ephemeral (wipe data on close); * right now we only make use of simple pickle-dump-to-file-and-gzip persistable storage; we use it for churn control and for challenge responses; everything is self-contained so to speak; * we hash the user twitter handles (unique usernames / screen names) and round up bridges-last-given-at timestamps; * we handle bot shutdown by catching the appropriate signal (then properly closing down the twitter stream listener and asking the storage controller to close down the handlers); * we use the storage system in the core bot via a general bot state object (which is itself oblivious to how storage is actually implemented); * wrote a simple and generic challenge-response system[2] (which makes use of the persistent storage); * instead of doing something very smart, we use a general CR system which takes care of particular challenge-responses; the general CR is usable as-is; the particular CR objects can be easily subclassed (and that's what we do now); * the current mock/bogus CR system that is in place (for testing etc.) is a naive text-based question-answer CR, which asks the users to add the number of characters in their twitter username to a given verbal/English-word number; * I should now finish up with ``BridgeRequest``s, which are the proper way to handle bridge requests in the bot while doing challenge-responses (the current interaction between the core bot and the CR system will lead / has been leading nowhere); * also, there's a question to be had whether the cached (and hashed) answers to CRs should be persisted to storage (if bot gets shutdown while some challenges are pending) in the first place. I've been unable to find[3] or to come up with a concept of a user-friendly *text-based* CR that would stand against any kind of thief who is able to create lots of Twitter users and to write twenty-line scripts solving any text-based challenges/questions presented. Either it will to be a difficult problem that will be easier solved by a computer than by a human (hence unfeasible general-UX-wise), or it will be so symmetrical in the sense that one only has to view the source (if even that) to come up with a script trivially solving the challenge presented. Hence I've been slowly moving on with the captcha-over-twitter-direct-messages idea, which is not pretty, but which would at least ensure that we don't give up bridges more easily than in, say, the current IPDistributor. [1]: https://github.com/wfn/twidibot/compare/master...churn_rewrite [2]: https://github.com/wfn/twidibot/compare/churn_rewrite...simple_cr2 [3] it's quite hard to find anything of use in the chatroom problem / text-based challenge response area. Basically, it would be great to have a reverse Turing test[4] that is not about captcha/OCR. I realize this is in itself a very ambitious topic. [4]: some context on early CAPTCHAs / precursors (have been trying to familiarize myself with the general area), http://www2.parc.com/istl/projects/captcha/docs/pessimalprint.pdf -- Kostas. 0x0e5dce45 @ pgp.mit.edu ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
[tor-dev] GSoC: BridgeDB Twitter Distributor report
Hi all, in the past couple of weeks I've been doing more of the same - namely, fleshing out churn control in the bot; finishing a generic challenge-response system (I'm also now considering making it into a Zope Interface); subclassed text-based challenge-response; incorporating isis' IRequestBridges and BridgeRequest's into the bot's bridge request processing part; and fake bridge-line-from-descriptor generator within the bot (didn't really do much re: the latter.) Unfortunately, all those parts are not yet ready for redeployment of the bot, and are either buggy or not finished (inclusion/use of BridgeRequests.) This is partly due to me having a bit less time in the last two weeks (a fault of my own; on the plus side, I've learned to use the soldering iron properly!) My plan is to finish the things that are near completion, and do another midterm-worthy status update very soon. I'll also be present during the developer meeting hackdays (2th-4th), and hope to use them to flesh out ideas, etc. with isis/sysrqb. -- Kostas. 0x0e5dce45 @ pgp.mit.edu ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
[tor-dev] GSoC: BridgeDB Twitter Distributor report
Hey all, in the past weeks I've been working on understanding what can be done using Twitter APIs and its media support in its CDN (for a later captcha implementation), as well as on improving my existing Twitter bridge distributor bot PoC. I've written some broken code, but it's alright. More details below. Distributor bot improvements included working on adding a churn rate control mechanism which securely stores Twitter user IDs (with code and design ideas from BridgeDB's HMAC approach to remembering e.g. email addresses in the EmailDistributor), and implementing a (mostly) bogus text-based challenge-response system (this is mostly so that we have a generic design for doing challenge-responses in this distributor - we'll be able to later on replace it with a decent CAPTCHA, for example. It's just nice to have a generic system and a thing for testing out the bot, etc.) I've also looked into using isis' new and shiny BridgeRequest objects to process user (well) 'bridge requests' in a non-hacky way; this should also eventually result in a bridge request syntax compatible with (a subset of) GetTor commands. But I still need to figure out the best way to use BridgeRequests, so nothing interesting to show yet. TODO * (still yet to) summarize a nice meeting i've had with sysrqb and isis. No definite conclusions were made, but there were (iirc) some nice ideas about a generic BridgeDB API that could be used by third party components, etc. (i.e. it might be worth pursuing it even if the Social Distributor is to be implemented at some later point.) * clean up my mess, test new code not to fail, and push new things onto https://github.com/wfn/twidibot/ (current (old) code there does work, if anyone's curious to run it) * figure out BridgeRequests, the new IRequestBridges (ha!) interface, and use these in the twitter bot * be able to 'serve' the bot fake bridge data so it could process it in a way that may be compatible with a future BridgeDB API (i.e., hopefully this bot will be able to run as a third-party-thing, separate from core bridgedb. This is hopefully how future distributors will/should work.) This way the bot will be more/actually 'realistic' in the way it serves current bogus bridge lines to users. (I thought I'd have this by now, but I don't. Hrm.) * continue looking into captcha systems modulo what can be used in the twitter context * look into bridgedb buckets and what I can help re: them, so the bridgedb API could happen sooner than later. (Old todo list item, did not yet touch it.) All in all, need to write more non-broken code, fewer words, and just continue with the current bot. Have a nice day/night/thing! -- Kostas. 0x0e5dce45 @ pgp.mit.edu ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] Python ExoneraTor
Hi all! On Mon, Jun 9, 2014 at 10:22 AM, Karsten Loesing kars...@torproject.org wrote: On 09/06/14 01:26, Damian Johnson wrote: Oh, and another quick thought - you once mentioned that a descriptor search service would make ExoneraTor obsolete, and in looking it over I agree. The search functionality ExoneraTor provides is trivial. The only reason it requires such a huge database is because it's storing a copy of every descriptor ever made. I suspect the actual right solution isn't to rewrite ExoneraTor at all, but rather develop a new service that can be queried for this descriptor data. That would make for a *much* more worthwhile project. ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :) I agree, that was the idea behind Kostas' GSoC project last year. And I still think it's a good idea. It's just not trivial to get right. Indeed, not trivial at all! I'll use this space to mention the running metrics archive backend modulo ExoneraTor stuff / what could be sorta-relevant here. fwiw, the onionoo-like backend is still running at an obscure address:port: http://ts.mkj.lt:/ TL;DR what can I do with that is: look at: https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md In particular, regarding ExoneraTor-like queries (incl. arbitrary subnet / part-of-ip lookups): https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md#exonerator-type-relay-participation-lookup Not sure if it's worth discussing all the weaknesses of this archive backend in this thread, but the short relevant version is that the ExoneraTor-like functionality does mostly work, but I would need to look into it so see how reliable the results are (is this relay ip address field really the one we should be using?, etc.) But what's nice is that it is possible to do arbitrary queries on all consensuses since ~2008, with no date specified (if you don't want to.) (Which is to say, it's possible, not necessarily this is the right way to do the solution for the problems in this thread) So e.g. this is the ip address where moria runs, and we want to see what relays have ever run on it: http://ts.mkj.lt:/details?search=128.31.0.34 Take the fingerprint of the one that is currently running (moria1), and look up its last 500 statuses (in a kind of condensed/summary form): http://ts.mkj.lt:/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE31condensed=true from, to date ranges can be specified as e.g. 2009, 2009-02, 2009-02-10, 2009-02-10 02:00:00. limit/offset/parameters/etc. specified here: https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md (Descriptors/digests aren't currently included (I think they used to), but they can be, etc.) The point is probably mostly about this is some evidence that it can be done. (But there are nuances, things are imperfect, time is needed, etc.) The question really is regarding the actual scope of this rewrite, I suppose. I'd probably agree with Karsten that just doing a port of the ExoneraTor functionality as it currently is on exonerator.torproject.org would be the safe bet. See how that goes, venture into more exotic lands later on maybe, etc. (That doesn't mean that I wouldn't be excited to put the current backend to good use, and/or use the knowledge I gained to help you folks in some way!) Regarding your comment about storing a copy of every descriptor ever made, I believe that users trust ExoneraTor's results more if they see the actual descriptors that lead to results. Of course, I'm saying that without knowing what ExoneraTor users actually want. But let's not drop descriptor copies from the database easily. And, heh, when you say that the search functionality ExoneraTor provides is trivial, a little part of me is dying. It's the part that spent a few weeks on getting the search functionality fast enough for production. That was not at all trivial. The oraddress24, oraddress48, and exitaddress24 fields as well as the indexes are the result of me running lots and lots of sample queries and wondering about Postgres' EXPLAIN ANALYZE results. Just saying that it's not going to be trivial to generalize the search functionality towards other fields than IP addresses and dates. Hear hear, I can only imagine! These things and exonerator stuff is not easy to be done in a way that would provide **consistently** good/great performance. I spent some days of the last summer also looking at EXPLAIN ANALYZE results (it was a great feeling to start to understand what they mean and how I can make them better), but eventually things start making sense. (And when they do, I also get that same feeling that NoSQL stuff doesn't magically solve things.) If others want to follow, here's the SQL code I'm talking about: https://gitweb.torproject.org/exonerator.git/blob/HEAD:/db/exonerator.sql So, I'm happy to talk about writing a searchable descriptor archive. It could _start_ with
Re: [tor-dev] Python ExoneraTor
On Tue, Jun 10, 2014 at 10:38 AM, Karsten Loesing kars...@torproject.org wrote: On 10/06/14 05:41, Damian Johnson wrote: let me make one remark about optimizing Postgres defaults: I wrote quite a few database queries in the past, and some of them perform horribly (relay search) whereas others perform really well (ExoneraTor). I believe that the majority of performance gains can be achieved by designing good tables, indexes, and queries. Only as a last resort we should consider optimizing the Postgres defaults. You realize that a searchable descriptor archives focuses much more on database optimization than the ExoneraTor rewrite from Java to Python (which would leave the database untouched)? Are other datastore models such as splunk or MongoDB useful? [splunk has a free yet proprietary limited binary... those having historical woes and takebacks, mentioned just for example here.] Earlier I mentioned the idea of Dynamo. Unless I'm mistaken this lends itself pretty naturally to addresses as a hash key, and descriptor dates as the range key. Lookups would then be O(log(n)) where n is the total number of descriptors an address has published (... that is to say very, very quick). This would be a fun project to give Boto a try. *sigh*... there really should be more hours in the day... Quoting my reply to Damian to a similar question earlier in the thread: I'm wary about moving to another database, especially NoSQL ones and/or cloud-based ones. They don't magically make things faster, and Postgres is something I understand quite well by now. [...] Not saying that DymanoDB can't be the better choice, but switching the database is not a priority for me. If somebody wants to give, say, MongoDB a try, I'd be interested in seeing the performance comparison to the current Postgres schema. When you do, please consider all three search_* functions that the current schema offers, including searches for other IPv4 addresses in the same /24 and other IPv6 addresses in the same /48. Personally, the only NoSQL thing I've come across (and have had some really good experiences with in the past) was Redis, which is a kind of key-value store-in-memory, with some nice simple data structures (like sets, and operations on sets. So if you can reduce your problem to (e.g.) sets and set operations, Redis might be a good fit.) (I think that isis is actually experimenting with Redis right now, to do prop 226-bridgedb-database-improvements.txt) If the things that you store in Redis can't be made to fit into memory, you'll probably have a bad time. So to generalize, if some relational data which needs to be searchable can be made to fit into memory (we can guarantee it wouldn't exceed x GB [for t time]), offloading that part onto some key-value (or some more elaborate) system *might* make sense. Also, I mixed up the link in footnote [2]. It should have linked to this diagnostic postgres query: https://github.com/wfn/torsearch/blob/master/misc/list_indexes_in_memory.sql -- regards Kostas ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] Introducing CollecTor (was: Spinning off Directory Archive from Metrics Portal)
On Fri, Jun 6, 2014 at 1:18 PM, Philipp Winter p...@nymity.ch wrote: On Wed, Jun 04, 2014 at 04:54:03PM +0200, Karsten Loesing wrote: On 25/05/14 10:35, Karsten Loesing wrote: I'm continuously tweaking the Metrics Portal [0] in the attempt to make it more useful. My latest idea is to finally spin off the Directory Archive part from it, which is the part that serves descriptor tarballs. Ta-da! === https://collector.torproject.org/ === New website! Looks great! Seconded - very awesome indeed! I added the service to: https://trac.torproject.org/projects/tor/wiki/org/operations/Infrastructure - Recently published descriptors can now be accessed much more easily: https://collector.torproject.org/recent/ That's a very useful feature. Am I right to assume that any service/program/client that relied on metrics rsync the recent/ folder feature should migrate to using https://collector.torproject.org/recent/ ? One thing that's neat with rsync is that it can take care of any lapses in service (on either the metrics data backend side, or on the client-which-is-downloading-the-data side) - it will just automagically mirror all the consensuses (if this is needed by the client/program/etc.) Of course, it's very easy to just make the client check if it has any lapses/holes in its (historical) view of the needed data, and to make it re-download (wget, whatever) the missing parts as needed. Just wanted to make sure there'll be no rsync-recent-metrics-data service any more (correct me if i got this wrong.) - Preliminary logo suggested by Jeroen and very quickly put together: https://people.torproject.org/~karsten/volatile/collector-logo.png -- if you're a graphic designer and want to contribute one hour of your time to design that for real, please contact me! Hmm, that seems to be the octopus which is part of USA-247's logo: http://en.wikipedia.org/wiki/USA-247 Quite sure this was some cheeky intended satire :) Really like the logo, btw ;) Hopefully, somebody can contribute a better one. Cheers, Philipp Kostas ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
[tor-dev] New BridgeDB Distributor (was: Re: New BridgeDB Distributor (Twitter/SocialDistributor intersections, etc.))
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 With isis' and sysrqb's permission, moving the new BridgeDB Distributor (and maybe general bridgedb distributor architecture discussion) thread onto tor-dev@. On 04/15/2014 10:30 PM, Kostas Jakeliunas wrote: On 03/29/2014 10:08 AM, Matthew Finkel wrote: (I look the liberty of making this readable again :)) On Fri, Mar 28, 2014 at 08:00:17PM +0200, Kostas Jakeliunas wrote: isis wrote: Kostas Jakeliunas transcribed 7.9K bytes: Hey isis, wfn here. [...] Hi! Howdy! I'm super excited to hear you're interested in working on this! [...] [...] a couple of questions (more like inconcrete musings) [...]: Would you personally think that incorporating some ideas from #7520[1] (Design and implement a social distributor for BridgeDB) would be within the scope of a ~three+ month project? The way I see it, if a twitter (or, say, xmpp+otr as mentioned by you/others on IRC) distributor were to be planned, it would either need to - incorporate some form of churn rate control / Sybil attack prevention, via e.g. recaptcha (I see that twitter direct (=personal) messages can include images; they'll probably be served by one of twitter media CDNs (would need to look things up), but it's probably safe to assume that as long as twitter itself is not blocked, those CDNs won't be, either); Yes, this stuff is already built, and wouldn't be too hard to incorporate. However, as I'm sure you already understand, there is no Proof of Work system which actually works for users while keeping adversaries out. For sure, we always have to keep this in mind. Hopefully there's a compromise that kinda-works, and eventually, given some more metrics/diagnostic info intersected with OONI hopefully being able to say which bridges don't work from which countries, it'll be possible to actually carry out tests in a kind-of-scientific/not-blind-guessing way.. At this point I just assume our adversary will always have more resources than us no matter which mechanism we use. More people, more compute power/time, more money. At this point I think we only have two things that they don't. We have more bridges and more love for people. Leveraging this is...not easy, however. :( POW is useful in some cases, for example, to prevent an asshole from crawling bridgedb so that they can add all bridges to a blacklist. When dealing with state-level adversaries I agree with isis that they're of little use. Agree. - or take an idea from the social distributor in #7520, namely/probably, implement some form of token system. This is not very doable in 6 weeks. It also, sadly, requires the DB backend work (which I'll be doing over the next three months, but might take more time). Aha, understood, yes. So basically, ideally I'd write code that could *later on* be easily extendable in relevant ways. But no tokens for now. Ideally this sounds like a good idea, however I'm not sure we (or at least I) have a good handle on what bridgedb will look like in 6-12 months. It's undergoing a lot of change right now. Don't interpret this as saying this is a bad idea because the more abstract and extensible you make this distributor the more useful it will be. I'm just a little worried about writing something for the future. Perhaps there's a good way to design and plan for this, though. Yeah, understood. As I understand it, isis is changing some things in bridgedb (bridgedb.Distributor, etc) right now / these days. For now, the idea is to have a thing that works that is more or less completely decoupled from the bridgedb codebase. If we do this right, it will hopefully be relatively easy to then integrate it in a way that will make sense at that point in time (e.g. as part of bridgedb.Distributor, *or* as a client to a core RESTful distributor/api/service that gives bridges to other 'third-party' distributors (see below.)) It might be possible to have some simplistic token system with pre-chosen seed nodes, etc. Of course, security and privacy implications ahoy - first and foremost, this would result in more than zero places/people knowing t he entire social graph, unless your and other people's ideas (the whole Pandora box of; I should attempt an honest read of rBridge, et al.; have only skimmed as of now) re: oblivious transfer, etc. were incorporated. Here it becomes quite difficult to define short-ish term deliverables of course. I know that you did quite a lot of research on the private/secure social distributor idea. Really, you don't want to get into this stuff. Or do, but don't do it for GSoC. I've spent the past year painfully writing proofs to correct the erro rs in that paper, and discovered some major problems for anonymity in old tried-and-true cryptographic primitives in the process. This is a HUGE project. Sounds insanely intense, in both a good and a bad way! It's
[tor-dev] GSoC: BridgeDB Twitter Distributor
Hi all, I'm excited to be able to spend another summer-of-code together with Tor (how impudent!) :) My name is Kostas (wfn on OFTC), primary mentor is isis and secondary mentor is sysrqb. I'll be working on writing a new BridgeDB Distributor[1]. I've set my primary task to designing and implementing a Twitter-distributor-bot (see proposal[2]): a Twitter bot answers personal (direct) messages, does rate control if needed, and gives bridge lines to users. There should be enough time to at least start on another distributor (right now I'm thinking about an XMPP-based one, as the federated-network-nature allows for some neat censorship circumvention approaches.) But there's also value in implementing a generic/core distributor that could give bridges to third-party distribution systems over a (say) RESTful API. Will see how things go, but core task for now is a twitter-based distributor. For further ideas, discussion, etc., see a separate tor-dev@ thread: https://lists.torproject.org/pipermail/tor-dev/2014-April/006742.html Ideas are very much welcome indeed! [1]: https://www.torproject.org/getinvolved/volunteer.html.en#newBridgedbDistributor [2]: http://kostas.mkj.lt/gsoc2014/gsoc2014.html -- Kostas. 0x0e5dce45 @ pgp.mit.edu ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] Incorporating your torsearch changes into Onionoo
On Wed, Oct 23, 2013 at 2:32 PM, Karsten Loesing kars...@torproject.orgwrote: On 10/11/13 4:05 PM, Kostas Jakeliunas wrote: Oops! Sorry for the delay in responding! Responding now. On Fri, Oct 11, 2013 at 12:00 PM, Karsten Loesing kars...@torproject.orgwrote: Hi Kostas, should we move this thread to tor-dev@? Hi Karsten! sure. From our earlier conversation about your GSoC project: In particular, we should discuss how to integrate your project into Onionoo. I could imagine that we: - create a database on the Onionoo machine; - run your database importer cronjob right after the current Onionoo cronjob; - make your code produce statuses documents and store them on disk, similar to details/weights/bandwidth documents; - let the ResourceServlet use your database to return the fingerprints to return documents for; and - extend the ResourceServlet to support the new statuses documents. Maybe I'm overlooking something and you have a better plan? In any case, we should take the path that implies writing as little code as possible to integrate your code in Onionoo. Let me know what you think! Sounds good. Responding to particular points: - create a database on the Onionoo machine; - run your database importer cronjob right after the current Onionoo cronjob; These should be no problem and make perfect sense. It's always best to use raw SQL table creation routines to make sure the database looks exactly like the one on the dev machine I guess (cf. using SQLAlchemy abstractions to do that (I did that before)). Current SQL script to do that is at [1]. I'll look over it. For example, I'd (still) like to generate some plots showing the chances of two fingerprints having the same substring (this is for the intermediate fingerprint table.) (One axis would be substring length, another would be the possibility in (portions of) %.) As of now, we still use substr(fingerprint, 0, 12), and it is reflected in the schema. Overall though, no particular snags here. I don't follow. But before we get into details here, I must admit that I was too optimistic about running your code on the current Onionoo machine. I ran a few benchmark tests on it last week to compare it to new hardware, and those tests almost made it fall over. We should not even think about adding new load to the current machine. New plan: can you run an Onionoo instance with your changes on a different machine? (If you need anything from me, like a tarball of the status/ and out/ directories, I'm happy to provide them to you.) I think we should run this instance for a while to see how reliable it is. And once we're confident enough, we'll likely have new hardware for the new Onionoo, so that we can move it there. This sounds like a very good idea. Ok, I can try and do this. Sorry for delaying my response as well, I'll try and follow up with what I need (if anything). - make your code produce statuses documents and store them on disk, similar to details/weights/bandwidth documents; Right, so if we are planning to support all V3 network statuses for all fingerprints, how are we to store all the status documents? The idea is to preprocess and serve static JSON documents, correct (as in the current Onionoo)? (cf. the idea of simply caching documents: if we serve a particular status document, it gets cached, and depending on the query parameters (date range restriction, e.g.) it may be set not to expire at all.) Or should we try and actually store all the statuses (the condensed status document version [2], of course)? Let's do it as the current Onionoo does it. This code does not exist, right? I've done some small testing on a local system, it seems the Onionoo way is plausible, since the generation of all the old(er) status etc. documents needs to happen only once (obviously, but now I understand this means the number of resulting status documents and their size is not such a big deal after all.) I don't have good code for it as of yet. - let the ResourceServlet use your database to return the fingerprints to return documents for; and - extend the ResourceServlet to support the new statuses documents. Sounds good. I assume you are very busy with other things as well, so ideally maybe you had in mind that I could try and do the Java part? :) Though, since you are much more familiar with (your own) code, you could probably do it faster than me. Not sure. Any particular technical issues/nuances here (re: ResourceServlet)? Can you give it a try? Happy to help with specific questions about ResourceServlet, and I'll try hard to reply faster this time. Again, sorry for the delay! Okay! I've been tinkering a bit, actually. Will see if I can produce something decent and reliable. Best wishes Kostas. [1]: https://github.com/wfn/torsearch/blob/master/db/db_create.sql [2
Re: [tor-dev] Incorporating your torsearch changes into Onionoo
On Fri, Oct 11, 2013 at 12:00 PM, Karsten Loesing kars...@torproject.orgwrote: Hi Kostas, should we move this thread to tor-dev@? Hi Karsten! sure. From our earlier conversation about your GSoC project: In particular, we should discuss how to integrate your project into Onionoo. I could imagine that we: - create a database on the Onionoo machine; - run your database importer cronjob right after the current Onionoo cronjob; - make your code produce statuses documents and store them on disk, similar to details/weights/bandwidth documents; - let the ResourceServlet use your database to return the fingerprints to return documents for; and - extend the ResourceServlet to support the new statuses documents. Maybe I'm overlooking something and you have a better plan? In any case, we should take the path that implies writing as little code as possible to integrate your code in Onionoo. Let me know what you think! Sounds good. Responding to particular points: - create a database on the Onionoo machine; - run your database importer cronjob right after the current Onionoo cronjob; These should be no problem and make perfect sense. It's always best to use raw SQL table creation routines to make sure the database looks exactly like the one on the dev machine I guess (cf. using SQLAlchemy abstractions to do that (I did that before)). Current SQL script to do that is at [1]. I'll look over it. For example, I'd (still) like to generate some plots showing the chances of two fingerprints having the same substring (this is for the intermediate fingerprint table.) (One axis would be substring length, another would be the possibility in (portions of) %.) As of now, we still use substr(fingerprint, 0, 12), and it is reflected in the schema. Overall though, no particular snags here. - make your code produce statuses documents and store them on disk, similar to details/weights/bandwidth documents; Right, so if we are planning to support all V3 network statuses for all fingerprints, how are we to store all the status documents? The idea is to preprocess and serve static JSON documents, correct (as in the current Onionoo)? (cf. the idea of simply caching documents: if we serve a particular status document, it gets cached, and depending on the query parameters (date range restriction, e.g.) it may be set not to expire at all.) Or should we try and actually store all the statuses (the condensed status document version [2], of course)? - let the ResourceServlet use your database to return the fingerprints to return documents for; and - extend the ResourceServlet to support the new statuses documents. Sounds good. I assume you are very busy with other things as well, so ideally maybe you had in mind that I could try and do the Java part? :) Though, since you are much more familiar with (your own) code, you could probably do it faster than me. Not sure. Any particular technical issues/nuances here (re: ResourceServlet)? cheerio Kostas. [1]: https://github.com/wfn/torsearch/blob/master/db/db_create.sql [2]: https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md#network-status-entry-documents (e.g. http://ts.mkj.lt:/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE31condensed=true ) ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] Searchable metrics archive - Onionoo-like API available online for probing
On Mon, Sep 2, 2013 at 2:20 PM, Karsten Loesing kars...@torproject.orgwrote: On 8/23/13 3:12 PM, Kostas Jakeliunas wrote: [snip] Hi Kostas, I finally managed to test your service and take a look at the specification document. Hey Karsten! Awesome, thanks a bunch! The few tests I tried ran pretty fast! I didn't hammer the service, so maybe there are still bottlenecks that I didn't find. But AFAICS, you did a great job there! Thanks for doing some poking! There is probably space for quite a bit more of parallelized benchmarking (not sure of term) to be done, but at least in principle (and from what I've observed / benchmarked so far), if a single query runs in good time, it's rather safe to assume that scaling to multiple queries at the same time will not be a big problem. There's always a limit of course, which I haven't yet observed (and which I should be able to / would do well to find, ideally.) This is, however, one of the strengths of PostgreSQL in any case: very nice parallel-query-scaling. Of course, since the queries are, more or less, always disk i/o-bound, there still could be hidden sneaky bottlenecks, that is very true for sure. Thanks for writing down the specification. So, would it be accurate to say that you're mostly not touching summary, status, bandwidth, and weights resources, but that you're adding a new fifth resource statuses? In other words, does the attached diagram visualize what you're going to add to Onionoo? Some explanations: - summary and details documents contain only the last known information about a relay or bridge, but those are on a pretty high detail level (at least for details documents). In contrast to the current Onionoo, your service returns summary and details documents for relays that didn't run in the last week, so basically since 2007. However, you're not going to provide summary or details for arbitrary points in time, right? (Which is okay, I'm just asking if I understood this correctly.) (Nice diagram, useful-) responding to particular points / nuances: summary and details documents contain only the last known information about a relay or bridge, but those are on a pretty high detail level (at least for details documents) This is true: the summary/details documents (just like in Onionoo proper) deal with the *last* known info about relays. That is how it works now, anyway. As per our subsequent IRC chat, we will now assume this is how it is intended to be. The way I see it from the perspective of my original project goals etc., the summary and details (+ bandwidth and weights) documents are meant for Onionoo {near-, full-}compatibility; they must stay Onionoo-like. The new network status document is the olden archive browse and info extract part: it is one of the ways of exposing an interface to the whole database (after all, we do store all the flags and nicknames and IP addresses for *all* the network statuses.) However, you're not going to provide summary or details for arbitrary points in time, right? (Which is okay, I'm just asking if I understood this correctly.) There is no reason why this wouldn't be possible. (I experimented with new search parameters, but haven't pushed them to master / changed the backend instance that is currently running.) A query involving date ranges could, for example, be something akin to, get a listing of details documents for relays which match this $nickname / $address / $fingerprint, and which have run (been listed in consensuses dated) from $startDate to $endDate. (would use new ?from=.., ?to=.. parameters, which you've mentioned / clarified earlier.) As per our IRC chat, I will add these parameters / query options not only to the network status document, but also to the summary and details documents. - bandwidth and weights documents always contain information covering the whole lifetime of a relay or bridge, where recent events have higher detail level. Again, you're not going to change anything here besides providing these documents for relays and bridges that are offline for more than a week. - statuses have the same level of detail for any time in the past. These documents are new. They're designed for the relay search service and for a simplified version of ExoneraTor (which doesn't care about exit policies and doesn't provide original descriptor contents). There are no statuses documents for bridges, right? Yes yes. No documents for bridges, for now. I'm not sure of the priority of the task of including bridges - it would sure be awesome to have bridges as well. For now, I assume that everything else should be finished (the protocol, the final scalable database schema/setup, etc.) before embarking on this point. The status entry API point is indeed about getting info from the whole archives, at the same detail level for any portion of the archives. (I should have articulated this / put into a design doc before, but this important nuance
[tor-dev] [GSoC 2013] Status report - Searchable metrics archive
Hello! Updating on my Searchable Tor metrics archive project. (As is very evident) I'm very open for naming suggestions. :) To the best of my understanding and current satisfaction, I solved the database bottlenecks, or at least I am, as of now, satisfied with the current output from my benchmarking utility. Things may change, but I am confident (and have support to argue) that the whole thing runs swell at least on amazon m2.2xlarge instances. For fun and profit, a part of the database (which, has, for now, status entries (only) in the range [2010-01-01 00:00:00, 2013-05-31 23:00:00]), namely, what is currently used by the Onionoo-like API is now available online (not on EC2, though) - will now write a separate email so that everyone can inspect it. I should now move on with implementing / extending the Onionoo API, in particular, working on date range queries, and refining/rewriting the list status entries API point (see below). Need to carefully plan some things, and always keep an updated API document. (Also need to update and publish a separate, more detailed specification document.) More concrete report points: - re-examined my benchmarking approach, and wrote a rather simple but effective set of benchmarking tools (more like a simple script) [1] that can be hopefully used outside this project as well; at the very least, together with the profiling and the query_info tools, it is powerful (but also simple) enough to be used to test all kinds of bottlenecks in ORMs and elsewhere. - used this tool to generate benchmark reports on EC2 and on the (less powerful) dev server, and with different schema settings (usually rather minor schema changes that do not require re-importing all the data) - came up with a triple table schema that proves to render our queries quickly: we first do a search (using whatever criteria (e.g. nickname, fingerprint, address, running), if any) on a table which has a column with unique fingerprints; extract the relevant fingerprints; JOIN with the main status entry table, which is much larger; and get the final results. Benchmarked using this schema. details If we are only extracting a list of the latest status entries (with distinct on fingerprint), we can do LIMITs and OFFSETs already on the fingerprint table, before the JOIN. This helps us quite a bit. On the other hand, nickname searches etc. are also efficient. As of now, I have re-enabled nickname+address+fingerprint substring search (not from the middle (LIKE %substring%), but from the beginning of a substring (LIKE substring%), which is still nice), and all is well. Updated the higher-level ORM to reflect this new table [2] (I've yet to change some column names, though - but these are cosmetics.) /details - found a way to generate the SQL queries that I need to generate using the higher-level SQLAlchemy SQL API using various SQLAlchemy-provided primitives, and always observing the resulting query statements. This is good, because everything becomes more modular: much easier to shape the query depending on the query parameters received, etc. (while still retaining it in sane order.) - hence (re)wrote a part of the Onionoo-like API that uses the new schema and the SQLAlchemy primitives. Extended the API a bit. [3] - wrote a very hacky API point for getting a list of status entries for a given fingerprint. I simply wanted a way (for myself and people) to query this kind of a relation easily and externally. It now works as part of the API. This part will probably need some discussion. - wrote a (kind of a stub) document explaining the current Onionoo-like API, what can be queried, what can be returned, what kinds of parameters work. [4] Will extend this later on. while writing the doc and rewriting part the API, stumbled upon a few things that make clear that I've made some shortcuts that may hurt later on. Will be happy to elaborate on them later on / separately. I need to carefully plan a few things, and then try rewriting the Onionoo API yet again, this time including more parameters and fields returned. TL;DR yay, a working database backend! I might give *one* more update detailing things I might have forgotten about soon re: this report - I don't want to make a habit of delaying reports (which I have consistently done), so reporting what I have now. [1]: https://github.com/wfn/torsearch/blob/master/torsearch/benchmark.py [2]: https://github.com/wfn/torsearch/blob/master/torsearch/models.py [3]: https://github.com/wfn/torsearch/blob/master/torsearch/onionoo_api.py [4]: https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md -- Kostas (wfn on OFTC) 0x0e5dce45 @ pgp.mit.edu ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive
On Wed, Aug 14, 2013 at 1:33 PM, Karsten Loesing kars...@torproject.orgwrote: Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more likely that we can run something requiring this extension on a torproject.org machine. Still, requiring extensions should be the last resort if no other solution can be found. Leaving out searches for nickname substrings is a valid solution for now. Got it. Do you have a list of searches you're planning to support? These are the ones that should *really* be supported: - ?search=nickname - ?search=fingerprint - ?lookup=fingerprint - ?search=address [done some limited testing, currently not focusing on this] The lookup parameter is basically the same as search=fingerprint with the additional requirement that fingerprint must be 40 characters long. So, this is the current search parameter. I agree, these would be good to support. You might also add another parameter ?address=address for ExoneraTor. That should, in theory, be just a subset of the search parameter. Oh yes, makes a lot of sense, OK. By the way: I considered having the last consensus (all the data for at least the /summary document, or /details as well) be stored in memory (this is possible) (probably as a hashtable where key = fingerprint, value = all the fields we'd need to return) so that when the backend is queried without any search criteria, it would be possible to avoid hitting the database (which is always nice), and just dump the last consensus. (There's also caching of course, which we could discuss at a (probably quite a bit) later point.) - ?running=boolean This one is tricky. So far, Onionoo looks only at the very latest consensus or bridge status to decide if a relay or bridge is running or not. But now you're adding archives to Onionoo, so that people can search for a certain consensus or certain bridge status in the past, or they can search for a time interval of consensuses or bridge statuses. How do you define that a relay or bridge is running, or more importantly included as not running? Agree, this is not clear. (And whatever ends up being done, this should be well documented and clearly articulated (of course.)) For me at least, 'running' implies the clause whether a given relay/bridge is running *right now*, i.e. whether it is present in the very last consensus. (Here's where that hashtable (with fingerprints as keys) in memory might be able to help: no need to run a separate query / do an inner join / whatnot; it would depend on whether there's a LIMIT involved though, etc.) I'm not sure which one is more useful (intuitively for me, the whether it is running *right now* is more useful.) Do you mean that it might make sense to have a field (or have running be it) indicating whether a given relay/bridge was present in the last consensus in the specified date range? If this is what you meant, then the return all that are/were not running clause would indeed be kind of..peculiar (semantically - it wouldn't be very obvious what's it doing.) Maybe it'd be simpler to first answer, what would be the most useful case? How do you define that a relay or bridge [should be] included as not running? Could you rephrase maybe? Do you mean that it might be difficult to construct sane queries to check for this condition? Or that the situation where - a from..to date range is specified - ?running=false is specified would be rather confusing ('exclude those nodes which are running *right now* ('now' possibly having nothing to do with the date range)? - ?flag=flag [every kind of clause which further narrows down the query is not bad; the current db model supports all the flags that Stem does, and each flag has its own column] I'd say leave this one out until there's an actual use case. Ok, I won't focus on these now; just wanted to say that these should be possible without much ado/problems. - ?first_seen_days=range - ?last_seen_days=range As per the plan, the db should be able to return a list of status entries / validafter ranges (which can be used in {first,last}_seen_days) given some fingerprint. Oh, I think there's a misunderstanding of these two fields. These fields are only there to search for relays or bridges that have first appeared or were last seen on a given day. You'll need two new parameters, say, from=datetime and to=datetime (or start=datetime and end=datetime) to define a valid-after range for your search. Ah! I wasn't paying attention here. :) Ok, all good. Thanks as always! Regards Kostas. ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive
On Tue, Aug 13, 2013 at 2:15 PM, Karsten Loesing kars...@torproject.orgwrote: I suggest putting pg_prewarm on the future work list. I sense there's a lot of unused potential in stock PostgreSQL. Tweaking the database at this point has the word premature optimization written on it in big letters for me. Also, to be very clear here, a tool that requires custom tweaks to PostgreSQL has minimal chances of running on torproject.org machines in the future. The current plan is that we'll have a dedicated database machine operated by our sysadmins that not even the service operator will have shell access to. Oh, understood then, OK, no extensions (at least) for now. Apropos: as of my current (limited) understanding, it might be difficult to support, for example, nickname sub-string searches without a (supported, official) extension. One such extension is pg_trgm [1], which is in the contrib/ directory in 9.1, and is just one make install away. But for now, I'll assume this is not possible / we should avoid this. So, why do you join descriptors and network statuses in the search process? At the Munich dev meeting I suggested joining the tables already in the import process. What do you think about that idea? Yes, I had made a half-hearted attempt to normalize the two tables some time ago, for a small amount of descriptors and status entries; I'll be trying out this scheme in full (will need to re-import a major part of the data (which I didn't do then) to be able to see if it scales well) after I try something else. (Namely, using a third table of unique fingerprints (the statusentry table currently holds ~170K unique fingerprints vs. ~67M rows in total) and (non-unique) nicknames for truly quick fingerprint lookup and nickname search; I did experiment with this as well, but I worked with a small subset of overall data in that case, too; and I think I can do a better job now.) It had seemed to me that the bottleneck was in having to sort a too large number of rows, but now I understand (if only just a bit) more about the 'explain analyze' output to see that the 'Nested Loop' procedure, which is what does the join in the join query discussed, is expensive and is part of the bottleneck so to speak. So I'll look into that after properly benchmarking stuff with the third table. (By the way, for future reference, we do have to test out different ideas on a substantial subset of overall data, as the scale function is not, so to say, linear.) :) https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql We use the following indexes while executing that query: * lower(nickname) on descriptor * (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on statusentry Using only the first 12 characters sounds like a fine approach to speed up things. But why 12? Why not 10 or 14? This is probably something you should annotate as parameter to find a good value for later in the process. (I'm not saying that 12 is a bad number. It's perfectly fine for now, but it might not be the best number.) Yes, this is as unscientific as it gets. As of now, we're using a raw SQL query, but I'll be encapuslating them properly soon (so we can easily attach different WHERE clauses, etc.), at which point I'll make it into a parameter. I did do some tests, but nothing extensive; just made sure the indexes can fit into memory whole, which was the main constraint. Will do some tests. Also, would it keep indexes smaller if you took something else than base16 encoding for fingerprints? What about base64? Or is there a binary type in PostgreSQL that works fine for indexes? Re: latter, no binary type for B-Trees (which is the default index type in pgsql) as far as I can see. But it's a good idea / approach, so I'll look into it, thanks! On the whole though, as long as all the indexes occupy only a subset of pgsql's internal buffers, there shouldn't be a problem / that's not the problem, afaik. But, if we're making a well-researched ORM/database design, I should look into it. Do you have a list of searches you're planning to support? These are the ones that should *really* be supported: - ?search=nickname - ?search=fingerprint - ?lookup=fingerprint - ?search=address [done some limited testing, currently not focusing on this] - ?running=boolean - ?flag=flag [every kind of clause which further narrows down the query is not bad; the current db model supports all the flags that Stem does, and each flag has its own column] - ?first_seen_days=range - ?last_seen_days=range As per the plan, the db should be able to return a list of status entries / validafter ranges (which can be used in {first,last}_seen_days) given some fingerprint. Thanks for your feedback and reply! Kostas. [1]: http://www.postgresql.org/docs/9.1/static/pgtrgm.html ___ tor-dev mailing list tor-dev@lists.torproject.org
Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive
Karsten, this won't be a very short email, but I honestly swear I did revise it a couple of times. :) This is not urgent by any measure, so whenever you find time to reply will be fine. ctrl+f to observe: for some precise data / support for my plan re: using the pg_prewarm extension. On Mon, Aug 12, 2013 at 2:16 PM, Karsten Loesing kars...@torproject.orgwrote: On 8/10/13 9:28 PM, Kostas Jakeliunas wrote: * I don't think we can avoid using certain postgresql extensions (if only one) - which means that deploying will always take more than apt-get pip install, but I believe it is needed; Can you give an example of a query that won't be executed efficiently without this extension and just fine with it? Maybe we can tweak that query somehow so it works fine on a vanilla PostgreSQL. Happy to give that some thoughts. I'd really want to avoid using stuff that is not in Debian. Or rather, if we really need to add non-standard extensions, we need more than thinking and believing that it's unavoidable. :) First off, the general idea. I know this might not sound convincing (see below re: this), but any query that uses an index will take significantly longer to execute if it needs to load parts of the index from disk. More precisely, query time deviation and max(query_time) inversely correlates with the percentage of the index in question in memory. The larger the index, the more difficult it is to 'prep' it into cache, the more unpredictable query exec time gets. Take a look at the query used to join descriptors and network statuses given some nickname (could be any other criterion, e.g. fingerprint or IP address): https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql We use the following indexes while executing that query: * lower(nickname) on descriptor * (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on statusentry (this one is used to efficiently join descriptor table with statusentry: (fingerprint, descriptor) pair is completely unique in the descriptor table, and it is fairly unique in the statusentry table (whereas a particular fingerprint usually has lots and lots of rows in statusentry)); this index uses only substrings because otherwise, it will hog memory on my remote development machine (not EC2), leaving not much for other indexes; this composite substring index still takes ~2.5GB for status entries (only) in the range between [2010-01; 2013-05] as of now * validafter on statusentry (the latter *must* stay in memory, as we use it elsewhere as well; for example, when not given a particular search criterion, we want to return a list of status entries (with distinct fingerprints) sorted by consensus validafter in descending order) We also want to keep a fingerprint index on the descriptor table because we want to be able to search / look up by fingerprint. I'm thinking of a way to demonstrate the efficiency of having the whole index in memory. For now, let me summarize what I have observed, intersect with what is relevant now: running the aforementioned query on some nickname that we haven't queried for since the last restart of postgresql, it might take, on average, about 1.5 to 3 seconds to execute on EC2, and considerably longer on my development db if it is a truly popular nickname (otherwise, more or less the same amount of time); sometimes a bit longer - up to ~4s (ideally it should be rather uniform since the indexes are *balanced* trees, but.. and autovacuum is enabled.) Running that same query later on (after we've run other queries after that first one), it will take = 160ms to execute and return results (this is a conservative number, usually it's much faster (see below)). Running EXPLAIN (ANALYZE, BUFFERS) shows that what happened was that there was no [disk] read next to index operations - only buffer hit. This means that there was no need to read from disk during all the sorting - only when we knew which rows to return did we need to actually read them from disk. (There are some nuances, but at least this will be true for PostgreSQL = 9.2 [1], which I haven't tried yet - there might be some pleasant surprises re: query time. Last I checked, Debian 9.0 repository contains postgresql 9.1.9.) Observe: 1a. Run that query looking for 'moria2' for the first time since postgresql restart - relay is an old one, only one distinct fingerprint, relatively few status entries: http://sprunge.us/cEGh 1b. Run that same query later on: http://sprunge.us/jiPg (notice: no reads, only hits; notice query time) 2a. Run query on 'gabelmoo' (a ton of status entries) for the first time (development machine, query time is rather insane indeed): http://sprunge.us/fQEK 2b. Run that same query on 'gablemoo' later on: http://sprunge.us/fDDV PostgresSQL is rather clever: it will keep the parts of indexes more often used in cache. What pg_prewarm simply does is: * load all (or critical for us) indexes to memory (and load them whole), which is possible
[tor-dev] [GSoC 2013] Status report - Searchable metrics archive
Hello, another busy benchmarking + profiling period for database querying, but this time more rigorous and awesome. * wrote a generic query analyzer which logs query statements, EXPLAIN, ANALYZE, spots and informs of particular queries that yield inefficient query plans; * wrote a very simple but rather exhaustive profiler (using python's cProfile) which logs query times, function calls, etc.; output is used to see which parts of the e.g. backend are slow during API calls; output can be easily used to construct a general query 'profile' for a particular database, etc.; [1] * benchmarked lots of different queries using these tools, recorded query times, was able to observe deviations/discrepancies; * uploaded the whole database and benchmarked briefly on an amazon EC2 m2.2xlarge instance; * concluded that, provided there is enough memory to cache *and hold* the indexes in cache, query times are good; * in particular, tested the following query scheme extensively: [2] (see comments there as well if curious); concluded that it runs well; * opted for testing raw SQL queries (from within Flask/python) - so far, translating them into ORM queries (while being careful) resulted in degraded performance; if we have to end up using raw SQL, I will create a way to encapsulate them nicely; * made sure data importing is not slowed and remains a quick-enough procedure; * researched PostgreSQL stuff, especially its two-layer caching; I now have an understanding of the way pgsql caches things in memory, how statistics on index usage are gathered and used for maintaining buffer_cache, etc. The searchable metrics archive would work best when all of its indexes are kept in memory. * to this end, looked into buffer cache hibernation [3], etc.; I think pg_prewarm [4, 5] would serve our purpose well. (Apparently many business/etc. solutions do find cache prewarming relevant - pity it's not supported in stock PostgreSQL.) The latter means that * I don't think we can avoid using certain postgresql extensions (if only one) - which means that deploying will always take more than apt-get pip install, but I believe it is needed; * next on my agenda is testing pg_prewarm on EC2 and, hopefully, putting our beloved database bottleneck problem to rest. I planned to expose the EC2 for public tor-dev inquiry (and ended up delaying status report yet again), but I'll have to do this separately. This is possible, however. Sorry for the delayed report. ## More generally, I'm happy with my queer queries [2] now; the two constraints/goals of * being able to run Onionoo-like queries on the whole descriptor / status entry database * being able to get a list of status entries for a particular relay will hopefully be put to rest very soon. The former is done, provided I have no trouble setting up a database index precaching system (which will ensure that all queries of the same syntax/scheme run quick enough.) Overall, I'm spending a bit too much time on a specific problem, but at least I have a more intimate lower-level knowledge of PostgreSQL, which turns out to be very relevant to this project. I hope to be able to soon move to extending Onionoo support and providing a clean API for getting lists of consensuses in which a particular relay was present. And maybe start with the frontend. :) Kostas. [1]: https://github.com/wfn/torsearch/commit/8e6f16a07c40f7806e98e9c71c1ce0f8e3849911 [2]: https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql [3]: http://postgresql.1045698.n5.nabble.com/patch-for-new-feature-Buffer-Cache-Hibernation-td4370109.html [4]: http://www.postgresql.org/message-id/ca+tgmobrrrxco+t6gcqrw_djw+uf9zedwf9bejnu+rb5teb...@mail.gmail.com [5]: http://raghavt.blogspot.com/2012/04/caching-in-postgresql.html ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] Onionoo protocol/implementation nuances / Onionoo-connected metrics project stuff
It should also be possible to do efficient *estimated* COUNTs (using reltuples [1, 2], provided the DB can be regularly VACUUMed + ANALYZEd (postgres-specific awesomeness)) - i.e. if everything is set up right, doing COUNTs would be efficient. This would be nice not only because one could run very quick queries asking e.g. how many consensuses include nickname LIKE %moo% between [daterange1, daterange2]? (if e.g. full text search is set up) but also, if we have to resort to sometimes returning an arbitrary subset of results (or sorted however we wish, but the sorting being done already on a small subset of results, if that makes sense), we'd be able to also supply info how many other results matching these particular criteria there are, and so on. The usefulness of all this really depends on intended use cases, and I suppose here some discussion could be had who / how would an Onionoo system covering all / most of all the descriptor+consensus archives and hopefully having an extended set of filter / result options be used? [1]: http://www.varlena.com/GeneralBits/120.php [2]: http://wiki.postgresql.org/wiki/Slow_Counting ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
[tor-dev] [GSoC '13] Tor status report - Searchable metrics archive
Hey all, I apologize for this unusual timing for a status report, but I ended up delaying it beyond measure, so better now than later I guess. I can reiterate it + any updates soon, it's just that I figure I'm long overdue on informing tor-dev on what's going on. I've started my project [1] later than is usual, and more or less immediately ran into what I deemed to be a database / ORM scaling issue (the thing I'd been actually trying to avoid since writing the proposal), or at least a behaviour of the ORM which was suboptimal to what we have in mind: delivering (first and foremost) a searchable metrics archive backend/database which incorporates, as of current plan, server descriptors (relays and bridges, turns out a server descriptor model can happily service both) and server/router statuses across a few year timespan (currently using v3 consensus documents only), and provides querying functionality which can extract relations between the two. The 'querying with relations between the two' part, when tested on a broader span of data, seemed to be causing trouble to me. I ended up allocating probably inefficiently large amounts of time to this problem, rewriting the backend part, and trying to optimize the queries which underlied the ORM (turns out I didn't need to strip off the ORM abstraction - learned a few things about SQLAlchemy that way - I will follow-up with an email pointing to current code (sorry)). * The current iteration of the ORM model / backend (which actually is very simple) solves this problem. * Stem descriptor and network status mapping to ORM works, and is nicely (enough) integrated with the data import (from downloaded metrics archive) tools, as well as an API to make queries on the ORM. * Implemented a partial Onionoo-protocol-adhering (without compression and without some fields) backend for ?summary and ?details Onionoo queries. * Still tidying everything up. And *finally* writing a design document outlining what we actually ended up with, and what is required till full Onionoo integration. Code review will happen pretty soon, and hopefully we'll have some discussion upon where to go from here. Karsten mentioned that it might be possible to use the existing Onionoo incarnation to continue providing bandwidth weight etc. data (basically stuff from extra-info), and it might be possible to join the two systems into an Onionoo-supporting backend which will cover all / majority of archives available. Another (or) further avenue would be to continue with the initial proposed plan to extend the query format; and to build a frontend which would make use of the extended query format. Expect another email with links to (decent) code. [1]: http://kostas.mkj.lt/gsoc2013/gsoc2013.html ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] Metrics Plans
Hi, forgot to reply to this email earlier on.. On Tue, Jun 11, 2013 at 6:02 PM, Damian Johnson ata...@torproject.orgwrote: I can try experimenting with this later on (when we have the full / needed importer working, e.g.), but it might be difficult to scale indeed (not sure, of course). Do you have any specific use cases in mind? (actually curious, could be interesting to hear.) The advantages of being able to reconstruct Descriptor instances is simpler usage (and hence more maintainable code). [...] Obviously we'd still want to do raw SQL queries for high traffic applications. However, for applications where maintainability trumps speed this could be a nice feature to have. Oh, very nice, this would indeed be great, and this kind of usage would, I suppose, facilitate the new tool's function as a simplifying 'glue' that reduces multiple tools/applications into one. In any case, since the model for a descriptor can be mapped to/from Stem's Descriptor instance, this should be possible. (More) raw SQL queries for the backend's internal usage would still be used - yes, this makes sense. * After making the schema update the importer could then run over this raw data table, constructing Descriptor instances from it and performing updates for any missing attributes. I can't say I can easily see the specifics of how all this would work, but if we had an always-up-to-date data model (mediated by Stem Relay Descriptor class, but not necessarily), this might work.. (The ORM - Stem Descriptor object mapping itself is trivial, so all is well in that regard.) I'm not sure if I entirely follow. As I understand it the importer... * Reads raw rsynced descriptor data. * Uses it to construct stem Descriptor instances. * Persists those to the database. My suggestion is that for the first step it could read the rsynced descriptors *or* the raw descriptor content from the database itself. This means that the importer could be used to not only populate new descriptors, but also back-fill after a schema update. That is to say, adding a new column would simply be... * Perform the schema update. * Run the importer, which... * Reads raw descriptor data from the database. * Uses it to construct stem Descriptor instances. * Performs an UPDATE for anything that's out of sync or missing from the database. Aha, got it - this would actually probably be a brilliant way to do it. :) that is, My suggestion is that for the first step it could read the rsynced descriptors *or* the raw descriptor content from the database itself. This means that the importer could be used to not only populate new descriptors, but also back-fill after a schema update. is definitely possible, and doing UPDATEs could indeed be automated that way. Ok, so since I'm writing the new database importer incarnation now, it's definitely possible to put each descriptor's raw contents/text into a separate, non-indexed field. This would then simply be a matter of satisfying disk space constraints, and no more. There could/should be a way of switching this raw import option off, IMO. Kostas. On Tue, Jun 11, 2013 at 6:02 PM, Damian Johnson ata...@torproject.orgwrote: I can try experimenting with this later on (when we have the full / needed importer working, e.g.), but it might be difficult to scale indeed (not sure, of course). Do you have any specific use cases in mind? (actually curious, could be interesting to hear.) The advantages of being able to reconstruct Descriptor instances is simpler usage (and hence more maintainable code). Ie, usage could be as simple as... from tor.metrics import descriptor_db # Fetches all of the server descriptors for a given date. These are provided as # instances of... # # stem.descriptor.server_descriptor.RelayDescriptor for desc in descriptor_db.get_server_descriptors(2013, 1, 1): # print the addresses of only the exits if desc.exit_policy.is_exiting_allowed(): print desc.address Obviously we'd still want to do raw SQL queries for high traffic applications. However, for applications where maintainability trumps speed this could be a nice feature to have. * After making the schema update the importer could then run over this raw data table, constructing Descriptor instances from it and performing updates for any missing attributes. I can't say I can easily see the specifics of how all this would work, but if we had an always-up-to-date data model (mediated by Stem Relay Descriptor class, but not necessarily), this might work.. (The ORM - Stem Descriptor object mapping itself is trivial, so all is well in that regard.) I'm not sure if I entirely follow. As I understand it the importer... * Reads raw rsynced descriptor data. * Uses it to construct stem Descriptor instances. * Persists those to the database. My
Re: [tor-dev] Metrics Plans
Hi! Maybe we should focus on a 'grand unified backend' rather than splitting Kostas' summer between both a backend and frontend? If he could replace the backends of the majority of our metrics services then that would greatly simplify the metrics ecosystem. I'm mostly interested in the back-end, too. But I think it won't be as much fun for Kostas if he can't also work on something that's visible to users. I don't know what he prefers though. Honestly, I would actually be up for focusing, if need be, exclusively on the backend part. It would also probably (hopefully) prove to be the most beneficial to the overall ecosystem of tools. However, such a plan would imply that the final goal (ideally) is to have a replacement for Onionoo, which means that it would have to be reliably stable and scalable, so that multiple frontends could all use it at once. (It will have to be stable in any case, of course.) I think this would be a great goal, but if we can define and isolate development stages to a great extent, I think having two goals: (a) Onionoo replacement; (b) descriptor search+browse frontend - at the same time is OK, and either one of them could be dropped/reduced during the process - I think I understand, but I'm not sure. Just to get this right, is either of these states the planned end state of your GSoC project? 1) descriptor database supporting efficient queries, separate API similar to Onionoo's, front-end application using new search parameters; 2) descriptor database supporting efficient queries, full integration with Onionoo API, no special front-end application using new search parameters; or 3) descriptor database supporting efficient queries, full integration with Onionoo API, front-end application using Onionoo's new search parameters. Yes - thanks for helping to nicely articulate them by the way - in the sense that *any* of these end states would qualify, from my perspective at least, as a success for this project. As I said, I think it is possible to work on things without fear of making redundant effort while also not restricting ourselves to one particular end state of the three, until some significantly later point in time. This is because it is possible to firstly do the efficient database, then implement a subset of the Onionoo-like API (with a possibility for diverging from the Onionoo standard later if a need arises at some point later on), and finally - optionally/hopefully - work on the client-side frontend application. I'd still like to do the frontend if the rest can be done in a subset of the whole timeline; I'd also perhaps like to work/tinker on it after the official GSoC timeline; but if (in mid-summer) it turns out that making an Onionoo replacement is possible (the new backend/database scales well for complex queries and so on, and implementing the whole Onionoo API is realistic/easy), I can simply focus on the backend. Note that there's no Onionoo client that uses bridge data, yet. We have been planning to add bridge support to Atlas for a while, but this hasn't happened yet. But in general, bridge data is quite similar to relay data. There are some specifics because of sanitized descriptor parts, but in general, data structures are similar. Understood. Bridge data / sanitized descriptors seem similar indeed, should fit in nicely. I think it's an advantage here that Onionoo itself has a front-end and a back-end part. The back-end processes data once per hour and writes it to the file system. The front-end is a single Java servlet that does all the filtering and sorting in memory and reads larger JSON files from disk. What we could do is: keep the back-end running, so that it keeps producing details, bandwidth, and weights files, and only replace the servlet by a Python thing that also knows how to respond to more complex search queries. Yes, this sounds great! Basically delegating bandwidth and weights calculation to what we have already, and focusing on queries etc. I will have to look into the actual Onionoo backend implementation, namely, how much of the produce static JSON files including descriptor data can be reused. In any case, I don't think that having Onionoo(-compatibility, etc.) as an additional set of variables / potential deliverables should pose a problem. This was a vague/generic reply, but I will eventually follow up with more things. Kostas. On Wed, May 29, 2013 at 5:34 PM, Karsten Loesing kars...@torproject.orgwrote: On 5/29/13 4:05 AM, Kostas Jakeliunas wrote: Hello! (@tor-dev: will also write a separate email, introducing the GSoC project at hand.) This GSoc idea started a year back as a searchable descriptor search application, totally unrelated to Onionoo. It was when I read Kostas' proposal that I started thinking about an integration with Onionoo. That's why the plan is still a bit vague. We should work together with Kostas very soon
Re: [tor-dev] Remote descriptor fetching
Hi folks! Indeed, this would be pretty bad. I'm not convinced that moria1 provides truncated responses though. It could also be that it compresses results for every new request and that compressed responses randomly differ in size, but are still valid compressions of the same input. Kostas, do you want to look more into this and open a ticket if this really turns out to be a bug? I did check each downloaded file, each was different in size etc., but not all of them were valid, from a shallow look at things (just chucking the file to zlib and seeing what comes out). Ok, I'll try looking into this. :) do note that exams etc. are still ongoing, so this will get pushed back, if anybody figures things out earlier, then great! Tor clients use the ORPort to fetch descriptors. As I understand it the DirPort has been pretty well unused for years, in which case a regression there doesn't seem that surprising. Guess we'll see. Noted - OK, will see! Re: python url request parallelization: @Damian: in the past when I wanted to do concurrent urllib requests, I simply used threading.Thread. There might be caveats here, I'm not familiar with the specifics. I can (again, (maybe quite a bit) later) try cooking something up to see if such a simple parallelization approach would work? (I should probably just try and do it when I have time, maybe will turn out some specific solution is needed and you guys will have solved it by then anyway.) Cheers Kostas. ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] Metrics Plans
Ah, forgot to add my footnote to the dirspec - we all know the link, but in any case: [1]: https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt This was in the context of discussing which fields from 2.1 to include. On Tue, Jun 11, 2013 at 12:34 AM, Kostas Jakeliunas kos...@jakeliunas.comwrote: Here, I think it is realistic to try and use and import all the fields available from metrics-db-*. My PoC is overly simplistic in this regard: only relay descriptors, and only a limited subset of data fields is used in the schema, for the import. I'm not entirely sure what fields that would include. Two options come to mind... * Include just the fields that we need. This would require us to update the schema and perform another backfill whenever we need something new. I don't consider this 'frequent backfill' requirement to be a bad thing though - this would force us to make it extremely easy to spin up a new instance which is a very nice attribute to have. * Make the backend a more-or-less complete data store of descriptor data. This would mean schema updates whenever there's a dir-spec addition [1]. An advantage of this is that the ORM could provide us with stem Descriptor instances [2]. For high traffic applications though we'd probably still want to query the backend directly since we usually won't care about most descriptor attributes. In truth, I'm not sure here, either. I agree that it basically boils down to either of the two aforementioned options. I'm okay with any of them. I'd like to, however, see how well the db import scales if we were to import all relay descriptor fields. There aren't a lot of them (dirspec [1]), if we don't count extra-info of course and only want to deal with the Router descriptor format (2.1). So I think I should try working with those fields, and see if the import goes well and quickly enough. I plan to do simple python timeit / timing report macroses that may be attached / deattached from functions easily, would be simple and clean that way to measure things and so on. [...] An advantage of [more-or-less complete data store of descriptor data] is that the ORM could provide us with stem Descriptor instances [2]. For high traffic applications though we'd probably still want to query the backend directly since we usually won't care about most descriptor attributes. I can try experimenting with this later on (when we have the full / needed importer working, e.g.), but it might be difficult to scale indeed (not sure, of course). Do you have any specific use cases in mind? (actually curious, could be interesting to hear.) [2] fn is noted, I'll think about it. The idea would be import all data as DB fields (so, indexable), but it makes sense to also import raw text lines to be able to e.g. supply the frontend application with raw data if needed, as the current tools do. But I think this could be made to be a separate table, with descriptor id as primary key, which means this can be done later on if need be, would not cause a problem. I guess there's no need to this right now. I like this idea. A couple advantages that this could provide us are... * The importer can provide warnings when our present schema is out of sync with stem's Descriptor attributes (ie. there has been a new dir-spec addition). * After making the schema update the importer could then run over this raw data table, constructing Descriptor instances from it and performing updates for any missing attributes. The 'schema/format mismatch report' idea sounds like a really good idea! Surely if we are to try for Onionoo compatibility / eventual replacement, but in any case, this seems like a very useful thing for the future. I will keep this in mind for the nearest future / database importer rewrite. * After making the schema update the importer could then run over this raw data table, constructing Descriptor instances from it and performing updates for any missing attributes. I can't say I can easily see the specifics of how all this would work, but if we had an always-up-to-date data model (mediated by Stem Relay Descriptor class, but not necessarily), this might work.. (The ORM - Stem Descriptor object mapping itself is trivial, so all is well in that regard.) On Wed, May 29, 2013 at 5:49 PM, Damian Johnson ata...@torproject.orgwrote: Here, I think it is realistic to try and use and import all the fields available from metrics-db-*. My PoC is overly simplistic in this regard: only relay descriptors, and only a limited subset of data fields is used in the schema, for the import. I'm not entirely sure what fields that would include. Two options come to mind... * Include just the fields that we need. This would require us to update the schema and perform another backfill whenever we need something new. I don't consider this 'frequent backfill' requirement to be a bad thing though - this would force
Re: [tor-dev] Metrics Plans
Hello! (@tor-dev: will also write a separate email, introducing the GSoC project at hand.) This GSoc idea started a year back as a searchable descriptor search application, totally unrelated to Onionoo. It was when I read Kostas' proposal that I started thinking about an integration with Onionoo. That's why the plan is still a bit vague. We should work together with Kostas very soon to clarify the plan. Indeed, as it currently stands, the extent of the proposed backend part of the searchable descriptor project is unclear. The original plan was not to aim for a universal backend which could ideally, for example, service existing web-side Metrics etc. project applications. The idea was to hopefully be able to replace relay and consensus search/lookup tools with a single and more powerful search and browse descriptor archives application. However I completely agree that an integrated, reusable backend sounds more exciting and could potentially/hopefully make the broader Tor metrics-* c ecosystem more uniform if that's the word - reducing the tool/component counts. I think this is doable if the tasks/steps of this project are somewhat isolated, so that incremental development can happen, and it's not an all-or-nothing gamble (obviously that is the way it is intended to be, but I think this would be an important aspect of this project in particular as well.) Maybe we should focus on a 'grand unified backend' rather than splitting Kostas' summer between both a backend and frontend? If he could replace the backends of the majority of our metrics services then that would greatly simplify the metrics ecosystem. I'm mostly interested in the back-end, too. But I think it won't be as much fun for Kostas if he can't also work on something that's visible to users. I don't know what he prefers though. Honestly, I would actually be up for focusing, if need be, exclusively on the backend part. It would also probably (hopefully) prove to be the most beneficial to the overall ecosystem of tools. However, such a plan would imply that the final goal (ideally) is to have a replacement for Onionoo, which means that it would have to be reliably stable and scalable, so that multiple frontends could all use it at once. (It will have to be stable in any case, of course.) I think this would be a great goal, but if we can define and isolate development stages to a great extent, I think having two goals: (a) Onionoo replacement; (b) descriptor search+browse frontend - at the same time is OK, and either one of them could be dropped/reduced during the process - this is what I'd have in mind, generally speaking, in terms of general, let's say incremental deliverables / sub-projects, which can be done sequentially: 1. Work out the relay schema for (a) relay descriptors; (b) consensus-statuses; (c) *bridge summaries; (d) *bridge network statuses; Here, I think it is realistic to try and use and import all the fields available from metrics-db-*. My PoC is overly simplistic in this regard: only relay descriptors, and only a limited subset of data fields is used in the schema, for the import. I think it is realistic to import bridge data used and reported by Onionoo. Here is the good, 'incremental' part I think: the Onionoo protocol/design is useful in itself, as a clean relay processing (what comes in and in what form it comes out) design. I think it makes sense to do the DB schema having the fields used and reported by Onionoo in mind. Even if the project ends up not aiming to even be compatible with Onionoo (in terms of its API endpoints, or perhaps not reporting everything (e.g. guard probability) - though I would like to aim for compatibility, as would all of you, I suppose!), I think there should be little to no duplication of effort when designing the schema and the descriptor/data import part of the backend. The bridge data can later be dropped. I will soon try looking closer if the schema can be made such that it may later be very easily *extended* to include bridges data, but it might be safer to at least have the whole schema from the beginning for processing db-R, db-B and db-P, and e.g. simply not work on actual bridge data import at first (depending on priorities.) 2. Implement data import part: so again, the focus would be on importing all possible fields available from, most importantly, metrics-db-R. More fields in relay descriptors, and also consensus statuses. Descriptors (IDs) in consensuses will refer to relay descriptors; must be possible to efficiently query the consensus table as well to ask in which statuses has this descriptor been present? These two parts are crucial whether the project is to aim for Onionoo replacement, and/or also provide a searchbrowse frontend. 3. Implement Onionoo-compatible search queries, and (maybe only) a subset of result fields. Again, I don't see why using the Onionoo protocol/design shouldn't work here in any case. (Other Onionoo-specific nuanses, like
[tor-dev] Searchable Tor descriptor archive - GSoC 2013 project
Greetings! I'm a student who will be working on the Searchable Tor descriptor archive as part of Google Summer of Code. Yay! I've been following Tor development for a while and hope that this opportunity will be my way of sneaking into the development kitchen of Tor. In any case, I hope to stay around for a longer time to come. The original GSoC project proposal is based on one of the Tor project ideas available [1] and is part of the Tor Metrics project [2]. The GSoC proposal itself is also available to read [3] (TXT; if there's any interest, I can work on reformatting.) My primary mentor is Karsten and my secondary mentor is Damian. I will quote the abstract from the proposal to sum up the high-level goals of this project: I'd like to create a more integrated and powerful descriptor archival search and browse system. (The current tools are very restrictive and the experience disjointed.) To do this, I'll write an archival browsing application wherein the results are interactive: they may act as further search filters. Together with a search string input tool which will have more filtering options, the application will provide a more cohesive archival browse search experience and will be a more efficient tool. So as of now, we have an array of tools for inspecting, searching for and getting aggregate data about running relays. (For an overview, see the Tools page in the Metrics portal. [4]) These tools include relay search, consensus info, exit-by-IP search, and quite a few more; furthermore, two Onionoo [5] based applications/tools: Atlas and Compass. This project would proposes to: - implement a more powerful backend that would allow one to search for all available relays since mid-2007 (I should have clarified in the previous discussions, and Karsten already includes this bit; i.e., since v2 statuses became available [6]; I guess this can also be discussed). More powerful here means, first and foremost, all (= v2) archival data (relay descriptors and consensuses at the very least), and furthermore (at least per the original proposal), involving more complex queries: we'd be looking into, I think, minimally, combined AND/OR filters referring to a wider range of data fields available in the archival data and the ability to specify multiple date ranges. Referring to consensus-related data while searching for relays and vice versa would also be possible. (The capabilities would therefore also include those of exoneraTor.) - implement backend results which would, as of current standing, aim for Onionoo compatibility (again see protocol design in [5]), or perhaps supersede it while providing backwards compatibility (e.g. returning paginated lists of consensus-status-entries where a specified relay was present.) - (as per original proposal,) implement a more powerful archival descriptor search browse tool (frontend) which would provide a more uniform looking up relays / searching by using many criteria / further refining search in the results page experience - refining search results, i.e. adjusting filters would be semantically the same as entering search criteria in the beginning; hence a more interactive experience, a more powerful search/browse tool. The goals and design of the project have to be clarified, however. There is ongoing discussion (see another tor-dev thread [7] e.g.) whether perhaps the focus could be to create a backend which would speak the full Onionoo protocol and therefore be a potential replacement not only for relay search and exoneraTor, but also for other components: all presently-speaking Onionoo applications could be made to talk to the new backend, for example. The overall count of components will hopefully be reduced in any case, but ideally, we would end up with a much more integrated Tor Metrics (and maybe beyond) ecosystem. Many open questions, however - see again [7]. Obviously discussions are very welcome indeed! I'm wfn on OFTC (#tor-dev, #nottor), also reachable via XMPP phistophe...@jabber.org, and am very much up for any kind of chat. :) I'll be busy with exams in the first three weeks of June, though - but will find time for sure! Regards Kostas. [1] https://www.torproject.org/getinvolved/volunteer#metricsSearch [2] https://metrics.torproject.org/ [3] http://kostas.mkj.lt/gsoc2013.txt [4] https://metrics.torproject.org/tools.html [5] https://onionoo.torproject.org/ [6] https://metrics.torproject.org/data.html#relaydesc [7] https://lists.torproject.org/pipermail/tor-dev/2013-May/004940.html ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] Remote descriptor fetching
On Tue, May 28, 2013 at 2:50 AM, Damian Johnson ata...@torproject.orgwrote: So far, so good. By my read of the man pages this means that gzip or python's zlib module should be able to handle the decompression. However, I must be missing something... % wget http://128.31.0.34:9131/tor/server/all.z [...] % python import zlib with open('all.z') as desc_file: ... print zlib.decompress(desc_file.read()) ... Traceback (most recent call last): File stdin, line 2, in module zlib.error: Error -5 while decompressing data: incomplete or truncated stream This seemed peculiar, so I tried it out. Each time I wget all.z from that address, it's always a different one; I guess that's how it should be, but it seems that sometimes not all of it gets downloaded (hence the actually legit zlib error.) I was able to make it work after my second download attempt (with your exact code); zlib handles it well. So far it's worked every time since. This is probably not good if the source may sometimes deliver an incomplete stream. TL;DR try wget'ing multiple times and getting even more puzzled (?) ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] Iran
have there been any attempts to produce a pluggable transport which would emulate http? (Ah, I suppose there've been quite a bit of discussion indeed. ( https://trac.torproject.org/projects/tor/ticket/8676, etc.)) On Sun, May 5, 2013 at 9:58 PM, Kostas Jakeliunas kos...@jakeliunas.comwrote: If we had a PT that encapsulated obfs3 inside the body of http then this may work. I'm probably missing some previous discussions which might have covered it, but: have there been any attempts to produce a pluggable transport which would emulate http? Basically, have the transport use http headers, and put all encrypted data in the body (possibly prepending it with some html tags even)? This sounds like a nice idea. On Sun, May 5, 2013 at 9:41 PM, Matthew Finkel matthew.fin...@gmail.comwrote: On Sun, May 05, 2013 at 04:18:56PM +0300, George Kadianakis wrote: tor-admin tor-ad...@torland.me writes: On Sunday 05 May 2013 14:50:51 George Kadianakis wrote: It would be interesting to learn which ports they currently whitelist, except from the usual HTTP/HTTPS. I also wonder if they just block based on TCP port, or whether they also have DPI heuristics. On the Tor side, it seems like we should start looking into #7875: https://trac.torproject.org/projects/tor/ticket/7875 ___ I am wondering if here is there a way for a user to ask bridgedb for a bridge with a specific port? ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev If I remember correctly BridgeDB tries (in a best-effort manner) to give users bridges that are listening on port 443. Obfuscated bridges that bind on 443 are not very common (because of #7875) so I guess that not many obfuscated bridges on 443 are given out. In any case, I don't think that a user can explicitly ask BridgeDB for a bridge on a specific port, but this might be a useful feature request (especially if this filtering based on TCP port tactic continues). This may be a good feature to have, in general, but it does not sound like this will solve the current problem in Iran. The last report says they're whitelisting ports *and* protocols[1]. So even if a user attempts to use obfs3 on port 443 it'll likely be blocked because obfs3 is not a look-like-https protocol. If we had a PT that encapsulated obfs3 inside the body of http then this may work. CDA also says SSL/TLS connections are throttled to 5% of the normal speed [2], so that's no fun either. [1] https://twitter.com/CDA/status/331006059923795968 [2] https://twitter.com/CDA/status/331040305648369664 ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] Iran
If we had a PT that encapsulated obfs3 inside the body of http then this may work. I'm probably missing some previous discussions which might have covered it, but: have there been any attempts to produce a pluggable transport which would emulate http? Basically, have the transport use http headers, and put all encrypted data in the body (possibly prepending it with some html tags even)? This sounds like a nice idea. On Sun, May 5, 2013 at 9:41 PM, Matthew Finkel matthew.fin...@gmail.comwrote: On Sun, May 05, 2013 at 04:18:56PM +0300, George Kadianakis wrote: tor-admin tor-ad...@torland.me writes: On Sunday 05 May 2013 14:50:51 George Kadianakis wrote: It would be interesting to learn which ports they currently whitelist, except from the usual HTTP/HTTPS. I also wonder if they just block based on TCP port, or whether they also have DPI heuristics. On the Tor side, it seems like we should start looking into #7875: https://trac.torproject.org/projects/tor/ticket/7875 ___ I am wondering if here is there a way for a user to ask bridgedb for a bridge with a specific port? ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev If I remember correctly BridgeDB tries (in a best-effort manner) to give users bridges that are listening on port 443. Obfuscated bridges that bind on 443 are not very common (because of #7875) so I guess that not many obfuscated bridges on 443 are given out. In any case, I don't think that a user can explicitly ask BridgeDB for a bridge on a specific port, but this might be a useful feature request (especially if this filtering based on TCP port tactic continues). This may be a good feature to have, in general, but it does not sound like this will solve the current problem in Iran. The last report says they're whitelisting ports *and* protocols[1]. So even if a user attempts to use obfs3 on port 443 it'll likely be blocked because obfs3 is not a look-like-https protocol. If we had a PT that encapsulated obfs3 inside the body of http then this may work. CDA also says SSL/TLS connections are throttled to 5% of the normal speed [2], so that's no fun either. [1] https://twitter.com/CDA/status/331006059923795968 [2] https://twitter.com/CDA/status/331040305648369664 ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] Iran
(Sorry, last email for now --) I see that StegoTorus is an Obfsproxy fork that extends it to a) split Tor streams across multiple connections to avoid packet size signatures, and b) embed the traffic flows in traces that look like html, javascript, or pdf. However, its public repo seems to haven't been updated for more than nine months. [1] Also, 'Format-Transforming Encryption' looks interesting, but I take it not much in terms of implementation beyond a research paper [2] (which looks interesting). [1] https://gitweb.torproject.org/stegotorus.git [2] https://eprint.iacr.org/2012/494 On Sun, May 5, 2013 at 10:08 PM, Kostas Jakeliunas kos...@jakeliunas.comwrote: have there been any attempts to produce a pluggable transport which would emulate http? (Ah, I suppose there've been quite a bit of discussion indeed. ( https://trac.torproject.org/projects/tor/ticket/8676, etc.)) On Sun, May 5, 2013 at 9:58 PM, Kostas Jakeliunas kos...@jakeliunas.comwrote: If we had a PT that encapsulated obfs3 inside the body of http then this may work. I'm probably missing some previous discussions which might have covered it, but: have there been any attempts to produce a pluggable transport which would emulate http? Basically, have the transport use http headers, and put all encrypted data in the body (possibly prepending it with some html tags even)? This sounds like a nice idea. On Sun, May 5, 2013 at 9:41 PM, Matthew Finkel matthew.fin...@gmail.comwrote: On Sun, May 05, 2013 at 04:18:56PM +0300, George Kadianakis wrote: tor-admin tor-ad...@torland.me writes: On Sunday 05 May 2013 14:50:51 George Kadianakis wrote: It would be interesting to learn which ports they currently whitelist, except from the usual HTTP/HTTPS. I also wonder if they just block based on TCP port, or whether they also have DPI heuristics. On the Tor side, it seems like we should start looking into #7875: https://trac.torproject.org/projects/tor/ticket/7875 ___ I am wondering if here is there a way for a user to ask bridgedb for a bridge with a specific port? ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev If I remember correctly BridgeDB tries (in a best-effort manner) to give users bridges that are listening on port 443. Obfuscated bridges that bind on 443 are not very common (because of #7875) so I guess that not many obfuscated bridges on 443 are given out. In any case, I don't think that a user can explicitly ask BridgeDB for a bridge on a specific port, but this might be a useful feature request (especially if this filtering based on TCP port tactic continues). This may be a good feature to have, in general, but it does not sound like this will solve the current problem in Iran. The last report says they're whitelisting ports *and* protocols[1]. So even if a user attempts to use obfs3 on port 443 it'll likely be blocked because obfs3 is not a look-like-https protocol. If we had a PT that encapsulated obfs3 inside the body of http then this may work. CDA also says SSL/TLS connections are throttled to 5% of the normal speed [2], so that's no fun either. [1] https://twitter.com/CDA/status/331006059923795968 [2] https://twitter.com/CDA/status/331040305648369664 ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
[tor-dev] GSoC 2013 / Tor project ideas - Searchable Tor descriptor archive - (pre-)proposal
Hello Karsten and everyone else :) (TL;DR: would like to work on the searchable Tor descriptor archive project idea; considering drafting up a GSoC application) I'm a student backend+frontend programmer from Lithuania who'd be very much interested in contributing to the Tor project via Google Summer of Code (well, ideally at least; the plan would be to volunteer some time to Tor in any case, but it's yet to happen, and GSoC is simply too awesome an opportunity not to try) - The 'searchable Tor descriptor/metrics archive' project idea [1] would, I think, best fit in with my previous experience and general interests in terms of contributing to the Tor project. The searchable archive project idea in itself has a rather clear list of goals / generic constraints, and since I haven't contributed any code to the Tor project before, working with an existing general project idea (building a more concrete design proposal on top of it) probably makes most sense. This particular project, I think, would match my previous Python backend programming experience: building backends to work with large datasets / databases -- crafting efficient ORMs and responsive APIs to interact with them. [2] Applying the knowledge/skills learned to something which is ideologically close at heart and the purpose of which is very obvious to me sounds thrilling! (This year, as far as Python frameworks are concerned, I've been mostly exposed and have been working with Flask - have some (limited) experience with Django before that. As far as a proof-of-concept for the searchable archive is concerned, I'm considering trying some things out with Flask, since it allows me to do some quick prototyping.) I'd like to try and work out an implementation/design draft for what I could / would like to do (this is a preliminary email - I know I'm a bit late!) Ideally it (and a simple proof of concept search form - browseable/clickable results / relay descriptor navigation page) would serve as the base for my GSoC application, but I have to be realistic about me being rather late to apply and not having participated in neither Tor nor GSoC before. I'd like to work out an application draft if possible, though. (Were I to get accepted, I would be able not to do any part-time work this summer, or would only need to take passive care of a couple of already running backends.) I've read into the Tor Metrics portal pages (esp. Data Formats), and am trying to get acquainted with the existing archiving solution (reading into the 'metrics-web' java source (under metrics-web/src/org/torproject/ernie/web) to see how the descriptor etc. archives are currently parsed / imported into Postgres and so on), to first and foremost be able to evaluate the scope of what I'd like to write. I will presently work on a more specific list of constraints for the searchable archive project idea. I can then try producing a GSoC application draft. Just to get an idea of what kind of system I'd be building / working on - at the very least, we'd be looking into: - (re)building the archival / metrics data update system - the proposed method in [1] was a simple rsync over ssh / etc. to keep the data in sync with the descriptor data collection point. If possible, it would help if the rsync could work with uncompressed archives - rsync is intelligent enough not to need to send *that* much excess data - and diffing is more efficient with uncompressed data. A simple rsync script (can be run as a cron job) would work here. - a python script (probably to be run through cron) to import the archives into DB. Can stat files to only need to import new/modified ones, e.g. The good thing about such an approach is that the script could work as a semi-standalone (would still need the DB / ORM design), therefore could be used in conjunction with other, different tools - and it would be built as an atomic target during the implementation process - I heard you guys like modular project design proposals ;) who doesn't like them! We already have metrics-utils/exonerator/exonerator.py (which works as a semantically-aware descriptor archive grep tool) - some archive parsing logic can be reused maybe - the more pertinent thing here would be to - build the ORM for storing all the archival data in DB. Postgres is preferred and could work, especially since probably the a large part of the current ORM logic could be used here (I've taken a glance at the current architecture, it makes good sense to me, but I haven't looked further, neither have I done any benchmarking with the existing ORM (except for some web-based relay search test queries which don't really count.)) - it is very important to build an ORM which would scale well data-wise, and would suit our queries well. - query logic and types - the idea would be to allow to do incremental query-building - on the SQL level, WHERE clauses can be incrementally