Re: [tor-dev] Relay Database: Existing Schemas?

2016-02-26 Thread Kostas Jakeliunas
On Thu, Apr 16, 2015 at 4:53 PM, Karsten Loesing <kars...@torproject.org> wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 15/04/15 21:18, nusenu wrote:
>> Hi,
>>
>> I'm planing to store relay data in a database for analysis. I
>> assume others have done so as well, so before going ahead and
>> designing a db schema I'd like to make sure I didn't miss
>> pre-existing db schemas one could build on.
>>
>> Data to be stored: - (most) descriptor fields - everything that
>> onionoo provides in a details record (geoip, asn, rdns, tordnsel,
>> cw, ...) - historic records
>>
>> I didn't find something matching so far, so I'll go ahead, but if
>> you know of other existing relay db schemas I'd like to hear about
>> it.
>>
>> thanks, nusenu
>>
>>
>>
>>
>> "GSoC2013: Searchable Tor descriptor archive" (Kostas Jakeliunas)
>> https://www.google-melange.com/gsoc/project/details/google/gsoc2013/wfn/
>>
>>
> 5866452879933440
>> https://lists.torproject.org/pipermail/tor-dev/2013-May/004923.html
>>
>>
> https://lists.torproject.org/pipermail/tor-dev/2013-September/005357.htm
>> l https://github.com/wfn/torsearch (btw, someone knows the license
>> of this?)
>
> Cc'ing Kostas for this question.

Hi nusenu,

I've been going through old mail, and on 2015-04-16 you asked about
about a license (see above).

Just added a LICENSE file - can't hurt (standard BSD 3-clause).

If you're still by any chance collating (ha) and/or want to talk about
schema design for descriptors (I personally would not lose hope for
RDBMSes for large datasets - not until one gets into *actually* big
data - say, terabytes at least, or more - but of course it gets
nuanced real fast).

--

Kostas.

0x0e5dce45 @ pgp.mit.edu

>
>>> This is true: the summary/details documents (just like in Onionoo
>>>  proper) deal with the *last* known info about relays.
>>
>>
>> ernie
>> https://gitweb.torproject.org/metrics-db.git/plain/doc/manual.pdf
>> (didn't find db/tordir.sql mentioned in the pdf)
>
> That file lives here now:
>
> https://gitweb.torproject.org/metrics-web.git/tree/modules/legacy/db/tordir.sql
>
> A better schema might be the following one though.  It's smaller, but
> it's better documented:
>
> https://gitweb.torproject.org/exonerator.git/tree/db/exonerator.sql
>
>> "Instructions for setting up relay descriptor database"
>> https://lists.torproject.org/pipermail/tor-dev/2010-March/001783.html
>
> That's
>>
> five years old.  I'd say ignore that one.
>
>> "Set up descriptor database for other researchers"
>> https://trac.torproject.org/projects/tor/ticket/1643
>
> Also five years old.  Better ignore.
>
> Hope that helps.
>
> All the best,
> Karsten
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1
> Comment: GPGTools - http://gpgtools.org
>
> iQEcBAEBAgAGBQJVL9rcAAoJEJD5dJfVqbCrFZgIAIEv/Yi4sNoa8clYVAxuk0Sh
> FFbRDT0kLs19t/DgTwUtB6jD4Lh0akMc806AaIFgfCdL+QwcG0llBfZnSsrbszoH
> Xoi226PRx9lPITrA7KYds4PUZfqIqg3ECpNsKNa4PLB7SlQdNfJQ1wDngcwu2CrF
> Hk+zHbu0gfSkfZRBqxt5aJLTFXR0aBYybF4d6sPJ4OW5Al2U8r9DYysXc0xALvwq
> bvEDFctV1wkDgA3mP3guRrXImXYT1AQPFFlz0TR1eBruuSJBiPKIv7Fs/ocns4aR
> OhxIEaKBaAO+HkvyxDcZ1ukXldR13s3MUPD0XvvZ8xQRCBZpNMygqTMi6pIjTN4=
> =a0Nb
> -END PGP SIGNATURE-
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


[tor-dev] [GSoC] BridgeDB Twitter Distributor report

2014-07-26 Thread Kostas Jakeliunas
Progress/activities since last time:

  * incorporating BridgeRequest's together with an initial bridge request
API over JSON (it's easier to do both as they are tightly related.) The
bridge request api is based on isis' initial fix/12029-dist-api_r1;

  * bogus server-side bridge provider that implements the json api: just
something that gives fake bridges based on the request (which is
handled/contained in BridgeRequest.) (will have server side code real soon
now (hoped to have it by now.))

  * my churn_rewrite could probably make use of bridgedb's current approach
to pickled storage. (It's also worth switching to twisted.spread.jelly for
(mostly) security)

  * experimenting with sending images over twitter DMs. Twitter API does
not support images in DMs, but the web as well as various mobile apps
support attaching images to DMs (images end up in twitter CDN. (served over
ton.twitter.com), which is good.) Some progress here: the web client DM
send requests (where image files can be attached) are contained; the bot
should be able to send images in DMs soon, emulating a normal web user
agent (but using the two twitter APIs for all other activities and DMs.)

  * once bridgerequest's + request api (client-side + my mock server-side
thing) are done, the bot will have approached a not-far-from-functional
state

Apologies for the late report.

--

Kostas.

0x0e5dce45 @ pgp.mit.edu
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


[tor-dev] [GSoC] BridgeDB Twitter Distributor report

2014-07-13 Thread Kostas Jakeliunas
Hi all,

preferring existing code over shiny code and being mad late, I

  * (re)wrote a simple but working churn control mechanism[1], which uses

  * a general persistable storage system:

* in particular, the bot now has a central storage controller
which takes care of storage handlers which, in turn, may be of
different varieties. Each variety knows how to handle its own kind of
storage containers (simple objects with data as attributes). Some of
them may be persistable, others necessarily ephemeral (wipe data on
close);
* right now we only make use of simple
pickle-dump-to-file-and-gzip persistable storage; we use it for churn
control and for challenge responses; everything is self-contained so
to speak;
* we hash the user twitter handles (unique usernames / screen
names) and round up bridges-last-given-at timestamps;
* we handle bot shutdown by catching the appropriate signal (then
properly closing down the twitter stream listener and asking the
storage controller to close down the handlers);
* we use the storage system in the core bot via a general bot
state object (which is itself oblivious to how storage is actually
implemented);

  * wrote a simple and generic challenge-response system[2] (which
makes use of the persistent storage);
* instead of doing something very smart, we use a general CR
system which takes care of particular challenge-responses; the general
CR is usable as-is; the particular CR objects can be easily subclassed
(and that's what we do now);
* the current mock/bogus CR system that is in place (for testing
etc.) is a naive text-based question-answer CR, which asks the users
to add the number of characters in their twitter username to a given
verbal/English-word number;
* I should now finish up with ``BridgeRequest``s, which are the
proper way to handle bridge requests in the bot while doing
challenge-responses (the current interaction between the core bot and
the CR system will lead / has been leading nowhere);
* also, there's a question to be had whether the cached (and
hashed) answers to CRs should be persisted to storage (if bot gets
shutdown while some challenges are pending) in the first place.

I've been unable to find[3] or to come up with a concept of a
user-friendly *text-based* CR that would stand against any kind of
thief who is able to create lots of Twitter users and to write
twenty-line scripts solving any text-based challenges/questions
presented. Either it will to be a difficult problem that will be
easier solved by a computer than by a human (hence unfeasible
general-UX-wise), or it will be so symmetrical in the sense that one
only has to view the source (if even that) to come up with a script
trivially solving the challenge presented.

Hence I've been slowly moving on with the
captcha-over-twitter-direct-messages idea, which is not pretty, but
which would at least ensure that we don't give up bridges more easily
than in, say, the current IPDistributor.

[1]: https://github.com/wfn/twidibot/compare/master...churn_rewrite
[2]: https://github.com/wfn/twidibot/compare/churn_rewrite...simple_cr2

[3] it's quite hard to find anything of use in the chatroom problem
/ text-based challenge response area. Basically, it would be great
to have a reverse Turing test[4] that is not about captcha/OCR. I
realize this is in itself a very ambitious topic.
[4]: some context on early CAPTCHAs / precursors (have been trying to
familiarize myself with the general area),
http://www2.parc.com/istl/projects/captcha/docs/pessimalprint.pdf

--

Kostas.

0x0e5dce45 @ pgp.mit.edu
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


[tor-dev] GSoC: BridgeDB Twitter Distributor report

2014-06-21 Thread Kostas Jakeliunas
Hi all,

in the past couple of weeks I've been doing more of the same - namely,
fleshing out churn control in the bot; finishing a generic
challenge-response system (I'm also now considering making it into a
Zope Interface); subclassed text-based challenge-response;
incorporating isis' IRequestBridges and BridgeRequest's into the bot's
bridge request processing part; and fake bridge-line-from-descriptor
generator within the bot (didn't really do much re: the latter.)

Unfortunately, all those parts are not yet ready for redeployment of
the bot, and are either buggy or not finished (inclusion/use of
BridgeRequests.) This is partly due to me having a bit less time in
the last two weeks (a fault of my own; on the plus side, I've learned
to use the soldering iron properly!) My plan is to finish the things
that are near completion, and do another midterm-worthy status update
very soon. I'll also be present during the developer meeting
hackdays (2th-4th), and hope to use them to flesh out ideas, etc. with
isis/sysrqb.

--

Kostas.

0x0e5dce45 @ pgp.mit.edu
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


[tor-dev] GSoC: BridgeDB Twitter Distributor report

2014-06-10 Thread Kostas Jakeliunas
Hey all,

in the past weeks I've been working on understanding what can be done
using Twitter APIs and its media support in its CDN (for a later
captcha implementation), as well as on improving my existing Twitter
bridge distributor bot PoC. I've written some broken code, but it's
alright. More details below.


Distributor bot improvements included working on adding a churn rate
control mechanism which securely stores Twitter user IDs (with code
and design ideas from BridgeDB's HMAC approach to remembering e.g.
email addresses in the EmailDistributor), and implementing a (mostly)
bogus text-based challenge-response system (this is mostly so that we
have a generic design for doing challenge-responses in this
distributor - we'll be able to later on replace it with a decent
CAPTCHA, for example. It's just nice to have a generic system and a
thing for testing out the bot, etc.)

I've also looked into using isis' new and shiny BridgeRequest objects
to process user (well) 'bridge requests' in a non-hacky way; this
should also eventually result in a bridge request syntax compatible
with (a subset of) GetTor commands. But I still need to figure out the
best way to use BridgeRequests, so nothing interesting to show yet.

TODO

 * (still yet to) summarize a nice meeting i've had with sysrqb and
isis. No definite conclusions were made, but there were (iirc) some
nice ideas about a generic BridgeDB API that could be used by third
party components, etc. (i.e. it might be worth pursuing it even if the
Social Distributor is to be implemented at some later point.)

 * clean up my mess, test new code not to fail, and push new things
onto https://github.com/wfn/twidibot/ (current (old) code there does
work, if anyone's curious to run it)

 * figure out BridgeRequests, the new IRequestBridges (ha!) interface,
and use these in the twitter bot

 * be able to 'serve' the bot fake bridge data so it could process it
in a way that may be compatible with a future BridgeDB API (i.e.,
hopefully this bot will be able to run as a third-party-thing,
separate from core bridgedb. This is hopefully how future distributors
will/should work.) This way the bot will be more/actually 'realistic'
in the way it serves current bogus bridge lines to users. (I thought
I'd have this by now, but I don't. Hrm.)

 * continue looking into captcha systems modulo what can be used in
the twitter context

 * look into bridgedb buckets and what I can help re: them, so the
bridgedb API could happen sooner than later. (Old todo list item, did
not yet touch it.)

All in all, need to write more non-broken code, fewer words, and just
continue with the current bot.

Have a nice day/night/thing!

--

Kostas.

0x0e5dce45 @ pgp.mit.edu
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Python ExoneraTor

2014-06-10 Thread Kostas Jakeliunas
Hi all!

On Mon, Jun 9, 2014 at 10:22 AM, Karsten Loesing kars...@torproject.org wrote:
 On 09/06/14 01:26, Damian Johnson wrote:
 Oh, and another quick thought - you once mentioned that a descriptor
 search service would make ExoneraTor obsolete, and in looking it over
 I agree. The search functionality ExoneraTor provides is trivial. The
 only reason it requires such a huge database is because it's storing a
 copy of every descriptor ever made.

 I suspect the actual right solution isn't to rewrite ExoneraTor at
 all, but rather develop a new service that can be queried for this
 descriptor data. That would make for a *much* more worthwhile project.

 ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :)

 I agree, that was the idea behind Kostas' GSoC project last year.  And I
 still think it's a good idea.  It's just not trivial to get right.

Indeed, not trivial at all!

I'll use this space to mention the running metrics archive backend
modulo ExoneraTor stuff / what could be sorta-relevant here.

fwiw, the onionoo-like backend is still running at an obscure address:port:
http://ts.mkj.lt:/

TL;DR what can I do with that is: look at:

https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md

In particular, regarding ExoneraTor-like queries (incl. arbitrary
subnet / part-of-ip lookups):

https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md#exonerator-type-relay-participation-lookup

Not sure if it's worth discussing all the weaknesses of this archive
backend in this thread, but the short relevant version is that the
ExoneraTor-like functionality does mostly work, but I would need to
look into it so see how reliable the results are (is this relay ip
address field really the one we should be using?, etc.)

But what's nice is that it is possible to do arbitrary queries on all
consensuses since ~2008, with no date specified (if you don't want
to.) (Which is to say, it's possible, not necessarily this is the
right way to do the solution for the problems in this thread)

So e.g. this is the ip address where moria runs, and we want to see
what relays have ever run on it:

http://ts.mkj.lt:/details?search=128.31.0.34

Take the fingerprint of the one that is currently running (moria1),
and look up its last 500 statuses (in a kind of condensed/summary
form): 
http://ts.mkj.lt:/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE31condensed=true

from, to date ranges can be specified as e.g. 2009, 2009-02,
2009-02-10, 2009-02-10 02:00:00. limit/offset/parameters/etc.
specified here:
https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md

(Descriptors/digests aren't currently included (I think they used to),
but they can be, etc.)

The point is probably mostly about this is some evidence that it can be done.
(But there are nuances, things are imperfect, time is needed, etc.)

The question really is regarding the actual scope of this rewrite, I suppose.

I'd probably agree with Karsten that just doing a port of the
ExoneraTor functionality as it currently is on
exonerator.torproject.org would be the safe bet. See how that goes,
venture into more exotic lands later on maybe, etc. (That doesn't mean
that I wouldn't be excited to put the current backend to good use,
and/or use the knowledge I gained to help you folks in some way!)


 Regarding your comment about storing a copy of every descriptor ever
 made, I believe that users trust ExoneraTor's results more if they see
 the actual descriptors that lead to results.  Of course, I'm saying that
 without knowing what ExoneraTor users actually want.  But let's not drop
 descriptor copies from the database easily.

 And, heh, when you say that the search functionality ExoneraTor provides
 is trivial, a little part of me is dying.  It's the part that spent a
 few weeks on getting the search functionality fast enough for
 production.  That was not at all trivial.  The oraddress24, oraddress48,
 and exitaddress24 fields as well as the indexes are the result of me
 running lots and lots of sample queries and wondering about Postgres'
 EXPLAIN ANALYZE results.  Just saying that it's not going to be trivial
 to generalize the search functionality towards other fields than IP
 addresses and dates.

Hear hear, I can only imagine! These things and exonerator stuff is
not easy to be done in a way that would provide **consistently**
good/great performance.

I spent some days of the last summer also looking at EXPLAIN ANALYZE
results (it was a great feeling to start to understand what they mean
and how I can make them better), but eventually things start making
sense. (And when they do, I also get that same feeling that NoSQL
stuff doesn't magically solve things.)


 If others want to follow, here's the SQL code I'm talking about:

 https://gitweb.torproject.org/exonerator.git/blob/HEAD:/db/exonerator.sql

 So, I'm happy to talk about writing a searchable descriptor archive.  It
 could _start_ with 

Re: [tor-dev] Python ExoneraTor

2014-06-10 Thread Kostas Jakeliunas
On Tue, Jun 10, 2014 at 10:38 AM, Karsten Loesing
kars...@torproject.org wrote:
 On 10/06/14 05:41, Damian Johnson wrote:
 let me make one remark about optimizing Postgres defaults: I wrote quite
 a few database queries in the past, and some of them perform horribly
 (relay search) whereas others perform really well (ExoneraTor).  I
 believe that the majority of performance gains can be achieved by
 designing good tables, indexes, and queries.  Only as a last resort we
 should consider optimizing the Postgres defaults.

 You realize that a searchable descriptor archives focuses much more on
 database optimization than the ExoneraTor rewrite from Java to Python
 (which would leave the database untouched)?

 Are other datastore models such as splunk or MongoDB useful?
 [splunk has a free yet proprietary limited binary... those having
 historical woes and takebacks, mentioned just for example here.]

 Earlier I mentioned the idea of Dynamo. Unless I'm mistaken this lends
 itself pretty naturally to addresses as a hash key, and descriptor
 dates as the range key. Lookups would then be O(log(n)) where n is the
 total number of descriptors an address has published (... that is to
 say very, very quick).

 This would be a fun project to give Boto a try. *sigh*... there really
 should be more hours in the day...

 Quoting my reply to Damian to a similar question earlier in the thread:

 I'm wary about moving to another database, especially NoSQL ones and/or 
 cloud-based ones.  They don't magically make things faster, and Postgres is 
 something I understand quite well by now. [...] Not saying that DymanoDB 
 can't be the better choice, but switching the database is not a priority for 
 me.

 If somebody wants to give, say, MongoDB a try, I'd be interested in
 seeing the performance comparison to the current Postgres schema.  When
 you do, please consider all three search_* functions that the current
 schema offers, including searches for other IPv4 addresses in the same
 /24 and other IPv6 addresses in the same /48.

Personally, the only NoSQL thing I've come across (and have had some
really good experiences with in the past) was Redis, which is a kind
of key-value store-in-memory, with some nice simple data structures
(like sets, and operations on sets. So if you can reduce your problem
to (e.g.) sets and set operations, Redis might be a good fit.)

(I think that isis is actually experimenting with Redis right now, to
do prop 226-bridgedb-database-improvements.txt)

If the things that you store in Redis can't be made to fit into
memory, you'll probably have a bad time.

So to generalize, if some relational data which needs to be searchable
can be made to fit into memory (we can guarantee it wouldn't exceed x
GB [for t time]), offloading that part onto some key-value (or some
more elaborate) system *might* make sense.

Also, I mixed up the link in footnote [2]. It should have linked to
this diagnostic postgres query:

https://github.com/wfn/torsearch/blob/master/misc/list_indexes_in_memory.sql

--

regards
Kostas
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Introducing CollecTor (was: Spinning off Directory Archive from Metrics Portal)

2014-06-06 Thread Kostas Jakeliunas
On Fri, Jun 6, 2014 at 1:18 PM, Philipp Winter p...@nymity.ch wrote:
 On Wed, Jun 04, 2014 at 04:54:03PM +0200, Karsten Loesing wrote:
 On 25/05/14 10:35, Karsten Loesing wrote:
  I'm continuously tweaking the Metrics Portal [0] in the attempt to make
  it more useful.  My latest idea is to finally spin off the Directory
  Archive part from it, which is the part that serves descriptor tarballs.

 Ta-da!   === https://collector.torproject.org/ ===   New website!

 Looks great!

Seconded - very awesome indeed!


 I added the service to:
 https://trac.torproject.org/projects/tor/wiki/org/operations/Infrastructure

  - Recently published descriptors can now be accessed much more easily:
 https://collector.torproject.org/recent/

 That's a very useful feature.


Am I right to assume that any service/program/client that relied on
metrics rsync the recent/ folder feature should migrate to using
https://collector.torproject.org/recent/ ?

One thing that's neat with rsync is that it can take care of any
lapses in service (on either the metrics data backend side, or on the
client-which-is-downloading-the-data side) - it will just
automagically mirror all the consensuses (if this is needed by the
client/program/etc.)

Of course, it's very easy to just make the client check if it has any
lapses/holes in its (historical) view of the needed data, and to make
it re-download (wget, whatever) the missing parts as needed.

Just wanted to make sure there'll be no rsync-recent-metrics-data
service any more (correct me if i got this wrong.)

  - Preliminary logo suggested by Jeroen and very quickly put together:
 https://people.torproject.org/~karsten/volatile/collector-logo.png -- if
 you're a graphic designer and want to contribute one hour of your time
 to design that for real, please contact me!

 Hmm, that seems to be the octopus which is part of USA-247's logo:
 http://en.wikipedia.org/wiki/USA-247


Quite sure this was some cheeky intended satire :)
Really like the logo, btw ;)

 Hopefully, somebody can contribute a better one.

 Cheers,
 Philipp

Kostas
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


[tor-dev] New BridgeDB Distributor (was: Re: New BridgeDB Distributor (Twitter/SocialDistributor intersections, etc.))

2014-04-22 Thread Kostas Jakeliunas
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

With isis' and sysrqb's permission, moving the new BridgeDB
Distributor (and maybe general bridgedb distributor architecture
discussion) thread onto tor-dev@.

On 04/15/2014 10:30 PM, Kostas Jakeliunas wrote:
 On 03/29/2014 10:08 AM, Matthew Finkel wrote:
 (I look the liberty of making this readable again :))
 
 On Fri, Mar 28, 2014 at 08:00:17PM +0200, Kostas Jakeliunas
 wrote:
 isis wrote:
 Kostas Jakeliunas transcribed 7.9K bytes:
 Hey isis,
 
 wfn here. [...]
 
 
 Hi!
 
 
 Howdy!
 
 I'm super excited to hear you're interested in working on this! 
 [...]
 [...] a couple of questions (more like inconcrete musings)
 [...]:
 
 Would you personally think that incorporating some ideas
 from #7520[1] (Design and implement a social distributor
 for BridgeDB) would be within the scope of a ~three+ month
 project? The way I see it, if a twitter (or, say, xmpp+otr
 as mentioned by you/others on IRC) distributor were to be 
 planned, it would either need to
 
 - incorporate some form of churn rate control / Sybil
 attack prevention, via e.g. recaptcha (I see that twitter
 direct (=personal) messages can include images; they'll
 probably be served by one of twitter media CDNs (would need
 to look things up), but it's probably safe to assume that
 as long as twitter itself is not blocked, those CDNs won't
 be, either);
 
 Yes, this stuff is already built, and wouldn't be too hard
 to incorporate. However, as I'm sure you already understand,
 there is no Proof of Work system which actually works for
 users while keeping adversaries out.
 
 For sure, we always have to keep this in mind. Hopefully
 there's a compromise that kinda-works, and eventually, given
 some more metrics/diagnostic info intersected with OONI
 hopefully being able to say which bridges don't work from which
 countries, it'll be possible to actually carry out tests in a
 kind-of-scientific/not-blind-guessing way..
 
 
 At this point I just assume our adversary will always have more 
 resources than us no matter which mechanism we use. More people,
 more compute power/time, more money. At this point I think we
 only have two things that they don't. We have more bridges and
 more love for people. Leveraging this is...not easy, however. :(
 POW is useful in some cases, for example, to prevent an asshole
 from crawling bridgedb so that they can add all bridges to a
 blacklist. When dealing with state-level adversaries I agree with
 isis that they're of little use.
 
 
 Agree.
 
 
 - or take an idea from the social distributor in #7520,
 namely/probably, implement some form of token system.
 
 
 This is not very doable in 6 weeks. It also, sadly, requires
 the DB backend work (which I'll be doing over the next three
 months, but might take more time).
 
 Aha, understood, yes. So basically, ideally I'd write code that
 could *later on* be easily extendable in relevant ways. But no
 tokens for now.
 
 
 Ideally this sounds like a good idea, however I'm not sure we (or
 at least I) have a good handle on what bridgedb will look like in
 6-12 months. It's undergoing a lot of change right now. Don't
 interpret this as saying this is a bad idea because the more
 abstract and extensible you make this distributor the more useful
 it will be. I'm just a little worried about writing something for
 the future. Perhaps there's a good way to design and plan for
 this, though.
 
 
 Yeah, understood. As I understand it, isis is changing some things
 in bridgedb (bridgedb.Distributor, etc) right now / these days.
 
 For now, the idea is to have a thing that works that is more or
 less completely decoupled from the bridgedb codebase. If we do this
 right, it will hopefully be relatively easy to then integrate it in
 a way that will make sense at that point in time (e.g. as part of 
 bridgedb.Distributor, *or* as a client to a core RESTful 
 distributor/api/service that gives bridges to other 'third-party' 
 distributors (see below.))
 
 It might be possible to have some simplistic token system
 with pre-chosen seed nodes, etc. Of course, security and
 privacy implications ahoy - first and foremost, this would
 result in more than zero places/people knowing t he entire
 social graph, unless your and other people's ideas (the
 whole Pandora box of; I should attempt an honest read of
 rBridge, et al.; have only skimmed as of now) re: oblivious
 transfer, etc. were incorporated. Here it becomes quite
 difficult to define short-ish term deliverables of course.
 I know that you did quite a lot of research on the
 private/secure social distributor idea.
 
 Really, you don't want to get into this stuff. Or do, but
 don't do it for GSoC. I've spent the past year painfully
 writing proofs to correct the erro rs in that paper, and
 discovered some major problems for anonymity in old 
 tried-and-true cryptographic primitives in the process.
 
 This is a HUGE project.
 
 Sounds insanely intense, in both a good and a bad way! It's

[tor-dev] GSoC: BridgeDB Twitter Distributor

2014-04-22 Thread Kostas Jakeliunas
Hi all,

I'm excited to be able to spend another summer-of-code together with
Tor (how impudent!) :) My name is Kostas (wfn on OFTC), primary mentor
is isis and secondary mentor is sysrqb.

I'll be working on writing a new BridgeDB Distributor[1]. I've set my
primary task to designing and implementing a Twitter-distributor-bot
(see proposal[2]): a Twitter bot answers personal (direct) messages,
does rate control if needed, and gives bridge lines to users.

There should be enough time to at least start on another distributor
(right now I'm thinking about an XMPP-based one, as the
federated-network-nature allows for some neat censorship circumvention
approaches.) But there's also value in implementing a generic/core
distributor that could give bridges to third-party distribution
systems over a (say) RESTful API. Will see how things go, but core
task for now is a twitter-based distributor.

For further ideas, discussion, etc., see a separate tor-dev@ thread:
https://lists.torproject.org/pipermail/tor-dev/2014-April/006742.html

Ideas are very much welcome indeed!

[1]: 
https://www.torproject.org/getinvolved/volunteer.html.en#newBridgedbDistributor
[2]: http://kostas.mkj.lt/gsoc2014/gsoc2014.html

--

Kostas.

0x0e5dce45 @ pgp.mit.edu
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Incorporating your torsearch changes into Onionoo

2013-10-25 Thread Kostas Jakeliunas
On Wed, Oct 23, 2013 at 2:32 PM, Karsten Loesing kars...@torproject.orgwrote:

 On 10/11/13 4:05 PM, Kostas Jakeliunas wrote:

 Oops!  Sorry for the delay in responding!  Responding now.

  On Fri, Oct 11, 2013 at 12:00 PM, Karsten Loesing 
 kars...@torproject.orgwrote:
 
  Hi Kostas,
 
  should we move this thread to tor-dev@?
 
 
  Hi Karsten!
 
  sure.
 
 From our earlier conversation about your GSoC project:
  In particular, we should discuss how to integrate your project into
  Onionoo.  I could imagine that we:
 
   - create a database on the Onionoo machine;
   - run your database importer cronjob right after the current Onionoo
  cronjob;
   - make your code produce statuses documents and store them on disk,
  similar to details/weights/bandwidth documents;
   - let the ResourceServlet use your database to return the
  fingerprints to return documents for; and
   - extend the ResourceServlet to support the new statuses documents.
 
  Maybe I'm overlooking something and you have a better plan?  In any
  case, we should take the path that implies writing as little code as
  possible to integrate your code in Onionoo.
 
  Let me know what you think!
 
 
  Sounds good. Responding to particular points:
 
   - create a database on the Onionoo machine;
   - run your database importer cronjob right after the current Onionoo
  cronjob;
 
  These should be no problem and make perfect sense. It's always best to
 use
  raw SQL table creation routines to make sure the database looks exactly
  like the one on the dev machine I guess (cf. using SQLAlchemy
 abstractions
  to do that (I did that before)).
 
  Current SQL script to do that is at [1]. I'll look over it. For example,
  I'd (still) like to generate some plots showing the chances of two
  fingerprints having the same substring (this is for the intermediate
  fingerprint table.) (One axis would be substring length, another would be
  the possibility in (portions of) %.) As of now, we still use
  substr(fingerprint, 0, 12), and it is reflected in the schema.
 
  Overall though, no particular snags here.

 I don't follow.  But before we get into details here, I must admit that
 I was too optimistic about running your code on the current Onionoo
 machine.  I ran a few benchmark tests on it last week to compare it to
 new hardware, and those tests almost made it fall over.  We should not
 even think about adding new load to the current machine.

 New plan: can you run an Onionoo instance with your changes on a
 different machine?  (If you need anything from me, like a tarball of the
 status/ and out/ directories, I'm happy to provide them to you.)  I
 think we should run this instance for a while to see how reliable it is.
  And once we're confident enough, we'll likely have new hardware for the
 new Onionoo, so that we can move it there.


This sounds like a very good idea. Ok, I can try and do this. Sorry for
delaying my response as well, I'll try and follow up with what I need (if
anything).

  - make your code produce statuses documents and store them on disk,
  similar to details/weights/bandwidth documents;
 
  Right, so if we are planning to support all V3 network statuses for all
  fingerprints, how are we to store all the status documents? The idea is
 to
  preprocess and serve static JSON documents, correct (as in the current
  Onionoo)? (cf. the idea of simply caching documents: if we serve a
  particular status document, it gets cached, and depending on the query
  parameters (date range restriction, e.g.) it may be set not to expire at
  all.)
 
  Or should we try and actually store all the statuses (the condensed
 status
  document version [2], of course)?

 Let's do it as the current Onionoo does it.  This code does not exist,
 right?


I've done some small testing on a local system, it seems the Onionoo way is
plausible, since the generation of all the old(er) status etc. documents
needs to happen only once (obviously, but now I understand this means the
number of resulting status documents and their size is not such a big deal
after all.) I don't have good code for it as of yet.


   - let the ResourceServlet use your database to return the
  fingerprints to return documents for; and
   - extend the ResourceServlet to support the new statuses documents.
 
  Sounds good. I assume you are very busy with other things as well, so
  ideally maybe you had in mind that I could try and do the Java part? :)
  Though, since you are much more familiar with (your own) code, you could
  probably do it faster than me. Not sure.
  Any particular technical issues/nuances here (re: ResourceServlet)?

 Can you give it a try?  Happy to help with specific questions about
 ResourceServlet, and I'll try hard to reply faster this time.  Again,
 sorry for the delay!


Okay! I've been tinkering a bit, actually. Will see if I can produce
something decent and reliable.

Best wishes
Kostas.


  [1]: https://github.com/wfn/torsearch/blob/master/db/db_create.sql
  [2

Re: [tor-dev] Incorporating your torsearch changes into Onionoo

2013-10-11 Thread Kostas Jakeliunas
On Fri, Oct 11, 2013 at 12:00 PM, Karsten Loesing kars...@torproject.orgwrote:

 Hi Kostas,

 should we move this thread to tor-dev@?


Hi Karsten!

sure.

From our earlier conversation about your GSoC project:
  In particular, we should discuss how to integrate your project into
  Onionoo.  I could imagine that we:
 
   - create a database on the Onionoo machine;
   - run your database importer cronjob right after the current Onionoo
  cronjob;
   - make your code produce statuses documents and store them on disk,
  similar to details/weights/bandwidth documents;
   - let the ResourceServlet use your database to return the
  fingerprints to return documents for; and
   - extend the ResourceServlet to support the new statuses documents.
 
  Maybe I'm overlooking something and you have a better plan?  In any
  case, we should take the path that implies writing as little code as
  possible to integrate your code in Onionoo.

 Let me know what you think!


Sounds good. Responding to particular points:

  - create a database on the Onionoo machine;
  - run your database importer cronjob right after the current Onionoo
 cronjob;

These should be no problem and make perfect sense. It's always best to use
raw SQL table creation routines to make sure the database looks exactly
like the one on the dev machine I guess (cf. using SQLAlchemy abstractions
to do that (I did that before)).

Current SQL script to do that is at [1]. I'll look over it. For example,
I'd (still) like to generate some plots showing the chances of two
fingerprints having the same substring (this is for the intermediate
fingerprint table.) (One axis would be substring length, another would be
the possibility in (portions of) %.) As of now, we still use
substr(fingerprint, 0, 12), and it is reflected in the schema.

Overall though, no particular snags here.

  - make your code produce statuses documents and store them on disk,
 similar to details/weights/bandwidth documents;

Right, so if we are planning to support all V3 network statuses for all
fingerprints, how are we to store all the status documents? The idea is to
preprocess and serve static JSON documents, correct (as in the current
Onionoo)? (cf. the idea of simply caching documents: if we serve a
particular status document, it gets cached, and depending on the query
parameters (date range restriction, e.g.) it may be set not to expire at
all.)

Or should we try and actually store all the statuses (the condensed status
document version [2], of course)?

  - let the ResourceServlet use your database to return the
 fingerprints to return documents for; and
  - extend the ResourceServlet to support the new statuses documents.

Sounds good. I assume you are very busy with other things as well, so
ideally maybe you had in mind that I could try and do the Java part? :)
Though, since you are much more familiar with (your own) code, you could
probably do it faster than me. Not sure.
Any particular technical issues/nuances here (re: ResourceServlet)?

cheerio
Kostas.

[1]: https://github.com/wfn/torsearch/blob/master/db/db_create.sql
[2]:
https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md#network-status-entry-documents
(e.g.
http://ts.mkj.lt:/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE31condensed=true
 )
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Searchable metrics archive - Onionoo-like API available online for probing

2013-09-02 Thread Kostas Jakeliunas
On Mon, Sep 2, 2013 at 2:20 PM, Karsten Loesing kars...@torproject.orgwrote:

 On 8/23/13 3:12 PM, Kostas Jakeliunas wrote:
  [snip]

 Hi Kostas,

 I finally managed to test your service and take a look at the
 specification document.


Hey Karsten!

Awesome, thanks a bunch!

The few tests I tried ran pretty fast!  I didn't hammer the service, so
 maybe there are still bottlenecks that I didn't find.  But AFAICS, you
 did a great job there!


Thanks for doing some poking! There is probably space for quite a bit more
of parallelized benchmarking (not sure of term) to be done, but at least in
principle (and from what I've observed / benchmarked so far), if a single
query runs in good time, it's rather safe to assume that scaling to
multiple queries at the same time will not be a big problem. There's always
a limit of course, which I haven't yet observed (and which I should be able
to / would do well to find, ideally.) This is, however, one of the
strengths of PostgreSQL in any case: very nice parallel-query-scaling. Of
course, since the queries are, more or less, always disk i/o-bound, there
still could be hidden sneaky bottlenecks, that is very true for sure.


 Thanks for writing down the specification.

 So, would it be accurate to say that you're mostly not touching summary,
 status, bandwidth, and weights resources, but that you're adding a new
 fifth resource statuses?

 In other words, does the attached diagram visualize what you're going to
 add to Onionoo?  Some explanations:

 - summary and details documents contain only the last known information
 about a relay or bridge, but those are on a pretty high detail level (at
 least for details documents).  In contrast to the current Onionoo, your
 service returns summary and details documents for relays that didn't run
 in the last week, so basically since 2007.  However, you're not going to
 provide summary or details for arbitrary points in time, right?  (Which
 is okay, I'm just asking if I understood this correctly.)


(Nice diagram, useful-) responding to particular points / nuances:

summary and details documents contain only the last known information
 about a relay or bridge, but those are on a pretty high detail level (at
 least for details documents)


This is true: the summary/details documents (just like in Onionoo proper)
deal with the *last* known info about relays. That is how it works now,
anyway.

As per our subsequent IRC chat, we will now assume this is how it is
intended to be. The way I see it from the perspective of my original
project goals etc., the summary and details (+ bandwidth and weights)
documents are meant for Onionoo {near-, full-}compatibility; they must stay
Onionoo-like. The new network status document is the olden archive browse
and info extract part: it is one of the ways of exposing an interface to
the whole database (after all, we do store all the flags and nicknames and
IP addresses for *all* the network statuses.)

However, you're not going to
 provide summary or details for arbitrary points in time, right?  (Which
 is okay, I'm just asking if I understood this correctly.)


There is no reason why this wouldn't be possible. (I experimented with new
search parameters, but haven't pushed them to master / changed the backend
instance that is currently running.)

A query involving date ranges could, for example, be something akin to,

get a listing of details documents for relays which match this $nickname /
$address / $fingerprint, and which have run (been listed in consensuses
dated) from $startDate to $endDate. (would use new ?from=.., ?to=..
parameters, which you've mentioned / clarified earlier.)

As per our IRC chat, I will add these parameters / query options not only
to the network status document, but also to the summary and details
documents.


 - bandwidth and weights documents always contain information covering
 the whole lifetime of a relay or bridge, where recent events have higher
 detail level.  Again, you're not going to change anything here besides
 providing these documents for relays and bridges that are offline for
 more than a week.

 - statuses have the same level of detail for any time in the past.
 These documents are new.  They're designed for the relay search service
 and for a simplified version of ExoneraTor (which doesn't care about
 exit policies and doesn't provide original descriptor contents).  There
 are no statuses documents for bridges, right?


Yes  yes. No documents for bridges, for now. I'm not sure of the priority
of the task of including bridges - it would sure be awesome to have bridges
as well. For now, I assume that everything else should be finished (the
protocol, the final scalable database schema/setup, etc.) before embarking
on this point.

The status entry API point is indeed about getting info from the whole
archives, at the same detail level for any portion of the archives.

(I should have articulated this / put into a design doc before, but this
important nuance

[tor-dev] [GSoC 2013] Status report - Searchable metrics archive

2013-08-23 Thread Kostas Jakeliunas
Hello!

Updating on my Searchable Tor metrics archive project. (As is very evident)
I'm very open for naming suggestions. :)

To the best of my understanding and current satisfaction, I solved the
database bottlenecks, or at least I am, as of now, satisfied with the
current output from my benchmarking utility. Things may change, but I am
confident (and have support to argue) that the whole thing runs swell at
least on amazon m2.2xlarge instances.

For fun and profit, a part of the database (which, has, for now, status
entries (only) in the range [2010-01-01 00:00:00, 2013-05-31 23:00:00]),
namely, what is currently used by the Onionoo-like API is now available
online (not on EC2, though) - will now write a separate email so that
everyone can inspect it.

I should now move on with implementing / extending the Onionoo API, in
particular, working on date range queries, and refining/rewriting the list
status entries API point (see below). Need to carefully plan some things,
and always keep an updated API document. (Also need to update and publish a
separate, more detailed specification document.)

More concrete report points:

   - re-examined my benchmarking approach, and wrote a rather simple but
   effective set of benchmarking tools (more like a simple script) [1] that
   can be hopefully used outside this project as well; at the very least,
   together with the profiling and the query_info tools, it is powerful (but
   also simple) enough to be used to test all kinds of bottlenecks in ORMs and
   elsewhere.

   - used this tool to generate benchmark reports on EC2 and on the (less
   powerful) dev server, and with different schema settings (usually rather
   minor schema changes that do not require re-importing all the data)

   - came up with a triple table schema that proves to render our queries
   quickly: we first do a search (using whatever criteria (e.g. nickname,
   fingerprint, address, running), if any) on a table which has a column with
   unique fingerprints; extract the relevant fingerprints; JOIN with the main
   status entry table, which is much larger; and get the final results.
   Benchmarked using this schema.

   details If we are only extracting a list of the latest status entries
   (with distinct on fingerprint), we can do LIMITs and OFFSETs already on the
   fingerprint table, before the JOIN. This helps us quite a bit. On the other
   hand, nickname searches etc. are also efficient. As of now, I have
   re-enabled nickname+address+fingerprint substring search (not from the
   middle (LIKE %substring%), but from the beginning of a substring (LIKE
   substring%), which is still nice), and all is well. Updated the
   higher-level ORM to reflect this new table [2] (I've yet to change some
   column names, though - but these are cosmetics.) /details

   - found a way to generate the SQL queries that I need to generate using
   the higher-level SQLAlchemy SQL API using various SQLAlchemy-provided
   primitives, and always observing the resulting query statements. This is
   good, because everything becomes more modular: much easier to shape the
   query depending on the query parameters received, etc. (while still
   retaining it in sane order.)

   - hence (re)wrote a part of the Onionoo-like API that uses the new
   schema and the SQLAlchemy primitives. Extended the API a bit. [3]

   - wrote a very hacky API point for getting a list of status entries for
   a given fingerprint. I simply wanted a way (for myself and people) to query
   this kind of a relation easily and externally. It now works as part of the
   API. This part will probably need some discussion.

   - wrote a (kind of a stub) document explaining the current Onionoo-like
   API, what can be queried, what can be returned, what kinds of parameters
   work. [4] Will extend this later on.


while writing the doc and rewriting part the API, stumbled upon a few
things that make clear that I've made some shortcuts that may hurt later
on. Will be happy to elaborate on them later on / separately. I need to
carefully plan a few things, and then try rewriting the Onionoo API yet
again, this time including more parameters and fields returned.

TL;DR yay, a working database backend!

I might give *one* more update detailing things I might have forgotten
about soon re: this report - I don't want to make a habit of delaying
reports (which I have consistently done), so reporting what I have now.

[1]: https://github.com/wfn/torsearch/blob/master/torsearch/benchmark.py
[2]: https://github.com/wfn/torsearch/blob/master/torsearch/models.py
[3]: https://github.com/wfn/torsearch/blob/master/torsearch/onionoo_api.py
[4]: https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md

--

Kostas (wfn on OFTC)

0x0e5dce45 @ pgp.mit.edu
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive

2013-08-15 Thread Kostas Jakeliunas
On Wed, Aug 14, 2013 at 1:33 PM, Karsten Loesing kars...@torproject.orgwrote:


 Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more
 likely that we can run something requiring this extension on a
 torproject.org machine.  Still, requiring extensions should be the last
 resort if no other solution can be found.  Leaving out searches for
 nickname substrings is a valid solution for now.


Got it.

  Do you have a list of searches you're planning to support?
 
 
  These are the ones that should *really* be supported:
 
 - ?search=nickname
 - ?search=fingerprint
 - ?lookup=fingerprint
 - ?search=address [done some limited testing, currently not focusing
 on
 this]

 The lookup parameter is basically the same as search=fingerprint with
 the additional requirement that fingerprint must be 40 characters long.
  So, this is the current search parameter.

 I agree, these would be good to support.

 You might also add another parameter ?address=address for ExoneraTor.
 That should, in theory, be just a subset of the search parameter.


Oh yes, makes a lot of sense, OK.

By the way: I considered having the last consensus (all the data for at
least the /summary document, or /details as well) be stored in memory (this
is possible) (probably as a hashtable where key = fingerprint, value = all
the fields we'd need to return) so that when the backend is queried without
any search criteria, it would be possible to avoid hitting the database
(which is always nice), and just dump the last consensus. (There's also
caching of course, which we could discuss at a (probably quite a bit) later
point.)


 - ?running=boolean

 This one is tricky.  So far, Onionoo looks only at the very latest
 consensus or bridge status to decide if a relay or bridge is running or
 not.

 But now you're adding archives to Onionoo, so that people can search for
 a certain consensus or certain bridge status in the past, or they can
 search for a time interval of consensuses or bridge statuses.  How do
 you define that a relay or bridge is running, or more importantly
 included as not running?


Agree, this is not clear. (And whatever ends up being done, this should be
well documented and clearly articulated (of course.))

For me at least, 'running' implies the clause whether a given relay/bridge
is running *right now*, i.e. whether it is present in the very last
consensus. (Here's where that hashtable (with fingerprints as keys) in
memory might be able to help: no need to run a separate query / do an inner
join / whatnot; it would depend on whether there's a LIMIT involved though,
etc.)

I'm not sure which one is more useful (intuitively for me, the whether it
is running *right now* is more useful.) Do you mean that it might make
sense to have a field (or have running be it) indicating whether a given
relay/bridge was present in the last consensus in the specified date range?
If this is what you meant, then the return all that are/were not running
clause would indeed be kind of..peculiar (semantically - it wouldn't be
very obvious what's it doing.)

Maybe it'd be simpler to first answer, what would be the most useful case?

 How do you define that a relay or bridge [should be] included as not
running?

Could you rephrase maybe? Do you mean that it might be difficult to
construct sane queries to check for this condition? Or that the situation
where

   - a from..to date range is specified
   - ?running=false is specified

would be rather confusing ('exclude those nodes which are running *right
now* ('now' possibly having nothing to do with the date range)?

 - ?flag=flag [every kind of clause which further narrows down the
 query
 is not bad; the current db model supports all the flags that Stem
 does, and
 each flag has its own column]

 I'd say leave this one out until there's an actual use case.


Ok, I won't focus on these now; just wanted to say that these should be
possible without much ado/problems.


 - ?first_seen_days=range
 - ?last_seen_days=range
 
  As per the plan, the db should be able to return a list of status
 entries /
  validafter ranges (which can be used in {first,last}_seen_days) given
 some
  fingerprint.

 Oh, I think there's a misunderstanding of these two fields.  These
 fields are only there to search for relays or bridges that have first
 appeared or were last seen on a given day.

 You'll need two new parameters, say, from=datetime and to=datetime (or
 start=datetime and end=datetime) to define a valid-after range for your
 search.


Ah! I wasn't paying attention here. :) Ok, all good.

Thanks as always!
Regards
Kostas.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive

2013-08-13 Thread Kostas Jakeliunas
On Tue, Aug 13, 2013 at 2:15 PM, Karsten Loesing kars...@torproject.orgwrote:

 I suggest putting pg_prewarm on the future work list.  I sense there's a
 lot of unused potential in stock PostgreSQL.  Tweaking the database at
 this point has the word premature optimization written on it in big
 letters for me.


 Also, to be very clear here, a tool that requires custom tweaks to
 PostgreSQL has minimal chances of running on torproject.org machines in
 the future.  The current plan is that we'll have a dedicated database
 machine operated by our sysadmins that not even the service operator
 will have shell access to.


Oh, understood then, OK, no extensions (at least) for now.

Apropos: as of my current (limited) understanding, it might be difficult to
support, for example, nickname sub-string searches without a (supported,
official) extension. One such extension is pg_trgm [1], which is in the
contrib/ directory in 9.1, and is just one make install away. But for now,
I'll assume this is not possible / we should avoid this.

So, why do you join descriptors and network statuses in the search
 process?  At the Munich dev meeting I suggested joining the tables
 already in the import process.  What do you think about that idea?


Yes, I had made a half-hearted attempt to normalize the two tables some
time ago, for a small amount of descriptors and status entries; I'll be
trying out this scheme in full (will need to re-import a major part of the
data (which I didn't do then) to be able to see if it scales well) after I
try something else. (Namely, using a third table of unique fingerprints
(the statusentry table currently holds ~170K unique fingerprints vs. ~67M
rows in total) and (non-unique) nicknames for truly quick fingerprint
lookup and nickname search; I did experiment with this as well, but I
worked with a small subset of overall data in that case, too; and I think I
can do a better job now.)

It had seemed to me that the bottleneck was in having to sort a too large
number of rows, but now I understand (if only just a bit) more about the
'explain analyze' output to see that the 'Nested Loop' procedure, which is
what does the join in the join query discussed, is expensive and is part of
the bottleneck so to speak. So I'll look into that after properly
benchmarking stuff with the third table. (By the way, for future reference,
we do have to test out different ideas on a substantial subset of overall
data, as the scale function is not, so to say, linear.) :)


  https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
 
  We use the following indexes while executing that query:
 
   * lower(nickname) on descriptor
 
   * (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on
 statusentry

 Using only the first 12 characters sounds like a fine approach to speed
 up things.  But why 12?  Why not 10 or 14?  This is probably something
 you should annotate as parameter to find a good value for later in the
 process.  (I'm not saying that 12 is a bad number.  It's perfectly fine
 for now, but it might not be the best number.)


Yes, this is as unscientific as it gets. As of now, we're using a raw SQL
query, but I'll be encapuslating them properly soon (so we can easily
attach different WHERE clauses, etc.), at which point I'll make it into a
parameter. I did do some tests, but nothing extensive; just made sure the
indexes can fit into memory whole, which was the main constraint. Will do
some tests.


 Also, would it keep indexes smaller if you took something else than
 base16 encoding for fingerprints?  What about base64?  Or is there a
 binary type in PostgreSQL that works fine for indexes?


Re: latter, no binary type for B-Trees (which is the default index type in
pgsql) as far as I can see. But it's a good idea / approach, so I'll look
into it, thanks! On the whole though, as long as all the indexes occupy
only a subset of pgsql's internal buffers, there shouldn't be a problem /
that's not the problem, afaik. But, if we're making a well-researched
ORM/database design, I should look into it.


 Do you have a list of searches you're planning to support?


These are the ones that should *really* be supported:

   - ?search=nickname
   - ?search=fingerprint
   - ?lookup=fingerprint
   - ?search=address [done some limited testing, currently not focusing on
   this]
   - ?running=boolean
   - ?flag=flag [every kind of clause which further narrows down the query
   is not bad; the current db model supports all the flags that Stem does, and
   each flag has its own column]
   - ?first_seen_days=range
   - ?last_seen_days=range

As per the plan, the db should be able to return a list of status entries /
validafter ranges (which can be used in {first,last}_seen_days) given some
fingerprint.

Thanks for your feedback and reply!

Kostas.


[1]: http://www.postgresql.org/docs/9.1/static/pgtrgm.html
___
tor-dev mailing list
tor-dev@lists.torproject.org

Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive

2013-08-12 Thread Kostas Jakeliunas
Karsten,

this won't be a very short email, but I honestly swear I did revise it a
couple of times. :) This is not urgent by any measure, so whenever you find
time to reply will be fine. ctrl+f to observe: for some precise data /
support for my plan re: using the pg_prewarm extension.

On Mon, Aug 12, 2013 at 2:16 PM, Karsten Loesing kars...@torproject.orgwrote:

 On 8/10/13 9:28 PM, Kostas Jakeliunas wrote:
* I don't think we can avoid using certain postgresql extensions (if
 only
  one) - which means that deploying will always take more than apt-get 
 pip
  install, but I believe it is needed;

 Can you give an example of a query that won't be executed efficiently
 without this extension and just fine with it?  Maybe we can tweak that
 query somehow so it works fine on a vanilla PostgreSQL.  Happy to give
 that some thoughts.

 I'd really want to avoid using stuff that is not in Debian.  Or rather,
 if we really need to add non-standard extensions, we need more than
 thinking and believing that it's unavoidable. :)


First off, the general idea. I know this might not sound convincing (see
below re: this), but any query that uses an index will take significantly
longer to execute if it needs to load parts of the index from disk. More
precisely, query time deviation and max(query_time) inversely correlates
with the percentage of the index in question in memory. The larger the
index, the more difficult it is to 'prep' it into cache, the more
unpredictable query exec time gets.

Take a look at the query used to join descriptors and network statuses
given some nickname (could be any other criterion, e.g. fingerprint or IP
address):

https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql

We use the following indexes while executing that query:

 * lower(nickname) on descriptor

 * (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on statusentry
(this one is used to efficiently join descriptor table with statusentry:
(fingerprint, descriptor) pair is completely unique in the descriptor
table, and it is fairly unique in the statusentry table (whereas a
particular fingerprint usually has lots and lots of rows in statusentry));
this index uses only substrings because otherwise, it will hog memory on my
remote development machine (not EC2), leaving not much for other indexes;
this composite substring index still takes ~2.5GB for status entries (only)
in the range between [2010-01; 2013-05] as of now

 * validafter on statusentry (the latter *must* stay in memory, as we use
it elsewhere as well; for example, when not given a particular search
criterion, we want to return a list of status entries (with distinct
fingerprints) sorted by consensus validafter in descending order)

We also want to keep a fingerprint index on the descriptor table because we
want to be able to search / look up by fingerprint.

I'm thinking of a way to demonstrate the efficiency of having the whole
index in memory. For now, let me summarize what I have observed, intersect
with what is relevant now: running the aforementioned query on some
nickname that we haven't queried for since the last restart of postgresql,
it might take, on average, about 1.5 to 3 seconds to execute on EC2, and
considerably longer on my development db if it is a truly popular nickname
(otherwise, more or less the same amount of time); sometimes a bit longer -
up to ~4s (ideally it should be rather uniform since the indexes are
*balanced* trees, but.. and autovacuum is enabled.)

Running that same query later on (after we've run other queries after that
first one), it will take = 160ms to execute and return results (this is a
conservative number, usually it's much faster (see below)). Running EXPLAIN
(ANALYZE, BUFFERS) shows that what happened was that there was no [disk]
read next to index operations - only buffer hit. This means that there
was no need to read from disk during all the sorting - only when we knew
which rows to return did we need to actually read them from disk. (There
are some nuances, but at least this will be true for PostgreSQL = 9.2 [1],
which I haven't tried yet - there might be some pleasant surprises re:
query time. Last I checked, Debian 9.0 repository contains postgresql
9.1.9.)

Observe:

1a. Run that query looking for 'moria2' for the first time since postgresql
restart - relay is an old one, only one distinct fingerprint, relatively
few status entries: http://sprunge.us/cEGh

1b. Run that same query later on: http://sprunge.us/jiPg (notice: no reads,
only hits; notice query time)

2a. Run query on 'gabelmoo' (a ton of status entries) for the first time
(development machine, query time is rather insane indeed):
http://sprunge.us/fQEK

2b. Run that same query on 'gablemoo' later on: http://sprunge.us/fDDV

PostgresSQL is rather clever: it will keep the parts of indexes more often
used in cache. What pg_prewarm simply does is:

 * load all (or critical for us) indexes to memory (and load them whole),
which is possible

[tor-dev] [GSoC 2013] Status report - Searchable metrics archive

2013-08-10 Thread Kostas Jakeliunas
Hello,

another busy benchmarking + profiling period for database querying, but
this time more rigorous and awesome.

  * wrote a generic query analyzer which logs query statements, EXPLAIN,
ANALYZE, spots and informs of particular queries that yield inefficient
query plans;
  * wrote a very simple but rather exhaustive profiler (using python's
cProfile) which logs query times, function calls, etc.; output is used to
see which parts of the e.g. backend are slow during API calls; output can
be easily used to construct a general query 'profile' for a particular
database, etc.; [1]
  * benchmarked lots of different queries using these tools, recorded query
times, was able to observe deviations/discrepancies;
  * uploaded the whole database and benchmarked briefly on an amazon EC2
m2.2xlarge instance;
  * concluded that, provided there is enough memory to cache *and hold* the
indexes in cache, query times are good;
  * in particular, tested the following query scheme extensively: [2] (see
comments there as well if curious); concluded that it runs well;
  * opted for testing raw SQL queries (from within Flask/python) - so far,
translating them into ORM queries (while being careful) resulted in
degraded performance; if we have to end up using raw SQL, I will create a
way to encapsulate them nicely;
  * made sure data importing is not slowed and remains a quick-enough
procedure;
  * researched PostgreSQL stuff, especially its two-layer caching; I now
have an understanding of the way pgsql caches things in memory, how
statistics on index usage are gathered and used for maintaining
buffer_cache, etc.
The searchable metrics archive would work best when all of its indexes are
kept in memory.
  * to this end, looked into buffer cache hibernation [3], etc.; I think
pg_prewarm [4, 5] would serve our purpose well. (Apparently many
business/etc. solutions do find cache prewarming relevant - pity it's not
supported in stock PostgreSQL.)

The latter means that
  * I don't think we can avoid using certain postgresql extensions (if only
one) - which means that deploying will always take more than apt-get  pip
install, but I believe it is needed;
 * next on my agenda is testing pg_prewarm on EC2 and, hopefully, putting
our beloved database bottleneck problem to rest.

I planned to expose the EC2 for public tor-dev inquiry (and ended up
delaying status report yet again), but I'll have to do this separately.
This is possible, however. Sorry for the delayed report.

##

More generally,

I'm happy with my queer queries [2] now;
the two constraints/goals of

  * being able to run Onionoo-like queries on the whole descriptor / status
entry database
  * being able to get a list of status entries for a particular relay

will hopefully be put to rest very soon. The former is done, provided I
have no trouble setting up a database index precaching system (which will
ensure that all queries of the same syntax/scheme run quick enough.)

Overall, I'm spending a bit too much time on a specific problem, but at
least I have a more intimate lower-level knowledge of PostgreSQL, which
turns out to be very relevant to this project. I hope to be able to soon
move to extending Onionoo support and providing a clean API for getting
lists of consensuses in which a particular relay was present. And maybe
start with the frontend. :)

Kostas.

[1]:
https://github.com/wfn/torsearch/commit/8e6f16a07c40f7806e98e9c71c1ce0f8e3849911
[2]: https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
[3]:
http://postgresql.1045698.n5.nabble.com/patch-for-new-feature-Buffer-Cache-Hibernation-td4370109.html
[4]:
http://www.postgresql.org/message-id/ca+tgmobrrrxco+t6gcqrw_djw+uf9zedwf9bejnu+rb5teb...@mail.gmail.com
[5]: http://raghavt.blogspot.com/2012/04/caching-in-postgresql.html
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Onionoo protocol/implementation nuances / Onionoo-connected metrics project stuff

2013-07-29 Thread Kostas Jakeliunas
It should also be possible to do efficient *estimated* COUNTs (using
reltuples [1, 2], provided the DB can be regularly VACUUMed + ANALYZEd
(postgres-specific awesomeness)) - i.e. if everything is set up right,
doing COUNTs would be efficient. This would be nice not only because one
could run very quick queries asking e.g. how many consensuses include
nickname LIKE %moo% between [daterange1, daterange2]? (if e.g. full text
search is set up) but also, if we have to resort to sometimes returning an
arbitrary subset of results (or sorted however we wish, but the sorting
being done already on a small subset of results, if that makes sense), we'd
be able to also supply info how many other results matching these
particular criteria there are, and so on. The usefulness of all this really
depends on intended use cases, and I suppose here some discussion could be
had who / how would an Onionoo system covering all / most of all the
descriptor+consensus archives and hopefully having an extended set of
filter / result options be used?

[1]: http://www.varlena.com/GeneralBits/120.php
[2]: http://wiki.postgresql.org/wiki/Slow_Counting
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


[tor-dev] [GSoC '13] Tor status report - Searchable metrics archive

2013-07-22 Thread Kostas Jakeliunas
Hey all,

I apologize for this unusual timing for a status report, but I ended up
delaying it beyond measure, so better now than later I guess. I can
reiterate it + any updates soon, it's just that I figure I'm long overdue
on informing tor-dev on what's going on.

I've started my project [1] later than is usual, and more or less
immediately ran into what I deemed to be a database / ORM scaling issue
(the thing I'd been actually trying to avoid since writing the proposal),
or at least a behaviour of the ORM which was suboptimal to what we have in
mind: delivering (first and foremost) a searchable metrics archive
backend/database which incorporates, as of current plan, server descriptors
(relays and bridges, turns out a server descriptor model can happily
service both) and server/router statuses across a few year timespan
(currently using v3 consensus documents only), and provides querying
functionality which can extract relations between the two. The 'querying
with relations between the two' part, when tested on a broader span of
data, seemed to be causing trouble to me. I ended up allocating probably
inefficiently large amounts of time to this problem, rewriting the backend
part, and trying to optimize the queries which underlied the ORM (turns out
I didn't need to strip off the ORM abstraction - learned a few things about
SQLAlchemy that way - I will follow-up with an email pointing to current
code (sorry)).

  * The current iteration of the ORM model / backend (which actually is
very simple) solves this problem.
  * Stem descriptor and network status mapping to ORM works, and is nicely
(enough) integrated with the data import (from downloaded metrics archive)
tools, as well as an API to make queries on the ORM.
  * Implemented a partial Onionoo-protocol-adhering (without compression
and without some fields) backend for ?summary and ?details Onionoo queries.
  * Still tidying everything up. And *finally* writing a design document
outlining what we actually ended up with, and what is required till full
Onionoo integration.

Code review will happen pretty soon, and hopefully we'll have some
discussion upon where to go from here. Karsten mentioned that it might be
possible to use the existing Onionoo incarnation to continue providing
bandwidth weight etc. data (basically stuff from extra-info), and it might
be possible to join the two systems into an Onionoo-supporting backend
which will cover all / majority of archives available. Another (or) further
avenue would be to continue with the initial proposed plan to extend the
query format; and to build a frontend which would make use of the extended
query format. Expect another email with links to (decent) code.

[1]: http://kostas.mkj.lt/gsoc2013/gsoc2013.html
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Metrics Plans

2013-07-01 Thread Kostas Jakeliunas
Hi,

forgot to reply to this email earlier on..

On Tue, Jun 11, 2013 at 6:02 PM, Damian Johnson ata...@torproject.orgwrote:

  I can try experimenting with this later on (when we have the full /
 needed
  importer working, e.g.), but it might be difficult to scale indeed (not
  sure, of course). Do you have any specific use cases in mind? (actually
  curious, could be interesting to hear.)

 The advantages of being able to reconstruct Descriptor instances is
 simpler usage (and hence more maintainable code).

 [...]

 Obviously we'd still want to do raw SQL queries for high traffic
 applications. However, for applications where maintainability trumps
 speed this could be a nice feature to have.


Oh, very nice, this would indeed be great, and this kind of usage would, I
suppose, facilitate the new tool's function as a simplifying 'glue' that
reduces multiple tools/applications into one. In any case, since the model
for a descriptor can be mapped to/from Stem's Descriptor instance, this
should be possible. (More) raw SQL queries for the backend's internal usage
would still be used - yes, this makes sense.


  * After making the schema update the importer could then run over this
  raw data table, constructing Descriptor instances from it and
  performing updates for any missing attributes.
 
  I can't say I can easily see the specifics of how all this would work,
 but
  if we had an always-up-to-date data model (mediated by Stem Relay
 Descriptor
  class, but not necessarily), this might work.. (The ORM - Stem
 Descriptor
  object mapping itself is trivial, so all is well in that regard.)

 I'm not sure if I entirely follow. As I understand it the importer...

 * Reads raw rsynced descriptor data.
 * Uses it to construct stem Descriptor instances.
 * Persists those to the database.

 My suggestion is that for the first step it could read the rsynced
 descriptors *or* the raw descriptor content from the database itself.
 This means that the importer could be used to not only populate new
 descriptors, but also back-fill after a schema update.

 That is to say, adding a new column would simply be...

 * Perform the schema update.
 * Run the importer, which...
   * Reads raw descriptor data from the database.
   * Uses it to construct stem Descriptor instances.
   * Performs an UPDATE for anything that's out of sync or missing from
 the database.


Aha, got it - this would actually probably be a brilliant way to do it. :)
that is,

 My suggestion is that for the first step it could read the rsynced
 descriptors *or* the raw descriptor content from the database itself.
 This means that the importer could be used to not only populate new
 descriptors, but also back-fill after a schema update.

is definitely possible, and doing UPDATEs could indeed be automated that
way. Ok, so since I'm writing the new database importer incarnation now,
it's definitely possible to put each descriptor's raw contents/text into a
separate, non-indexed field. This would then simply be a matter of
satisfying disk space constraints, and no more. There could/should be a way
of switching this raw import option off, IMO.

Kostas.

On Tue, Jun 11, 2013 at 6:02 PM, Damian Johnson ata...@torproject.orgwrote:

  I can try experimenting with this later on (when we have the full /
 needed
  importer working, e.g.), but it might be difficult to scale indeed (not
  sure, of course). Do you have any specific use cases in mind? (actually
  curious, could be interesting to hear.)

 The advantages of being able to reconstruct Descriptor instances is
 simpler usage (and hence more maintainable code). Ie, usage could be
 as simple as...

 

 from tor.metrics import descriptor_db

 # Fetches all of the server descriptors for a given date. These are
 provided as
 # instances of...
 #
 #   stem.descriptor.server_descriptor.RelayDescriptor

 for desc in descriptor_db.get_server_descriptors(2013, 1, 1):
   # print the addresses of only the exits

   if desc.exit_policy.is_exiting_allowed():
 print desc.address

 

 Obviously we'd still want to do raw SQL queries for high traffic
 applications. However, for applications where maintainability trumps
 speed this could be a nice feature to have.

  * After making the schema update the importer could then run over this
  raw data table, constructing Descriptor instances from it and
  performing updates for any missing attributes.
 
  I can't say I can easily see the specifics of how all this would work,
 but
  if we had an always-up-to-date data model (mediated by Stem Relay
 Descriptor
  class, but not necessarily), this might work.. (The ORM - Stem
 Descriptor
  object mapping itself is trivial, so all is well in that regard.)

 I'm not sure if I entirely follow. As I understand it the importer...

 * Reads raw rsynced descriptor data.
 * Uses it to construct stem Descriptor instances.
 * Persists those to the database.

 My 

Re: [tor-dev] Metrics Plans

2013-06-10 Thread Kostas Jakeliunas
Hi!

 Maybe we should focus on a 'grand unified backend' rather than
  splitting Kostas' summer between both a backend and frontend? If he
  could replace the backends of the majority of our metrics services
  then that would greatly simplify the metrics ecosystem.
 
  I'm mostly interested in the back-end, too.  But I think it won't be as
  much fun for Kostas if he can't also work on something that's visible to
  users.  I don't know what he prefers though.
 
  Honestly, I would actually be up for focusing, if need be, exclusively on
  the backend part. It would also probably (hopefully) prove to be the most
  beneficial to the overall ecosystem of tools. However, such a plan would
  imply that the final goal (ideally) is to have a replacement for Onionoo,
  which means that it would have to be reliably stable and scalable, so
 that
  multiple frontends could all use it at once. (It will have to be stable
 in
  any case, of course.) I think this would be a great goal, but if we can
  define and isolate development stages to a great extent, I think having
 two
  goals: (a) Onionoo replacement; (b) descriptor search+browse frontend -
 at
  the same time is OK, and either one of them could be dropped/reduced
 during
  the process -

 I think I understand, but I'm not sure.  Just to get this right, is
 either of these states the planned end state of your GSoC project?

 1) descriptor database supporting efficient queries, separate API
 similar to Onionoo's, front-end application using new search parameters;

 2) descriptor database supporting efficient queries, full integration
 with Onionoo API, no special front-end application using new search
 parameters; or

 3) descriptor database supporting efficient queries, full integration
 with Onionoo API, front-end application using Onionoo's new search
 parameters.


Yes - thanks for helping to nicely articulate them by the way - in the
sense that *any* of these end states would qualify, from my perspective at
least, as a success for this project. As I said, I think it is possible to
work on things without fear of making redundant effort while also not
restricting ourselves to one particular end state of the three, until some
significantly later point in time. This is because it is possible to
firstly do the efficient database, then implement a subset of the
Onionoo-like API (with a possibility for diverging from the Onionoo
standard later if a need arises at some point later on), and finally -
optionally/hopefully - work on the client-side frontend application. I'd
still like to do the frontend if the rest can be done in a subset of the
whole timeline; I'd also perhaps like to work/tinker on it after the
official GSoC timeline; but if (in mid-summer) it turns out that making an
Onionoo replacement is possible (the new backend/database scales well for
complex queries and so on, and implementing the whole Onionoo API is
realistic/easy), I can simply focus on the backend.

 Note that there's no Onionoo client that uses bridge data, yet.  We have
 been planning to add bridge support to Atlas for a while, but this
 hasn't happened yet.

 But in general, bridge data is quite similar to relay data.  There are
 some specifics because of sanitized descriptor parts, but in general,
 data structures are similar.

Understood. Bridge data / sanitized descriptors seem similar indeed, should
fit in nicely.

I think it's an advantage here that Onionoo itself has a front-end and a
 back-end part.  The back-end processes data once per hour and writes it
 to the file system.  The front-end is a single Java servlet that does
 all the filtering and sorting in memory and reads larger JSON files from
 disk.  What we could do is: keep the back-end running, so that it keeps
 producing details, bandwidth, and weights files, and only replace the
 servlet by a Python thing that also knows how to respond to more complex
 search queries.


Yes, this sounds great! Basically delegating bandwidth and weights
calculation to what we have already, and focusing on queries etc. I will
have to look into the actual Onionoo backend implementation, namely, how
much of the produce static JSON files including descriptor data can be
reused.

In any case, I don't think that having Onionoo(-compatibility, etc.) as an
additional set of variables / potential deliverables should pose a problem.

This was a vague/generic reply, but I will eventually follow up with more
things.

Kostas.

On Wed, May 29, 2013 at 5:34 PM, Karsten Loesing kars...@torproject.orgwrote:

 On 5/29/13 4:05 AM, Kostas Jakeliunas wrote:
  Hello!
  (@tor-dev: will also write a separate email, introducing the GSoC project
  at hand.)
 
  This GSoc idea started a year back as a searchable descriptor search
  application, totally unrelated to Onionoo.  It was when I read Kostas'
  proposal that I started thinking about an integration with Onionoo.
  That's why the plan is still a bit vague.  We should work together with
  Kostas very soon

Re: [tor-dev] Remote descriptor fetching

2013-06-10 Thread Kostas Jakeliunas
Hi folks!

 Indeed, this would be pretty bad.  I'm not convinced that moria1
 provides truncated responses though.  It could also be that it
 compresses results for every new request and that compressed responses
 randomly differ in size, but are still valid compressions of the same
 input.  Kostas, do you want to look more into this and open a ticket if
 this really turns out to be a bug?

I did check each downloaded file, each was different in size etc., but not
all of them were valid, from a shallow look at things (just chucking the
file to zlib and seeing what comes out).

Ok, I'll try looking into this. :) do note that exams etc. are still
ongoing, so this will get pushed back, if anybody figures things out
earlier, then great!

 Tor clients use the ORPort to fetch descriptors. As I understand it
 the DirPort has been pretty well unused for years, in which case a
 regression there doesn't seem that surprising. Guess we'll see.

Noted - OK, will see!

Re: python url request parallelization: @Damian: in the past when I wanted
to do concurrent urllib requests, I simply used threading.Thread. There
might be caveats here, I'm not familiar with the specifics. I can (again,
(maybe quite a bit) later) try cooking something up to see if such a simple
parallelization approach would work? (I should probably just try and do it
when I have time, maybe will turn out some specific solution is needed and
you guys will have solved it by then anyway.)

Cheers
Kostas.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Metrics Plans

2013-06-10 Thread Kostas Jakeliunas
Ah, forgot to add my footnote to the dirspec - we all know the link, but in
any case:

[1]: https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt

This was in the context of discussing which fields from 2.1 to include.

On Tue, Jun 11, 2013 at 12:34 AM, Kostas Jakeliunas
kos...@jakeliunas.comwrote:

  Here, I think it is realistic to try and use and import all the fields
 available from metrics-db-*.
  My PoC is overly simplistic in this regard: only relay descriptors, and
 only a limited subset of data fields is used in the schema, for the import.

 I'm not entirely sure what fields that would include. Two options come
 to mind...

 * Include just the fields that we need. This would require us to
 update the schema and perform another backfill whenever we need
 something new. I don't consider this 'frequent backfill' requirement
 to be a bad thing though - this would force us to make it extremely
 easy to spin up a new instance which is a very nice attribute to have.

 * Make the backend a more-or-less complete data store of descriptor
 data. This would mean schema updates whenever there's a dir-spec
 addition [1]. An advantage of this is that the ORM could provide us
 with stem Descriptor instances [2]. For high traffic applications
 though we'd probably still want to query the backend directly since we
 usually won't care about most descriptor attributes.


 In truth, I'm not sure here, either.  I agree that it basically boils down
 to either of the two aforementioned options. I'm okay with any of them. I'd
 like to, however, see how well the db import scales if we were to import
 all relay descriptor fields. There aren't a lot of them (dirspec [1]), if
 we don't count extra-info of course and only want to deal with the Router
 descriptor format (2.1). So I think I should try working with  those
 fields, and see if the import goes well and quickly enough. I plan to do
 simple python timeit / timing report macroses that may be attached /
 deattached from functions easily, would be simple and clean that way to
 measure things and so on.

  [...] An advantage of [more-or-less complete data store of descriptor
  data] is that the ORM could provide us

  with stem Descriptor instances [2]. For high traffic applications
  though we'd probably still want to query the backend directly since we
  usually won't care about most descriptor attributes.

 I can try experimenting with this later on (when we have the full / needed
 importer working, e.g.), but it might be difficult to scale indeed (not
 sure, of course). Do you have any specific use cases in mind? (actually
 curious, could be interesting to hear.) [2] fn is noted, I'll think about
 it.


  The idea would be import all data as DB fields (so, indexable), but it
 makes sense to also import raw text lines to be able to e.g. supply the
 frontend application with raw data if needed, as the current tools do. But
 I think this could be made to be a separate table, with descriptor id as
 primary key, which means this can be done later on if need be, would not
 cause a problem. I guess there's no need to this right now.

 I like this idea. A couple advantages that this could provide us are...

 * The importer can provide warnings when our present schema is out of
 sync with stem's Descriptor attributes (ie. there has been a new
 dir-spec addition).

 * After making the schema update the importer could then run over this
 raw data table, constructing Descriptor instances from it and
 performing updates for any missing attributes.


 The 'schema/format mismatch report' idea sounds like a really good idea!
 Surely if we are to try for Onionoo compatibility / eventual replacement,
 but in any case, this seems like a very useful thing for the future. I will
 keep this in mind for the nearest future / database importer rewrite.

  * After making the schema update the importer could then run over this
  raw data table, constructing Descriptor instances from it and
  performing updates for any missing attributes.

 I can't say I can easily see the specifics of how all this would work, but
 if we had an always-up-to-date data model (mediated by Stem Relay
 Descriptor class, but not necessarily), this might work.. (The ORM - Stem
 Descriptor object mapping itself is trivial, so all is well in that regard.)

 On Wed, May 29, 2013 at 5:49 PM, Damian Johnson ata...@torproject.orgwrote:

  Here, I think it is realistic to try and use and import all the fields
 available from metrics-db-*.
  My PoC is overly simplistic in this regard: only relay descriptors, and
 only a limited subset of data fields is used in the schema, for the import.

 I'm not entirely sure what fields that would include. Two options come
 to mind...

 * Include just the fields that we need. This would require us to
 update the schema and perform another backfill whenever we need
 something new. I don't consider this 'frequent backfill' requirement
 to be a bad thing though - this would force

Re: [tor-dev] Metrics Plans

2013-05-28 Thread Kostas Jakeliunas
Hello!
(@tor-dev: will also write a separate email, introducing the GSoC project
at hand.)

This GSoc idea started a year back as a searchable descriptor search
 application, totally unrelated to Onionoo.  It was when I read Kostas'
 proposal that I started thinking about an integration with Onionoo.
 That's why the plan is still a bit vague.  We should work together with
 Kostas very soon to clarify the plan.


Indeed, as it currently stands, the extent of the proposed backend part of
the searchable descriptor project is unclear. The original plan was not to
aim for a universal backend which could ideally, for example, service
existing web-side Metrics etc. project applications. The idea was to
hopefully be able to replace relay and consensus search/lookup tools with a
single and more powerful search and browse descriptor archives
application.

However I completely agree that an integrated, reusable backend sounds more
exciting and could potentially/hopefully make the broader Tor metrics-* c
ecosystem more uniform if that's the word - reducing the tool/component
counts. I think this is doable if the tasks/steps of this project are
somewhat isolated, so that incremental development can happen, and it's not
an all-or-nothing gamble (obviously that is the way it is intended to be,
but I think this would be an important aspect of this project in particular
as well.)

 Maybe we should focus on a 'grand unified backend' rather than
  splitting Kostas' summer between both a backend and frontend? If he
  could replace the backends of the majority of our metrics services
  then that would greatly simplify the metrics ecosystem.

 I'm mostly interested in the back-end, too.  But I think it won't be as
 much fun for Kostas if he can't also work on something that's visible to
 users.  I don't know what he prefers though.


Honestly, I would actually be up for focusing, if need be, exclusively on
the backend part. It would also probably (hopefully) prove to be the most
beneficial to the overall ecosystem of tools. However, such a plan would
imply that the final goal (ideally) is to have a replacement for Onionoo,
which means that it would have to be reliably stable and scalable, so that
multiple frontends could all use it at once. (It will have to be stable in
any case, of course.) I think this would be a great goal, but if we can
define and isolate development stages to a great extent, I think having two
goals: (a) Onionoo replacement; (b) descriptor search+browse frontend - at
the same time is OK, and either one of them could be dropped/reduced during
the process - this is what I'd have in mind, generally speaking, in terms
of general, let's say incremental deliverables / sub-projects, which can be
done sequentially:

1. Work out the relay schema for (a) relay descriptors; (b)
consensus-statuses; (c) *bridge summaries; (d) *bridge network statuses;

Here, I think it is realistic to try and use and import all the fields
available from metrics-db-*. My PoC is overly simplistic in this regard:
only relay descriptors, and only a limited subset of data fields is used in
the schema, for the import. I think it is realistic to import bridge data
used and reported by Onionoo. Here is the good, 'incremental' part I think:
the Onionoo protocol/design is useful in itself, as a clean relay
processing (what comes in and in what form it comes out) design. I think
it makes sense to do the DB schema having the fields used and reported by
Onionoo in mind. Even if the project ends up not aiming to even be
compatible with Onionoo (in terms of its API endpoints, or perhaps not
reporting everything (e.g. guard probability) - though I would like to aim
for compatibility, as would all of you, I suppose!), I think there should
be little to no duplication of effort when designing the schema and the
descriptor/data import part of the backend. The bridge data can later be
dropped. I will soon try looking closer if the schema can be made such that
it may later be very easily *extended* to include bridges data, but it
might be safer to at least have the whole schema from the beginning for
processing db-R, db-B and db-P, and e.g. simply not work on actual bridge
data import at first (depending on priorities.)

2. Implement data import part: so again, the focus would be on importing
all possible fields available from, most importantly, metrics-db-R. More
fields in relay descriptors, and also consensus statuses. Descriptors (IDs)
in consensuses will refer to relay descriptors; must be possible to
efficiently query the consensus table as well to ask in which statuses has
this descriptor been present?

These two parts are crucial whether the project is to aim for Onionoo
replacement, and/or also provide a searchbrowse frontend.

3. Implement Onionoo-compatible search queries, and (maybe only) a subset
of result fields. Again, I don't see why using the Onionoo protocol/design
shouldn't work here in any case. (Other Onionoo-specific nuanses, like

[tor-dev] Searchable Tor descriptor archive - GSoC 2013 project

2013-05-28 Thread Kostas Jakeliunas
Greetings!

I'm a student who will be working on the Searchable Tor descriptor archive
as part of Google Summer of Code. Yay!

I've been following Tor development for a while and hope that this
opportunity will be my way of sneaking into the development kitchen of Tor.
In any case, I hope to stay around for a longer time to come.

The original GSoC project proposal is based on one of the Tor project ideas
available [1] and is part of the Tor Metrics project [2]. The GSoC proposal
itself is also available to read [3] (TXT; if there's any interest, I can
work on reformatting.) My primary mentor is Karsten and my secondary mentor
is Damian.

I will quote the abstract from the proposal to sum up the high-level goals
of this project:

I'd like to create a more integrated and powerful descriptor archival
 search and browse system. (The current tools are very restrictive and the
 experience disjointed.) To do this, I'll write an archival browsing
 application wherein the results are interactive: they may act as further
 search filters. Together with a search string input tool which will have
 more filtering options, the application will provide a more cohesive
 archival browse  search experience and will be a more efficient tool.


So as of now, we have an array of tools for inspecting, searching for and
getting aggregate data about running relays. (For an overview, see the
Tools page in the Metrics portal. [4]) These tools include relay search,
consensus info, exit-by-IP search, and quite a few more; furthermore, two
Onionoo [5] based applications/tools: Atlas and Compass.

This project would proposes to:

   - implement a more powerful backend that would allow one to search for
   all available relays since mid-2007 (I should have clarified in the
   previous discussions, and Karsten already includes this bit; i.e., since v2
   statuses became available [6]; I guess this can also be discussed). More
   powerful here means, first and foremost, all (= v2) archival data
   (relay descriptors and consensuses at the very least), and furthermore (at
   least per the original proposal), involving more complex queries: we'd be
   looking into, I think, minimally, combined AND/OR filters referring to a
   wider range of data fields available in the archival data and the ability
   to specify multiple date ranges. Referring to consensus-related data while
   searching for relays and vice versa would also be possible. (The
   capabilities would therefore also include those of exoneraTor.)

   - implement backend results which would, as of current standing, aim for
   Onionoo compatibility (again see protocol design in [5]), or perhaps
   supersede it while providing backwards compatibility (e.g. returning
   paginated lists of consensus-status-entries where a specified relay was
   present.)

   - (as per original proposal,) implement a more powerful archival
   descriptor search  browse tool (frontend) which would provide a more
   uniform looking up relays / searching by using many criteria / further
   refining search in the results page experience - refining search
   results, i.e. adjusting filters would be semantically the same as entering
   search criteria in the beginning; hence a more interactive experience, a
   more powerful search/browse tool.


The goals and design of the project have to be clarified, however. There is
ongoing discussion (see another tor-dev thread [7] e.g.) whether perhaps
the focus could be to create a backend which would speak the full Onionoo
protocol and therefore be a potential replacement not only for relay search
and exoneraTor, but also for other components: all presently-speaking
Onionoo applications could be made to talk to the new backend, for example.
The overall count of components will hopefully be reduced in any case, but
ideally, we would end up with a much more integrated Tor Metrics (and maybe
beyond) ecosystem.
Many open questions, however - see again [7]. Obviously discussions are
very welcome indeed!

I'm wfn on OFTC (#tor-dev, #nottor), also reachable via XMPP 
phistophe...@jabber.org, and am very much up for any kind of chat. :) I'll
be busy with exams in the first three weeks of June, though - but will find
time for sure!

Regards
Kostas.


[1] https://www.torproject.org/getinvolved/volunteer#metricsSearch

[2] https://metrics.torproject.org/

[3] http://kostas.mkj.lt/gsoc2013.txt

[4] https://metrics.torproject.org/tools.html

[5] https://onionoo.torproject.org/

[6] https://metrics.torproject.org/data.html#relaydesc

[7] https://lists.torproject.org/pipermail/tor-dev/2013-May/004940.html
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Remote descriptor fetching

2013-05-28 Thread Kostas Jakeliunas
On Tue, May 28, 2013 at 2:50 AM, Damian Johnson ata...@torproject.orgwrote:

 So far, so good. By my read of the man pages this means that gzip or
 python's zlib module should be able to handle the decompression.
 However, I must be missing something...

 % wget http://128.31.0.34:9131/tor/server/all.z

 [...]

 % python
  import zlib
  with open('all.z') as desc_file:
 ...   print zlib.decompress(desc_file.read())
 ...
 Traceback (most recent call last):
   File stdin, line 2, in module
 zlib.error: Error -5 while decompressing data: incomplete or truncated
 stream


This seemed peculiar, so I tried it out. Each time I wget all.z from that
address, it's always a different one; I guess that's how it should be, but
it seems that sometimes not all of it gets downloaded (hence the actually
legit zlib error.)

I was able to make it work after my second download attempt (with your
exact code); zlib handles it well. So far it's worked every time since.

This is probably not good if the source may sometimes deliver an incomplete
stream.

TL;DR try wget'ing multiple times and getting even more puzzled (?)
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Iran

2013-05-05 Thread Kostas Jakeliunas
 have there been any attempts to produce a pluggable transport which would
emulate http?

(Ah, I suppose there've been quite a bit of discussion indeed. (
https://trac.torproject.org/projects/tor/ticket/8676, etc.))

On Sun, May 5, 2013 at 9:58 PM, Kostas Jakeliunas kos...@jakeliunas.comwrote:

  If we had a PT that encapsulated obfs3 inside
 the body of http then this may work.

 I'm probably missing some previous discussions which might have covered
 it, but: have there been any attempts to produce a pluggable transport
 which would emulate http? Basically, have the transport use http headers,
 and put all encrypted data in the body (possibly prepending it with some
 html tags even)? This sounds like a nice idea.


 On Sun, May 5, 2013 at 9:41 PM, Matthew Finkel 
 matthew.fin...@gmail.comwrote:

 On Sun, May 05, 2013 at 04:18:56PM +0300, George Kadianakis wrote:
  tor-admin tor-ad...@torland.me writes:
 
   On Sunday 05 May 2013 14:50:51 George Kadianakis wrote:
   It would be interesting to learn which ports they currently
 whitelist,
   except from the usual HTTP/HTTPS.
  
   I also wonder if they just block based on TCP port, or whether they
   also have DPI heuristics.
  
   On the Tor side, it seems like we should start looking into #7875:
   https://trac.torproject.org/projects/tor/ticket/7875
   ___
   I am wondering if here is there a way for a user to ask bridgedb for
 a bridge
   with a specific port?
   ___
   tor-dev mailing list
   tor-dev@lists.torproject.org
   https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
 
  If I remember correctly BridgeDB tries (in a best-effort manner) to
  give users bridges that are listening on port 443. Obfuscated bridges
  that bind on 443 are not very common (because of #7875) so I guess
  that not many obfuscated bridges on 443 are given out.
 
  In any case, I don't think that a user can explicitly ask BridgeDB for
  a bridge on a specific port, but this might be a useful feature
  request (especially if this filtering based on TCP port tactic
  continues).

 This may be a good feature to have, in general, but it does not sound like
 this will solve the current problem in Iran. The last report says
 they're whitelisting ports *and* protocols[1]. So even if a user attempts
 to use obfs3 on port 443 it'll likely be blocked because obfs3 is not a
 look-like-https protocol. If we had a PT that encapsulated obfs3 inside
 the body of http then this may work. CDA also says SSL/TLS connections
 are throttled to 5% of the normal speed [2], so that's no fun either.

 [1] https://twitter.com/CDA/status/331006059923795968
 [2] https://twitter.com/CDA/status/331040305648369664
 ___
 tor-dev mailing list
 tor-dev@lists.torproject.org
 https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev



___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Iran

2013-05-05 Thread Kostas Jakeliunas
 If we had a PT that encapsulated obfs3 inside
the body of http then this may work.

I'm probably missing some previous discussions which might have covered it,
but: have there been any attempts to produce a pluggable transport which
would emulate http? Basically, have the transport use http headers, and put
all encrypted data in the body (possibly prepending it with some html tags
even)? This sounds like a nice idea.

On Sun, May 5, 2013 at 9:41 PM, Matthew Finkel matthew.fin...@gmail.comwrote:

 On Sun, May 05, 2013 at 04:18:56PM +0300, George Kadianakis wrote:
  tor-admin tor-ad...@torland.me writes:
 
   On Sunday 05 May 2013 14:50:51 George Kadianakis wrote:
   It would be interesting to learn which ports they currently whitelist,
   except from the usual HTTP/HTTPS.
  
   I also wonder if they just block based on TCP port, or whether they
   also have DPI heuristics.
  
   On the Tor side, it seems like we should start looking into #7875:
   https://trac.torproject.org/projects/tor/ticket/7875
   ___
   I am wondering if here is there a way for a user to ask bridgedb for a
 bridge
   with a specific port?
   ___
   tor-dev mailing list
   tor-dev@lists.torproject.org
   https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
 
  If I remember correctly BridgeDB tries (in a best-effort manner) to
  give users bridges that are listening on port 443. Obfuscated bridges
  that bind on 443 are not very common (because of #7875) so I guess
  that not many obfuscated bridges on 443 are given out.
 
  In any case, I don't think that a user can explicitly ask BridgeDB for
  a bridge on a specific port, but this might be a useful feature
  request (especially if this filtering based on TCP port tactic
  continues).

 This may be a good feature to have, in general, but it does not sound like
 this will solve the current problem in Iran. The last report says
 they're whitelisting ports *and* protocols[1]. So even if a user attempts
 to use obfs3 on port 443 it'll likely be blocked because obfs3 is not a
 look-like-https protocol. If we had a PT that encapsulated obfs3 inside
 the body of http then this may work. CDA also says SSL/TLS connections
 are throttled to 5% of the normal speed [2], so that's no fun either.

 [1] https://twitter.com/CDA/status/331006059923795968
 [2] https://twitter.com/CDA/status/331040305648369664
 ___
 tor-dev mailing list
 tor-dev@lists.torproject.org
 https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Iran

2013-05-05 Thread Kostas Jakeliunas
(Sorry, last email for now --) I see that StegoTorus is an Obfsproxy fork
that extends it to a) split Tor streams across multiple connections to
avoid packet size signatures, and b) embed the traffic flows in traces that
look like html, javascript, or pdf. However, its public repo seems to
haven't been updated for more than nine months. [1] Also,
'Format-Transforming Encryption' looks interesting, but I take it not much
in terms of implementation beyond a research paper [2] (which looks
interesting).

[1] https://gitweb.torproject.org/stegotorus.git
[2] https://eprint.iacr.org/2012/494

On Sun, May 5, 2013 at 10:08 PM, Kostas Jakeliunas kos...@jakeliunas.comwrote:

  have there been any attempts to produce a pluggable transport which
 would emulate http?

 (Ah, I suppose there've been quite a bit of discussion indeed. (
 https://trac.torproject.org/projects/tor/ticket/8676, etc.))


 On Sun, May 5, 2013 at 9:58 PM, Kostas Jakeliunas 
 kos...@jakeliunas.comwrote:

  If we had a PT that encapsulated obfs3 inside
 the body of http then this may work.

 I'm probably missing some previous discussions which might have covered
 it, but: have there been any attempts to produce a pluggable transport
 which would emulate http? Basically, have the transport use http headers,
 and put all encrypted data in the body (possibly prepending it with some
 html tags even)? This sounds like a nice idea.


 On Sun, May 5, 2013 at 9:41 PM, Matthew Finkel 
 matthew.fin...@gmail.comwrote:

 On Sun, May 05, 2013 at 04:18:56PM +0300, George Kadianakis wrote:
  tor-admin tor-ad...@torland.me writes:
 
   On Sunday 05 May 2013 14:50:51 George Kadianakis wrote:
   It would be interesting to learn which ports they currently
 whitelist,
   except from the usual HTTP/HTTPS.
  
   I also wonder if they just block based on TCP port, or whether they
   also have DPI heuristics.
  
   On the Tor side, it seems like we should start looking into #7875:
   https://trac.torproject.org/projects/tor/ticket/7875
   ___
   I am wondering if here is there a way for a user to ask bridgedb for
 a bridge
   with a specific port?
   ___
   tor-dev mailing list
   tor-dev@lists.torproject.org
   https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
 
  If I remember correctly BridgeDB tries (in a best-effort manner) to
  give users bridges that are listening on port 443. Obfuscated bridges
  that bind on 443 are not very common (because of #7875) so I guess
  that not many obfuscated bridges on 443 are given out.
 
  In any case, I don't think that a user can explicitly ask BridgeDB for
  a bridge on a specific port, but this might be a useful feature
  request (especially if this filtering based on TCP port tactic
  continues).

 This may be a good feature to have, in general, but it does not sound
 like
 this will solve the current problem in Iran. The last report says
 they're whitelisting ports *and* protocols[1]. So even if a user attempts
 to use obfs3 on port 443 it'll likely be blocked because obfs3 is not a
 look-like-https protocol. If we had a PT that encapsulated obfs3 inside
 the body of http then this may work. CDA also says SSL/TLS connections
 are throttled to 5% of the normal speed [2], so that's no fun either.

 [1] https://twitter.com/CDA/status/331006059923795968
 [2] https://twitter.com/CDA/status/331040305648369664
 ___
 tor-dev mailing list
 tor-dev@lists.torproject.org
 https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev




___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


[tor-dev] GSoC 2013 / Tor project ideas - Searchable Tor descriptor archive - (pre-)proposal

2013-04-29 Thread Kostas Jakeliunas
Hello Karsten and everyone else :)

(TL;DR: would like to work on the searchable Tor descriptor archive project
idea; considering drafting up a GSoC application)

I'm a student  backend+frontend programmer from Lithuania who'd be very
much interested in contributing to the Tor project via Google Summer of
Code (well, ideally at least; the plan would be to volunteer some time to
Tor in any case, but it's yet to happen, and GSoC is simply too awesome an
opportunity not to try) -

The 'searchable Tor descriptor/metrics archive' project idea [1] would, I
think, best fit in with my previous experience and general interests in
terms of contributing to the Tor project. The searchable archive project
idea in itself has a rather clear list of goals / generic constraints, and
since I haven't contributed any code to the Tor project before, working
with an existing general project idea (building a more concrete design
proposal on top of it) probably makes most sense.

This particular project, I think, would match my previous Python backend
programming experience: building backends to work with large datasets /
databases -- crafting efficient ORMs and responsive APIs to interact with
them. [2]

Applying the knowledge/skills learned to something which is ideologically
close at heart and the purpose of which is very obvious to me sounds
thrilling! (This year, as far as Python frameworks are concerned, I've been
mostly exposed and have been working with Flask - have some (limited)
experience with Django before that. As far as a proof-of-concept for the
searchable archive is concerned, I'm considering trying some things out
with Flask, since it allows me to do some quick prototyping.)

I'd like to try and work out an implementation/design draft for what I
could / would like to do (this is a preliminary email - I know I'm a bit
late!) Ideally it (and a simple proof of concept search form -
browseable/clickable results / relay descriptor navigation page) would
serve as the base for my GSoC application, but I have to be realistic about
me being rather late to apply and not having participated in neither Tor
nor GSoC before. I'd like to work out an application draft if possible,
though. (Were I to get accepted, I would be able not to do any part-time
work this summer, or would only need to take passive care of a couple of
already running backends.)

I've read into the Tor Metrics portal pages (esp. Data Formats), and am
trying to get acquainted with the existing archiving solution (reading into
the 'metrics-web' java source (under
metrics-web/src/org/torproject/ernie/web) to see how the descriptor etc.
archives are currently parsed / imported into Postgres and so on), to first
and foremost be able to evaluate the scope of what I'd like to write.

I will presently work on a more specific list of constraints for the
searchable archive project idea. I can then try producing a GSoC
application draft.

Just to get an idea of what kind of system I'd be building / working on -
at the very least, we'd be looking into:

   - (re)building the archival / metrics data update system - the proposed
   method in [1] was a simple rsync over ssh / etc. to keep the data in sync
   with the descriptor data collection point. If possible, it would help if
   the rsync could work with uncompressed archives - rsync is intelligent
   enough not to need to send *that* much excess data - and diffing is more
   efficient with uncompressed data.
   A simple rsync script (can be run as a cron job) would work here.

   - a python script (probably to be run through cron) to import the
   archives into DB. Can stat files to only need to import new/modified ones,
   e.g. The good thing about such an approach is that the script could work as
   a semi-standalone (would still need the DB / ORM design), therefore could
   be used in conjunction with other, different tools - and it would be built
   as an atomic target during the implementation process - I heard you guys
   like modular project design proposals ;) who doesn't like them!
   We already have metrics-utils/exonerator/exonerator.py (which works as a
   semantically-aware descriptor archive grep tool) - some archive parsing
   logic can be reused maybe - the more pertinent thing here would be to

   - build the ORM for storing all the archival data in DB. Postgres is
   preferred and could work, especially since probably the a large part of the
   current ORM logic could be used here (I've taken a glance at the current
   architecture, it makes good sense to me, but I haven't looked further,
   neither have I done any benchmarking with the existing ORM (except for some
   web-based relay search test queries which don't really count.))

   - it is very important to build an ORM which would scale well data-wise,
   and would suit our queries well.

   - query logic and types - the idea would be to allow to do incremental
   query-building - on the SQL level, WHERE clauses can be incrementally