On 8/13/13 3:17 PM, Kostas Jakeliunas wrote: > On Tue, Aug 13, 2013 at 2:15 PM, Karsten Loesing > <[email protected]>wrote: >> >> I suggest putting pg_prewarm on the future work list. I sense there's a >> lot of unused potential in stock PostgreSQL. Tweaking the database at >> this point has the word "premature optimization" written on it in big >> letters for me. > > >> Also, to be very clear here, a tool that requires custom tweaks to >> PostgreSQL has minimal chances of running on torproject.org machines in >> the future. The current plan is that we'll have a dedicated database >> machine operated by our sysadmins that not even the service operator >> will have shell access to. > > > Oh, understood then, OK, no extensions (at least) for now. > > Apropos: as of my current (limited) understanding, it might be difficult to > support, for example, nickname sub-string searches without a (supported, > official) extension. One such extension is pg_trgm [1], which is in the > contrib/ directory in 9.1, and is just one make install away. But for now, > I'll assume this is not possible / we should avoid this.
Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more likely that we can run something requiring this extension on a torproject.org machine. Still, requiring extensions should be the last resort if no other solution can be found. Leaving out searches for nickname substrings is a valid solution for now. >> So, why do you join descriptors and network statuses in the search >> process? At the Munich dev meeting I suggested joining the tables >> already in the import process. What do you think about that idea? > > > Yes, I had made a half-hearted attempt to normalize the two tables some > time ago, for a small amount of descriptors and status entries; I'll be > trying out this scheme in full (will need to re-import a major part of the > data (which I didn't do then) to be able to see if it scales well) after I > try something else. Okay. > (Namely, using a third table of unique fingerprints > (the statusentry table currently holds ~170K unique fingerprints vs. ~67M > rows in total) and (non-unique) nicknames for truly quick fingerprint > lookup and nickname search; I did experiment with this as well, but I > worked with a small subset of overall data in that case, too; and I think I > can do a better job now.) > > It had seemed to me that the bottleneck was in having to sort a too large > number of rows, but now I understand (if only just a bit) more about the > 'explain analyze' output to see that the 'Nested Loop' procedure, which is > what does the join in the join query discussed, is expensive and is part of > the bottleneck so to speak. So I'll look into that after properly > benchmarking stuff with the third table. Sounds like another fine thing to test, I agree. > (By the way, for future reference, > we do have to test out different ideas on a substantial subset of overall > data, as the scale function is not, so to say, linear.) :) Good point. Yes, we should keep this in mind for the future. >>> https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql >>> >>> We use the following indexes while executing that query: >>> >>> * lower(nickname) on descriptor >>> >>> * (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on >> statusentry >> >> Using only the first 12 characters sounds like a fine approach to speed >> up things. But why 12? Why not 10 or 14? This is probably something >> you should annotate as parameter to find a good value for later in the >> process. (I'm not saying that 12 is a bad number. It's perfectly fine >> for now, but it might not be the best number.) >> > > Yes, this is as unscientific as it gets. As of now, we're using a raw SQL > query, but I'll be encapuslating them properly soon (so we can easily > attach different WHERE clauses, etc.), at which point I'll make it into a > parameter. I did do some tests, but nothing extensive; just made sure the > indexes can fit into memory whole, which was the main constraint. Will do > some tests. > > >> Also, would it keep indexes smaller if you took something else than >> base16 encoding for fingerprints? What about base64? Or is there a >> binary type in PostgreSQL that works fine for indexes? >> > > Re: latter, no binary type for B-Trees (which is the default index type in > pgsql) as far as I can see. But it's a good idea / approach, so I'll look > into it, thanks! On the whole though, as long as all the indexes occupy > only a subset of pgsql's internal buffers, there shouldn't be a problem / > that's not the problem, afaik. But, if we're making a well-researched > ORM/database design, I should look into it. > > >> Do you have a list of searches you're planning to support? > > > These are the ones that should *really* be supported: > > - ?search=nickname > - ?search=fingerprint > - ?lookup=fingerprint > - ?search=address [done some limited testing, currently not focusing on > this] The lookup parameter is basically the same as search=fingerprint with the additional requirement that fingerprint must be 40 characters long. So, this is the current search parameter. I agree, these would be good to support. You might also add another parameter ?address=address for ExoneraTor. That should, in theory, be just a subset of the search parameter. > - ?running=<boolean> This one is tricky. So far, Onionoo looks only at the very latest consensus or bridge status to decide if a relay or bridge is running or not. But now you're adding archives to Onionoo, so that people can search for a certain consensus or certain bridge status in the past, or they can search for a time interval of consensuses or bridge statuses. How do you define that a relay or bridge is running, or more importantly included as not running? > - ?flag=flag [every kind of clause which further narrows down the query > is not bad; the current db model supports all the flags that Stem does, and > each flag has its own column] I'd say leave this one out until there's an actual use case. > - ?first_seen_days=range > - ?last_seen_days=range > > As per the plan, the db should be able to return a list of status entries / > validafter ranges (which can be used in {first,last}_seen_days) given some > fingerprint. Oh, I think there's a misunderstanding of these two fields. These fields are only there to search for relays or bridges that have first appeared or were last seen on a given day. You'll need two new parameters, say, from=datetime and to=datetime (or start=datetime and end=datetime) to define a valid-after range for your search. Hope this makes sense. All the best, Karsten _______________________________________________ tor-dev mailing list [email protected] https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
