Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Josh Berkus wrote: Mischa, Okay, although given the track record of page-based sampling for n-distinct, it's a bit like looking for your keys under the streetlight, rather than in the alley where you dropped them :-) Bad analogy, but funny. The issue with page-based vs. pure random sampling is that to do, for example, 10% of rows purely randomly would actually mean loading 50% of pages. With 20% of rows, you might as well scan the whole table. Unless, of course, we use indexes for sampling, which seems like a *really good* idea to me But doesn't an index only sample one column at a time, whereas with page-based sampling, you can sample all of the columns at once. And not all columns would have indexes, though it could be assumed that if a column doesn't have an index, then it doesn't matter as much for calculations such as n_distinct. But if you had 5 indexed rows in your table, then doing it index wise means you would have to make 5 passes instead of just one. Though I agree that page-based sampling is important for performance reasons. John =:- signature.asc Description: OpenPGP digital signature
Re: [HACKERS] Recognizing range constraints (was Re: [PERFORM] Plan
Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: On Wed, Apr 06, 2005 at 06:09:37PM -0400, Tom Lane wrote: Can anyone suggest a more general rule? Do we need for example to consider whether the relation membership is the same in two clauses that might be opposite sides of a range restriction? It seems like a.x b.y AND a.x b.z In a case like this, you could actually look at the data in b and see what the average range size is. Not with the current statistics --- you'd need some kind of cross-column statistics involving both y and z. (That is, I doubt it would be helpful to estimate the average range width by taking the difference of independently-calculated mean values of y and z ...) But yeah, in principle it would be possible to make a non-default estimate. regards, tom lane Actually, I think he was saying do a nested loop, and for each item in the nested loop, re-evaluate if an index or a sequential scan is more efficient. I don't think postgres re-plans once it has started, though you could test this in a plpgsql function. John =:- signature.asc Description: OpenPGP digital signature
Re: [HACKERS] [PERFORM] How to read query plan
Miroslav ulc wrote: Tom Lane wrote: ... I think the reason this is popping to the top of the runtime is that the joins are so wide (an average of ~85 columns in a join tuple according to the numbers above). Because there are lots of variable-width columns involved, most of the time the fast path for field access doesn't apply and we end up going to nocachegetattr --- which itself is going to be slow because it has to scan over so many columns. So the cost is roughly O(N^2) in the number of columns. As there are a lot of varchar(1) in the AdDevicesSites table, wouldn't be helpful to change them to char(1)? Would it solve the variable-width problem at least for some fields and speed the query up? I'm guessing there really wouldn't be a difference. I think varchar() and char() are stored the same way, just one always has space padding. I believe they are both varlena types, so they are still variable length. As a short-term hack, you might be able to improve matters if you can reorder your LEFT JOINs to have the minimum number of columns propagating up from the earlier join steps. In other words make the later joins add more columns than the earlier, as much as you can. That will be hard as the main table which contains most of the fields is LEFT JOINed with the others. I'll look at it if I find some way to improve it. One thing that you could try, is to select just the primary keys from the main table, and then later on, join back to that table to get the rest of the columns. It is a little bit hackish, but if it makes your query faster, you might want to try it. I'm not sure whether I understand the process of performing the plan but I imagine that the data from AdDevicesSites are retrieved only once when they are loaded and maybe stored in memory. Are the columns stored in the order they are in the SQL command? If so, wouldn't it be useful to move all varchar fields at the end of the SELECT query? I'm just guessing because I don't know at all how a database server is implemented and what it really does. I don't think they are stored in the order of the SELECT portion. I'm guessing they are loaded and saved as you go. But that the order of the LEFT JOIN at the end is probably important. .. regards, tom lane Miroslav John =:- signature.asc Description: OpenPGP digital signature