Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql
On Thu, Mar 20, 2014 at 10:45 AM, Tom Lane wrote: > Robert Haas writes: >> So you might think that the problem here is that we're assuming >> uniform density. Let's say there are a million rows in the table, and >> there are 100 that match our criteria, so the first one is going to >> happen 1/10,000'th of the way through the table. Thus we set SC = >> 0.0001 * TC, and that turns out to be an underestimate if the >> distribution isn't as favorable as we're hoping. However, that is NOT >> what we are doing. What we are doing is setting SC = 0. I mean, not >> quite 0, but yeah, effectively 0. Essentially we're assuming that no >> matter how selective the filter condition may be, we assume that it >> will match *the very first row*. > > I think this is wrong. Yeah, the SC may be 0 or near it, but the time to > fetch the first tuple is estimated as SC + (TC-SC)/N. Hmm, you're right, and experimentation confirms that the total cost of the limit comes out to about TC/selectivity. So scratch that theory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql
On Thu, Mar 20, 2014 at 8:51 PM, Tom Lane wrote: > Atri Sharma writes: > > Now, why cannot we take the estimate of all the buckets behind the bucket > > in which our value is present? Will that estimate not give us the > fraction > > of tuples that are expected to be before the first matching row? > > Uh, no, not unless you assume that the table happens to be perfectly > sorted by the column's value. > > > Yes, that is true. So, if an attribute has an index present, can we do this somehow? Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql
Atri Sharma writes: > Now, why cannot we take the estimate of all the buckets behind the bucket > in which our value is present? Will that estimate not give us the fraction > of tuples that are expected to be before the first matching row? Uh, no, not unless you assume that the table happens to be perfectly sorted by the column's value. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql
On Thu, Mar 20, 2014 at 8:10 PM, Robert Haas wrote: > On Tue, Mar 18, 2014 at 2:41 PM, Tom Lane wrote: > > Atri Sharma writes: > >> One of the factors that leads to bad estimates is that the histogram of > the > >> values of a column maintained by the planner gets old by time and the > data > >> in the column changes. So, the histogram is no longer a quite accurate > view > >> of the data and it leads to bad selectivity. > > > > TBH, this is so far down the list of problems that it'll be a long time > > before we need to worry about it. It's certainly not the number one > > priority for any project to model risk in the planner. > > > > The thing that I think is probably the number one problem is estimates > > that depend on an assumption of uniform distribution of sought-after rows > > among those encountered by a scan. This is usually where bad plans for > > LIMIT queries are coming from. We could certainly add some sort of fudge > > factor to those costs, but I'd like to have a more-or-less principled > > framework for doing so. > > I think the problem is, in some sense, more basic than that. I think > the kind of query we're talking about here is: > > SELECT * FROM foo WHERE unlikely ORDER BY indexed_column LIMIT 1 > > Assume for the sake of argument that there are 100 rows that would be > returned in the absence of the limit. Let SC and TC be the startup > cost and total cost of the index scan. As a matter of general policy, > we're going to say that the cost of this is SC + 0.01 * (TC - SC). > What makes this path look appealing to the planner is that SC is small > relative to TC. If we knew, for example, that we weren't going to > find the first match until 90% of the way through the index scan, then > we could set SC = 90% * TC and, all else being equal, the planner > would make the right decision. > > So you might think that the problem here is that we're assuming > uniform density. Let's say there are a million rows in the table, and > there are 100 that match our criteria, so the first one is going to > happen 1/10,000'th of the way through the table. Thus we set SC = > 0.0001 * TC, and that turns out to be an underestimate if the > distribution isn't as favorable as we're hoping. However, that is NOT > what we are doing. What we are doing is setting SC = 0. I mean, not > quite 0, but yeah, effectively 0. Essentially we're assuming that no > matter how selective the filter condition may be, we assume that it > will match *the very first row*. > > Cannot we reuse the same histogram we have in the planner right now for this? I mean, AFAIK, the heuristic we have is that we divide the histogram into equal size buckets and then find the bucket in which our predicate value lies, then take some part of that bucket and the rest of the buckets before that bucket,right? So, suppose a query is SELECT * FROM table WHERE a > 10, we shall find the bucket that 10 lies in, right? Now, why cannot we take the estimate of all the buckets behind the bucket in which our value is present? Will that estimate not give us the fraction of tuples that are expected to be before the first matching row? Its pretty wild, but I wanted to know if my understanding of this scenario is correct or not. Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql
Robert Haas writes: > So you might think that the problem here is that we're assuming > uniform density. Let's say there are a million rows in the table, and > there are 100 that match our criteria, so the first one is going to > happen 1/10,000'th of the way through the table. Thus we set SC = > 0.0001 * TC, and that turns out to be an underestimate if the > distribution isn't as favorable as we're hoping. However, that is NOT > what we are doing. What we are doing is setting SC = 0. I mean, not > quite 0, but yeah, effectively 0. Essentially we're assuming that no > matter how selective the filter condition may be, we assume that it > will match *the very first row*. I think this is wrong. Yeah, the SC may be 0 or near it, but the time to fetch the first tuple is estimated as SC + (TC-SC)/N. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql
On Tue, Mar 18, 2014 at 2:41 PM, Tom Lane wrote: > Atri Sharma writes: >> One of the factors that leads to bad estimates is that the histogram of the >> values of a column maintained by the planner gets old by time and the data >> in the column changes. So, the histogram is no longer a quite accurate view >> of the data and it leads to bad selectivity. > > TBH, this is so far down the list of problems that it'll be a long time > before we need to worry about it. It's certainly not the number one > priority for any project to model risk in the planner. > > The thing that I think is probably the number one problem is estimates > that depend on an assumption of uniform distribution of sought-after rows > among those encountered by a scan. This is usually where bad plans for > LIMIT queries are coming from. We could certainly add some sort of fudge > factor to those costs, but I'd like to have a more-or-less principled > framework for doing so. I think the problem is, in some sense, more basic than that. I think the kind of query we're talking about here is: SELECT * FROM foo WHERE unlikely ORDER BY indexed_column LIMIT 1 Assume for the sake of argument that there are 100 rows that would be returned in the absence of the limit. Let SC and TC be the startup cost and total cost of the index scan. As a matter of general policy, we're going to say that the cost of this is SC + 0.01 * (TC - SC). What makes this path look appealing to the planner is that SC is small relative to TC. If we knew, for example, that we weren't going to find the first match until 90% of the way through the index scan, then we could set SC = 90% * TC and, all else being equal, the planner would make the right decision. So you might think that the problem here is that we're assuming uniform density. Let's say there are a million rows in the table, and there are 100 that match our criteria, so the first one is going to happen 1/10,000'th of the way through the table. Thus we set SC = 0.0001 * TC, and that turns out to be an underestimate if the distribution isn't as favorable as we're hoping. However, that is NOT what we are doing. What we are doing is setting SC = 0. I mean, not quite 0, but yeah, effectively 0. Essentially we're assuming that no matter how selective the filter condition may be, we assume that it will match *the very first row*. So we're not assuming the average case and getting hosed when things come out worse than average. We're assuming the *best* case. So unless things happen to really swing in our favor, we got hosed. Now it might be that a fudge factor of 2 or 1.5 or 10 or 3 or 17 is appropriate, so that we actually assume we're going to have to scan a little more of the index than we expect. That can perhaps be justified by the possibility that there may actually be NO rows matching the filter condition, and we'll have to try scanning the entire index to get off the ground. We could also try to come up with a mathematical model for that. But that fudge factor would presumably be a multiplier on the effort of finding the first tuple. And right now we assume that finding the first tuple will be trivial. So I think we should fix THAT problem first, and then if that turns out to be insufficient, we can worry about what further fudging is required. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql
Atri Sharma writes: > One of the factors that leads to bad estimates is that the histogram of the > values of a column maintained by the planner gets old by time and the data > in the column changes. So, the histogram is no longer a quite accurate view > of the data and it leads to bad selectivity. TBH, this is so far down the list of problems that it'll be a long time before we need to worry about it. It's certainly not the number one priority for any project to model risk in the planner. The thing that I think is probably the number one problem is estimates that depend on an assumption of uniform distribution of sought-after rows among those encountered by a scan. This is usually where bad plans for LIMIT queries are coming from. We could certainly add some sort of fudge factor to those costs, but I'd like to have a more-or-less principled framework for doing so. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql
On Tue, Mar 18, 2014 at 11:43 PM, Josh Berkus wrote: > > > On Mon, Mar 17, 2014 at 8:47 PM, Tom Lane wrote: > >> Yeah. I would like to see the planner's cost estimates extended to > >> include some sort of uncertainty estimate, whereupon risk-averse people > >> could ask it to prefer low-uncertainty plans over high-uncertainty ones > >> (the plans we typically choose for ORDER BY ... LIMIT queries being > great > >> examples of the latter). But it's a long way from wishing that to > making > >> it so. Right now it's not even clear (to me anyway) how we'd measure or > >> model such uncertainty. > > I have been thinking of some ways to have a risk estimate of each selectivity that our planner gives. I think a way to do it is as follows: One of the factors that leads to bad estimates is that the histogram of the values of a column maintained by the planner gets old by time and the data in the column changes. So, the histogram is no longer a quite accurate view of the data and it leads to bad selectivity. One thing we can try to do is to add a factor of error that we feel the selectivity given can have. This allows us to factor in the probability that the data changed and the estimate of the difference of the current histogram and the histogram of the actual data currently present in the column in the table. We can use Central Limit Theorem ( http://en.wikipedia.org/wiki/Central_limit_theorem). Essentially, what the theorem says is that given a distribution that has finite variance and finite mean, we can take random independent samples from the data and calculate the standard deviation and the mean of the sample. If we have large enough number of samples and if we plot the mean and SD, they would follow a normal distribution. What is interesting is that this can allow us to predict the SD of a given dataset from the curve and the SD should be directly proportional to the deviation it has from the given planner histogram. I am no mathematician hence its hard for me to explain. I think this link [1] will be more helpful. So, we can have a probability value for the random variable and that shall model the confidence we have in our estimate. I may be wrong in some parts but I hope I have been able to convey the general idea. If this idea develops, I shall be happy to work on this but my hands are full in ROLLUPS right now, so for my part it shall take time. I just want to float the idea and get a general feel about the idea right now. Please let me know your comments and feedback on this. Regards, Atri [1]: http://www.theriac.org/DeskReference/viewDocument.php?id=177 -- Regards, Atri *l'apprenant*
Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql
> On Mon, Mar 17, 2014 at 8:47 PM, Tom Lane wrote: >> Yeah. I would like to see the planner's cost estimates extended to >> include some sort of uncertainty estimate, whereupon risk-averse people >> could ask it to prefer low-uncertainty plans over high-uncertainty ones >> (the plans we typically choose for ORDER BY ... LIMIT queries being great >> examples of the latter). But it's a long way from wishing that to making >> it so. Right now it's not even clear (to me anyway) how we'd measure or >> model such uncertainty. This is not a model, but here's some starting thoughts: A "high risk" plan has two components: a) our statistical data is out-of-date or inadequate b) the potential execution time if our estimates of selectivity are wrong is high c) the cost ratio of certain operations is wrong. Factor (a) can be modeled two ways: 1. If last_analyze is a long time ago, we have increased the risk. (Ideally, we'd have some idea of the change rate on the table vs. the last analyze time; right now we don't have those stats) 2. Certain patterns, such as multi-column selectivity and GIN/GiST selectivity are known to have poor estimates, and be higher risk. Certainly selectivity functions which have been programmed with a flat coefficient (like default 0.05 selectivity for gist_ops) could also return a risk factor which is fairly high. Factor (b) can be modeled simply by estimating the cost of a plan where all row estimates are changed by 10X, or even better by a calculation on the risk factor calculated in (a). This would then give us the "failure cost" of the bad plan. Note that we need to estimate in both directions, both for higher estimates and lower ones; "abort early" plans fail because the rows returned are lower than expected, for example. (b) estimation would be expensive if we did every combination of the entire plan with wrong estimates, so I'm wondering if it would be adequate to just estimate the node selectivity being off on a per-node basis. (c) we can't realistically estimate for at all (i.e. if we knew the cost factor was wrong, we'd fix it) so I suggest ignoring it for risk estimation. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers