Re: Bitmap scan is undercosted? - boolean correlation
Jeff Janeswrites: > On Dec 3, 2017 15:31, "Tom Lane" wrote: >> Jeff Janes writes: >>> But I do see that ties within the logical order of the column values are >>> broken to agree with the physical order. That is wrong, right? Is there >>> any argument that this is desirable? >> Uh ... what do you propose doing instead? We'd have to do something with >> ties, and it's not so obvious this way is wrong. > Let them be tied. If there are 10 distinct values, number the values 0 to > 9, and all rows of a given distinct values get the same number for the > logical order axis. > Calling the correlation 0.8 when it is really 0.0 seems obviously wrong to > me. Although if we switched btree to store duplicate values with tid as a > tie breaker, then maybe it wouldn't be as obviously wrong. I thought some more about this. What we really want the correlation stat to do is help us estimate how randomly an index-ordered scan will access the heap. If the values we've sampled are all unequal then there's no particular issue. However, if we have some group of equal values, we do not really know what order an indexscan will visit them in. The existing correlation calculation is making the *most optimistic possible* assumption, that such a group will be visited exactly in heap order --- and that assumption isn't too defensible. IIRC, a freshly built b-tree will behave that way, because the initial sort of a btree breaks ties using heap TIDs; but we don't maintain that property during later insertions. In any case, given that we do this calculation without regard to any specific index, we can't reasonably expect to model exactly what the index will do. It would be best to adopt some middle-of-the-road assumption about what the heap visitation order will be for a set of duplicate values: not exactly heap order, but I think we should not use a worst-case assumption either, since the btree may retain some amount of its initial ordering. BTW, I disagree that "correlation = zero" is the right answer for this particular example. If the btree is freshly built, then an index-order scan would visit all the heap pages in sequence to fetch "f" rows, and then visit them all in sequence again to fetch "t" rows, which is a whole lot better than the purely random access that zero correlation implies. So I think 0.8 or so is actually a perfectly reasonable answer when the index is fresh. The trouble is just that it'd behoove us to derate that answer somewhat for the probability that the index isn't fresh. My first thought for a concrete fix was to use the mean position of a group of duplicates for the purposes of the correlation calculation, but on reflection that's clearly wrong. We do have an idea, from the data we have, whether the duplicates are close together in the heap or spread all over. Using only mean position would fail to distinguish those cases, but really we'd better penalize the spread-all-over case. I'm not sure how to do that. Or we could leave this calculation alone and try to move towards keeping equal values in TID order in btrees. Not sure about the downsides of that, though. regards, tom lane
Re: Bitmap scan is undercosted? - boolean correlation
On Dec 3, 2017 15:31, "Tom Lane"wrote: Jeff Janes writes: > On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby wrote: >> It thinks there's somewhat-high correlation since it gets a list of x >> and y values (integer positions by logical and physical sort order) and >> 90% of the x list (logical value) are the same value ('t'), and the >> CTIDs are in order on the new index, so 90% of the values are 100% >> correlated. > But there is no index involved (except in the case of the functional > index). The correlation of table columns to physical order of the table > doesn't depend on the existence of an index, or the physical order within > an index. > But I do see that ties within the logical order of the column values are > broken to agree with the physical order. That is wrong, right? Is there > any argument that this is desirable? Uh ... what do you propose doing instead? We'd have to do something with ties, and it's not so obvious this way is wrong. Let them be tied. If there are 10 distinct values, number the values 0 to 9, and all rows of a given distinct values get the same number for the logical order axis. Calling the correlation 0.8 when it is really 0.0 seems obviously wrong to me. Although if we switched btree to store duplicate values with tid as a tie breaker, then maybe it wouldn't be as obviously wrong. Cheers, Jeff
Re: Bitmap scan is undercosted? - boolean correlation
Jeff Janeswrites: > On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby wrote: >> It thinks there's somewhat-high correlation since it gets a list of x >> and y values (integer positions by logical and physical sort order) and >> 90% of the x list (logical value) are the same value ('t'), and the >> CTIDs are in order on the new index, so 90% of the values are 100% >> correlated. > But there is no index involved (except in the case of the functional > index). The correlation of table columns to physical order of the table > doesn't depend on the existence of an index, or the physical order within > an index. > But I do see that ties within the logical order of the column values are > broken to agree with the physical order. That is wrong, right? Is there > any argument that this is desirable? Uh ... what do you propose doing instead? We'd have to do something with ties, and it's not so obvious this way is wrong. regards, tom lane
Re: Bitmap scan is undercosted? - boolean correlation
On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzbywrote: > On Sat, Dec 02, 2017 at 05:27:51PM -0800, Jeff Janes wrote: > > I think the non-extended stats code also has trouble with booleans. > > pg_stats gives me a correlation of 0.8 or higher for the flag column. > > It's not due to the boolean though; you see the same thing if you do: > CREATE INDEX aaa_f ON aaa((flag::text)); > ANALYZE aaa; > correlation | 0.81193 > > or: > ALTER TABLE aaa ADD flag2 int; UPDATE aaa SET flag2= flag::int > correlation | 0.81193 > > I think it's caused by having so few (2) values to correlate. > > most_common_vals | {f,t} > most_common_freqs | {0.9014,0.0986} > correlation| 0.822792 > > It thinks there's somewhat-high correlation since it gets a list of x and y > values (integer positions by logical and physical sort order) and 90% of > the x > list (logical value) are the same value ('t'), and the CTIDs are in order > on > the new index, so 90% of the values are 100% correlated. > But there is no index involved (except in the case of the functional index). The correlation of table columns to physical order of the table doesn't depend on the existence of an index, or the physical order within an index. But I do see that ties within the logical order of the column values are broken to agree with the physical order. That is wrong, right? Is there any argument that this is desirable? It looks like it could be fixed with a few extra double calcs per distinct value. Considering we already sorted the sample values using SQL-callable collation dependent comparators, I doubt a few C-level double calcs is going to be meaningful. Cheers, Jeff