Re: Bitmap scan is undercosted? - boolean correlation

2017-12-05 Thread Tom Lane
Jeff Janes  writes:
> On Dec 3, 2017 15:31, "Tom Lane"  wrote:
>> Jeff Janes  writes:
>>> But I do see that ties within the logical order of the column values are
>>> broken to agree with the physical order.  That is wrong, right?  Is there
>>> any argument that this is desirable?

>> Uh ... what do you propose doing instead?  We'd have to do something with
>> ties, and it's not so obvious this way is wrong.

> Let them be tied.  If there are 10 distinct values, number the values 0 to
> 9, and all rows of a given distinct values get the same number for the
> logical order axis.
> Calling the correlation 0.8 when it is really 0.0 seems obviously wrong to
> me.  Although if we switched btree to store duplicate values with tid as a
> tie breaker, then maybe it wouldn't be as obviously wrong.

I thought some more about this.  What we really want the correlation stat
to do is help us estimate how randomly an index-ordered scan will access
the heap.  If the values we've sampled are all unequal then there's no
particular issue.  However, if we have some group of equal values, we
do not really know what order an indexscan will visit them in.  The
existing correlation calculation is making the *most optimistic possible*
assumption, that such a group will be visited exactly in heap order ---
and that assumption isn't too defensible.  IIRC, a freshly built b-tree
will behave that way, because the initial sort of a btree breaks ties
using heap TIDs; but we don't maintain that property during later
insertions.  In any case, given that we do this calculation without regard
to any specific index, we can't reasonably expect to model exactly what
the index will do.  It would be best to adopt some middle-of-the-road
assumption about what the heap visitation order will be for a set of
duplicate values: not exactly heap order, but I think we should not use
a worst-case assumption either, since the btree may retain some amount
of its initial ordering.

BTW, I disagree that "correlation = zero" is the right answer for this
particular example.  If the btree is freshly built, then an index-order
scan would visit all the heap pages in sequence to fetch "f" rows, and
then visit them all in sequence again to fetch "t" rows, which is a whole
lot better than the purely random access that zero correlation implies.
So I think 0.8 or so is actually a perfectly reasonable answer when the
index is fresh.  The trouble is just that it'd behoove us to derate that
answer somewhat for the probability that the index isn't fresh.

My first thought for a concrete fix was to use the mean position of
a group of duplicates for the purposes of the correlation calculation,
but on reflection that's clearly wrong.  We do have an idea, from the
data we have, whether the duplicates are close together in the heap
or spread all over.  Using only mean position would fail to distinguish
those cases, but really we'd better penalize the spread-all-over case.
I'm not sure how to do that.

Or we could leave this calculation alone and try to move towards keeping
equal values in TID order in btrees.  Not sure about the downsides of
that, though.

regards, tom lane



Re: Bitmap scan is undercosted? - boolean correlation

2017-12-03 Thread Jeff Janes
On Dec 3, 2017 15:31, "Tom Lane"  wrote:

Jeff Janes  writes:
> On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby 
wrote:
>> It thinks there's somewhat-high correlation since it gets a list of x
>> and y values (integer positions by logical and physical sort order) and
>> 90% of the x list (logical value) are the same value ('t'), and the
>> CTIDs are in order on the new index, so 90% of the values are 100%
>> correlated.

> But there is no index involved (except in the case of the functional
> index).  The correlation of table columns to physical order of the table
> doesn't depend on the existence of an index, or the physical order within
> an index.

> But I do see that ties within the logical order of the column values are
> broken to agree with the physical order.  That is wrong, right?  Is there
> any argument that this is desirable?

Uh ... what do you propose doing instead?  We'd have to do something with
ties, and it's not so obvious this way is wrong.


Let them be tied.  If there are 10 distinct values, number the values 0 to
9, and all rows of a given distinct values get the same number for the
logical order axis.

Calling the correlation 0.8 when it is really 0.0 seems obviously wrong to
me.  Although if we switched btree to store duplicate values with tid as a
tie breaker, then maybe it wouldn't be as obviously wrong.

Cheers,

Jeff


Re: Bitmap scan is undercosted? - boolean correlation

2017-12-03 Thread Tom Lane
Jeff Janes  writes:
> On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby  wrote:
>> It thinks there's somewhat-high correlation since it gets a list of x
>> and y values (integer positions by logical and physical sort order) and
>> 90% of the x list (logical value) are the same value ('t'), and the
>> CTIDs are in order on the new index, so 90% of the values are 100%
>> correlated.

> But there is no index involved (except in the case of the functional
> index).  The correlation of table columns to physical order of the table
> doesn't depend on the existence of an index, or the physical order within
> an index.

> But I do see that ties within the logical order of the column values are
> broken to agree with the physical order.  That is wrong, right?  Is there
> any argument that this is desirable?

Uh ... what do you propose doing instead?  We'd have to do something with
ties, and it's not so obvious this way is wrong.

regards, tom lane



Re: Bitmap scan is undercosted? - boolean correlation

2017-12-03 Thread Jeff Janes
On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby  wrote:

> On Sat, Dec 02, 2017 at 05:27:51PM -0800, Jeff Janes wrote:
> > I think the non-extended stats code also has trouble with booleans.
> > pg_stats gives me a correlation  of 0.8 or higher for the flag column.
>
> It's not due to the boolean though; you see the same thing if you do:
> CREATE INDEX aaa_f ON aaa((flag::text));
> ANALYZE aaa;
> correlation | 0.81193
>
> or:
> ALTER TABLE aaa ADD flag2 int; UPDATE aaa SET flag2= flag::int
> correlation | 0.81193
>
> I think it's caused by having so few (2) values to correlate.
>
> most_common_vals   | {f,t}
> most_common_freqs  | {0.9014,0.0986}
> correlation| 0.822792
>
> It thinks there's somewhat-high correlation since it gets a list of x and y
> values (integer positions by logical and physical sort order) and 90% of
> the x
> list (logical value) are the same value ('t'), and the CTIDs are in order
> on
> the new index, so 90% of the values are 100% correlated.
>

But there is no index involved (except in the case of the functional
index).  The correlation of table columns to physical order of the table
doesn't depend on the existence of an index, or the physical order within
an index.

But I do see that ties within the logical order of the column values are
broken to agree with the physical order.  That is wrong, right?  Is there
any argument that this is desirable?

It looks like it could be fixed with a few extra double calcs per distinct
value.  Considering we already sorted the sample values using SQL-callable
collation dependent comparators, I doubt a few C-level double calcs is
going to be meaningful.

Cheers,

Jeff