Re: [PERFORM] Index scan cost calculation

2015-12-03 Thread Glyn Astill

> From: Jim Nasby <jim.na...@bluetreble.com>
>To: Jeff Janes <jeff.ja...@gmail.com>; Glyn Astill <glynast...@yahoo.co.uk> 
>Cc: Pgsql-performance <pgsql-performance@postgresql.org>
>Sent: Wednesday, 2 December 2015, 22:32
>Subject: Re: [PERFORM] Index scan cost calculation
> 
>
>On 11/30/15 5:03 PM, Jeff Janes wrote:
>> It thinks the combination of (show, type, best, block)  is enough to
>> get down to a single row.  One index adds "flag" to that (which is not
>> useful to the query) and the other adds "row" to that, which is useful
>> but the planner doesn't think it is because once you are down to a
>> single tuple additional selectivity doesn't help.
>
>It occurs to me that maybe you could force this behavior by building an 
>index on a row() instead of on the individual fields. IE:
>
>CREATE INDEX ... ON( row(show, type, best, block, row) )
>
>You would then have to query based on that:
>
>WHERE row(show, type, best, block, row) = row( 'Trans Siberian 
>Orchestra', 'Music', true, 1, 1 )
>
>You mentioned legacy code which presumably you can't modify to do that, 
>but maybe there's a way to trick the planner into it with a view...
>
>CREATE VIEW AS
>SELECT r.show, r.type, r..., etc, etc
>   FROM ( SELECT *, row(show, type, best, block, row) AS r FROM table ) a
>;
>
>When you stick a where clause on that there's a chance it'd get turned 
>into WHERE row() = row()... but now that I see it I'm probably being 
>over optimistic about that. You could probably force the issue with an 
>ON SELECT ON table DO INSTEAD rule, but IIRC those aren't supported.


Thanks, interesting idea, but no cigar.

For the moment just ensuring the seats_index01 is the last index created seems 
to suffice, fragile though it is.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Index scan cost calculation

2015-12-02 Thread Jim Nasby

On 11/30/15 5:03 PM, Jeff Janes wrote:

It thinks the combination of (show, type, best, block)  is enough to
get down to a single row.  One index adds "flag" to that (which is not
useful to the query) and the other adds "row" to that, which is useful
but the planner doesn't think it is because once you are down to a
single tuple additional selectivity doesn't help.


It occurs to me that maybe you could force this behavior by building an 
index on a row() instead of on the individual fields. IE:


CREATE INDEX ... ON( row(show, type, best, block, row) )

You would then have to query based on that:

WHERE row(show, type, best, block, row) = row( 'Trans Siberian 
Orchestra', 'Music', true, 1, 1 )


You mentioned legacy code which presumably you can't modify to do that, 
but maybe there's a way to trick the planner into it with a view...


CREATE VIEW AS
SELECT r.show, r.type, r..., etc, etc
  FROM ( SELECT *, row(show, type, best, block, row) AS r FROM table ) a
;

When you stick a where clause on that there's a chance it'd get turned 
into WHERE row() = row()... but now that I see it I'm probably being 
over optimistic about that. You could probably force the issue with an 
ON SELECT ON table DO INSTEAD rule, but IIRC those aren't supported.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Index scan cost calculation

2015-12-01 Thread Glyn Astill
>
>Clauses that can't be used in an "indexable" way are excluded from the
>index selectivity, but not from the total query selectivity.
>
>> Or is it just likely that the selection of the new index is just by chance?
>
>Bingo.
>


Got it, thanks! Very much appreciated.


Glyn


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Index scan cost calculation

2015-11-30 Thread Jeff Janes
On Mon, Nov 30, 2015 at 6:03 AM, Glyn Astill  wrote:
>
>
>
>
> If I create the index show+best+block+row+seat then the planner appears to 
> favour that, and all is well.  Despite the startup cost estimate being the 
> same, and total cost being 0.01 higher.  This is something I fail to 
> understand fully.

I think usually Index scans that are estimated to be within 1% of each
other are considered to be identical.  Which one gets chosen then
depends on what order they are considered in, which I think is in
implementation dependent detail. Usually it is the most recently
created one, which would explain why you got the plan switch with the
new index.


> Tom stated the index choice is due to a selectivity underestimate.  I think 
> this may be because there is actually a correlation between "best"+"block" 
> and "type", but from Toms reply my understanding was that total selectivity 
> for the query is calculated as the product of the individual selectivities in 
> the where clause.

I think the problem here is not with total query selectivity estimate,
but rather selectivity estimates of the indexes.

It thinks the combination of (show, type, best, block)  is enough to
get down to a single row.  One index adds "flag" to that (which is not
useful to the query) and the other adds "row" to that, which is useful
but the planner doesn't think it is because once you are down to a
single tuple additional selectivity doesn't help.


> Are particular equality clauses ever excluded from the calculation as a 
> result of available indexes or otherwise?

Clauses that can't be used in an "indexable" way are excluded from the
index selectivity, but not from the total query selectivity.

> Or is it just likely that the selection of the new index is just by chance?

Bingo.

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Index scan cost calculation

2015-11-30 Thread Glyn Astill


> From: Jeff Janes <jeff.ja...@gmail.com>
> To: Glyn Astill <glynast...@yahoo.co.uk>
> Cc: Pgsql-performance <pgsql-performance@postgresql.org>
> Sent: Saturday, 28 November 2015, 19:25
> Subject: Re: [PERFORM] Index scan cost calculation
> 
> 
> Why does the index seats_index02 exist in the first place?  It looks
> like an index designed for the benefit of a single query.  In which
> case, could flag column be moved up front?  That should prevent it
> from looking falsely enticing.
> 
> A column named "flag" is not usually the type of thing you expect to
> see a range query on, so moving it leftward in the index should not be
> a problem.
> 


Unfortunately it's not possible to move flag left in this scenario.

As you say it's an issue that would not really exist in normal SQL access. The 
main issue is the way it's required for ordering; The index in question is used 
 by a legacy language that accesses records sequentially as if they were direct 
from isam files it used historically via a driver.  In some cases it steps 
through records on a particular show+type until a flag changes and carries on 
unless particular values are seen.


If I create the index show+best+block+row+seat then the planner appears to 
favour that, and all is well.  Despite the startup cost estimate being the 
same, and total cost being 0.01 higher.  This is something I fail to understand 
fully.

Tom stated the index choice is due to a selectivity underestimate.  I think 
this may be because there is actually a correlation between "best"+"block" and 
"type", but from Toms reply my understanding was that total selectivity for the 
query is calculated as the product of the individual selectivities in the where 
clause. Are particular equality clauses ever excluded from the calculation as a 
result of available indexes or otherwise?  Or is it just likely that the 
selection of the new index is just by chance?


Either way I my understanding here is definitely lacking.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Index scan cost calculation

2015-11-28 Thread Jeff Janes
On Thu, Nov 26, 2015 at 8:11 AM, Glyn Astill  wrote:
> Hi All,
>
> Using pg 9.4.5 I'm looking at a table set up by a 3rd party application and 
> trying to figure out why a particular index is being chosen over another for 
> updates/deletes.
>
> From what I can see the reason is that plans using either index have the same 
> exactly the same cost.  So rather I'm asking if there's something glaringly 
> obvious I'm missing, or is there anything I can to to get better estimates.
>
> The table is as follows and has  ~ 50M rows, ~ 4.5GB in size:
>
> CREATE TABLE tickets.seats
> (
>   recnum serial NOT NULL,
>   show numeric(8,0) NOT NULL,
>   type numeric(4,0) NOT NULL,
>   block character varying(8) NOT NULL,
>   "row" numeric(14,0) NOT NULL,
>   seat numeric(8,0) NOT NULL,
>   flag character varying(15) NOT NULL,
>   transno numeric(8,0) NOT NULL,
>   best numeric(4,0) NOT NULL,
>   "user" character varying(15) NOT NULL,
>   "time" numeric(10,0) NOT NULL,
>   date date NOT NULL,
>   date_reserved timestamp NOT NULL
> );
>
> Indexes:
>   "seats_index01" PRIMARY KEY, btree (show, type, best, block, "row", seat)   
>// (1094 MB)
>   "seats_index00" UNIQUE, btree (recnum)  
> // (2423 MB)
>   "seats_index02" UNIQUE, btree (show, type, best, block, flag, "row", seat, 
> recnum)  // (2908 MB)


Why does the index seats_index02 exist in the first place?  It looks
like an index designed for the benefit of a single query.  In which
case, could flag column be moved up front?  That should prevent it
from looking falsely enticing.

A column named "flag" is not usually the type of thing you expect to
see a range query on, so moving it leftward in the index should not be
a problem.

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Index scan cost calculation

2015-11-26 Thread Glyn Astill
Hi All,

Using pg 9.4.5 I'm looking at a table set up by a 3rd party application and 
trying to figure out why a particular index is being chosen over another for 
updates/deletes.

>From what I can see the reason is that plans using either index have the same 
>exactly the same cost.  So rather I'm asking if there's something glaringly 
>obvious I'm missing, or is there anything I can to to get better estimates.

The table is as follows and has  ~ 50M rows, ~ 4.5GB in size:

CREATE TABLE tickets.seats
(
  recnum serial NOT NULL,
  show numeric(8,0) NOT NULL,
  type numeric(4,0) NOT NULL,
  block character varying(8) NOT NULL,
  "row" numeric(14,0) NOT NULL,
  seat numeric(8,0) NOT NULL,
  flag character varying(15) NOT NULL,
  transno numeric(8,0) NOT NULL,
  best numeric(4,0) NOT NULL,
  "user" character varying(15) NOT NULL,
  "time" numeric(10,0) NOT NULL,
  date date NOT NULL,
  date_reserved timestamp NOT NULL
);

Indexes:
  "seats_index01" PRIMARY KEY, btree (show, type, best, block, "row", seat) 
 // (1094 MB)
  "seats_index00" UNIQUE, btree (recnum)
  // (2423 MB)
  "seats_index02" UNIQUE, btree (show, type, best, block, flag, "row", seat, 
recnum)  // (2908 MB)

default_statistics target is 100, and the following columns are non-default:

attname | attstattarget
+---
show  |  1000
type   |  1000
block|  2000
row|  1000
seat   |  1000
flag   |  1000
best   |  1000

Increasing these further appears to make no noticeable difference. (pg_stats 
here for these columns here: http://pastebin.com/2WQQec7N)

An example query below shows that in some cases the seats_index02 index is 
being chosen:

# analyze verbose seats;
INFO:  analyzing "tickets.seats"
INFO:  "seats": scanned 593409 of 593409 pages, containing 50926456 live rows 
and 349030 dead rows; 60 rows in sample, 50926456 estimated total rows

# begin;
BEGIN
# explain analyze delete from seats where ("show" = 58919 AND "type" = 1 AND 
"best" = 10 AND "block" = 'GMA' AND "row" =26 AND "seat" = 15);
QUERY PLAN
-
Delete on seats  (cost=0.56..4.59 rows=1 width=6) (actual time=0.480..0.480 
rows=0 loops=1)
->  Index Scan using seats_index02 on seats  (cost=0.56..4.59 rows=1 width=6) 
(actual time=0.452..0.453 rows=1 loops=1)
Index Cond: ((show = 58919::numeric) AND (type = 1::numeric) AND (best = 
10::numeric) AND ((block)::text = 'GMA'::text) AND ("row" = 26::numeric) AND 
(seat = 15::numeric))
Planning time: 2.172 ms
Execution time: 0.531 ms
(5 rows)

But from my naive standpoint, seats_index01 is a better candidate:

# abort; begin;
ROLLBACK
BEGIN

# update pg_index set indisvalid = false where indexrelid = 
'seats_index02'::regclass;
# explain analyze delete from seats where ("show" = 58919 AND "type" = 1 AND 
"best" = 10 AND "block" = 'GMA' AND "row" =26 AND "seat" = 15);
QUERY PLAN
-
Delete on seats  (cost=0.56..4.59 rows=1 width=6) (actual time=0.103..0.103 
rows=0 loops=1)
->  Index Scan using seats_index01 on seats  (cost=0.56..4.59 rows=1 width=6) 
(actual time=0.078..0.080 rows=1 loops=1)
Index Cond: ((show = 58919::numeric) AND (type = 1::numeric) AND (best = 
10::numeric) AND ((block)::text = 'GMA'::text) AND ("row" = 26::numeric) AND 
(seat = 15::numeric))
Planning time: 0.535 ms
Execution time: 0.146 ms
(5 rows)


In this instance, the time difference is not huge, however in some seemingly 
random cases where there are a lot of rows with only the "seat" column 
differing the choice of seats_index02 is much larger ~ 70ms vs 0.something ms 
with seats_index01

I suspect some of the seemingly random cases could be where there's been an 
update, followed by a delete since the last analyze, despite auto analyze 
running fairly frequently.

Any suggestions appreciated.

Thanks
Glyn


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Index scan cost calculation

2015-11-26 Thread Glyn Astill
- Original Message -

> From: Glyn Astill <glynast...@yahoo.co.uk>
> To: Pgsql-performance <pgsql-performance@postgresql.org>
> Sent: Thursday, 26 November 2015, 16:11
> Subject: [PERFORM] Index scan cost calculation
> 
> Hi All,
> 
> Using pg 9.4.5 I'm looking at a table set up by a 3rd party application and 
> trying to figure out why a particular index is being chosen over another for 
> updates/deletes.
> 
> From what I can see the reason is that plans using either index have the same 
> exactly the same cost.  So rather I'm asking if there's something 
> glaringly obvious I'm missing, or is there anything I can to to get better 
> estimates.
> 
> The table is as follows and has  ~ 50M rows, ~ 4.5GB in size:
> 
> CREATE TABLE tickets.seats
> (
>   recnum serial NOT NULL,
>   show numeric(8,0) NOT NULL,
>   type numeric(4,0) NOT NULL,
>   block character varying(8) NOT NULL,
>   "row" numeric(14,0) NOT NULL,
>   seat numeric(8,0) NOT NULL,
>   flag character varying(15) NOT NULL,
>   transno numeric(8,0) NOT NULL,
>   best numeric(4,0) NOT NULL,
>   "user" character varying(15) NOT NULL,
>   "time" numeric(10,0) NOT NULL,
>   date date NOT NULL,
>   date_reserved timestamp NOT NULL
> );
> 
> Indexes:
>   "seats_index01" PRIMARY KEY, btree (show, type, best, block, 
> "row", seat)  // (1094 MB)
>   "seats_index00" UNIQUE, btree (recnum)  
> // (2423 MB)
>   "seats_index02" UNIQUE, btree (show, type, best, block, flag, 
> "row", seat, recnum)  // (2908 MB)

> 


^^ If those first two sizes look wrong, it's because they are; they should be 
the other way around.

> default_statistics target is 100, and the following columns are non-default:
> 
> attname | attstattarget
> +---
> show  |  1000
> type   |  1000
> block|  2000
> row|  1000
> seat   |  1000
> flag   |  1000
> best   |  1000
> 
> Increasing these further appears to make no noticeable difference. (pg_stats 
> here for these columns here: http://pastebin.com/2WQQec7N)
> 
> An example query below shows that in some cases the seats_index02 index is 
> being 
> chosen:
> 
> # analyze verbose seats;
> INFO:  analyzing "tickets.seats"
> INFO:  "seats": scanned 593409 of 593409 pages, containing 50926456 
> live rows and 349030 dead rows; 60 rows in sample, 50926456 estimated 
> total 
> rows
> 
> # begin;
> BEGIN
> # explain analyze delete from seats where ("show" = 58919 AND 
> "type" = 1 AND "best" = 10 AND "block" = 
> 'GMA' AND "row" =26 AND "seat" = 15);
> QUERY PLAN
> -
> Delete on seats  (cost=0.56..4.59 rows=1 width=6) (actual time=0.480..0.480 
> rows=0 loops=1)
> ->  Index Scan using seats_index02 on seats  (cost=0.56..4.59 rows=1 width=6) 
> (actual time=0.452..0.453 rows=1 loops=1)
> Index Cond: ((show = 58919::numeric) AND (type = 1::numeric) AND (best = 
> 10::numeric) AND ((block)::text = 'GMA'::text) AND ("row" = 
> 26::numeric) AND (seat = 15::numeric))
> Planning time: 2.172 ms
> Execution time: 0.531 ms
> (5 rows)
> 
> But from my naive standpoint, seats_index01 is a better candidate:
> 
> # abort; begin;
> ROLLBACK
> BEGIN
> 
> # update pg_index set indisvalid = false where indexrelid = 
> 'seats_index02'::regclass;
> # explain analyze delete from seats where ("show" = 58919 AND 
> "type" = 1 AND "best" = 10 AND "block" = 
> 'GMA' AND "row" =26 AND "seat" = 15);
> QUERY PLAN
> -
> Delete on seats  (cost=0.56..4.59 rows=1 width=6) (actual time=0.103..0.103 
> rows=0 loops=1)
> ->  Index Scan using seats_index01 on seats  (cost=0.56..4.59 rows=1 width=6) 
> (actual time=0.078..0.080 rows=1 loops=1)
> Index Cond: ((show = 58919::numeric) AND (type = 1::numeric) AND (best = 
> 10::numeric) AND ((block)::text = 'GMA'::text) AND ("row" = 
> 26::numeric) AND (seat = 15::numeric))
> Planning time: 0.535 ms
> Execution time: 0.146 ms
> (5 rows)
> 
> 
> In this instance, the time difference is not huge, however in some se

Re: [PERFORM] Index scan cost calculation

2015-11-26 Thread Tom Lane
Glyn Astill  writes:
> Using pg 9.4.5 I'm looking at a table set up by a 3rd party application and 
> trying to figure out why a particular index is being chosen over another for 
> updates/deletes.
> From what I can see the reason is that plans using either index have the same 
> exactly the same cost.  So rather I'm asking if there's something glaringly 
> obvious I'm missing, or is there anything I can to to get better estimates.

I think what's happening is that it's estimating that exactly one index
tuple needs to be visited in both cases, so that the cost estimates come
out the same.  That's correct in the one case but overly optimistic in the
other; the misestimate likely is a consequence of the index columns being
interdependent.  For instance, if "type" can be predicted from the other
columns then specifying it isn't really adding anything to the query
selectivity, but the planner won't know that.  We can conclude from the
results you've shown that the planner thinks that show+type+best+block
is sufficient to uniquely determine a table entry, which implies that
at least some of those columns are strongly correlated with row+seat.

The problem will probably go away by itself as your table grows, but
if you don't want to wait, you might want to reflect on which of the index
columns might be (partially?) functionally dependent on the other columns,
and whether you could redesign the key structure to avoid that.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Index scan cost calculation

2015-11-26 Thread Glyn Astill
- Original Message -

> From: Tom Lane <t...@sss.pgh.pa.us>
> To: Glyn Astill <glynast...@yahoo.co.uk>
> Cc: Pgsql-performance <pgsql-performance@postgresql.org>
> Sent: Thursday, 26 November 2015, 16:44
> Subject: Re: [PERFORM] Index scan cost calculation
> 
> Glyn Astill <glynast...@yahoo.co.uk> writes:
>>  Using pg 9.4.5 I'm looking at a table set up by a 3rd party application 
> and trying to figure out why a particular index is being chosen over another 
> for 
> updates/deletes.
>>  From what I can see the reason is that plans using either index have the 
> same exactly the same cost.  So rather I'm asking if there's something 
> glaringly obvious I'm missing, or is there anything I can to to get better 
> estimates.
> 
> I think what's happening is that it's estimating that exactly one index
> tuple needs to be visited in both cases, so that the cost estimates come
> out the same.  That's correct in the one case but overly optimistic in the
> other; the misestimate likely is a consequence of the index columns being
> interdependent.  For instance, if "type" can be predicted from the 
> other
> columns then specifying it isn't really adding anything to the query
> selectivity, but the planner won't know that.  We can conclude from the
> results you've shown that the planner thinks that show+type+best+block
> is sufficient to uniquely determine a table entry, which implies that
> at least some of those columns are strongly correlated with row+seat.
> 
> The problem will probably go away by itself as your table grows, but
> if you don't want to wait, you might want to reflect on which of the index
> columns might be (partially?) functionally dependent on the other columns,
> and whether you could redesign the key structure to avoid that.


Many thanks for the explanation, is such a functional  dependency assumed 
purely based optimistically on statistics gathered by analyze?  My (ignorant) 
thinking was that those sorts of decisions would only be made from keys or 
constraints on the table.


There's no way to determine a particular seat+row combination from 
show+type+best+block or vice versa.

We need show+type+best+block+row+seat to identify an individual row, but 
approximately 90% of the table has just a space " " for the value of "block", 
and zeros for both "best" and "row", and for each of those you could say any 
show+type would almost certainly have row+seat combinations of 0+1, 0+2 and so 
on.


Unfortunately it's an unnormalized legacy structure that I can't really change.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Index scan cost calculation

2015-11-26 Thread Tom Lane
Glyn Astill  writes:
>> From: Tom Lane 
>> The problem will probably go away by itself as your table grows, but
>> if you don't want to wait, you might want to reflect on which of the index
>> columns might be (partially?) functionally dependent on the other columns,
>> and whether you could redesign the key structure to avoid that.

> Many thanks for the explanation, is such a functional  dependency assumed 
> purely based optimistically on statistics gathered by analyze?

Well, the point is that the selectivities associated with the individual
WHERE clauses are assumed independent, which allows us to just multiply
them together.  If they're not really independent then multiplication
gives a combined selectivity that's too small.  But without cross-column
statistics there's not much we can do to get a better estimate.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance