RE: Bad estimates

2017-11-22 Thread Artur Zając
Thank you for your response,

Clause used by me is not important (I used binary & operator only for
example), I tried to show some kind of problems.

Now I did another test:

alter table xyz add x int;
alter table xyz add y int;
alter table xyz add z int;
update xyz set x=gs,y=gs,z=gs;

create index xyza_i1 on xyz ((x%200));
create index xyza_i2 on xyz ((y%200));
create index xyza_i3 on xyz ((z%200));

vacuum full verbose xyza;

And now:

explain analyze select gs from xyza where (x%200)=1 and (y%200)=1 and
(z%200)=1;

QUERY PLAN

--
 Bitmap Heap Scan on xyz  (cost=2782.81..2786.83 rows=1 width=4) (actual
time=134.827..505.642 rows=5 loops=1)
   Recheck Cond: (((z % 200) = 1) AND ((y % 200) = 1) AND ((x % 200) = 1))
   Heap Blocks: exact=5
   ->  BitmapAnd  (cost=2782.81..2782.81 rows=1 width=0) (actual
time=108.712..108.712 rows=0 loops=1)
 ->  Bitmap Index Scan on xyza_i3  (cost=0.00..927.43 rows=5
width=0) (actual time=22.857..22.857 rows=5 loops=1)
   Index Cond: ((z % 200) = 1)
 ->  Bitmap Index Scan on xyza_i2  (cost=0.00..927.43 rows=5
width=0) (actual time=26.058..26.058 rows=5 loops=1)
   Index Cond: ((y % 200) = 1)
 ->  Bitmap Index Scan on xyza_i1  (cost=0.00..927.43 rows=5
width=0) (actual time=23.079..23.079 rows=5 loops=1)
   Index Cond: ((x % 200) = 1)
 Planning time: 0.340 ms
 Execution time: 513.171 ms
(12 rows)

Estimates are exactly the same because it's assumed that if first clause
reduces records count by n, second by m, third by o then bringing all of
them together will reduce the result records count by n*m*o, so it is the
general behaviour, independent of whether they are statistics or not.

You suggest:

> If you're not wedded to this data representation, consider replacing that
integer flags column with a bunch of boolean columns.  You might or might
not want indexes on the booleans, but > in any case ANALYZE would create
stats that would allow decent estimates for "WHERE boolval".

But, did you ever think about something like this?

CREATE STATISTICS ON (x&1) FROM xyz;

(using the syntax similar to CREATE STATISTICS from PostgreSQL 10).

Sometimes It's not possibile to divide one column into many , and as I know,
it is not worth creating an index if there are few different values in the
table.


Artur Zajac 


-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Wednesday, November 22, 2017 4:02 PM
To: Artur Zając 
Cc: pgsql-performance@lists.postgresql.org
Subject: Re: Bad estimates

=?iso-8859-2?Q?Artur_Zaj=B1c?=  writes:
[ poor estimates for WHERE clauses like "(gs & 1) = 1" ]

Don't hold your breath waiting for that to get better on its own.
You need to work with the planner, not expect it to perform magic.
It has no stats that would help it discover what the behavior of that sort
of WHERE clause is; nor is there a good reason for it to think that the
selectivity of such a clause is only 0.5 rather than something more in line
with the usual behavior of an equality constraint on an integer value.

One way you could attack the problem, if you're wedded to this data
representation, is to create expression indexes on the terms "(gs & x)"
for all the values of x you use.  Not only would that result in better
estimates (after an ANALYZE) but it would also open the door to satisfying
this type of query through an index search.  A downside is that updating all
those indexes could make DML on the table pretty expensive.

If you're not wedded to this data representation, consider replacing that
integer flags column with a bunch of boolean columns.  You might or might
not want indexes on the booleans, but in any case ANALYZE would create stats
that would allow decent estimates for "WHERE boolval".

regards, tom lane





Re: Bad estimates

2017-11-22 Thread Laurenz Albe
Artur Zając wrote:
> We have table created like this:
> 
> CREATE TABLE xyz AS SELECT generate_series(1,1000,1) AS gs;
> 
> Now:
> 
> explain analyze select * from xyz where gs&1=1;

>  Seq Scan on xyz  (cost=0.00..260815.38 rows=68920 width=4)
>   (actual time=0.044..2959.728 rows=500 loops=1)
>Filter: ((gs & 1) = 1)
>Rows Removed by Filter: 500
[...]
> And one more clause:
> 
> explain analyze select * from xyz where gs&1=1 and gs&2=2 and gs&4=4;

>  Seq Scan on xyz  (cost=0.00..398655.62 rows=2 width=4)
>   (actual time=0.052..3329.422 rows=125 loops=1)
>Filter: (((gs & 1) = 1) AND ((gs & 2) = 2) AND ((gs & 4) = 4))
>Rows Removed by Filter: 875

> As we can see estimates differs significally from the actual records count -
> only three clauses are reducing estimated number of records from 1000 to
> 2.
> 
> I noticed that each additional clause reduces the number about 200 times and
> define DEFAULT_NUM_DISTINCT is responsible for this behaviur.
> 
> I think that this variable should be lower or maybe estimation using
> DEFAULT_NUM_DISTTINCT should be done once per table.

The problem is that the expression "gs & 1" is a black box for the
optimizer; it cannot estimate how selective the condition is and falls
back to a default value that is too low.

You can create an index to
a) improve the estimate
and
b) speed up the queries:

CREATE INDEX ON xyz ((gs & 1), (gs & 2), (gs & 4));

Don't forget to ANALYZE afterwards.

Yours,
Laurenz Albe



RE: Bad estimates

2017-11-22 Thread Alex Ignatov
It doesn’t help in this case.

 

--

Alex Ignatov 
Postgres Professional:  <http://www.postgrespro.com> http://www.postgrespro.com 
The Russian Postgres Company

 

From: Don Seiler [mailto:d...@seiler.us] 
Sent: Wednesday, November 22, 2017 5:49 PM
To: Artur Zając 
Cc: pgsql-performance@lists.postgresql.org
Subject: Re: Bad estimates

 

I'm assuming you never analyzed the table after creation & data load? What does 
this show you:

 

select * from pg_stat_all_tables where relname='xyz';

 

Don.

 

-- 

Don Seiler
www.seiler.us <http://www.seiler.us> 



Re: Bad estimates

2017-11-22 Thread Tom Lane
=?iso-8859-2?Q?Artur_Zaj=B1c?=  writes:
[ poor estimates for WHERE clauses like "(gs & 1) = 1" ]

Don't hold your breath waiting for that to get better on its own.
You need to work with the planner, not expect it to perform magic.
It has no stats that would help it discover what the behavior of
that sort of WHERE clause is; nor is there a good reason for it
to think that the selectivity of such a clause is only 0.5 rather
than something more in line with the usual behavior of an equality
constraint on an integer value.

One way you could attack the problem, if you're wedded to this data
representation, is to create expression indexes on the terms "(gs & x)"
for all the values of x you use.  Not only would that result in better
estimates (after an ANALYZE) but it would also open the door to satisfying
this type of query through an index search.  A downside is that updating
all those indexes could make DML on the table pretty expensive.

If you're not wedded to this data representation, consider replacing that
integer flags column with a bunch of boolean columns.  You might or might
not want indexes on the booleans, but in any case ANALYZE would create
stats that would allow decent estimates for "WHERE boolval".

regards, tom lane



Re: Bad estimates (DEFAULT_UNK_SEL)

2017-11-22 Thread Justin Pryzby
On Wed, Nov 22, 2017 at 03:29:54PM +0100, Artur Zając wrote:
> CREATE TABLE xyz AS SELECT generate_series(1,1000,1) AS gs;
> 
> db=# explain analyze select * from xyz where gs&1=1;
>  Seq Scan on xyz  (cost=0.00..260815.38 rows=68920 width=4) (actual 
> time=0.044..2959.728 rows=500 loops=1)
...
> newrr=# explain analyze select * from xyz where gs&1=1 and gs&2=2 and gs&4=4;
>  Seq Scan on xyz  (cost=0.00..398655.62 rows=2 width=4) (actual 
> time=0.052..3329.422 rows=125 loops=1)

> I noticed that each additional clause reduces the number about 200 times and
> define DEFAULT_NUM_DISTINCT is responsible for this behaviur.

I think it's actually:

src/include/utils/selfuncs.h-/* default selectivity estimate for boolean and 
null test nodes */
src/include/utils/selfuncs.h-#define DEFAULT_UNK_SEL0.005

..which is 1/200.

Note, you can do this, which helps a bit by collecting stats for the index
expr:

postgres=# CREATE INDEX ON xyz((gs&1));
postgres=# ANALYZE xyz;
postgres=# explain analyze SELECT * FROM xyz WHERE gs&1=1 AND gs&2=2 AND gs&4=4;
 Bitmap Heap Scan on xyz  (cost=91643.59..259941.99 rows=124 width=4) (actual 
time=472.376..2294.035 rows=125 loops=1)
   Recheck Cond: ((gs & 1) = 1)
   Filter: (((gs & 2) = 2) AND ((gs & 4) = 4))
   Rows Removed by Filter: 375
   Heap Blocks: exact=44248
   ->  Bitmap Index Scan on xyz_expr_idx  (cost=0.00..91643.55 rows=4962016 
width=0) (actual time=463.477..463.477 rows=500 loops=1)
 Index Cond: ((gs & 1) = 1)

Justin



Re: Bad estimates

2017-11-22 Thread Don Seiler
I'm assuming you never analyzed the table after creation & data load? What
does this show you:

select * from pg_stat_all_tables where relname='xyz';

Don.

-- 
Don Seiler
www.seiler.us