Re: [PERFORM] Number of characters in column preventing index usage

2017-02-17 Thread Hustler DBA
Thanks you guys are correct... the size of the table caused the optimizer
to do a seq scan instead of using the index. I tried it on a  24 MB and 1
GB table and the expected index was used.



On Fri, Feb 17, 2017 at 7:04 PM, Tom Lane  wrote:

> "David G. Johnston"  writes:
> > On Fri, Feb 17, 2017 at 3:49 PM, Tomas Vondra <
> tomas.von...@2ndquadrant.com>
> > wrote:
> >> That may seem a bit strange, but I'd bet it finds the short value in
> some
> >> statistic (MCV, histogram) ans so can provide very accurate estimate.
>
> > ​I'm not seeing how any of the statistic columns would capture a value
> that
> > doesn't actually appear in the table...(actual ... row=0)​
>
> I think it's the other way around.  It found
> '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea' in the stats, concluded
> (accurately) that there would be five matches, and on the strength of that
> decided that a seqscan over this very tiny table would be faster than an
> indexscan.  In the other case, the short string exists neither in the
> table nor the stats, and the default estimate is turning out to be that
> there's a single match, for which it likes the indexscan solution.  This
> is all pretty unsurprising if '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'
> is in the most-common-values list.  Anything that's *not* in that list
> is going to get a smaller rowcount estimate.  (I don't think that the
> string length, per se, has anything to do with it.)
>
> I'm not sure what performance problem the OP was looking to solve,
> but expecting experiments on toy-sized tables to give the same plans
> as you get on large tables is a standard mistake when learning to work
> with the PG planner.
>
> Also, if toy-sized tables are all you've got, meaning the whole database
> can be expected to stay RAM-resident at all times, it'd be a good idea
> to reduce random_page_cost to reflect that.  The default planner cost
> settings are meant for data that's mostly on spinning rust.
>
> regards, tom lane
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: [PERFORM] Number of characters in column preventing index usage

2017-02-17 Thread Tom Lane
"David G. Johnston"  writes:
> On Fri, Feb 17, 2017 at 3:49 PM, Tomas Vondra 
> wrote:
>> That may seem a bit strange, but I'd bet it finds the short value in some
>> statistic (MCV, histogram) ans so can provide very accurate estimate.

> ​I'm not seeing how any of the statistic columns would capture a value that
> doesn't actually appear in the table...(actual ... row=0)​

I think it's the other way around.  It found
'01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea' in the stats, concluded
(accurately) that there would be five matches, and on the strength of that
decided that a seqscan over this very tiny table would be faster than an
indexscan.  In the other case, the short string exists neither in the
table nor the stats, and the default estimate is turning out to be that
there's a single match, for which it likes the indexscan solution.  This
is all pretty unsurprising if '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'
is in the most-common-values list.  Anything that's *not* in that list
is going to get a smaller rowcount estimate.  (I don't think that the
string length, per se, has anything to do with it.)

I'm not sure what performance problem the OP was looking to solve,
but expecting experiments on toy-sized tables to give the same plans
as you get on large tables is a standard mistake when learning to work
with the PG planner.

Also, if toy-sized tables are all you've got, meaning the whole database
can be expected to stay RAM-resident at all times, it'd be a good idea
to reduce random_page_cost to reflect that.  The default planner cost
settings are meant for data that's mostly on spinning rust.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Number of characters in column preventing index usage

2017-02-17 Thread David G. Johnston
On Fri, Feb 17, 2017 at 3:49 PM, Tomas Vondra 
wrote:

>  That may seem a bit strange, but I'd bet it finds the short value in some
> statistic (MCV, histogram) ans so can provide very accurate estimate.


​​ ->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29 rows=1
width=0) (actual time=0.043..0.043 rows=0 loops=1)

​I'm not seeing how any of the statistic columns would capture a value that
doesn't actually appear in the table...(actual ... row=0)​

Unless there is some prefix matching going on here since the short value is
a substring(1, n) of the longer one which does appear 5 times.

​I guess maybe because the value doesn't appear it uses the index (via IOS)
to confirm absence (or near absence, i.e., 1) while, knowing the larger
value appears 5 times out of 223, it decides a quick table scan is faster
than any form of double-lookup (whether on the visibility map or the heap).

​https://www.postgresql.org/docs/9.6/static/indexes-index-only-scans.html​

​David J.​


Re: [PERFORM] Number of characters in column preventing index usage

2017-02-17 Thread Hustler DBA
Yes, both queries are the same, I just shorten the parameter value to see
what would have happened. The database that I inherited has a column that
stores GUID/UUIDs in a varchar(255) and a select on that table on that
column is doing a FULL TABLE SCAN (seq scan). All the values in the column
are 36 characters long. The table is 104 KB.

I realize that there was no index on that column so when I created the
index and tried to search on a parameter value, it doesn't use the index,
but when I shorten the parameter value then the optimizer decides to use an
index for the search.



On Fri, Feb 17, 2017 at 5:52 PM, Tomas Vondra 
wrote:

> On 02/17/2017 11:42 PM, David G. Johnston wrote:
>
>> On Fri, Feb 17, 2017 at 3:19 PM, Hustler DBA > >wrote:
>>
>>
>>
>> my_db=# create index tab_idx1 on tab(ID);
>>
>> CREATE INDEX
>> my_db=# explain (analyze, buffers) select count(*) from tab where ID
>> = '01625cfa-2bf8-45cf' ;
>>   QUERY
>> PLAN
>> 
>> 
>> ---
>>  Aggregate  (cost=8.29..8.30 rows=1 width=0) (actual
>> time=0.048..0.048 rows=1 loops=1)
>>Buffers: shared read=2
>>->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29
>> rows=1 width=0) (actual time=0.043..0.043 rows=0 loops=1)
>>  Index Cond: (ID = '01625cfa-2bf8-45cf'::text)
>>
>>
>>
>>->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual
>> time=0.031..0.108 rows=5 loops=1)
>>  Filter: ((ID)::text =
>> '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'::text)
>>  Rows Removed by Filter: 218
>>  Buffers: shared hit=12
>>  Planning time: 0.122 ms
>>  Execution time: 0.180 ms
>> (8 rows)
>>
>>
>> ​IIRC the only reason the first query cares to use the index is because
>> it can perform an Index Only Scan and thus avoid touching the heap at
>> all.  If it cannot avoid touching the heap the planner is going to just
>> use a sequential scan to retrieve the records directly from the heap and
>> save the index lookup step.
>>
>>
> I don't follow - the queries are exactly the same in both cases, except
> the parameter value. So both cases are eligible for index only scan.
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: [PERFORM] Number of characters in column preventing index usage

2017-02-17 Thread Tomas Vondra

On 02/17/2017 11:42 PM, David G. Johnston wrote:

On Fri, Feb 17, 2017 at 3:19 PM, Hustler DBA >wrote:


my_db=# create index tab_idx1 on tab(ID);

CREATE INDEX
my_db=# explain (analyze, buffers) select count(*) from tab where ID
= '01625cfa-2bf8-45cf' ;
  QUERY
PLAN

---
 Aggregate  (cost=8.29..8.30 rows=1 width=0) (actual
time=0.048..0.048 rows=1 loops=1)
   Buffers: shared read=2
   ->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29
rows=1 width=0) (actual time=0.043..0.043 rows=0 loops=1)
 Index Cond: (ID = '01625cfa-2bf8-45cf'::text)



   ->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual
time=0.031..0.108 rows=5 loops=1)
 Filter: ((ID)::text =
'01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'::text)
 Rows Removed by Filter: 218
 Buffers: shared hit=12
 Planning time: 0.122 ms
 Execution time: 0.180 ms
(8 rows)


​IIRC the only reason the first query cares to use the index is because
it can perform an Index Only Scan and thus avoid touching the heap at
all.  If it cannot avoid touching the heap the planner is going to just
use a sequential scan to retrieve the records directly from the heap and
save the index lookup step.



I don't follow - the queries are exactly the same in both cases, except 
the parameter value. So both cases are eligible for index only scan.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Number of characters in column preventing index usage

2017-02-17 Thread Alvaro Herrera
Hustler DBA wrote:
> I am seeing this strange behavior, I don't know if this is by design by
> Postgres.
> 
> I have an index on a column which is defined as "character varying(255)".
> When the value I am searching for is of a certain length, the optimizer
> uses the index but when the value is long, the optimizer doesn't use the
> index but does a seq scan on the table. Is this by design? How can I make
> the optimizer use the index no matter what the size/length of the value
> being searched for?

As I recall, selectivity for strings is estimated based on the length of
the string.  Since your sample string looks suspiciously like an UUID,
perhaps you'd be better served by using an UUID column for it, which may
give better results.  This would prevent you from using the shortened
version for searches (which I suppose you can do with LIKE using the
varchar type), but you could replace it with something like this:

select *
from tab
where ID between '01625cfa-2bf8-45cf--' and
  '01625cfa-2bf8-45cf--';

Storage (both the table and indexes) is going to be more efficient this
way too.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Number of characters in column preventing index usage

2017-02-17 Thread Tomas Vondra

Hi,

On 02/17/2017 11:19 PM, Hustler DBA wrote:

I am seeing this strange behavior, I don't know if this is by design by
Postgres.

I have an index on a column which is defined as "character
varying(255)". When the value I am searching for is of a certain length,
the optimizer uses the index but when the value is long, the optimizer
doesn't use the index but does a seq scan on the table. Is this by
design? How can I make the optimizer use the index no matter what the
size/length of the value being searched for?



AFAIK there are no such checks, i.e. the optimizer does not consider the 
length of the value when deciding between scan types.




PostgreSQL version: 9.4



That's good to know, but we also need information about the table 
involved in your queries. I'd bet the table is tiny (it seems to be just 
12 pages, so ~100kB), making the indexes rather useless.



my_db=# explain (analyze, buffers) select count(*) from tab where ID =
'01625cfa-2bf8-45cf' ;
  QUERY PLAN

---
 Aggregate  (cost=8.29..8.30 rows=1 width=0) (actual time=0.048..0.048
rows=1 loops=1)
   Buffers: shared read=2
   ->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29 rows=1
width=0) (actual time=0.043..0.043 rows=0 loops=1)
 Index Cond: (ID = '01625cfa-2bf8-45cf'::text)
 Heap Fetches: 0
 Buffers: shared read=2
 Planning time: 0.250 ms
 Execution time: 0.096 ms
(8 rows)

my_db=# explain (analyze, buffers) select count(*) from tab where ID =
'01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea' ;
QUERY PLAN

---
 Aggregate  (cost=14.80..14.81 rows=1 width=0) (actual time=0.115..0.115
rows=1 loops=1)
   Buffers: shared hit=12
   ->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual
time=0.031..0.108 rows=5 loops=1)
 Filter: ((ID)::text = '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'::text)
 Rows Removed by Filter: 218
 Buffers: shared hit=12
 Planning time: 0.122 ms
 Execution time: 0.180 ms
(8 rows)


The only difference I see is that for the long value the planner expects 
5 rows, while for the short one it expects 1 row. That may seem a bit 
strange, but I'd bet it finds the short value in some statistic (MCV, 
histogram) ans so can provide very accurate estimate. While for the 
longer one, it ends up using some default (0.5% for equality IIRC) or 
value deduced from ndistinct. Or something like that.


The differences between the two plans are rather negligible, both in 
terms of costs (8.3 vs. 14.81) and runtime (0.1 vs 0.2 ms). The choice 
of a sequential scan seems perfectly reasonable for such tiny tables.


FWIW it's impossible to draw conclusions based on two EXPLAIN ANALYZE 
executions. The timing instrumentation from EXPLAIN ANALYZE may have 
significant impact impact (different for each plan!). You also need to 
testing with more values and longer runs, not just a single execution 
(there are caching effects etc.)


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Number of characters in column preventing index usage

2017-02-17 Thread David G. Johnston
On Fri, Feb 17, 2017 at 3:19 PM, Hustler DBA  wrote:

>
> my_db=# create index tab_idx1 on tab(ID);
>
> CREATE INDEX
> my_db=# explain (analyze, buffers) select count(*) from tab where ID = '
> 01625cfa-2bf8-45cf' ;
>   QUERY PLAN
>
> 
> 
> ---
>  Aggregate  (cost=8.29..8.30 rows=1 width=0) (actual time=0.048..0.048
> rows=1 loops=1)
>Buffers: shared read=2
>->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29 rows=1
> width=0) (actual time=0.043..0.043 rows=0 loops=1)
>  Index Cond: (ID = '01625cfa-2bf8-45cf'::text)
>
>

>->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual
> time=0.031..0.108 rows=5 loops=1)
>  Filter: ((ID)::text = '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea
> '::text)
>  Rows Removed by Filter: 218
>  Buffers: shared hit=12
>  Planning time: 0.122 ms
>  Execution time: 0.180 ms
> (8 rows)
>

​IIRC the only reason the first query cares to use the index is because it
can perform an Index Only Scan and thus avoid touching the heap at all.  If
it cannot avoid touching the heap the planner is going to just use a
sequential scan to retrieve the records directly from the heap and save the
index lookup step.

David J.


[PERFORM] Number of characters in column preventing index usage

2017-02-17 Thread Hustler DBA
I am seeing this strange behavior, I don't know if this is by design by
Postgres.

I have an index on a column which is defined as "character varying(255)".
When the value I am searching for is of a certain length, the optimizer
uses the index but when the value is long, the optimizer doesn't use the
index but does a seq scan on the table. Is this by design? How can I make
the optimizer use the index no matter what the size/length of the value
being searched for?


PostgreSQL version: 9.4


my_db=# explain (analyze, buffers) select count(*) from tab where ID =
'01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea' ;
QUERY PLAN

---
 Aggregate  (cost=14.80..14.81 rows=1 width=0) (actual time=0.114..0.114
rows=1 loops=1)
   Buffers: shared hit=12
   ->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual
time=0.025..0.109 rows=5 loops=1)
 Filter: ((ID)::text = '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'::text)
 Rows Removed by Filter: 218
 Buffers: shared hit=12
 Planning time: 0.155 ms
 Execution time: 0.167 ms
(8 rows)

my_db=# create index tab_idx1 on tab(ID);

CREATE INDEX
my_db=# explain (analyze, buffers) select count(*) from tab where ID = '
01625cfa-2bf8-45cf' ;
  QUERY PLAN

---
 Aggregate  (cost=8.29..8.30 rows=1 width=0) (actual time=0.048..0.048
rows=1 loops=1)
   Buffers: shared read=2
   ->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29 rows=1
width=0) (actual time=0.043..0.043 rows=0 loops=1)
 Index Cond: (ID = '01625cfa-2bf8-45cf'::text)
 Heap Fetches: 0
 Buffers: shared read=2
 Planning time: 0.250 ms
 Execution time: 0.096 ms
(8 rows)

my_db=# explain (analyze, buffers) select count(*) from tab where ID = '
01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea' ;
QUERY PLAN

---
 Aggregate  (cost=14.80..14.81 rows=1 width=0) (actual time=0.115..0.115
rows=1 loops=1)
   Buffers: shared hit=12
   ->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual
time=0.031..0.108 rows=5 loops=1)
 Filter: ((ID)::text = '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'::text)
 Rows Removed by Filter: 218
 Buffers: shared hit=12
 Planning time: 0.122 ms
 Execution time: 0.180 ms
(8 rows)

my_db=#