Thank you so much!
So to recap the general way to reproduce this issue is:
create table products(id int primary key, type varchar);
insert into products select generate_series(1,1), 'aaa';
insert into products select generate_series(10001,2), 'bbb';
create index idx on products(type);
prepare stmt as select * from products where type = $1 limit 1;
Which quickly devolves into:
explain analyze execute stmt ('ccc');
QUERY PLAN
--
Limit (cost=0.00..0.03 rows=1 width=8) (actual time=1.821..1.821
rows=0 loops=1)
-> Seq Scan on products (cost=0.00..339.00 rows=1 width=8)
(actual time=1.819..1.819 rows=0 loops=1)
Filter: ((type)::text = $1)
Rows Removed by Filter: 2
Total runtime: 1.843 ms
(5 rows)
So if I am searching for 'ccc' eventually the prepared plan
"optimises" and uses the better mechanism of just scanning the table
to find the first hit which is what the statistics suggest.
However a fairly common pattern I use it to check for "lack of
presence" of a value.
For example: if the product type 'ccc' is not in the table do this.
Unfortunately the optimiser deoptimises this class of operation.
I tried the exact example above with an "int" instead of a "varchar"
in the "type" column and was not able to reproduce the issue, I wonder
if there is some sort of different handling for strings vs numbers.
Unfortunately my actual table in play has a rather bad schema, the
"archetype" column really should be an int. That said we only have 2
general archetypes at the moment (private message and topic) and the
occasional single "banner" outlier, which may or may not be there. So
the data modelling is pretty hostile to performance. I have some ideas
on how to solve my particular problem but I do have some general
concerns.
Ruby on Rails is just about to ship a new version that heavily changes
the mechanics of query execution. For example, Product.where(name:
"foo").first will now result in a prepared query whereas in the past
it would just send the raw query. Overall this approach is better and
saner, but my general concern is that our API offers no "escape hatch"
for these outlier conditions. You can disable globally, but can not
just disable for a single call. I will raise this particular concern
to the team.
My second question/concern is that I feel I am totally
misunderstanding the changes to 'plancache.c', I thought that the
decision of the plan to use was purely based on the value sent in to
the prepared query. However it seems that the planner completely
ignores the value in some steps. (so, for example I was thinking that
"aaa" and "ccc" would result in completely different plans)
Thank you so much for your time, patience and general awesomeness
On Fri, Nov 14, 2014 at 11:34 AM, Tom Lane wrote:
> Sam Saffron writes:
>> I have hit a rather odd issue with prepared queries on both pg 9.3 and 9.4
>> beta.
>> I have this table (copy at http://samsaffron.com/testing.db.gz) with a
>> very odd performance profile:
>
> Interesting case. The issue seems to be that your statistics look like
> this:
>
> select * from pg_stats where tablename = 'topics' and attname = 'archetype';
> schemaname | tablename | attname | inherited | null_frac | avg_width |
> n_distinct | most_common_vals | most_common_freqs |
> histogram_bounds | correlation | most_common_elems | most_common_elem_freqs |
> elem_count_histogram
> +---+---+---+---+---++---+-+--+-+---++-
> public | topics| archetype | f | 0 |12 |
> 2 | {private_message,regular} | {0.604957,0.395043} | |
>0.612985 | ||
> (1 row)
>
> That is, archetype consists of 60% 'private_message', 40% 'regular', and
> absolutely nothing else. So the condition archetype = 'banner' is very
> selective, and a plan built knowing that that is the parameter value will
> use the index:
>
> # explain select * from topics where archetype = 'banner' limit 1;
> QUERY PLAN
> --
> Limit (cost=0.29..651.49 rows=1 width=520)
>-> Index Scan using idx11 on topics (cost=0.29..651.49 rows=1 width=520)
> Index Cond: ((arc