Re: [PERFORM] CREATE STATISTICS and join selectivity

2017-11-15 Thread David Rowley
On 16 November 2017 at 09:19, Justin Pryzby <pry...@telsasoft.com> wrote:
> I just noticed that PG10 CREATE STATISTICS (dependencies) doesn't seem to work
> for joins on multiple columns; is that right?

Unfortunately, for now, they're not used for join selectivity
estimates, only for the base rel selectivities. That's all there was
time for with PG10. This is highly likely to be improved sometime in
the future.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Unnecessary DISTINCT while primary key in SQL

2017-11-05 Thread David Rowley
On 5 November 2017 at 04:20, 刘瑞 <whx20...@gmail.com> wrote:
> CREATE TABLE test_tbl ( k INT PRIMARY KEY, col text)
> INSERT into test_tbl select generate_series(1,1000), 'test';
>
> SQL with DISTINCT:
> test=# explain analyze select distinct col, k from test_tbl order by k limit
> 1000;
> QUERY PLAN
> --
>  Limit  (cost=1277683.22..1277690.72 rows=1000 width=36) (actual
> time=12697.994..12698.382 rows=1000 loops=1)
>->  Unique  (cost=1277683.22..1329170.61 rows=6864985 width=36) (actual
> time=12697.992..12698.311 rows=1000 loops=1)
>  ->  Sort  (cost=1277683.22..1294845.68 rows=6864985 width=36)
> (actual time=12697.991..12698.107 rows=1000 loops=1)
>Sort Key: k, col
>Sort Method: external sort  Disk: 215064kB
>->  Seq Scan on test_tbl  (cost=0.00..122704.85 rows=6864985
> width=36) (actual time=0.809..7561.215 rows=1000 loops=1)
>  Planning time: 2.368 ms
>  Execution time: 12728.471 ms
> (8 rows)

The current planner does not make much of an effort into recording
which columns remain distinct at each level. I have ideas on how to
improve this and it would include improving your case here.

9.6 did improve a slight variation of your query, but this was for
GROUP BY instead of DISTINCT. Probably there's no reason why the same
optimisation could not be applied to DISTINCT, I just didn't think of
it when writing the patch.

The item from the release notes [1] reads "Ignore GROUP BY columns
that are functionally dependent on other columns"

So, if you were to write the query as:

explain analyze select col, k from test_tbl group by col, k order by k
limit 1000;

It should run much more quickly, although still not as optimal as it could be.

[1] https://www.postgresql.org/docs/9.6/static/release-9-6.html

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Cheaper subquery scan not considered unless offset 0

2017-10-29 Thread David Rowley
On 30 October 2017 at 00:24, Benjamin Coutu <ben.co...@zeyos.com> wrote:
>   ->  Index Scan using "PK_items_ID" on items a  (cost=0.42..1.05 rows=1 
> width=332) (actual time=0.003..0.003 rows=1 loops=15521 total=46.563)

I've never seen EXPLAIN output like that before.

Is this some modified version of PostgreSQL?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How does max_parallel_workers_per_gather change load averages?

2017-10-03 Thread David Rowley
On 4 October 2017 at 08:48, Ben Nachtrieb  wrote:
> I have 2 cores and my max_parallel_workers_per_gather = 2 and
> max_worker_processes = 8, but my load averages are between 8 and 5 with
> scheduled at 1/189 to 5/195. Are these so high because I increased
> max_parallel_workers_per_gather? My understanding is that if my load
> averages are greater than my number of cores the system is overloaded.
> Should I think about it differently once I increase
> max_parallel_workers_per_gather? How should I think about it?

Parallel query is not 100% efficient. For example, adding twice the
CPU, in theory, will never double the performance, there's always some
overhead to this. It's really only useful to do on systems with spare
CPU cycles to perform this extra work. You don't seem to have much to
spare, so you may get along better if you disable parallel query.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query of partitioned object doesnt use index in qa

2017-09-20 Thread David Rowley
On 21 September 2017 at 04:15, Mike Broers <mbro...@gmail.com> wrote:
> Ultimately I think this is just highlighting the need in my environment to
> set random_page_cost lower (we are on an SSD SAN anyway..), but I dont think
> I have a satisfactory reason by the row estimates are so bad in the QA
> planner and why it doesnt use that partition index there.

Without the index there are no stats to allow the planner to perform a
good estimate on "e.body->>'SID' is not null", so it applies a default
of 99.5%. So, as a simple example, if you have a partition with 1
million rows. If you apply 99.5% to that you get 995000 rows. Now if
you add the selectivity for "e.validation_status_code = 'P' ", let's
say that's 50%, the row estimate for the entire WHERE clause would be
497500 (100 * 0.995 * 0.5). Since the 99.5% is applied in both
cases, then the only variable part is validation_status_code. Perhaps
validation_status_code  = 'P' is much more common in QA than in
production.

You can look at the stats as gathered by ANALYZE with:

\x on
select * from pg_stats where tablename = 'event__' and attname
= 'validation_status_code';
\x off

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query of partitioned object doesnt use index in qa

2017-09-13 Thread David Rowley
On 14 September 2017 at 08:28, Mike Broers <mbro...@gmail.com> wrote:
> I have a query of a partitioned table that uses the partition index in
> production but uses sequence scans in qa.  The only major difference I can
> tell is the partitions are much smaller in qa.  In production the partitions
> range in size from around 25 million rows to around 60 million rows, in QA
> the partitions are between 4 and 12 million rows.  I would think this would
> be big enough to get the planner to prefer the index but this is the major
> difference between the two database as far as I can tell.


QA:

> │   ->  Seq Scan on event__ e_1
> (cost=0.00..2527918.06 rows=11457484 width=782)│
>

Production:
>
> │   ->  Index Scan using
> ix_event__00011162_landing_id on event__00011162 e_1  (cost=0.56..15476.59
> rows=23400 width=572)   │


If QA has between 4 and 12 million rows, then the planner's row
estimate for the condition thinks 11457484 are going to match, so a
Seqscan is likely best here. If those estimates are off then it might
be worth double checking your nightly analyze is working correctly on
QA.

The planner may be able to be coaxed into using the index with a
higher effective_cache_size and/or a lower random_page_cost setting,
although you really should be looking at those row estimates first.
Showing us the EXPLAIN ANALYZE would have been much more useful so
that we could have seen if those are accurate or not.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query plan for views and WHERE clauses, Luke is not using the index

2017-08-21 Thread David Rowley
On 19 August 2017 at 04:46, kimaidou <kimai...@gmail.com> wrote:
> When we call the WHERE on the view:
>
> EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
> SELECT *
> FROM "qgep"."vw_qgep_reach"
> WHERE "progression_geometry" &&
> st_makeenvelope(1728327.03249295568093657,8240789.26074041239917278,1728608.10987572139129043,8240958.16933418624103069,3949)
>
>
> The query plan is "wrong", as PostgreSQL seems to consider it should do a
> seq scan on the tables, and only afterwards filter with the WHERE:
> https://explain.depesz.com/s/wXV
>
> The query takes about 1 second instead of less than 100ms.
>
> Do you have any hint on this kind of issue ?

This is by design due to the DISTINCT ON() clause. Only quals which
filter columns which are in the DISTINCT ON can be safely pushed down.

Consider the following, where I've manually pushed the WHERE clause.

postgres=# create table tt (a int, b int);
CREATE TABLE
postgres=# create index on tt (a);
CREATE INDEX
postgres=# insert into tt values(1,1),(1,2),(2,1),(2,2);
INSERT 0 4
postgres=# select * from (select distinct on (a) a,b from tt order by
a,b) tt where b = 2;
 a | b
---+---
(0 rows)


postgres=# select * from (select distinct on (a) a,b from tt where b =
2 order by a,b) tt;
 a | b
---+---
 1 | 2
 2 | 2
(2 rows)

Note the results are not the same.

If I'd done WHERE a = 2, then the planner would have pushed the qual
down into the subquery.

More reading in check_output_expressions() in allpaths.c:

/* If subquery uses DISTINCT ON, check point 3 */
if (subquery->hasDistinctOn &&
!targetIsInSortList(tle, InvalidOid, subquery->distinctClause))
{
/* non-DISTINCT column, so mark it unsafe */
safetyInfo->unsafeColumns[tle->resno] = true;
continue;
}

The comment for point 3 reads:

 * 3. If the subquery uses DISTINCT ON, we must not push down any quals that
 * refer to non-DISTINCT output columns, because that could change the set
 * of rows returned.  (This condition is vacuous for DISTINCT, because then
 * there are no non-DISTINCT output columns, so we needn't check.  Note that
 * subquery_is_pushdown_safe already reported that we can't use volatile
 * quals if there's DISTINCT or DISTINCT ON.)


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Inefficient max query when using group by

2017-05-04 Thread David Rowley
On 4 May 2017 at 22:52,  <jesse.hieta...@vaisala.com> wrote:
> I have a performance problem with my query. As a simplified example, I have
> a table called Book, which has three columns: id, released (timestamp) and
> author_id. I have a need to search for the latest books released by multiple
> authors, at a specific point in the history. This could be latest book
> between beginning of time and now, or latest book released last year etc. In
> other words, only the latest book for each author, in specific time window.
> I have also a combined index for released and author_id columns.
>
> First, I tried a simple query that selects maximum value of released and the
> author_id, which are grouped by the author_id (then later do a join by these
> author_id, released columns to get the whole rows).  Performance of this
> query is pretty bad (Execution time around 250-300ms for five authors). See
> query and query plan in the link below:
>
> https://gist.github.com/jehie/ca9fac16b6e3c19612d815446a0e1bc0
>
>
>
> The execution time seems to grow linearly when the number of author_ids
> increase (50ms per author_id). I don’t completely understand why it takes so
> long for this query to execute and why it does not use the directional index
> scan?
>
> I also tried second query using limit (where I can only ask for one
> author_id at a time, so cannot use this directly when searching for books of
> multiple author), which performs nicely (0.2ms):
>
> https://gist.github.com/jehie/284e7852089f6debe22e05c63e73027f
>
>
>
> So, any ideas how to make multiple-author lookups (like in the first query)
> perform better? Or any other ideas?

Yes, you could sidestep the whole issue by using a LATERAL join.

Something like:

EXPLAIN ANALYZE
SELECT b.released, b.author_id
FROM (VALUES('1'),('2'),('3'),('4'),('5')) a (author_id)
CROSS JOIN LATERAL (SELECT released, author_id
 FROM book
  WHERE author_id = a.author_id
AND released<=to_timestamp(2e9)
AND released>=to_timestamp(0)
ORDER BY released desc
 LIMIT 1) b;

or you could write a function which just runs that query. Although,
with the above or the function method, if you give this enough
authors, then it'll eventually become slower than the problem query.
Perhaps if you know the number of authors will not be too great, then
you'll be ok.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with 3 table joins

2017-04-25 Thread David Rowley
On 26 April 2017 at 15:19, Alessandro Ferrucci
<alessandroferru...@gmail.com> wrote:
> After about 40 inutes the slow query finally finished and the result of the
> EXPLAIN plan can be found here:
>
> https://explain.depesz.com/s/BX22

> Index Scan using field_unit_id_idx on field  (cost=0.00..8746678.52 
> rows=850149 width=8) (actual time=0.030..2414345.998 rows=10315 loops=1)"

This estimate seems a long way off. Are the stats up-to-date on the
table? Try again after running: ANALYZE field;

It might also be a good idea to ANALYZE all the tables. Is auto-vacuum
switched on?

The plan in question would work better if you create an index on field
(field_name, unit_id);

but I think if you update the stats the plan will switch.

A HashJoin, hashing "unit"  and index scanning on field_field_name_idx
would have been a much smarter plan choice for the planner to make.

Also how many distinct field_names are there? SELECT COUNT(DISTINCT
field_name) FROM field;

You may want to increase the histogram buckets on that columns if
there are more than 100 field names, and the number of rows with each
field name is highly variable. ALTER TABLE field ALTER COLUMN
field_name SET STATISTICS ; 100 is the default, and 10000
is the maximum.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Insert Concurrency

2017-04-17 Thread David Rowley
On 18 April 2017 at 14:55, ROBERT PRICE <rprice...@hotmail.com> wrote:
> I come from an Oracle background and am porting an application to postgres.
> App has a table that will contain 100 million rows and has to be loaded by a
> process that reads messages off a SQS queue and makes web service calls to
> insert records one row at a time in a postgres RDS instance. I know slow by
> slow is not the ideal approach but I was wondering if postgres had
> partitioning or other ways to tune concurrent insert statements. Process
> will run 50 - 100 concurrent threads.

Have you tested performance and noticed that it is insufficient for
your needs? or do you just assume PostgreSQL suffers from the same
issue as Oracle in regards to INSERT contention on a single table?

You may like to look at pgbench [1] to test the performance if you've
not done so already.

[1] https://www.postgresql.org/docs/9.6/static/pgbench.html

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Fast HashJoin only after a cluster/recreate table

2016-04-01 Thread David Rowley
  Index Cond: ((es09codemp = 1) AND
> (es09tipdoc = t3.es09tipdoc) AND (es09numdoc = t3.es09numdoc))
>  Filter: (es09tipdoc ~~ '%'::text)
>  Buffers: shared hit=6706
>->  Sort  (cost=37.35..38.71 rows=547 width=32) (actual
> time=0.592..2.206 rows=2919 loops=1)
>  Sort Key: t2.es08tipdoc
>  Sort Method: quicksort  Memory: 67kB
>  Buffers: shared hit=7
>  ->  Seq Scan on es08t t2  (cost=0.00..12.47 rows=547
> width=32) (actual time=0.003..0.126 rows=547 loops=1)
>Buffers: shared hit=7
>  ->  Materialize  (cost=0.56..287644.85 rows=716126 width=23)
> (actual time=0.027..68577.800 rows=993087854 loops=1)
>Buffers: shared hit=75342
>->  GroupAggregate  (cost=0.56..278693.28 rows=716126
> width=15) (actual time=0.025..4242.453 rows=3607573 loops=1)
>  Group Key: es09t1.es09codemp, es09t1.es09tipdoc,
> es09t1.es09numdoc
>  Buffers: shared hit=75342
>  ->  Index Only Scan using es09t1_pkey on es09t1
>  (cost=0.56..199919.49 rows=7161253 width=15) (actual time=0.016..1625.031
> rows=7160921 loops=1)
>Index Cond: (es09codemp = 1)
>Heap Fetches: 51499
>Buffers: shared hit=75342
>  Planning time: 50.129 ms
>  Execution time: 380419.435 ms
>


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PERFORM] Merge joins on index scans

2016-02-28 Thread David Rowley
On 27 February 2016 at 11:07, James Parks <james.pa...@meraki.net> wrote:
>
> CREATE TABLE a (id bigint primary key, nonce bigint);
> CREATE TABLE b (id bigint primary key, a_id bigint not null);
> CREATE INDEX a_idx ON b (a_id);
>
> The query:
>
> SELECT b.* FROM b JOIN a ON b.a_id = a.id WHERE a.nonce = ? ORDER BY b.id
> ASC;
>
> (skip down to [1] and [2] to see the query performance)
>
> What I know:
>
> If you force the query planner to use a merge join on the above query, it
> takes 10+ minutes to complete using the data as per below. If you force the
> query planner to use a hash join on the same data, it takes ~200
> milliseconds.

I believe I know what is going on here, but can you please test;

SELECT b.* FROM b WHERE EXISTS (SELECT 1 FROM a ON b.a_id = a.id AND
a.nonce = ?) ORDER BY b.id ASC;

using the merge join plan.

If this performs much better then the problem is due to the merge join
mark/restore causing the join to have to transition through many
tuples which don't match the a.nonce = ? predicate. The mark and
restore is not required for the rewritten query, as this use a semi
join rather than a regular inner join. With the semi join the executor
knows that it's only meant to be matching a single tuple in "a", so
once the first match is found it can move to the next row in the outer
relation without having to restore the scan back to where it started
matching that inner row again.

If I'm right, to get around the problem you could; create index on a
(nonce, id);

If such an index is out of the question then a patch has been
submitted for review which should fix this problem in (hopefully)
either 9.6 or 9.7
https://commitfest.postgresql.org/9/129/
If you have a test environment handy, it would be nice if you could
test the patch on the current git head to see if this fixes your
problem. The findings would be quite interesting for me. Please note
this patch is for test environments only at this stage, not for
production use.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL seems to create inefficient plans in simple conditional joins

2016-01-30 Thread David Rowley
On 31 January 2016 at 01:30, Hedayat Vatankhah <hedayat@gmail.com> wrote:
> Personally, I expect both queries below to perform exactly the same:
>
> SELECT
> t1.id, *
> FROM
> t1
> INNER JOIN
> t2 ON t1.id = t2.id
> where t1.id > -9223372036513411363;
>
> And:
>
> SELECT
> t1.id, *
> FROM
> t1
> INNER JOIN
> t2 ON t1.id = t2.id
> where t1.id > -9223372036513411363 and t2.id > -9223372036513411363;
>
> Unfortunately, they do not. PostgreSQL creates different plans for these
> queries, which results in very poor performance for the first one compared
> to the second (What I'm testing against is a DB with around 350 million
> rows in t1, and slightly less in t2).
>
> EXPLAIN output:
> First query: http://explain.depesz.com/s/uauk
> Second query: link: http://explain.depesz.com/s/uQd

Yes, unfortunately you've done about the only thing that you can do,
and that's just include both conditions in the query. Is there some
special reason why you can't just write the t2.id > ... condition in
the query too? or is the query generated dynamically by some software
that you have no control over?

I'd personally quite like to see improvements in this area, and even
wrote a patch [1] which fixes this problem too. The problem I had when
proposing the fix for this was that I was unable to report details
about how many people are hit by this planner limitation. The patch I
proposed caused a very small impact on planning time for many queries,
and was thought by many not to apply in enough cases for it to be
worth slowing down queries which cannot possibly benefit. Of course I
agree with this, I've no interest in slowing down planning on queries,
but at the same time understand the annoying poor optimisation in this
area.

Although please remember the patch I proposed was merely a first draft
proposal. Not for production use.

[1] 
http://www.postgresql.org/message-id/flat/cakjs1f9fk_x_5hkcpcseimy16owe3empmmgsgwlckkj_rw9...@mail.gmail.com#cakjs1f9fk_x_5hkcpcseimy16owe3empmmgsgwlckkj_rw9...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL seems to create inefficient plans in simple conditional joins

2016-01-30 Thread David Rowley
On 31 January 2016 at 06:14, Vitalii Tymchyshyn <v...@tym.im> wrote:
> It may be more for -hackers, but I often hear "this wont be used because of
> planning time increase". Now as I know we have statistics on real query time
> after few runs that is used to decide if plan should be switched.
> Can this statistics be used to apply advanced planning features for
> relatively long running queries? E.g. a parameter like
> sophisticated_planning_l1_threshold=500ms. If query runs over this
> threshold, replan it with more sophisticated features taking few more
> millis. Possibly different levels can be introduced. Also allow to set
> threshold to 0, saying "apply to all queries right away".
> Another good option is to threshold against cumulative query time. E.g. if
> there was 1 runs 0.5 millis each, it may be beneficial to spend few
> millis to get 0.2 millis each.

I agree with you. I recently was working with long running queries on
a large 3TB database. I discovered a new optimisation was possible,
and wrote a patch to implement. On testing the extra work which the
optimiser performed took 7 micoseconds, and this saved 6 hours of
execution time. Now, I've never been much of an investor in my life,
but a 3 billion times return on an investment seems quite favourable.
Of course, that's quite an extreme case, but it's hard to ignore the
benefit is still significant in less extreme cases.

The idea you've mentioned here is very similar to what I bought up at
the developer meeting a few days ago, see AOB section in [1]

Unfortunately I didn't really get many of the correct people on my
side with it, and some wanted examples of specific patches, which is
completely not what I wanted to talk about. I was more aiming for some
agreement for generic infrastructure to do exactly as you describe.

[1]  https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2016_Developer_Meeting


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query order of magnitude slower with slightly different where clause

2016-01-17 Thread David Rowley
t;> Execution time: 5459.461 ms
>
>
> Please let me know if there is any more info I can provide to help figure
> out why it's choosing an undesirable plan with just a slight change in the
> the clause.
>

Hi Adam,

This is fairly simple to explain. The reason you see better performance
with the singe claim_id is that IN() clauses with a single 1 item are
converted to a single equality expression. For example: (just using system
tables so you can try this too, without having to create any special tables)

# explain select * from pg_class where oid in(1);
 QUERY PLAN
-
 Index Scan using pg_class_oid_index on pg_class  (cost=0.27..8.29 rows=1
width=219)
   Index Cond: (oid = '1'::oid)

We get an index scan with the index condition: oid = 1.

If we have 2 items, then we don't get this.

# explain select * from pg_class where oid in(1,2);
   QUERY PLAN
-
 Bitmap Heap Scan on pg_class  (cost=8.56..14.03 rows=2 width=219)
   Recheck Cond: (oid = ANY ('{1,2}'::oid[]))
   ->  Bitmap Index Scan on pg_class_oid_index  (cost=0.00..8.56 rows=2
width=0)
 Index Cond: (oid = ANY ('{1,2}'::oid[]))
(4 rows)

Now I also need to explain that PostgreSQL will currently push ONLY
equality expressions into other relations. For example, if we write:

# explain select * from pg_class pc inner join pg_attribute pa on pc.oid =
pa.attrelid where pc.oid in(1);
   QUERY PLAN

 Nested Loop  (cost=0.55..22.63 rows=4 width=422)
   ->  Index Scan using pg_class_oid_index on pg_class pc  (cost=0.27..8.29
rows=1 width=223)
 Index Cond: (oid = '1'::oid)
   ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute pa
 (cost=0.28..14.30 rows=4 width=203)
 Index Cond: (attrelid = '1'::oid)
(5 rows)

You can see that I only put pg_class.oid = 1 in the query, but internally
the query planner also added the pg_attribute.attrelid = 1. It was able to
do this due to the join condition dictating that pc.oid = pa.attrelid,
therefore this will always be equal, and since pc.oid = 1, then pa.attrelid
must also be 1.

If we have 2 items in the IN() clause, then this no longer happens:

# explain select * from pg_class pc inner join pg_attribute pa on pc.oid =
pa.attrelid where pc.oid in(1,2);
       QUERY PLAN

 Nested Loop  (cost=8.84..54.84 rows=15 width=422)
   ->  Bitmap Heap Scan on pg_class pc  (cost=8.56..14.03 rows=2 width=223)
 Recheck Cond: (oid = ANY ('{1,2}'::oid[]))
 ->  Bitmap Index Scan on pg_class_oid_index  (cost=0.00..8.56
rows=2 width=0)
   Index Cond: (oid = ANY ('{1,2}'::oid[]))
   ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute pa
 (cost=0.28..20.33 rows=8 width=203)
 Index Cond: (attrelid = pc.oid)
(7 rows)

In your case the claim_id = 'e8a38718-7997-4304-bbfa-138deb84aa82'::uuid
was pushed down into the subqueries, thus giving them less work to do, and
also the flexibility of using indexes on claim_id in the tables contained
within the subqueries. PostgreSQL currently does not push any inequality
predicates down at all.

A few months ago I did a little bit of work to try and lift this
restriction, although I only made it cover the >=, >, < and <= operators as
a first measure.

Details here:
http://www.postgresql.org/message-id/flat/cakjs1f9fk_x_5hkcpcseimy16owe3empmmgsgwlckkj_rw9...@mail.gmail.com#cakjs1f9fk_x_5hkcpcseimy16owe3empmmgsgwlckkj_rw9...@mail.gmail.com

If you didn't have the VIEW, you could manually push these predicates into
each subquery. However this is not really possible to do with the VIEW.
Perhaps something could be done with a function and using dynamic SQL to
craft a query manually, or you could just get rid of the view and have the
application build the query. If that's not an option then maybe you could
response to the thread above to mention that you've been hit by this
problem and would +1 some solution to fix it, and perhaps cross link to
this thread. I did have a little bit of a hard time in convincing people
that this was in fact a fairly common problem in the above thread, so it
would be nice to see people who have hit this problem respond to that.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PERFORM] Why is now()::date so much faster than current_date

2015-11-17 Thread David Rowley
On 17 November 2015 at 21:49, Thomas Kellerer <spam_ea...@gmx.net> wrote:

> Hello,
>
> I stumbled over this answer: http://stackoverflow.com/a/9717125/330315
> and this sounded quite strange to me.
>
> So I ran this on my Windows laptop with Postgres 9.4.5, 64bit and indeed
> now()::date is much faster than current_date:
>
>   explain analyze
>   select current_date
>   from   generate_series (1, 100);
>
>   Function Scan on generate_series  (cost=0.00..6.00 rows=1000 width=0)
> (actual time=243.878..1451.839 rows=100 loops=1)
>   Planning time: 0.047 ms
>   Execution time: 1517.881 ms
>
> And:
>
>   explain analyze
>   select now()::date
>   from   generate_series (1, 100);
>
>   Function Scan on generate_series  (cost=0.00..6.00 rows=1000 width=0)
> (actual time=244.491..785.819 rows=100 loops=1)
>   Planning time: 0.037 ms
>   Execution time: 826.612 ms
>
>
>
The key to this is in the EXPLAIN VERBOSE output:

postgres=# explain verbose select current_date;
QUERY PLAN
--
 Result  (cost=0.00..0.01 rows=1 width=0)
   Output: ('now'::cstring)::date
(2 rows)

You can see that the implementation of current_date requires using the
date_in() function as well as the date_out() function. date_in() parses the
'now' string, then the resulting date is converted back into a date string
with date_out().  Using now()::date does not have to parse any date
strings, it just needs to call date_out() to give the final output.

The reason for this is likely best explained by the comment in gram.y:

/*
* Translate as "'now'::text::date".
*
* We cannot use "'now'::date" because coerce_type() will
* immediately reduce that to a constant representing
* today's date.  We need to delay the conversion until
* runtime, else the wrong things will happen when
* CURRENT_DATE is used in a column default value or rule.
*
* This could be simplified if we had a way to generate
* an expression tree representing runtime application
* of type-input conversion functions.  (As of PG 7.3
* that is actually possible, but not clear that we want
* to rely on it.)
*
* The token location is attached to the run-time
* typecast, not to the Const, for the convenience of
* pg_stat_statements (which doesn't want these constructs
* to appear to be replaceable constants).
*/

--
 David Rowley   http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PERFORM] Having some problems with concurrent COPY commands

2015-10-13 Thread David Rowley
On 14 October 2015 at 08:33, Shaun Thomas <bonesmo...@gmail.com> wrote:

> On Tue, Oct 13, 2015 at 7:23 AM, Andres Freund <and...@anarazel.de> wrote:
> > and send the results.
>
> Whelp, I'm an idiot. I can't account for how I did it, but I can only
> assume I didn't export my ports in the tests properly. I ran
> everything again and there's a marked difference between 9.3 and 9.4.
> The parallel copy times still inflate, but only from 1.4s to 2.5s at 4
> procs. Though it gets a bit dicey after that.
>
>
>
Do the times still inflate in the same way if you perform the COPY before
adding the indexes to the table?

--
 David Rowley   http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [PERFORM] Slow HashAggregate/cache access

2015-08-05 Thread David Rowley
On 6 August 2015 at 07:55, Andreas Joseph Krogh andr...@visena.com wrote:

 På onsdag 05. august 2015 kl. 20:25:25, skrev Maxim Boguk 
 maxim.bo...@gmail.com:

 [snip]

 ​I think I know where issue is.
 The PostgreSQL planner unable pass join conditions into subquery with
 aggregate functions (it's well known limitation).
 [snip]


 I'm curious; will 9.5 help here as it has WHERE clause pushdown in
 subqueries with window functions?

 http://michael.otacoo.com/postgresql-2/postgres-9-5-feature-highlight-where-pushdown-with-window-function/



I've not looked at the query in any detail, but that particular patch won't
help as it only allows pushdown of predicate into subqueries with window
functions where the predicate is part of all of the subquery's PARTITION BY
clauses.

The query in question has no window clauses, so qual pushdown is not
disabled for that reason.

Regards

David Rowley
--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [PERFORM] Slow HashAggregate/cache access

2015-08-05 Thread David Rowley
On 6 August 2015 at 06:25, Maxim Boguk maxim.bo...@gmail.com wrote:



 On Wed, Aug 5, 2015 at 11:41 AM, Alexandre de Arruda Paes 
 adald...@gmail.com wrote:

 Hi,

 First, sorry to compare Post with other database system, but I know
 nothing about Oracle...

 This customer have an application made with a framework thats generates
 the SQL statements (so, We can't make any query optimizations) .

 We did the following tests:

 1) Postgresql 9.3 and Oracle 10 in a desktop machine(8 GB RAM, 1 SATA
 disk,Core i5)
 2) Postgresql 9.3 in a server + FC storage (128 GB RAM, Xeon 32 cores,
 SAS disks)


 ​I think I know where issue is.
 The PostgreSQL planner unable pass join conditions into subquery with
 aggregate functions (it's well known limitation).


I think this statement is quite misleading. Let's look at an example:

create table t1 (a int not null, v int not null);
create table t2 (a int not null);
insert into t1 select s.i,10 from generate_series(1,1000)
s(i),generate_series(1,1000);
insert into t2 select generate_series(1,1000);
create index on t1 (a);


explain select t2.a,s.sumv from (select a,sum(v) sumv from t1 group by a) s
inner join t2 on t2.a = s.a where t2.a = 1;
QUERY PLAN
--
 Nested Loop  (cost=0.42..59.76 rows=1 width=12)
   -  GroupAggregate  (cost=0.42..42.24 rows=1 width=8)
 Group Key: t1.a
 -  Index Scan using t1_a_idx on t1  (cost=0.42..37.38 rows=969
width=8)
   Index Cond: (a = 1)
   -  Seq Scan on t2  (cost=0.00..17.50 rows=1 width=4)
 Filter: (a = 1)
(7 rows)

As you can see, the predicate is pushes down just fine into a subquery with
aggregates.

The likely reason that PostgreSQL Is not behaving the same as SQL Server
and Oracle is because the predicate pushdowns are limited to equality
operators only as internally these are all represented by a series of
equivalence classes which in this case say that 1 = t2.a = t1.a,
therefore it's possible to apply t1.a = 1 at the lowest level.

These equivalence classes don't currently handle non-equality operators.
Here's an example:

explain select t2.a,s.sumv from (select a,sum(v) sumv from t1 group by a) s
inner join t2 on t2.a = s.a where t2.a = 1;
   QUERY PLAN

 Hash Join  (cost=19442.51..19466.27 rows=1 width=12)
   Hash Cond: (t1.a = t2.a)
   -  HashAggregate  (cost=19425.00..19435.00 rows=1000 width=8)
 Group Key: t1.a
 -  Seq Scan on t1  (cost=0.00..14425.00 rows=100 width=8)
   -  Hash  (cost=17.50..17.50 rows=1 width=4)
 -  Seq Scan on t2  (cost=0.00..17.50 rows=1 width=4)
   Filter: (a = 1)
(8 rows)

Notice the seq scan on t1 instead of the index scan on t1_a_idx.

A way around this is to manually push the predicate down into the subquery:

explain select t2.a,s.sumv from (select a,sum(v) sumv from t1 where t1.a =
1 group by a) s inner join t2 on t2.a = s.a where t2.a = 1;
  QUERY PLAN
---
 Nested Loop  (cost=0.42..21.98 rows=1 width=12)
   Join Filter: (t1.a = t2.a)
   -  GroupAggregate  (cost=0.42..4.46 rows=1 width=8)
 Group Key: t1.a
 -  Index Scan using t1_a_idx on t1  (cost=0.42..4.44 rows=1
width=8)
   Index Cond: (a = 1)
   -  Seq Scan on t2  (cost=0.00..17.50 rows=1 width=4)
 Filter: (a = 1)
(8 rows)


The query in question is likely performing badly because of this:

 -  Seq Scan on fr13t1  (cost=0.00..25072.50
rows=311800 width=21) (actual time=0.007..115.766 rows=311800 loops=1)
   Filter: (fr01codemp = '1'::smallint)
   Buffers: shared hit=21175

Just how selective is fr01codemp = '1'::smallint ? Is there an index on
that column ?

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [PERFORM] hyperthreadin low performance

2015-07-21 Thread David Rowley
On 21 July 2015 at 14:59, Jeison Bedoya Delgado jeis...@audifarma.com.co
wrote:

 hi everyone,

 Recently update a database to machine with RHEL7, but i see that the
 performance is betther if the hyperthreading tecnology is deactivated and
 use only 32 cores.

 is normal that the machine performance is better with 32 cores that 64
 cores?.


You might be interested in
http://www.postgresql.org/message-id/53f4f36e.6050...@agliodbs.com

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [PERFORM] Query RE: Optimising UUID Lookups

2015-03-24 Thread David Rowley
On 21 March 2015 at 23:34, Roland Dunn roland.d...@gmail.com wrote:


 If we did add more RAM, would it be the effective_cache_size setting
 that we would alter? Is there a way to force PG to load a particular
 table into RAM? If so, is it actually a good idea?


Have you had a look at EXPLAIN (ANALYZE, BUFFERS) for the query?

Pay special attention to Buffers: shared read=NNN and Buffers: shared
hit=NNN, if you're not reading any buffers between runs then the pages are
in the PostgreSQL shared buffers. By the looks of your config you have 10GB
of these. On the other hand if you're getting buffer reads, then they're
either coming from disk, or from the OS cache. PostgreSQL won't really know
the difference.

If you're not getting any buffer reads and it's still slow, then the
problem is not I/O

Just for fun... What happens if you stick the 50 UUIDs in some table,
analyze it, then perform a join between the 2 tables, using IN() or
EXISTS()... Is that any faster?

Also how well does it perform with: set enable_bitmapscan = off; ?

Regards

David Rowley


Re: [PERFORM] Postgres does not use indexes with OR-conditions

2014-11-06 Thread David Rowley
On Fri, Nov 7, 2014 at 5:16 PM, arhipov arhi...@dc.baikal.ru wrote:

 Hello,

 I have just came across interesting Postgres behaviour with OR-conditions.
 Are there any chances that the optimizer will handle this situation in the
 future?

 select *
 from commons.financial_documents fd
 where fd.creation_time = '2011-11-07 10:39:07.285022+08'
 order by fd.creation_time desc
 limit 200

 select *
 from commons.financial_documents fd
 where fd.creation_time = '2011-11-07 10:39:07.285022+08'
or fd.creation_time  '2011-11-07 10:39:07.285022+08'
 order by fd.creation_time desc
 limit 200


 It would certainly be possible, providing the constants compare equally,
but... Question: Would you really want to pay a, say 1% increase in
planning time for ALL queries, so that you could have this unique case of
queries perform better at execution time?

Is there a valid reason why you don't just write the query with the =
operator?

Regards

David Rowley


Re: [PERFORM] unnecessary sort in the execution plan when doing group by

2014-10-28 Thread David Rowley
On Tue, Oct 28, 2014 at 7:26 PM, Huang, Suya suya.hu...@au.experian.com
wrote:

  Hi,



 This is the Greenplum database 4.3.1.0.


Likely this is the wrong place to ask for help. The plan output that you've
pasted below looks very different to PostgreSQL's EXPLAIN output.



  QUERY PLAN

---

Gather Motion 24:1  (slice2; segments: 24)  (cost=31286842.08..31287447.81
rows=1683 width=536)

   Rows out:  15380160 rows at destination with 14860 ms to first row,
23856 ms to end, start offset by 104 ms.

   -  HashAggregate  (cost=31286842.08..31287447.81 rows=1683 width=536)


-

Gather Motion 24:1  (slice2; segments: 24)
(cost=152269717.33..157009763.41 rows=1196982 width=568)

   Rows out:  15380160 rows at destination with 35320 ms to first row,
70091 ms to end, start offset by 102 ms.

   -  GroupAggregate  (cost=152269717.33..157009763.41 rows=1196982
width=568)


Most likely the reason you're getting the difference in plan is because the
planner is probably decided that there will be too many hash entries for a
hash table based on the 3 grouping columns... Look at the estimates, 1683 with
2 columns and 1196982 with the 3 columns. If those estimates turned out to
be true, then the hash table for 3 columns will be massively bigger than it
would be with 2 columns. With PostgreSQL you might see the plan changing if
you increased the work_mem setting. For greenplum, I've no idea if that's
the same.

Databases are often not very good at knowing with the number of distinct
values would be over more than 1 column. Certain databases have solved this
with multi column statistics, but PostgreSQL does not have these. Although
I just noticed last night that someone is working on them.

Regards

David Rowley


Re: [PERFORM] IS NOT NULL and LEFT JOIN

2014-10-21 Thread David Rowley
On Tue, Oct 21, 2014 at 2:58 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Laurent Martelli laurent.marte...@enercoop.org writes:
  Do we agree that both queries are identical ?

 No, they *aren't* identical.  Go consult any SQL reference.  Left join
 conditions don't work the way you seem to be thinking: after the join,
 the RHS column might be null, rather than equal to the LHS column.



For what it's worth I'd say they are identical, at least, if you discount
deferring  foreign key constraints or also executing the query from within
a volatile function which was called by a query which just updated the
user_info table to break referential integrity.

The presence of the foreign key on contract_contract.user_info which
references user_user_info.id means that any non-null
contract_contract.user_info record must reference a valid user_user_info
record, therefore the join is not required to prove that a non nulled
user_info contract records match a user info record, therefore the join to
check it exists is pretty much pointless in just about all cases that
you're likely to care about.

Although, saying that I'm still a bit confused about the question. Are you
asking if there's some way to get PostgreSQL to run the 1st query faster?
Or are you asking if both queries are equivalent?

Regards

David Rowley


Re: [PERFORM] IS NOT NULL and LEFT JOIN

2014-10-19 Thread David Rowley
On Sun, Oct 19, 2014 at 5:10 PM, Laurent Martelli 
laurent.marte...@enercoop.org wrote:

  Hello there,

 I have a strange query plan involving an IS NOT NULL and a LEFT JOIN.

 I grant you that the query can be written without the JOIN on
 user_user_info,
 but it is generated like this by hibernate. Just changing the IS NOT NULL
 condition
 to the other side of useless JOIN makes a big difference in the query plan
 :

 -- THE BAD ONE : given the selectivity on c.name and c.email, barely more
 than one row will ever be returned


But it looks like you're ignoring the fact that the OR condition would
force the query to match not only the user and the email, but also any row
that finds a match in the user_user_info table, which going by the
planner's estimates, that's every row in the contract_contract table. This
is why the planner chooses a seqscan on the contract_contract table instead
of using the index on lower(name).

Is it really your intention to get all rows that find a this martelli
contract that has this email, and along with that, get every contract that
has a not null user_info record?

I see that you have a foreign key on c.user_info to reference the user, so
this should be matching everything with a non null user_info record.


explain analyze select c.*
from contact_contact c
left outer join user_user_info u on c.user_info=u.id
left outer join contact_address a on c.address=a.id
   where lower(c.name)='martelli'
 and c.email='ds...@ezrfz.com' or u.id is not null;

   QUERY
 PLAN

 
  Hash Left Join  (cost=1.83..2246.76 rows=59412 width=4012) (actual
 time=53.645..53.645 rows=0 loops=1)
Hash Cond: (c.user_info = u.id)
Filter: (((lower((c.name)::text) = 'martelli'::text) AND
 ((c.email)::text = 'ds...@ezrfz.com'::text)) OR (u.id IS NOT NULL))
Rows Removed by Filter: 58247
-  Seq Scan on contact_contact c  (cost=0.00..2022.12 rows=59412
 width=4012) (actual time=0.007..6.892 rows=58247 loops=1)
-  Hash  (cost=1.37..1.37 rows=37 width=8) (actual time=0.029..0.029
 rows=37 loops=1)
  Buckets: 1024  Batches: 1  Memory Usage: 2kB
  -  Seq Scan on user_user_info u  (cost=0.00..1.37 rows=37
 width=8) (actual time=0.004..0.015 rows=37 loops=1)
  Planning time: 0.790 ms
  Execution time: 53.712 ms

 -- THE GOOD ONE (test IS NOT NULL on contact0_.user_info instead of
 userinfo1_.id)
 explain analyze select c.*
from contact_contact c
left outer join user_user_info u on c.user_info=u.id
left outer join contact_address a on c.address=a.id
   where lower(c.name)='martelli'
 and c.email='ds...@ezrfz.com' or c.user_info is not null;
  QUERY
 PLAN

 
  Bitmap Heap Scan on contact_contact c  (cost=8.60..16.41 rows=1
 width=4012) (actual time=0.037..0.037 rows=0 loops=1)
Recheck Cond: (((email)::text = 'ds...@ezrfz.com'::text) OR (user_info
 IS NOT NULL))
Filter: (((lower((name)::text) = 'martelli'::text) AND ((email)::text =
 'ds...@ezrfz.com'::text)) OR (user_info IS NOT NULL))
-  BitmapOr  (cost=8.60..8.60 rows=2 width=0) (actual
 time=0.034..0.034 rows=0 loops=1)
  -  Bitmap Index Scan on idx_contact_email  (cost=0.00..4.30
 rows=2 width=0) (actual time=0.027..0.027 rows=0 loops=1)
Index Cond: ((email)::text = 'ds...@ezrfz.com'::text)
  -  Bitmap Index Scan on contact_contact_user_info_idx
 (cost=0.00..4.30 rows=1 width=0) (actual time=0.007..0.007 rows=0 loops=1)
Index Cond: (user_info IS NOT NULL)
  Planning time: 0.602 ms
  Execution time: 0.118 ms



If you look closely at the 2nd query plan, you'll see that no joins are
performed, and it's only the contract_contract table that's looked at. This
is because PostgresSQL sees that none of the columns from the 2 tables
which are being left joined to are used, and also that the columns that
you're joining to on these tables are unique, therefore joining to them
cannot duplicate any rows, and since these are left joined, if there was no
matching row, then it wouldn't filter out rows from the contract_contract
table, as it would with INNER JOINs. The planner sees that these left joins
are pointless, so just removes them from the plan.

Regards

David Rowley


Re: [PERFORM] View has different query plan than select statement

2014-05-19 Thread David Rowley
On Mon, May 19, 2014 at 4:47 PM, Geoff Hull geoff.h...@mccarthy.co.nzwrote:

 I am sending this on behalf of my colleague who tried to post to this list
 last year but without success, then also tried
 pgsql-performance-ow...@postgresql.org but without getting a reply.

 I have recently re-tested this in P/G version 9.3.4 with the same results:

 Hi,

 I have created a table 'test_table' and index 'idx_test_table' with a view
 'v_test_table'. However the query plan used by the view does not use the
 index but when running the select statement itself it does use the index.
 Given that query specific hints are not available in Postgres 9.1 how can I
 persuade the view to use the same query plan as the select statement?

 Thanks,

 Tim


 --DROP table test_table CASCADE;

 -- create test table
 CREATE TABLE test_table (
 history_id SERIAL,
 id character varying(50) NOT NULL ,
 name character varying(50),
 CONSTRAINT test_table_pkey PRIMARY KEY (history_id)
 );

 -- create index on test table
 CREATE INDEX idx_test_table ON test_table (id);

 -- populate test table
 INSERT INTO test_table (id, name) SELECT *, 'Danger Mouse' FROM (SELECT
 md5(random()::text) from generate_series(1,1)) q;

 -- collect stats
 ANALYZE test_table;


 EXPLAIN (ANALYZE, BUFFERS)
 SELECT *
 FROM test_table
 WHERE id = '02b304b1c54542570d9f7bd39361f5b4';

 Index Scan using idx_test_table on test_table (cost=0.00..8.27 rows=1
 width=50) (actual time=0.021..0.022 rows=1 loops=1)
  Index Cond: ((id)::text = '02b304b1c54542570d9f7bd39361f5b4'::text)
  Buffers: shared hit=3
 Total runtime: 0.051 ms


 -- select statement with good plan

 EXPLAIN (ANALYZE, BUFFERS)
 SELECT id,
 CASE WHEN COALESCE(LAG(name) OVER (PARTITION BY id ORDER BY history_id),
 name || 'x')  name
 then name
 end as name
 FROM test_table
 WHERE id = '02b304b1c54542570d9f7bd39361f5b4';

 WindowAgg (cost=8.28..8.31 rows=1 width=50) (actual time=0.050..0.051
 rows=1 loops=1)
  Buffers: shared hit=3
  - Sort (cost=8.28..8.29 rows=1 width=50) (actual time=0.039..0.039
 rows=1 loops=1)
  Sort Key: history_id
  Sort Method: quicksort Memory: 25kB
  Buffers: shared hit=3
  - Index Scan using idx_test_table on test_table (cost=0.00..8.27 rows=1
 width=50) (actual time=0.030..0.031 rows=1 loops=1)
  Index Cond: ((id)::text = '02b304b1c54542570d9f7bd39361f5b4'::text)
  Buffers: shared hit=3
 Total runtime: 0.102 ms


 --DROP VIEW v_test_table;

 CREATE OR REPLACE VIEW v_test_table AS
 SELECT id,
 CASE WHEN COALESCE(LAG(name) OVER (PARTITION BY id ORDER BY history_id),
 name || 'x')  name
 then name
 end as name
 FROM test_table;


 -- Query via view with bad plan

 EXPLAIN (ANALYZE, BUFFERS)
 SELECT *
 FROM v_test_table
 WHERE id = '02b304b1c54542570d9f7bd39361f5b4';

 Subquery Scan on v_test_table (cost=868.39..1243.39 rows=50 width=65)
 (actual time=26.115..33.327 rows=1 loops=1)
  Filter: ((v_test_table.id)::text =
 '02b304b1c54542570d9f7bd39361f5b4'::text)
  Buffers: shared hit=104, temp read=77 written=77
  - WindowAgg (cost=868.39..1118.39 rows=1 width=50) (actual time=
 26.022..32.519 rows=1 loops=1)
  Buffers: shared hit=104, temp read=77 written=77
  - Sort (cost=868.39..893.39 rows=1 width=50) (actual
 time=26.013..27.796 rows=1 loops=1)
  Sort Key: test_table.id, test_table.history_id
  Sort Method: external merge Disk: 608kB
  Buffers: shared hit=104, temp read=77 written=77
  - Seq Scan on test_table (cost=0.00..204.00 rows=1 width=50)
 (actual time=0.010..1.804 rows=1 loops=1)
  Buffers: shared hit=104
 Total runtime: 33.491 ms


 How can I get the view to use the same query plan as the select statement?


Hi Geoff,

Unfortunately the view is not making use of the index due to the presence
of the windowing function in the view. I think you would find that if that
was removed then the view would more than likely use the index again.

The reason for this is that currently the WHERE clause of the outer query
is not pushed down into the view due to some overly strict code which
completely disallows pushdowns of where clauses into sub queries that
contain windowing functions...

In your case, because you have this id in your partition by clause, then
technically it is possible to push the where clause down into the sub
query. I wrote a patch a while back which lifts this restriction. it
unfortunately missed the boat for 9.4, but with any luck it will make it
into 9.5. If you're up for compiling postgres from source, then you can
test the patch out:

http://www.postgresql.org/message-id/cahoyfk9ihosarntwc-nj5tphko4wcausd-1c_0wecogi9ue...@mail.gmail.com

It should apply to current HEAD without too much trouble.

Regards

David Rowley


Re: [PERFORM] DATE_TRUNC() and GROUP BY?

2013-12-20 Thread David Rowley
On Fri, Dec 20, 2013 at 1:35 PM, Dave Johansen davejohan...@gmail.comwrote:

 I just ran into an interesting issue on Postgres 8.4. I have a database
 with about 3 months of data and when I do following query:
 SELECT DATE_TRUNC('day', time) AS time_t, COUNT(*) FROM mytable GROUP BY
 time_t;

 EXPLAIN shows that it's doing a sort and then a GroupAggregate. There will
 only be ~90 outputs, so is there a way I can hint/force the planner to just
 do a HashAggregate?

 Just to see if it would change the plan, I tried increasing the work_mem
 up to 1GB and it still did the same plan.


PostgreSQL does not really have any stats on the selectivity of
date_trunc('day', time) so my guess is that it can only assume that it has
the same selectivity as the time column by itself... Which is very untrue
in this case.
The group aggregate plan is chosen here as PostgreSQL thinks the the hash
table is going to end up pretty big and decides that the group aggregate
will be the cheaper option.

I mocked up your data and on 9.4 I can get the hash aggregate plan to run
if I set the n_distinct value to 90 then analyze the table again.. Even if
you could do this on 8.4 I'd not recommend it as it will probably cause
havoc with other plans around the time column. I did also get the hash
aggregate plan to run if I created a functional index on date_trunc('day',
time) then ran analyze again. I don't have a copy of 8.4 around to see if
the planner will make use of the index in the same way.

What would be really nice is if we could create our own statistics on what
we want, something like:

CREATE STATISTICS name ON table (date_trunc('day', time));

That way postgres could have a better idea of the selectivity in this
situation.

I'd give creating the function index a try, but keep in mind the overhead
that it will cause with inserts, updates and deletes.

Regards

David Rowley


 Thanks,
 Dave



Re: [PERFORM] PostgreSQL over internet

2013-01-27 Thread David Rowley


 From: pgsql-performance-ow...@postgresql.org
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of belal hamed
 Sent: 27 January 2013 13:16
 To: pgsql-performance@postgresql.org
 Subject: [PERFORM] PostgreSQL over internet


 by wireshark I monitor TCP packets I found total data transmit/received
400B
 I took about 2.5s to fetch results  why ??


Are you sure there's not any QOS somewhere that is slowing down the packets
for port 5432 or whichever you're using for PostgreSQL?
Perhaps temporarily changing PostgreSQL's listening port to something else
might be a good test.

David



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] indexing for distinct search in timestamp based table

2008-08-28 Thread David Rowley
I once also had a similar performance problem when looking for all matching
rows between two timestamps. In fact that's why I'm here today. The problem
was with MySQL. I had some tables of around 10 million rows and all my
searching was timestamp based. MySQL didn't do what I wanted. I found that
using a CLUSTERED index with postgresql to be lightning quick. Yet mostly
the matching rows I was working with was not much over the 100k mark. I'm
wondering if clustering the table on ad_log_start_time will help cut down on
random reads.

That's if you can afford to block the users while postgresql clusters the
table.

If you're inserting in order of the start_time column (which I was) then the
cluster should almost maintain itself (I think), providing you're not
updating or deleting anyway, I'd assume that since it looks like a log
table.

David.

 
-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Rainer Mager
Sent: 28 August 2008 09:06
To: pgsql-performance@postgresql.org
Subject: [PERFORM] indexing for distinct search in timestamp based table

I'm looking for some help in speeding up searches. My table is pretty simple
(see below), but somewhat large, and continuously growing. Currently it has
about 50 million rows.

The table is (I know I have excessive indexes, I'm trying to get the
appropriate ones and drop the extras):
  Table public.ad_log
Column|Type |
Modifiers
--+-+---
-
 ad_log_id| integer | not null default
nextval('ad_log_ad_log_id_seq'::regclass)
 channel_name | text| not null
 player_name  | text| not null
 ad_name  | text| not null
 start_time   | timestamp without time zone | not null
 end_time | timestamp without time zone | not null
Indexes:
ad_log_pkey PRIMARY KEY, btree (ad_log_id)
ad_log_channel_name_key UNIQUE, btree (channel_name, player_name,
ad_name, start_time, end_time)
ad_log_ad_and_start btree (ad_name, start_time)
ad_log_ad_name btree (ad_name)
ad_log_all btree (channel_name, player_name, start_time, ad_name)
ad_log_channel_name btree (channel_name)
ad_log_end_time btree (end_time)
ad_log_player_and_start btree (player_name, start_time)
ad_log_player_name btree (player_name)
ad_log_start_time btree (start_time)



The query I'm trying to speed up is below. In it the field tag can be one
of channel_name, player_name, or ad_name. I'm actually trying to return the
distinct values and I found GROUP BY to be slightly faster than using
DISTINCT. Also, any of those fields may be unspecified in the WHERE clauses
in which case we use '%', but it seems Postgres optimizes that pretty well.

SELECT field FROM ad_log 
WHERE channel_name LIKE :channel_name
AND player_name LIKE :player_name 
AND ad_name LIKE :ad_name 
AND start_time BETWEEN :start_date AND (date(:end_date) + 1)
GROUP BY field ORDER BY field


A typical query is:

explain analyze SELECT channel_name FROM ad_log WHERE channel_name LIKE '%'
AND ad_name LIKE '%' AND start_time BETWEEN '2008-07-01' AND
(date('2008-07-28') + 1) GROUP BY channel_name ORDER BY channel_name;

with the result being:
 
QUERY PLAN


---
 Sort  (cost=1163169.02..1163169.03 rows=5 width=10) (actual
time=75460.187..75460.192 rows=15 loops=1)
   Sort Key: channel_name
   Sort Method:  quicksort  Memory: 17kB
   -  HashAggregate  (cost=1163168.91..1163168.96 rows=5 width=10) (actual
time=75460.107..75460.114 rows=15 loops=1)
 -  Bitmap Heap Scan on ad_log  (cost=285064.30..1129582.84
rows=13434427 width=10) (actual time=8506.250..65771.597 rows=13701296
loops=1)
   Recheck Cond: ((start_time = '2008-07-01
00:00:00'::timestamp without time zone) AND (start_time =
'2008-07-29'::date))
   Filter: ((channel_name ~~ '%'::text) AND (ad_name ~~
'%'::text))
   -  Bitmap Index Scan on ad_log_start_time
(cost=0.00..281705.70 rows=13434427 width=0) (actual time=8488.443..8488.443
rows=13701296 loops=1)
 Index Cond: ((start_time = '2008-07-01
00:00:00'::timestamp without time zone) AND (start_time =
'2008-07-29'::date))
 Total runtime: 75460.361 ms


It seems to me there should be some way to create an index to speed this up,
but the various ones I've tried so far haven't helped. Any suggestions would
be greatly appreciated.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)

Re: [PERFORM] Nested Loop join being improperly chosen

2008-08-28 Thread David Rowley
I had a similar problem here:
http://archives.postgresql.org/pgsql-bugs/2008-07/msg00026.php

Is the nested loop performing a LEFT join with yours? It's a little
difficult to tell just from the query plan you showed.

A work around for mine was to use a full outer join and eliminate the extra
rows in the where clause. A bit of a hack but it changed a 2 min query into
one that ran in under a second.

Of course this is not helping with your problem but at least may trigger
some more feedback.

David.


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Brad Ediger
Sent: 22 August 2008 16:26
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Nested Loop join being improperly chosen

Hello,
I'm having trouble with a Nested Loop being selected for a rather  
complex query; it turns out this is a pretty bad plan as the nested  
loop's row estimates are quite off (1 estimated / 1207881 actual). If  
I disable enable_nestloop, the query executes much faster (42 seconds  
instead of 605). The tables in the query have all been ANALYZEd just  
before generating these plans.

Here are the plans with and without enable_nestloop:

http://pastie.org/258043

The inventory table is huge; it currently has about 1.3 x 10^9 tuples.  
The items table has around 10,000 tuples, and the other tables in the  
query are tiny.

Any ideas or suggestions would be greatly appreciated. Thanks!
--
Brad Ediger



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance