subject:"Re\: \[PERFORM\] Performance issues"

Re: [PERFORM] Performance issues

2015-03-22 Thread Tomas Vondra

On 22.3.2015 22:50, Vivekanand Joshi wrote:
 Any documentation regarding how to configure postgresql.conf file as per
 individual user?

That can't be done in postgresql.conf, but by ALTER ROLE commands.



-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-22 Thread Vivekanand Joshi

Any documentation regarding how to configure postgresql.conf file as per
individual user?
On 21 Mar 2015 13:10, Josh Krupka jkru...@gmail.com wrote:

 The other approaches of fixing the estimates, cost params, etc are the
 right way of fixing it.  *However* if you needed a quick fix for just this
 report and can't find a way of setting it in Jaspersoft for just the report
 (I don't think it will let you run multiple sql statements by default,
 maybe not at all) there are still a couple more options.  You can define a
 new datasource in jasper, point this report to that datasource, and have
 that new datasource configured to not use the nested loops.  You could do
 that either by making the new datasource use a different user than
 everything else, and disable nested loops for that user in postgres, or you
 could probably have the datasource initialization process disable nested
 loops.

Re: [PERFORM] Performance issues

2015-03-21 Thread Josh Krupka

The other approaches of fixing the estimates, cost params, etc are the
right way of fixing it.  *However* if you needed a quick fix for just this
report and can't find a way of setting it in Jaspersoft for just the report
(I don't think it will let you run multiple sql statements by default,
maybe not at all) there are still a couple more options.  You can define a
new datasource in jasper, point this report to that datasource, and have
that new datasource configured to not use the nested loops.  You could do
that either by making the new datasource use a different user than
everything else, and disable nested loops for that user in postgres, or you
could probably have the datasource initialization process disable nested
loops.

Re: [PERFORM] Performance issues

2015-03-18 Thread Jerry Sievers

Vivekanand Joshi vjo...@zetainteractive.com writes:

 So, here is the first taste of success and which gives me the confidence
 that if properly worked out with a good hardware and proper tuning,
 PostgreSQL could be a good replacement.

 Out of the 9 reports which needs to be migrated in PostgreSQL, 3 are now
 running.

 Report 4 was giving an issue and I will see it tomorrow.

 Just to inform you guys that, the thing that helped most is setting
 enable_nestloops to false worked. Plans are now not miscalculated.

 But this is not a production-suitable setting. So what do you think how to
 get a work around this?

Consider just disabling that setting for 1 or a few odd queries you have
for which they are known  to plan badly.

begin;
set local enable_nestloops to false;
select ...;
commit/abort;

I'd say never make that sort of setting DB or cluster-wide.




 Regards,
 Vivek

 -Original Message-
 From: pgsql-performance-ow...@postgresql.org
 [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Tomas Vondra
 Sent: Tuesday, March 17, 2015 9:00 PM
 To: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Performance issues

 On 17.3.2015 16:24, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 15:43:
 On 17.3.2015 15:19, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 14:55:
  (2) using window functions, e.g. like this:

  SELECT * FROM (
SELECT *,
 ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
ORDER BY FROM max_creation_dt) AS rn
FROM s_f_touchpoint_execution_status_history
  ) foo WHERE rn = 1

  But estimating this is also rather difficult ...


 From my experience rewriting something like the above using DISTINCT
 ON is usually faster.

 How do you get the last record (with respect to a timestamp column)
 using a DISTINCT ON?

 You need to use order by ... desc. See here:
 http://sqlfiddle.com/#!15/d4846/2

 Nice, thanks!


 Btw: your row_number() usage wouldn't return the latest row either.
 It would return the oldest row.

 Oh, right. I forgot the DESC in the window.


 -- 
 Tomas Vondrahttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance

-- 
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consult...@comcast.net
p: 312.241.7800


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-18 Thread Felipe Santos

2015-03-18 14:31 GMT-03:00 Vivekanand Joshi vjo...@zetainteractive.com:

 So, here is the first taste of success and which gives me the confidence
 that if properly worked out with a good hardware and proper tuning,
 PostgreSQL could be a good replacement.

 Out of the 9 reports which needs to be migrated in PostgreSQL, 3 are now
 running.

 Report 4 was giving an issue and I will see it tomorrow.

 Just to inform you guys that, the thing that helped most is setting
 enable_nestloops to false worked. Plans are now not miscalculated.




 Regards,
 Vivek

 -Original Message-
 From: pgsql-performance-ow...@postgresql.org
 [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Tomas Vondra
 Sent: Tuesday, March 17, 2015 9:00 PM
 To: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Performance issues

 On 17.3.2015 16:24, Thomas Kellerer wrote:
  Tomas Vondra schrieb am 17.03.2015 um 15:43:
  On 17.3.2015 15:19, Thomas Kellerer wrote:
  Tomas Vondra schrieb am 17.03.2015 um 14:55:
   (2) using window functions, e.g. like this:
 
   SELECT * FROM (
 SELECT *,
  ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
 ORDER BY FROM max_creation_dt) AS rn
 FROM s_f_touchpoint_execution_status_history
   ) foo WHERE rn = 1
 
   But estimating this is also rather difficult ...
 
 
  From my experience rewriting something like the above using DISTINCT
  ON is usually faster.
 
  How do you get the last record (with respect to a timestamp column)
  using a DISTINCT ON?
 
  You need to use order by ... desc. See here:
  http://sqlfiddle.com/#!15/d4846/2

 Nice, thanks!

 
  Btw: your row_number() usage wouldn't return the latest row either.
  It would return the oldest row.

 Oh, right. I forgot the DESC in the window.


 --
 Tomas Vondrahttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance





But this is not a production-suitable setting. So what do you think how to
get a work around this?

What about creating a read-only replica and apply this setting there?

Re: [PERFORM] Performance issues

2015-03-18 Thread Vivekanand Joshi

The issue here is that the queries are running inside a Jasper Reports. So
we cannot set this only for a one single query.

We are accessing our reports from a web-browser, which in turn runs the
report from Application Server (Jasper). This server connects to
PostgreSQL server.

Inside a JRXML(Jasper report file) file we cannot set this parameter.

I am attaching a JRXML file for a feel.  You can open this file in
notepad. I don't think we can set server level property in this file. So
how about a workaround?

Vivek



-Original Message-
From: Jerry Sievers [mailto:gsiever...@comcast.net]
Sent: Thursday, March 19, 2015 12:06 AM
To: vjo...@zetainteractive.com
Cc: Tomas Vondra; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance issues

Vivekanand Joshi vjo...@zetainteractive.com writes:

 So, here is the first taste of success and which gives me the
 confidence that if properly worked out with a good hardware and proper
 tuning, PostgreSQL could be a good replacement.

 Out of the 9 reports which needs to be migrated in PostgreSQL, 3 are
 now running.

 Report 4 was giving an issue and I will see it tomorrow.

 Just to inform you guys that, the thing that helped most is setting
 enable_nestloops to false worked. Plans are now not miscalculated.

 But this is not a production-suitable setting. So what do you think
 how to get a work around this?

Consider just disabling that setting for 1 or a few odd queries you have
for which they are known  to plan badly.

begin;
set local enable_nestloops to false;
select ...;
commit/abort;

I'd say never make that sort of setting DB or cluster-wide.




 Regards,
 Vivek

 -Original Message-
 From: pgsql-performance-ow...@postgresql.org
 [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Tomas
 Vondra
 Sent: Tuesday, March 17, 2015 9:00 PM
 To: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Performance issues

 On 17.3.2015 16:24, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 15:43:
 On 17.3.2015 15:19, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 14:55:
  (2) using window functions, e.g. like this:

  SELECT * FROM (
SELECT *,
 ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
ORDER BY FROM max_creation_dt) AS rn
FROM s_f_touchpoint_execution_status_history
  ) foo WHERE rn = 1

  But estimating this is also rather difficult ...


 From my experience rewriting something like the above using
 DISTINCT ON is usually faster.

 How do you get the last record (with respect to a timestamp column)
 using a DISTINCT ON?

 You need to use order by ... desc. See here:
 http://sqlfiddle.com/#!15/d4846/2

 Nice, thanks!


 Btw: your row_number() usage wouldn't return the latest row either.
 It would return the oldest row.

 Oh, right. I forgot the DESC in the window.


 --
 Tomas Vondrahttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


 --
 Sent via pgsql-performance mailing list
 (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consult...@comcast.net
p: 312.241.7800


AvgEmailsRecieved.jrxml
Description: Binary data

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-18 Thread Vitalii Tymchyshyn

You can set it for the db user or use stored proc.

Best regards, Vitalii Tymchyshyn

Ср, 18 бер. 2015 14:48 Vivekanand Joshi vjo...@zetainteractive.com пише:

 The issue here is that the queries are running inside a Jasper Reports. So
 we cannot set this only for a one single query.

 We are accessing our reports from a web-browser, which in turn runs the
 report from Application Server (Jasper). This server connects to
 PostgreSQL server.

 Inside a JRXML(Jasper report file) file we cannot set this parameter.

 I am attaching a JRXML file for a feel.  You can open this file in
 notepad. I don't think we can set server level property in this file. So
 how about a workaround?

 Vivek



 -Original Message-
 From: Jerry Sievers [mailto:gsiever...@comcast.net]
 Sent: Thursday, March 19, 2015 12:06 AM
 To: vjo...@zetainteractive.com
 Cc: Tomas Vondra; pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Performance issues

 Vivekanand Joshi vjo...@zetainteractive.com writes:

  So, here is the first taste of success and which gives me the
  confidence that if properly worked out with a good hardware and proper
  tuning, PostgreSQL could be a good replacement.
 
  Out of the 9 reports which needs to be migrated in PostgreSQL, 3 are
  now running.
 
  Report 4 was giving an issue and I will see it tomorrow.
 
  Just to inform you guys that, the thing that helped most is setting
  enable_nestloops to false worked. Plans are now not miscalculated.
 
  But this is not a production-suitable setting. So what do you think
  how to get a work around this?

 Consider just disabling that setting for 1 or a few odd queries you have
 for which they are known  to plan badly.

 begin;
 set local enable_nestloops to false;
 select ...;
 commit/abort;

 I'd say never make that sort of setting DB or cluster-wide.


 
 
  Regards,
  Vivek
 
  -Original Message-
  From: pgsql-performance-ow...@postgresql.org
  [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Tomas
  Vondra
  Sent: Tuesday, March 17, 2015 9:00 PM
  To: pgsql-performance@postgresql.org
  Subject: Re: [PERFORM] Performance issues
 
  On 17.3.2015 16:24, Thomas Kellerer wrote:
  Tomas Vondra schrieb am 17.03.2015 um 15:43:
  On 17.3.2015 15:19, Thomas Kellerer wrote:
  Tomas Vondra schrieb am 17.03.2015 um 14:55:
   (2) using window functions, e.g. like this:
 
   SELECT * FROM (
 SELECT *,
  ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
 ORDER BY FROM max_creation_dt) AS rn
 FROM s_f_touchpoint_execution_status_history
   ) foo WHERE rn = 1
 
   But estimating this is also rather difficult ...
 
 
  From my experience rewriting something like the above using
  DISTINCT ON is usually faster.
 
  How do you get the last record (with respect to a timestamp column)
  using a DISTINCT ON?
 
  You need to use order by ... desc. See here:
  http://sqlfiddle.com/#!15/d4846/2
 
  Nice, thanks!
 
 
  Btw: your row_number() usage wouldn't return the latest row either.
  It would return the oldest row.
 
  Oh, right. I forgot the DESC in the window.
 
 
  --
  Tomas Vondrahttp://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
 
 
  --
  Sent via pgsql-performance mailing list
  (pgsql-performance@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-performance

 --
 Jerry Sievers
 Postgres DBA/Development Consulting
 e: postgres.consult...@comcast.net
 p: 312.241.7800

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-18 Thread Tomas Vondra

Hi,

On 18.3.2015 18:31, Vivekanand Joshi wrote:
 So, here is the first taste of success and which gives me the
 confidence that if properly worked out with a good hardware and
 proper tuning, PostgreSQL could be a good replacement.
 
 Out of the 9 reports which needs to be migrated in PostgreSQL, 3 are
 now running.
 
 Report 4 was giving an issue and I will see it tomorrow.
 
 Just to inform you guys that, the thing that helped most is setting 
 enable_nestloops to false worked. Plans are now not miscalculated.

The estimates are still miscalculated, but you're forcing the database
not to use the nested loop. The problem is the nested loop may be
appropriate in some cases (maybe only in a few places of the plan) so
this is really corse-grained solution.

 But this is not a production-suitable setting. So what do you think
 how to get a work around this?

(a) Try to identify why the queries are poorly estimated, and rephrase
them somehow. This is the best solution, but takes time, expertise
and may not be feasible in some cases.

(b) Tweak the database structure, possibly introducing intermediate
tables, materialized views (or tables maintained by triggers - this
might work for the 'latest record' subquery), etc.

(c) Try to tweak the cost parameters, to make the nested loops more
expensive (and thus less likely to be selected), but in a more
gradual way than enable_nestloops=false.

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-18 Thread Vivekanand Joshi

So, here is the first taste of success and which gives me the confidence
that if properly worked out with a good hardware and proper tuning,
PostgreSQL could be a good replacement.

Out of the 9 reports which needs to be migrated in PostgreSQL, 3 are now
running.

Report 4 was giving an issue and I will see it tomorrow.

Just to inform you guys that, the thing that helped most is setting
enable_nestloops to false worked. Plans are now not miscalculated.

But this is not a production-suitable setting. So what do you think how to
get a work around this?


Regards,
Vivek

-Original Message-
From: pgsql-performance-ow...@postgresql.org
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Tomas Vondra
Sent: Tuesday, March 17, 2015 9:00 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance issues

On 17.3.2015 16:24, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 15:43:
 On 17.3.2015 15:19, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 14:55:
  (2) using window functions, e.g. like this:

  SELECT * FROM (
SELECT *,
 ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
ORDER BY FROM max_creation_dt) AS rn
FROM s_f_touchpoint_execution_status_history
  ) foo WHERE rn = 1

  But estimating this is also rather difficult ...


 From my experience rewriting something like the above using DISTINCT
 ON is usually faster.

 How do you get the last record (with respect to a timestamp column)
 using a DISTINCT ON?

 You need to use order by ... desc. See here:
 http://sqlfiddle.com/#!15/d4846/2

Nice, thanks!


 Btw: your row_number() usage wouldn't return the latest row either.
 It would return the oldest row.

Oh, right. I forgot the DESC in the window.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-17 Thread Vivekanand Joshi

The confusion for me here is that :


I am getting results from the view in around 3 seconds
(S_V_D_CAMPAIGN_HIERARCHY) and 25 seconds (S_V_F_PROMOTION_HISTORY_EMAIL)

But when I am using these two views in the query as the joining tables, it
doesn't give any result.  As per my understanding, the planner is making new
plan and that is costly instead of using output from the view, which is
actually understandable.

Is there a way, we can do anything about it?

I hope I am making some sense here.

Regards,
Vivek

-Original Message-
From: pgsql-performance-ow...@postgresql.org
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Tomas Vondra
Sent: Tuesday, March 17, 2015 8:13 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance issues

On 17.3.2015 15:19, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 14:55:
  (2) using window functions, e.g. like this:

  SELECT * FROM (
SELECT *,
 ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
ORDER BY FROM max_creation_dt) AS rn
FROM s_f_touchpoint_execution_status_history
  ) foo WHERE rn = 1

  But estimating this is also rather difficult ...


 From my experience rewriting something like the above using DISTINCT
 ON is usually faster.

How do you get the last record (with respect to a timestamp column) using a
DISTINCT ON?


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-17 Thread Thomas Kellerer

Tomas Vondra schrieb am 17.03.2015 um 15:43:
 On 17.3.2015 15:19, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 14:55:
  (2) using window functions, e.g. like this:

  SELECT * FROM (
SELECT *,
 ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
ORDER BY FROM max_creation_dt) AS rn
FROM s_f_touchpoint_execution_status_history
  ) foo WHERE rn = 1

  But estimating this is also rather difficult ...


 From my experience rewriting something like the above using DISTINCT 
 ON is usually faster.
 
 How do you get the last record (with respect to a timestamp column)
 using a DISTINCT ON?

You need to use order by ... desc. See here: http://sqlfiddle.com/#!15/d4846/2

Btw: your row_number() usage wouldn't return the latest row either. 
It would return the oldest row.










-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-17 Thread Vivekanand Joshi

This is the explain for a simple query:

explain Select * from S_V_F_PROMOTION_HISTORY_EMAIL a inner join
S_V_D_CAMPAIGN_HIERARCHY b on a.touchpoint_execution_id =
b.touchpoint_execution_id;


http://explain.depesz.com/s/gse

I am wondering the total cost here is less even then the result is not
coming out.

Regards,
Vivek

-Original Message-
From: Vivekanand Joshi [mailto:vjo...@zetainteractive.com]
Sent: Tuesday, March 17, 2015 8:40 PM
To: 'Tomas Vondra'; 'pgsql-performance@postgresql.org'
Subject: RE: [PERFORM] Performance issues

The confusion for me here is that :


I am getting results from the view in around 3 seconds
(S_V_D_CAMPAIGN_HIERARCHY) and 25 seconds (S_V_F_PROMOTION_HISTORY_EMAIL)

But when I am using these two views in the query as the joining tables, it
doesn't give any result.  As per my understanding, the planner is making new
plan and that is costly instead of using output from the view, which is
actually understandable.

Is there a way, we can do anything about it?

I hope I am making some sense here.

Regards,
Vivek

-Original Message-
From: pgsql-performance-ow...@postgresql.org
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Tomas Vondra
Sent: Tuesday, March 17, 2015 8:13 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance issues

On 17.3.2015 15:19, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 14:55:
  (2) using window functions, e.g. like this:

  SELECT * FROM (
SELECT *,
 ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
ORDER BY FROM max_creation_dt) AS rn
FROM s_f_touchpoint_execution_status_history
  ) foo WHERE rn = 1

  But estimating this is also rather difficult ...


 From my experience rewriting something like the above using DISTINCT
 ON is usually faster.

How do you get the last record (with respect to a timestamp column) using a
DISTINCT ON?


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
 Nested Loop  (cost=41187.21..45053.54 rows=1 width=403)
   Join Filter: 
(s_f_touchpoint_execution_status_history_2.touchpoint_execution_id = 
s_f_touchpoint_execution_status_history.touchpoint_execution_id)
   -  Nested Loop Left Join  (cost=18488.74..20591.32 rows=1 width=271)
 -  Nested Loop  (cost=18488.61..20591.17 rows=1 width=263)
   -  Nested Loop Left Join  (cost=18488.48..20591.00 rows=1 
width=255)
 -  Hash Join  (cost=18488.35..20590.84 rows=1 width=247)
   Hash Cond: 
((s_f_touchpoint_execution_status_history_1.touchpoint_execution_id = 
s_f_touchpoint_execution_status_history.touchpoint_execution
_id) AND ((max(s_f_touchpoint_execution_status_history_1.creation_dt)) = 
s_f_touchpoint_execution_status_history.creation_dt))
   -  HashAggregate  (cost=6221.56..6986.10 rows=76454 
width=16)
 -  Seq Scan on 
s_f_touchpoint_execution_status_history 
s_f_touchpoint_execution_status_history_1  (cost=0.00..4766.04 rows=291104 widt
h=16)
   -  Hash  (cost=12266.76..12266.76 rows=2 width=247)
 -  Hash Join  (cost=6261.89..12266.76 rows=2 
width=247)
   Hash Cond: 
(s_f_touchpoint_execution_status_history.touchpoint_execution_id = 
tp_exec.touchpoint_execution_id)
   -  Seq Scan on 
s_f_touchpoint_execution_status_history  (cost=0.00..5493.80 rows=136280 
width=16)
 Filter: 
(touchpoint_execution_status_type_id = ANY ('{3,4}'::integer[]))
   -  Hash  (cost=6261.88..6261.88 rows=1 
width=231)
 -  Nested Loop Left Join  
(cost=1955.40..6261.88 rows=1 width=231)
   -  Nested Loop Left Join  
(cost=1955.27..6261.72 rows=1 width=222)
 -  Nested Loop  
(cost=1954.99..6261.41 rows=1 width=197)
   -  Nested Loop  
(cost=1954.71..6261.08 rows=1 width=173)
 Join 
Filter: (camp_exec.campaign_id = wave.campaign_id)
 -  Nested 
Loop  (cost=1954.42..6255.85 rows=16 width=167)
   -  
Hash Join  (cost=1954.13..6249.70 rows=16 width=108)

 Hash Cond: ((tp_exec.touchpoint_id = tp_1

Re: [PERFORM] Performance issues

2015-03-17 Thread Tomas Vondra

On 17.3.2015 16:10, Vivekanand Joshi wrote:
 The confusion for me here is that :
 
 
 I am getting results from the view in around 3 seconds
 (S_V_D_CAMPAIGN_HIERARCHY) and 25 seconds (S_V_F_PROMOTION_HISTORY_EMAIL)
 
 But when I am using these two views in the query as the joining 
 tables, it doesn't give any result. As per my understanding, the 
 planner is making new plan and that is costly instead of using
 output from the view, which is actually understandable.

In general, yes. The problem is that the plan is constructed based on
the estimates, and those are very inaccurate in this case.

The planner may do various changes, but let's assume that does not
happen and the plans are executed and and the results are joined.

For example what might happen is this:

   for each row in 's_v_d_campaign_hierarchy' (1 row expected):
  execute s_v_f_promotion_history_email  join (11644 rows exp.)

But then it gets 45k rows from s_v_d_campaign_hierarchy, and ~400x more
rows from s_v_f_promotion_history_email (I'm neglecting the join
condition here, but that's not really significant). Kabm!

In reality, the plan is reorganized (e.g. different join order), but the
misestimates are still lurking there.

 Is there a way, we can do anything about it?

Rephrasing the query so that the planner can estimate it more accurately.



-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-17 Thread Vivekanand Joshi

Hi Guys,

Next level of query is following:

If this works, I guess 90% of the problem will be solved.

SELECT
COUNT(DISTINCT TARGET_ID)
FROM
S_V_F_PROMOTION_HISTORY_EMAIL PH
INNER JOIN S_V_D_CAMPAIGN_HIERARCHY CH
ON PH.TOUCHPOINT_EXECUTION_ID =
CH.TOUCHPOINT_EXECUTION_ID
WHERE
1=1
AND SEND_DT = '2014-03-13'
AND SEND_DT = '2015-03-14'


In this query, I am joining two views which were made earlier with CTEs. I
have replaced the CTE's with subqueries. The view were giving me output in
around 5-10 minutes and now I am getting the same result in around 3-4
seconds.

But when I executed the query written above, I am again stuck. I am
attaching the query plan as well the link.

http://explain.depesz.com/s/REeu

I can see most of the time is spending inside a nested loop and total
costs comes out be cost=338203.81..338203.82.

How to take care of this? I need to run this query in a report so I cannot
create a table like select * from views and then join the table. If I do
that I am getting the answer of whole big query in some 6-7 seconds. But
that is not feasible. A report (Jasper can have only one single (big/small
query).

Let me know if you need any other information.

Thanks a ton!
Vivek


-Original Message-
From: Jim Nasby [mailto:jim.na...@bluetreble.com]
Sent: Tuesday, March 17, 2015 5:36 AM
To: Tomas Vondra; vjo...@zetainteractive.com; Scott Marlowe; Varadharajan
Mukundan
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance issues

On 3/16/15 3:59 PM, Tomas Vondra wrote:
 On 16.3.2015 20:43, Jim Nasby wrote:
 On 3/13/15 7:12 PM, Tomas Vondra wrote:
 (4) I suspect many of the relations referenced in the views are not
   actually needed in the query, i.e. the join is performed but
   then it's just discarded because those columns are not used.
   Try to simplify the views as much has possible - remove all the
   tables that are not really necessary to run the query. If two
   queries need different tables, maybe defining two views is
   a better approach.

 A better alternative with multi-purpose views is to use an outer join
 instead of an inner join. With an outer join if you ultimately don't
 refer to any of the columns in a particular table Postgres will
 remove the table from the query completely.

 Really? Because a quick test suggests otherwise:

 db=# create table test_a (id int);
 CREATE TABLE
 db=# create table test_b (id int);
 CREATE TABLE
 db=# explain select test_a.* from test_a left join test_b using (id);
QUERY PLAN
 --
   Merge Left Join  (cost=359.57..860.00 rows=32512 width=4)
 Merge Cond: (test_a.id = test_b.id)
 -  Sort  (cost=179.78..186.16 rows=2550 width=4)
   Sort Key: test_a.id
   -  Seq Scan on test_a  (cost=0.00..35.50 rows=2550 width=4)
 -  Sort  (cost=179.78..186.16 rows=2550 width=4)
   Sort Key: test_b.id
   -  Seq Scan on test_b  (cost=0.00..35.50 rows=2550 width=4)
 (8 rows)

 Also, how would that work with duplicate rows in the referenced table?

Right, I neglected to mention that the omitted table must also be unique
on the join key:

decibel@decina.attlocal=# create table a(a_id serial primary key); CREATE
TABLE decibel@decina.attlocal=# create table b(a_id int); CREATE TABLE
decibel@decina.attlocal=# explain analyze select a.* from a left join b
using(a_id);
 QUERY PLAN

--
-
  Hash Right Join  (cost=67.38..137.94 rows=2550 width=4) (actual
time=0.035..0.035 rows=0 loops=1)
Hash Cond: (b.a_id = a.a_id)
-  Seq Scan on b  (cost=0.00..35.50 rows=2550 width=4) (never
executed)
-  Hash  (cost=35.50..35.50 rows=2550 width=4) (actual
time=0.002..0.002 rows=0 loops=1)
  Buckets: 4096  Batches: 1  Memory Usage: 32kB
  -  Seq Scan on a  (cost=0.00..35.50 rows=2550 width=4) (actual
time=0.001..0.001 rows=0 loops=1)
  Planning time: 0.380 ms
  Execution time: 0.086 ms
(8 rows)

decibel@decina.attlocal=# alter table b add primary key(a_id); ALTER TABLE
decibel@decina.attlocal=# explain analyze select a.* from a left join b
using(a_id);
   QUERY PLAN

--
-
  Seq Scan on a  (cost=0.00..35.50 rows=2550 width=4) (actual
time=0.001..0.001 rows=0 loops=1)
  Planning time: 0.247 ms
  Execution time: 0.029 ms
(3 rows)

decibel@decina.attlocal=# alter table a drop constraint a_pkey; ALTER
TABLE decibel@decina.attlocal=# explain analyze select a.* from a left
join b using(a_id

Re: [PERFORM] Performance issues

2015-03-17 Thread Vivekanand Joshi

EXPLAIN ANALYZE didn't give result even after three hours.

-Original Message-
From: Vivekanand Joshi [mailto:vjo...@zetainteractive.com]
Sent: Tuesday, March 17, 2015 1:11 PM
To: 'Jim Nasby'; 'Tomas Vondra'; 'Scott Marlowe'; 'Varadharajan Mukundan'
Cc: 'pgsql-performance@postgresql.org'
Subject: RE: [PERFORM] Performance issues

Hi Guys,

Next level of query is following:

If this works, I guess 90% of the problem will be solved.

SELECT
COUNT(DISTINCT TARGET_ID)
FROM
S_V_F_PROMOTION_HISTORY_EMAIL PH
INNER JOIN S_V_D_CAMPAIGN_HIERARCHY CH
ON PH.TOUCHPOINT_EXECUTION_ID =
CH.TOUCHPOINT_EXECUTION_ID
WHERE
1=1
AND SEND_DT = '2014-03-13'
AND SEND_DT = '2015-03-14'

In this query, I am joining two views which were made earlier with CTEs. I
have replaced the CTE's with subqueries. The view were giving me output in
around 5-10 minutes and now I am getting the same result in around 3-4
seconds.

But when I executed the query written above, I am again stuck. I am
attaching the query plan as well the link.

http://explain.depesz.com/s/REeu

I can see most of the time is spending inside a nested loop and total
costs comes out be cost=338203.81..338203.82.

How to take care of this? I need to run this query in a report so I cannot
create a table like select * from views and then join the table. If I do
that I am getting the answer of whole big query in some 6-7 seconds. But
that is not feasible. A report (Jasper can have only one single (big/small
query).

Let me know if you need any other information.

Thanks a ton!
Vivek

-Original Message-
From: Jim Nasby [mailto:jim.na...@bluetreble.com]
Sent: Tuesday, March 17, 2015 5:36 AM
To: Tomas Vondra; vjo...@zetainteractive.com; Scott Marlowe; Varadharajan
Mukundan
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance issues

On 3/16/15 3:59 PM, Tomas Vondra wrote:
 On 16.3.2015 20:43, Jim Nasby wrote:
 On 3/13/15 7:12 PM, Tomas Vondra wrote:
 (4) I suspect many of the relations referenced in the views are not
   actually needed in the query, i.e. the join is performed but
   then it's just discarded because those columns are not used.
   Try to simplify the views as much has possible - remove all the
   tables that are not really necessary to run the query. If two
   queries need different tables, maybe defining two views is
   a better approach.

 A better alternative with multi-purpose views is to use an outer join
 instead of an inner join. With an outer join if you ultimately don't
 refer to any of the columns in a particular table Postgres will
 remove the table from the query completely.

 Really? Because a quick test suggests otherwise:

 db=# create table test_a (id int);
 CREATE TABLE
 db=# create table test_b (id int);
 CREATE TABLE
 db=# explain select test_a.* from test_a left join test_b using (id);
QUERY PLAN
 --
   Merge Left Join  (cost=359.57..860.00 rows=32512 width=4)
 Merge Cond: (test_a.id = test_b.id)
 -  Sort  (cost=179.78..186.16 rows=2550 width=4)
   Sort Key: test_a.id
   -  Seq Scan on test_a  (cost=0.00..35.50 rows=2550 width=4)
 -  Sort  (cost=179.78..186.16 rows=2550 width=4)
   Sort Key: test_b.id
   -  Seq Scan on test_b  (cost=0.00..35.50 rows=2550 width=4)
 (8 rows)

 Also, how would that work with duplicate rows in the referenced table?

Right, I neglected to mention that the omitted table must also be unique
on the join key:

decibel@decina.attlocal=# create table a(a_id serial primary key); CREATE
TABLE decibel@decina.attlocal=# create table b(a_id int); CREATE TABLE
decibel@decina.attlocal=# explain analyze select a.* from a left join b
using(a_id);
 QUERY PLAN

--
-
  Hash Right Join  (cost=67.38..137.94 rows=2550 width=4) (actual
time=0.035..0.035 rows=0 loops=1)
Hash Cond: (b.a_id = a.a_id)
-  Seq Scan on b  (cost=0.00..35.50 rows=2550 width=4) (never
executed)
-  Hash  (cost=35.50..35.50 rows=2550 width=4) (actual
time=0.002..0.002 rows=0 loops=1)
  Buckets: 4096  Batches: 1  Memory Usage: 32kB
  -  Seq Scan on a  (cost=0.00..35.50 rows=2550 width=4) (actual
time=0.001..0.001 rows=0 loops=1)
  Planning time: 0.380 ms
  Execution time: 0.086 ms
(8 rows)

decibel@decina.attlocal=# alter table b add primary key(a_id); ALTER TABLE
decibel@decina.attlocal=# explain analyze select a.* from a left join b
using(a_id);
   QUERY PLAN

--
-
  Seq Scan

Re: [PERFORM] Performance issues

2015-03-17 Thread Tomas Vondra

Hi,

On 17.3.2015 08:41, Vivekanand Joshi wrote:
 Hi Guys,
 
 Next level of query is following:
 
 If this works, I guess 90% of the problem will be solved.
 
 SELECT
 COUNT(DISTINCT TARGET_ID)
 FROM
 S_V_F_PROMOTION_HISTORY_EMAIL PH
 INNER JOIN S_V_D_CAMPAIGN_HIERARCHY CH
 ON PH.TOUCHPOINT_EXECUTION_ID =
 CH.TOUCHPOINT_EXECUTION_ID
 WHERE
 1=1
 AND SEND_DT = '2014-03-13'
 AND SEND_DT = '2015-03-14'
 
 
 In this query, I am joining two views which were made earlier with CTEs. I
 have replaced the CTE's with subqueries. The view were giving me output in
 around 5-10 minutes and now I am getting the same result in around 3-4
 seconds.
 
 But when I executed the query written above, I am again stuck. I am
 attaching the query plan as well the link.
 
 http://explain.depesz.com/s/REeu
 
 I can see most of the time is spending inside a nested loop and total
 costs comes out be cost=338203.81..338203.82.

Most of that cost comes from this:

Seq Scan on s_f_promotion_history base (cost=0.00..283,333.66 rows=1
width=32)
  Filter: ((send_dt = '2014-03-13 00:00:00'::timestamp without time
zone) AND (send_dt = '2015-03-14 00:00:00'::timestamp without time


That's a bit weird, I guess. If you analyze this part of the query
separately, i.e.

EXPLAIN ANALYZE SELECT * FROM s_f_promotion_history
 WHERE (send_dt = '2014-03-13 00:00:00')
   AND (send_dt = '2015-03-14 00:00:00')

what do you get?

I suspect it's used in EXISTS, i.e. something like this:

... WHERE EXISTS (SELECT * FROM s_f_promotion_history
   WHERE ... send_dt conditions ...
 AND touchpoint_execution_id =
 s_f_touchpoint_execution_status_history_1.touchpoint_execution_id)

and this is transformed into a nested loop join. If there's a
misestimate, this may be quite expensive - try to create index on

   s_f_promotion_history (touchpoint_execution_id, send_date)


regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-17 Thread Vivekanand Joshi

Hi Tomas,

This is what I am getting,


EXPLAIN ANALYZE SELECT * FROM s_f_promotion_history  WHERE (send_dt =
'2014-03-13 00:00:00');

   QUERY
PLAN
-
 Seq Scan on s_f_promotion_history  (cost=0.00..28.66 rows=1 width=74)
(actual time=711.023..1136.393 rows=1338 loops=1)
   Filter: ((send_dt = '2014-03-13 00:00:00'::timestamp without time zone)
AND (send_dt = '2015-03-14 00:00:00'::timestamp without time zone))
   Rows Removed by Filter: 9998662
 Total runtime: 1170.682 ms


CREATE INDEX idx_pr_history ON
S_F_PROMOTION_HISTORY(touchpoint_execution_id, send_dt);

 After Creating Index:

  QUERY
PLAN
-
 Index Scan using idx_pr_history on s_f_promotion_history
(cost=0.43..254028.45 rows=1 width=74) (actual time=375.796..604.587
rows=1338 loops=1)
   Index Cond: ((send_dt = '2014-03-13 00:00:00'::timestamp without time
zone) AND (send_dt = '2015-03-14 00:00:00'::timestamp without time zone))
 Total runtime: 604.733 ms


The query I gave you is the smallest query, it is using two views and both
the views I have changed by using subqueries instead of CTEs.  When I join
these two views, it is not getting completed at all.

Explain analyze plan for view  s_v_f_promotion_history_email:
http://explain.depesz.com/s/ure
Explain analyze plan for view  s_v_d_campaign_hierarchy :
http://explain.depesz.com/s/WxI


Regards,
Vivek






-Original Message-
From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
Sent: Tuesday, March 17, 2015 5:15 PM
To: vjo...@zetainteractive.com; Jim Nasby; Scott Marlowe; Varadharajan
Mukundan
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance issues

On 17.3.2015 12:07, Vivekanand Joshi wrote:
 EXPLAIN ANALYZE didn't give result even after three hours.

In that case the only thing you can do is 'slice' the query into smaller
parts (representing subtrees of the plan), and analyze those first. Look
for misestimates (significant differences between estimated and actual
row counts, and very expensive parts).

We can't do that, because we don't have your data or queries, and
without the explain analyze it's difficult to give advices.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-17 Thread Tomas Vondra

On 17.3.2015 12:07, Vivekanand Joshi wrote:
 EXPLAIN ANALYZE didn't give result even after three hours.

In that case the only thing you can do is 'slice' the query into smaller
parts (representing subtrees of the plan), and analyze those first. Look
for misestimates (significant differences between estimated and actual
row counts, and very expensive parts).

We can't do that, because we don't have your data or queries, and
without the explain analyze it's difficult to give advices.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-17 Thread Vivekanand Joshi

Attaching explain analyze file as well.

Vivek


-Original Message-
From: Vivekanand Joshi [mailto:vjo...@zetainteractive.com]
Sent: Tuesday, March 17, 2015 5:36 PM
To: 'Tomas Vondra'; 'Jim Nasby'; 'Scott Marlowe'; 'Varadharajan Mukundan'
Cc: 'pgsql-performance@postgresql.org'
Subject: RE: [PERFORM] Performance issues

Hi Tomas,

This is what I am getting,


EXPLAIN ANALYZE SELECT * FROM s_f_promotion_history  WHERE (send_dt =
'2014-03-13 00:00:00');

   QUERY
PLAN
-
 Seq Scan on s_f_promotion_history  (cost=0.00..28.66 rows=1 width=74)
(actual time=711.023..1136.393 rows=1338 loops=1)
   Filter: ((send_dt = '2014-03-13 00:00:00'::timestamp without time zone)
AND (send_dt = '2015-03-14 00:00:00'::timestamp without time zone))
   Rows Removed by Filter: 9998662
 Total runtime: 1170.682 ms


CREATE INDEX idx_pr_history ON
S_F_PROMOTION_HISTORY(touchpoint_execution_id, send_dt);

 After Creating Index:

  QUERY
PLAN
-
 Index Scan using idx_pr_history on s_f_promotion_history
(cost=0.43..254028.45 rows=1 width=74) (actual time=375.796..604.587
rows=1338 loops=1)
   Index Cond: ((send_dt = '2014-03-13 00:00:00'::timestamp without time
zone) AND (send_dt = '2015-03-14 00:00:00'::timestamp without time zone))
Total runtime: 604.733 ms


The query I gave you is the smallest query, it is using two views and both
the views I have changed by using subqueries instead of CTEs.  When I join
these two views, it is not getting completed at all.

Explain analyze plan for view  s_v_f_promotion_history_email:
http://explain.depesz.com/s/ure Explain analyze plan for view
s_v_d_campaign_hierarchy : http://explain.depesz.com/s/WxI


Regards,
Vivek






-Original Message-
From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
Sent: Tuesday, March 17, 2015 5:15 PM
To: vjo...@zetainteractive.com; Jim Nasby; Scott Marlowe; Varadharajan
Mukundan
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance issues

On 17.3.2015 12:07, Vivekanand Joshi wrote:
 EXPLAIN ANALYZE didn't give result even after three hours.

In that case the only thing you can do is 'slice' the query into smaller
parts (representing subtrees of the plan), and analyze those first. Look for
misestimates (significant differences between estimated and actual row
counts, and very expensive parts).

We can't do that, because we don't have your data or queries, and without
the explain analyze it's difficult to give advices.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
s_v_f_promotion_history_email

Hash Left Join  (cost=34679.90..37396.37 rows=11644 width=148) (actual 
time=609.472..9070.675 rows=4559289 loops=1)
   Hash Cond: ((base.promo_hist_id = email.promo_hist_id) AND 
(base.audience_member_id = email.audience_member_id))
   -  Nested Loop  (cost=32782.62..35004.21 rows=11644 width=74) (actual 
time=567.441..4096.969 rows=4559289 loops=1)
 Join Filter: 
(s_f_touchpoint_execution_status_history.touchpoint_execution_id = 
base.touchpoint_execution_id)
 -  Nested Loop  (cost=32782.19..34504.16 rows=1 width=16) (actual 
time=337.484..884.438 rows=46454 loops=1)
   -  Nested Loop  (cost=32781.90..34503.83 rows=1 width=24) 
(actual time=337.462..682.943 rows=71892 loops=1)
 -  Unique  (cost=32781.61..34495.50 rows=1 width=8) 
(actual time=337.428..478.619 rows=76088 loops=1)
   -  Merge Join  (cost=32781.61..34495.50 rows=1 
width=8) (actual time=337.427..454.249 rows=77090 loops=1)
 Merge Cond: 
((s_f_touchpoint_execution_status_history.touchpoint_execution_id = 
s_f_touchpoint_execution_status_history_1.touchpoint_ex
ecution_id) AND (s_f_touchpoint_execution_status_history.creation_dt = 
(max(s_f_touchpoint_execution_status_history_1.creation_dt
 -  Sort  (cost=19697.87..20098.14 rows=160107 
width=16) (actual time=132.938..155.325 rows=160898 loops=1)
   Sort Key: 
s_f_touchpoint_execution_status_history.touchpoint_execution_id, 
s_f_touchpoint_execution_status_history.creation_dt
   Sort Method: quicksort  Memory: 13687kB
   -  Seq Scan on 
s_f_touchpoint_execution_status_history  (cost=0.00..5857.68 rows=160107 
width=16) (actual time=0.012..60.421 row
s=160898 loops=1)
 Filter: 
(touchpoint_execution_status_type_id = ANY ('{3,4,6

Re: [PERFORM] Performance issues

2015-03-17 Thread Tomas Vondra

Just as I feared, the attached explain analyze results show significant
misestimates, like this for example:

Nested Loop  (cost=32782.19..34504.16 rows=1 width=16)
(actual time=337.484..884.438 rows=46454 loops=1)

Nested Loop  (cost=18484.94..20366.29 rows=1 width=776)
  (actual time=2445.487..3741.049 rows=45360 loops=1)

Hash Left Join  (cost=34679.90..37396.37 rows=11644 width=148)
  (actual time=609.472..9070.675 rows=4559289 loops=1)

There's plenty of nested loop joins - the optimizer believes there will
be only a few rows in the outer relation, but gets order of magnitude
more tuples. And nested loops are terrible in that case.

In case of the first view, it seems to be caused by this:

Merge Cond:
((s_f_touchpoint_execution_status_history.touchpoint_execution_id =
s_f_touchpoint_execution_status_history_1.touchpoint_ex
ecution_id) AND (s_f_touchpoint_execution_status_history.creation_dt =
(max(s_f_touchpoint_execution_status_history_1.creation_dt

especially the ':id = max(:id)' condition is probably giving the
optimizer a hard time. This is a conceptually difficult poblem (i.e.
fixing this at the optimizer level is unlikely to happen any time soon,
because it effectively means you have to predict the statistical
properties of the aggregation).

You may try increasing the statistical target, which makes the stats a
bit more detailed (the default on 9.4 is 100):

SET default_statistics_target = 1;
ANALYZE;

But I don't really believe this might really fix the problem.

But maybe it's possible to rewrite the query somehow?

Let's experiment a bit - remove the aggregation, i.e. join directly to
s_f_touchpoint_execution_status_history. It'll return wrong results, but
the estimates should be better, so let's see what happens.

You may also try disabling nested loops - the other join algorithms
usually perform better with large row counts.

SET enable_nestloop = false;

This is not a production-suitable solution, but for experimenting that's OK.

ISTM what the aggregation (or the whole mergejoin) does is selecting the
last s_f_touchpoint_execution_status_history record for each
touchpoint_execution_id.

There are better ways to determine that, IMHO. For example:

 (1) adding a 'is_last' flag to s_f_touchpoint_execution_status_history

 This however requires maintaining that flag somehow, but the join
 would not be needed at all.

 The last IDs might be maintained in a separate table - the join
 would be still necessary, but it might be less intrusive and
 cheper to maintain.

 (2) using window functions, e.g. like this:

 SELECT * FROM (
   SELECT *,
ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
   ORDER BY FROM max_creation_dt) AS rn
   FROM s_f_touchpoint_execution_status_history
 ) foo WHERE rn = 1

 But estimating this is also rather difficult ...

 (3) Using temporary table / MV - this really depends on your
 requirements, load schedule, how you run the queries etc. It would
 however fix the estimation errors (probably).

The 2nd view seems to suffer because of the same issue (underestimates
leading to choice of nested loops), but caused by something else:

-  Hash Join  (cost=1954.13..6249.67 rows=13 width=108)
 (actual time=31.777..210.346 rows=72670 loops=1)
  Hash Cond: ((tp_exec.touchpoint_id = tp.touchpoint_id)
  AND (wave_exec.wave_id = tp.wave_id))

Estimating cardinality of joins with multi-column conditions is
difficult, no idea how to fix that at the moment.












-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-17 Thread Tomas Vondra

On 17.3.2015 15:19, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 14:55:
  (2) using window functions, e.g. like this:

  SELECT * FROM (
SELECT *,
 ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
ORDER BY FROM max_creation_dt) AS rn
FROM s_f_touchpoint_execution_status_history
  ) foo WHERE rn = 1

  But estimating this is also rather difficult ...
 
 
 From my experience rewriting something like the above using DISTINCT 
 ON is usually faster.

How do you get the last record (with respect to a timestamp column)
using a DISTINCT ON?


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-17 Thread Thomas Kellerer

Tomas Vondra schrieb am 17.03.2015 um 14:55:
  (2) using window functions, e.g. like this:
 
  SELECT * FROM (
SELECT *,
 ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
ORDER BY FROM max_creation_dt) AS rn
FROM s_f_touchpoint_execution_status_history
  ) foo WHERE rn = 1
 
  But estimating this is also rather difficult ...


From my experience rewriting something like the above using DISTINCT ON is 
usually faster. 

e.g.:

select distinct on (touchpoint_execution_id) *
from s_f_touchpoint_execution_status_history
order by touchpoint_execution_id, max_creation_dt;






-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-17 Thread Tomas Vondra

On 17.3.2015 16:24, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 15:43:
 On 17.3.2015 15:19, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 14:55:
  (2) using window functions, e.g. like this:

  SELECT * FROM (
SELECT *,
 ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
ORDER BY FROM max_creation_dt) AS rn
FROM s_f_touchpoint_execution_status_history
  ) foo WHERE rn = 1

  But estimating this is also rather difficult ...


 From my experience rewriting something like the above using DISTINCT 
 ON is usually faster.

 How do you get the last record (with respect to a timestamp column)
 using a DISTINCT ON?
 
 You need to use order by ... desc. See here: 
 http://sqlfiddle.com/#!15/d4846/2

Nice, thanks!

 
 Btw: your row_number() usage wouldn't return the latest row either. 
 It would return the oldest row.

Oh, right. I forgot the DESC in the window.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-16 Thread Tomas Vondra

On 16.3.2015 20:43, Jim Nasby wrote:
 On 3/13/15 7:12 PM, Tomas Vondra wrote:
 (4) I suspect many of the relations referenced in the views are not
  actually needed in the query, i.e. the join is performed but
  then it's just discarded because those columns are not used.
  Try to simplify the views as much has possible - remove all the
  tables that are not really necessary to run the query. If two
  queries need different tables, maybe defining two views is
  a better approach.
 
 A better alternative with multi-purpose views is to use an outer
 join instead of an inner join. With an outer join if you ultimately
 don't refer to any of the columns in a particular table Postgres will
 remove the table from the query completely.

Really? Because a quick test suggests otherwise:

db=# create table test_a (id int);
CREATE TABLE
db=# create table test_b (id int);
CREATE TABLE
db=# explain select test_a.* from test_a left join test_b using (id);
  QUERY PLAN
--
 Merge Left Join  (cost=359.57..860.00 rows=32512 width=4)
   Merge Cond: (test_a.id = test_b.id)
   -  Sort  (cost=179.78..186.16 rows=2550 width=4)
 Sort Key: test_a.id
 -  Seq Scan on test_a  (cost=0.00..35.50 rows=2550 width=4)
   -  Sort  (cost=179.78..186.16 rows=2550 width=4)
 Sort Key: test_b.id
 -  Seq Scan on test_b  (cost=0.00..35.50 rows=2550 width=4)
(8 rows)

Also, how would that work with duplicate rows in the referenced table?


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-16 Thread Jim Nasby


On 3/16/15 3:59 PM, Tomas Vondra wrote:

On 16.3.2015 20:43, Jim Nasby wrote:

On 3/13/15 7:12 PM, Tomas Vondra wrote:

(4) I suspect many of the relations referenced in the views are not
  actually needed in the query, i.e. the join is performed but
  then it's just discarded because those columns are not used.
  Try to simplify the views as much has possible - remove all the
  tables that are not really necessary to run the query. If two
  queries need different tables, maybe defining two views is
  a better approach.


A better alternative with multi-purpose views is to use an outer
join instead of an inner join. With an outer join if you ultimately
don't refer to any of the columns in a particular table Postgres will
remove the table from the query completely.


Really? Because a quick test suggests otherwise:

db=# create table test_a (id int);
CREATE TABLE
db=# create table test_b (id int);
CREATE TABLE
db=# explain select test_a.* from test_a left join test_b using (id);
   QUERY PLAN
--
  Merge Left Join  (cost=359.57..860.00 rows=32512 width=4)
Merge Cond: (test_a.id = test_b.id)
-  Sort  (cost=179.78..186.16 rows=2550 width=4)
  Sort Key: test_a.id
  -  Seq Scan on test_a  (cost=0.00..35.50 rows=2550 width=4)
-  Sort  (cost=179.78..186.16 rows=2550 width=4)
  Sort Key: test_b.id
  -  Seq Scan on test_b  (cost=0.00..35.50 rows=2550 width=4)
(8 rows)

Also, how would that work with duplicate rows in the referenced table?


Right, I neglected to mention that the omitted table must also be unique 
on the join key:


decibel@decina.attlocal=# create table a(a_id serial primary key);
CREATE TABLE
decibel@decina.attlocal=# create table b(a_id int);
CREATE TABLE
decibel@decina.attlocal=# explain analyze select a.* from a left join b 
using(a_id);
QUERY PLAN 


---
 Hash Right Join  (cost=67.38..137.94 rows=2550 width=4) (actual 
time=0.035..0.035 rows=0 loops=1)

   Hash Cond: (b.a_id = a.a_id)
   -  Seq Scan on b  (cost=0.00..35.50 rows=2550 width=4) (never executed)
   -  Hash  (cost=35.50..35.50 rows=2550 width=4) (actual 
time=0.002..0.002 rows=0 loops=1)

 Buckets: 4096  Batches: 1  Memory Usage: 32kB
 -  Seq Scan on a  (cost=0.00..35.50 rows=2550 width=4) 
(actual time=0.001..0.001 rows=0 loops=1)

 Planning time: 0.380 ms
 Execution time: 0.086 ms
(8 rows)

decibel@decina.attlocal=# alter table b add primary key(a_id);
ALTER TABLE
decibel@decina.attlocal=# explain analyze select a.* from a left join b 
using(a_id);
  QUERY PLAN 


---
 Seq Scan on a  (cost=0.00..35.50 rows=2550 width=4) (actual 
time=0.001..0.001 rows=0 loops=1)

 Planning time: 0.247 ms
 Execution time: 0.029 ms
(3 rows)

decibel@decina.attlocal=# alter table a drop constraint a_pkey;
ALTER TABLE
decibel@decina.attlocal=# explain analyze select a.* from a left join b 
using(a_id);
  QUERY PLAN 


---
 Seq Scan on a  (cost=0.00..35.50 rows=2550 width=4) (actual 
time=0.001..0.001 rows=0 loops=1)

 Planning time: 0.098 ms
 Execution time: 0.011 ms
(3 rows)
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-16 Thread Vivekanand Joshi

)
 -  Index Scan using
s_d_wave_pkey on s_d_wave wave  (cost=0.29..0.31 rows=1 width=22) (actual
time=0.003..0.003 rows=1 loops=72670)
   Index Cond: (wave_id =
wave_exec.wave_id)
   -  Index Scan using
s_d_campaign_pkey on s_d_campaign camp  (cost=0.29..0.32 rows=1 width=40)
(actual time=0.003..0.003 rows=1 loops=72427)
 Index Cond: (campaign_id =
camp_exec.campaign_id)
 -  Index Scan using s_d_content_pkey on
s_d_content content  (cost=0.28..0.30 rows=1 width=33) (actual
time=0.002..0.003 rows=1 loops=72427)
   Index Cond: (tp_exec.content_id =
content_id)
   -  Index Scan using s_d_message_type_pkey on
s_d_message_type message_type  (cost=0.13..0.15 rows=1 width=120) (actual
time=0.001..0.002 rows=1 loops=72427)
 Index Cond: (tp_exec.message_type_id =
message_type_id)
 -  Index Scan using s_d_group_pkey on s_d_group grup
(cost=0.13..0.15 rows=1 width=320) (actual time=0.001..0.002 rows=1
loops=72427)
   Index Cond: (camp_exec.group_id = group_id)
   -  Index Scan using d_channel_pk on s_d_channel_type channel
(cost=0.13..0.15 rows=1 width=120) (actual time=0.001..0.002 rows=1
loops=72427)
 Index Cond: (channel_type_id = tp.channel_type_id)
 -  Index Scan using s_d_category_pkey on s_d_category CATEGORY
(cost=0.13..0.15 rows=1 width=120) (actual time=0.001..0.002 rows=1
loops=67508)
   Index Cond: (camp.category_id = category_id)
   -  CTE Scan on valid_executions  (cost=0.00..0.02 rows=1 width=8)
(actual time=0.004..6.803 rows=52997 loops=67508)
 Total runtime: 966566.574 ms



Can you please see it an let me know where is the issue?


-Original Message-
From: Gavin Flower [mailto:gavinflo...@archidevsys.co.nz]
Sent: Sunday, March 15, 2015 3:02 AM
To: Varadharajan Mukundan
Cc: Tomas Vondra; vjo...@zetainteractive.com; Scott Marlowe;
pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance issues

On 15/03/15 10:23, Varadharajan Mukundan wrote:
 Hi Gavin,

 Vivekanand is his first mail itself mentioned the below configuration
 of postgresql.conf. It looks good enough to me.

 Total Memory : 8 GB

 shared_buffers = 2GB

 work_mem = 64MB

 maintenance_work_mem = 700MB

 effective_cache_size = 4GB


Sorry, it didn't register when I read it!
(Probably reading too fast)

 On Sat, Mar 14, 2015 at 10:06 PM, Gavin Flower
 gavinflo...@archidevsys.co.nz wrote:
 On 14/03/15 13:12, Tomas Vondra wrote:
 On 14.3.2015 00:28, Vivekanand Joshi wrote:
 Hi Guys,

 So here is the full information attached as well as in the link
 provided below:

 http://pgsql.privatepaste.com/41207bea45

 I can provide new information as well.
 Thanks.

 We still don't have EXPLAIN ANALYZE - how long was the query running
 (I assume it got killed at some point)? It's really difficult to
 give you any advices because we don't know where the problem is.

 If EXPLAIN ANALYZE really takes too long (say, it does not complete
 after an hour / over night), you'll have to break the query into
 parts and first tweak those independently.

 For example in the first message you mentioned that select from the
 S_V_D_CAMPAIGN_HIERARCHY view takes ~9 minutes, so start with that.
 Give us EXPLAIN ANALYZE for that query.

 Few more comments:

 (1) You're using CTEs - be aware that CTEs are not just aliases, but
   impact planning / optimization, and in some cases may prevent
   proper optimization. Try replacing them with plain views.

 (2) Varadharajan Mukundan already recommended you to create index on
   s_f_promotion_history.send_dt. Have you tried that? You may also
   try creating an index on all the columns needed by the query, so
   that Index Only Scan is possible.

 (3) There are probably additional indexes that might be useful here.
   What I'd try is adding indexes on all columns that are either a
   foreign key or used in a WHERE condition. This might be an
   overkill in some cases, but let's see.

 (4) I suspect many of the relations referenced in the views are not
   actually needed in the query, i.e. the join is performed but
   then it's just discarded because those columns are not used.
   Try to simplify the views as much has possible - remove all the
   tables that are not really necessary to run the query. If two
   queries need different tables, maybe defining two views is
   a better approach.

 (5) The vmstat / iostat data are pretty useless - what you provided are
   averages since the machine was started, but we need a few samples
   collected when the query is running. I.e. start the query, and
 then
   give us a few samples from these commands

Re: [PERFORM] Performance issues

2015-03-16 Thread Marc Mamin

 on s_d_campaign_execution camp_exec
(cost=0.29..0.37 rows=1 width=67) (actual time=0.013..0.013 rows=1
loops=72670)
 Index Cond:
(campaign_execution_id = wave_exec.campaign_execution_id)
 -  Index Scan using
s_d_wave_pkey on s_d_wave wave  (cost=0.29..0.31 rows=1 width=22) (actual
time=0.003..0.003 rows=1 loops=72670)
   Index Cond: (wave_id =
wave_exec.wave_id)
   -  Index Scan using
s_d_campaign_pkey on s_d_campaign camp  (cost=0.29..0.32 rows=1 width=40)
(actual time=0.003..0.003 rows=1 loops=72427)
 Index Cond: (campaign_id =
camp_exec.campaign_id)
 -  Index Scan using s_d_content_pkey on
s_d_content content  (cost=0.28..0.30 rows=1 width=33) (actual
time=0.002..0.003 rows=1 loops=72427)
   Index Cond: (tp_exec.content_id =
content_id)
   -  Index Scan using s_d_message_type_pkey on
s_d_message_type message_type  (cost=0.13..0.15 rows=1 width=120) (actual
time=0.001..0.002 rows=1 loops=72427)
 Index Cond: (tp_exec.message_type_id =
message_type_id)
 -  Index Scan using s_d_group_pkey on s_d_group grup
(cost=0.13..0.15 rows=1 width=320) (actual time=0.001..0.002 rows=1
loops=72427)
   Index Cond: (camp_exec.group_id = group_id)
   -  Index Scan using d_channel_pk on s_d_channel_type channel
(cost=0.13..0.15 rows=1 width=120) (actual time=0.001..0.002 rows=1
loops=72427)
 Index Cond: (channel_type_id = tp.channel_type_id)
 -  Index Scan using s_d_category_pkey on s_d_category CATEGORY
(cost=0.13..0.15 rows=1 width=120) (actual time=0.001..0.002 rows=1
loops=67508)
   Index Cond: (camp.category_id = category_id)
   -  CTE Scan on valid_executions  (cost=0.00..0.02 rows=1 width=8)
(actual time=0.004..6.803 rows=52997 loops=67508)
 Total runtime: 966566.574 ms



Can you please see it an let me know where is the issue?


-Original Message-
From: Gavin Flower [mailto:gavinflo...@archidevsys.co.nz]
Sent: Sunday, March 15, 2015 3:02 AM
To: Varadharajan Mukundan
Cc: Tomas Vondra; vjo...@zetainteractive.com; Scott Marlowe;
pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance issues

On 15/03/15 10:23, Varadharajan Mukundan wrote:
 Hi Gavin,

 Vivekanand is his first mail itself mentioned the below configuration
 of postgresql.conf. It looks good enough to me.

 Total Memory : 8 GB

 shared_buffers = 2GB

 work_mem = 64MB

 maintenance_work_mem = 700MB

 effective_cache_size = 4GB


Sorry, it didn't register when I read it!
(Probably reading too fast)

 On Sat, Mar 14, 2015 at 10:06 PM, Gavin Flower
 gavinflo...@archidevsys.co.nz wrote:
 On 14/03/15 13:12, Tomas Vondra wrote:
 On 14.3.2015 00:28, Vivekanand Joshi wrote:
 Hi Guys,

 So here is the full information attached as well as in the link
 provided below:

 http://pgsql.privatepaste.com/41207bea45

 I can provide new information as well.
 Thanks.

 We still don't have EXPLAIN ANALYZE - how long was the query running
 (I assume it got killed at some point)? It's really difficult to
 give you any advices because we don't know where the problem is.

 If EXPLAIN ANALYZE really takes too long (say, it does not complete
 after an hour / over night), you'll have to break the query into
 parts and first tweak those independently.

 For example in the first message you mentioned that select from the
 S_V_D_CAMPAIGN_HIERARCHY view takes ~9 minutes, so start with that.
 Give us EXPLAIN ANALYZE for that query.

 Few more comments:

 (1) You're using CTEs - be aware that CTEs are not just aliases, but
   impact planning / optimization, and in some cases may prevent
   proper optimization. Try replacing them with plain views.

 (2) Varadharajan Mukundan already recommended you to create index on
   s_f_promotion_history.send_dt. Have you tried that? You may also
   try creating an index on all the columns needed by the query, so
   that Index Only Scan is possible.

 (3) There are probably additional indexes that might be useful here.
   What I'd try is adding indexes on all columns that are either a
   foreign key or used in a WHERE condition. This might be an
   overkill in some cases, but let's see.

 (4) I suspect many of the relations referenced in the views are not
   actually needed in the query, i.e. the join is performed but
   then it's just discarded because those columns are not used.
   Try to simplify the views as much has possible - remove all the
   tables that are not really necessary to run the query. If two
   queries need different tables, maybe defining two views is
   a better approach

Re: [PERFORM] Performance issues

2015-03-16 Thread Vivekanand Joshi

Hey guys, thanks a lot.

This is really helping. I am learning a lot. BTW, I changed CTE into
subquery and it improved the performance by miles. I am getting the result
in less than 3 seconds, though I am using a 24 GB ram server. It is still a
great turnaround time as compared to  previous execution time.

Now I will look into the bigger query. I read explain analyze and that
helped a lot. I will be coming up with more questions tomorrow as bigger
query still has got some problems.
On 16 Mar 2015 23:55, Tomas Vondra tomas.von...@2ndquadrant.com wrote:

 On 16.3.2015 18:49, Marc Mamin wrote:
 
  Hi Team,
 
  This is the EXPLAIN ANALYZE for one of the view :
 S_V_D_CAMPAIGN_HIERARCHY:

 FWIW, this is a somewhat more readable version of the plan:

 http://explain.depesz.com/s/nbB

 In the future, please do two things:

 (1) Attach the plan as a text file, because the mail clients tend to
 screw things up (wrapping long lines). Unless the plan is trivial,
 of course - but pgsql-performance usually deals with complex stuff.

 (2) Put the plan on explain.depesz.com helps too, because it's
 considerably more readable (but always do 1, because resorces
 placed somewhere else tends to disappear, and the posts then make
 very little sense, which is bad when searching in the archives)

 (3) Same for stuff pasted somewhere else - always attach it to the
 message. For example I'd like to give you more accurate advice, but
 I can't as http://pgsql.privatepaste.com/41207bea45 is unavailable.

 
 
Rows Removed by Join Filter: 3577676116
 
That's quite a lot.
You're possibly missing a clause in a join, resulting in a cross
 join.
It is also helpful to put your result here:
http://explain.depesz.com/
regards,

 IMHO this is merely a consequence of using the CTE, which produces 52997
 rows and is scanned 67508x as the inner relation of a nested loop. That
 gives you 3577721476 tuples in total, and only 45360 are kept (hence
 3577676116 are removed).

 This is a prime example of why CTEs are not just aliases for subqueries,
 but may actually cause serious trouble.

 There are other issues (e.g. the row count estimate of the CTE is
 seriously off, most likely because of the HashAggregate in the outer
 branch), but that's a secondary issue IMHO.

 Vivekanand, try this (in the order of intrusiveness):

 (1) Get rid of the CTE, and just replace it with subselect in the FROM
 part of the query, so instead of this:

 WITH valid_executions AS (...)
 SELECT ... FROM ... JOIN valid_executions ON (...)

 you'll have something like this:

 SELECT ... FROM ... JOIN (...) AS valid_executions ON (...)

 This way the subselect will optimized properly.


 (2) Replace the CTE with a materialized view, or a temporary table.
 This has both advantages and disadvantages - the main advantage is
 that you can create indexes, collect statistics. Disadvantage is
 you have to refresh the MV, fill temporary table etc.

 I expect (1) to improve the performance significantly, and (2) might
 improve it even further by fixing the misestimates.


 regards

 --
 Tomas Vondrahttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-16 Thread Tomas Vondra

On 16.3.2015 18:49, Marc Mamin wrote:
 
 Hi Team,

 This is the EXPLAIN ANALYZE for one of the view : S_V_D_CAMPAIGN_HIERARCHY:

FWIW, this is a somewhat more readable version of the plan:

http://explain.depesz.com/s/nbB

In the future, please do two things:

(1) Attach the plan as a text file, because the mail clients tend to
screw things up (wrapping long lines). Unless the plan is trivial,
of course - but pgsql-performance usually deals with complex stuff.

(2) Put the plan on explain.depesz.com helps too, because it's
considerably more readable (but always do 1, because resorces
placed somewhere else tends to disappear, and the posts then make
very little sense, which is bad when searching in the archives)

(3) Same for stuff pasted somewhere else - always attach it to the
message. For example I'd like to give you more accurate advice, but
I can't as http://pgsql.privatepaste.com/41207bea45 is unavailable.

 
 
   Rows Removed by Join Filter: 3577676116
 
   That's quite a lot.
   You're possibly missing a clause in a join, resulting in a cross join.
   It is also helpful to put your result here:
   http://explain.depesz.com/
   regards,

IMHO this is merely a consequence of using the CTE, which produces 52997
rows and is scanned 67508x as the inner relation of a nested loop. That
gives you 3577721476 tuples in total, and only 45360 are kept (hence
3577676116 are removed).

This is a prime example of why CTEs are not just aliases for subqueries,
but may actually cause serious trouble.

There are other issues (e.g. the row count estimate of the CTE is
seriously off, most likely because of the HashAggregate in the outer
branch), but that's a secondary issue IMHO.

Vivekanand, try this (in the order of intrusiveness):

(1) Get rid of the CTE, and just replace it with subselect in the FROM
part of the query, so instead of this:

WITH valid_executions AS (...)
SELECT ... FROM ... JOIN valid_executions ON (...)

you'll have something like this:

SELECT ... FROM ... JOIN (...) AS valid_executions ON (...)

This way the subselect will optimized properly.


(2) Replace the CTE with a materialized view, or a temporary table.
This has both advantages and disadvantages - the main advantage is
that you can create indexes, collect statistics. Disadvantage is
you have to refresh the MV, fill temporary table etc.

I expect (1) to improve the performance significantly, and (2) might
improve it even further by fixing the misestimates.


regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-16 Thread Jim Nasby


On 3/13/15 7:12 PM, Tomas Vondra wrote:

(4) I suspect many of the relations referenced in the views are not
 actually needed in the query, i.e. the join is performed but
 then it's just discarded because those columns are not used.
 Try to simplify the views as much has possible - remove all the
 tables that are not really necessary to run the query. If two
 queries need different tables, maybe defining two views is
 a better approach.


A better alternative with multi-purpose views is to use an outer join 
instead of an inner join. With an outer join if you ultimately don't 
refer to any of the columns in a particular table Postgres will remove 
the table from the query completely.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-14 Thread Gavin Flower


On 14/03/15 13:12, Tomas Vondra wrote:

On 14.3.2015 00:28, Vivekanand Joshi wrote:

Hi Guys,

So here is the full information attached as well as in the link
provided below:

http://pgsql.privatepaste.com/41207bea45

I can provide new information as well.

Thanks.

We still don't have EXPLAIN ANALYZE - how long was the query running (I
assume it got killed at some point)? It's really difficult to give you
any advices because we don't know where the problem is.

If EXPLAIN ANALYZE really takes too long (say, it does not complete
after an hour / over night), you'll have to break the query into parts
and first tweak those independently.

For example in the first message you mentioned that select from the
S_V_D_CAMPAIGN_HIERARCHY view takes ~9 minutes, so start with that. Give
us EXPLAIN ANALYZE for that query.

Few more comments:

(1) You're using CTEs - be aware that CTEs are not just aliases, but
 impact planning / optimization, and in some cases may prevent
 proper optimization. Try replacing them with plain views.

(2) Varadharajan Mukundan already recommended you to create index on
 s_f_promotion_history.send_dt. Have you tried that? You may also
 try creating an index on all the columns needed by the query, so
 that Index Only Scan is possible.

(3) There are probably additional indexes that might be useful here.
 What I'd try is adding indexes on all columns that are either a
 foreign key or used in a WHERE condition. This might be an
 overkill in some cases, but let's see.

(4) I suspect many of the relations referenced in the views are not
 actually needed in the query, i.e. the join is performed but
 then it's just discarded because those columns are not used.
 Try to simplify the views as much has possible - remove all the
 tables that are not really necessary to run the query. If two
 queries need different tables, maybe defining two views is
 a better approach.

(5) The vmstat / iostat data are pretty useless - what you provided are
 averages since the machine was started, but we need a few samples
 collected when the query is running. I.e. start the query, and then
 give us a few samples from these commands:

 iostat -x -k 1
 vmstat 1


Would like to see if queries of these type can actually run in
postgres server?

Why not? We're running DWH applications on tens/hundreds of GBs.


If yes, what would be the minimum requirements for hardware? We would
like to migrate our whole solution on PostgreSQL as we can spend on
hardware as much as we can but working on a proprietary appliance is
becoming very difficult for us.

That's difficult to say, because we really don't know where the problem
is and how much the queries can be optimized.


I notice that no one appears to have suggested the default setting in 
postgresql.conf - these need changing as they are initially set up for 
small machines, and to let PostgreSQL take anywhere near full advantage 
of a box have large amounts of RAM, you need to change some of the 
configuration settings!


For example 'temp_buffers' (default 8MB) and 'maintenance_work_mem' 
(default 16MB) should be drastically increased,  and there are other 
settings that need changing.  The precise values depend on many factors, 
but the initial values set by default are definitely far too small for 
your usage.


Am assuming that you are looking at PostgreSQL 9.4.



Cheers,
Gavin




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-14 Thread Varadharajan Mukundan

Hi Gavin,

Vivekanand is his first mail itself mentioned the below configuration
of postgresql.conf. It looks good enough to me.

Total Memory : 8 GB

shared_buffers = 2GB

work_mem = 64MB

maintenance_work_mem = 700MB

effective_cache_size = 4GB

On Sat, Mar 14, 2015 at 10:06 PM, Gavin Flower
gavinflo...@archidevsys.co.nz wrote:
 On 14/03/15 13:12, Tomas Vondra wrote:

 On 14.3.2015 00:28, Vivekanand Joshi wrote:

 Hi Guys,

 So here is the full information attached as well as in the link
 provided below:

 http://pgsql.privatepaste.com/41207bea45

 I can provide new information as well.

 Thanks.

 We still don't have EXPLAIN ANALYZE - how long was the query running (I
 assume it got killed at some point)? It's really difficult to give you
 any advices because we don't know where the problem is.

 If EXPLAIN ANALYZE really takes too long (say, it does not complete
 after an hour / over night), you'll have to break the query into parts
 and first tweak those independently.

 For example in the first message you mentioned that select from the
 S_V_D_CAMPAIGN_HIERARCHY view takes ~9 minutes, so start with that. Give
 us EXPLAIN ANALYZE for that query.

 Few more comments:

 (1) You're using CTEs - be aware that CTEs are not just aliases, but
  impact planning / optimization, and in some cases may prevent
  proper optimization. Try replacing them with plain views.

 (2) Varadharajan Mukundan already recommended you to create index on
  s_f_promotion_history.send_dt. Have you tried that? You may also
  try creating an index on all the columns needed by the query, so
  that Index Only Scan is possible.

 (3) There are probably additional indexes that might be useful here.
  What I'd try is adding indexes on all columns that are either a
  foreign key or used in a WHERE condition. This might be an
  overkill in some cases, but let's see.

 (4) I suspect many of the relations referenced in the views are not
  actually needed in the query, i.e. the join is performed but
  then it's just discarded because those columns are not used.
  Try to simplify the views as much has possible - remove all the
  tables that are not really necessary to run the query. If two
  queries need different tables, maybe defining two views is
  a better approach.

 (5) The vmstat / iostat data are pretty useless - what you provided are
  averages since the machine was started, but we need a few samples
  collected when the query is running. I.e. start the query, and then
  give us a few samples from these commands:

  iostat -x -k 1
  vmstat 1

 Would like to see if queries of these type can actually run in
 postgres server?

 Why not? We're running DWH applications on tens/hundreds of GBs.

 If yes, what would be the minimum requirements for hardware? We would
 like to migrate our whole solution on PostgreSQL as we can spend on
 hardware as much as we can but working on a proprietary appliance is
 becoming very difficult for us.

 That's difficult to say, because we really don't know where the problem
 is and how much the queries can be optimized.


 I notice that no one appears to have suggested the default setting in
 postgresql.conf - these need changing as they are initially set up for small
 machines, and to let PostgreSQL take anywhere near full advantage of a box
 have large amounts of RAM, you need to change some of the configuration
 settings!

 For example 'temp_buffers' (default 8MB) and 'maintenance_work_mem' (default
 16MB) should be drastically increased,  and there are other settings that
 need changing.  The precise values depend on many factors, but the initial
 values set by default are definitely far too small for your usage.

 Am assuming that you are looking at PostgreSQL 9.4.



 Cheers,
 Gavin





-- 
Thanks,
M. Varadharajan



Experience is what you get when you didn't get what you wanted
   -By Prof. Randy Pausch in The Last Lecture

My Journal :- www.thinkasgeek.wordpress.com


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-14 Thread Gavin Flower


On 15/03/15 10:23, Varadharajan Mukundan wrote:

Hi Gavin,

Vivekanand is his first mail itself mentioned the below configuration
of postgresql.conf. It looks good enough to me.

Total Memory : 8 GB

shared_buffers = 2GB

work_mem = 64MB

maintenance_work_mem = 700MB

effective_cache_size = 4GB



Sorry, it didn't register when I read it!
(Probably reading too fast)


On Sat, Mar 14, 2015 at 10:06 PM, Gavin Flower
gavinflo...@archidevsys.co.nz wrote:

On 14/03/15 13:12, Tomas Vondra wrote:

On 14.3.2015 00:28, Vivekanand Joshi wrote:

Hi Guys,

So here is the full information attached as well as in the link
provided below:

http://pgsql.privatepaste.com/41207bea45

I can provide new information as well.

Thanks.

We still don't have EXPLAIN ANALYZE - how long was the query running (I
assume it got killed at some point)? It's really difficult to give you
any advices because we don't know where the problem is.

If EXPLAIN ANALYZE really takes too long (say, it does not complete
after an hour / over night), you'll have to break the query into parts
and first tweak those independently.

For example in the first message you mentioned that select from the
S_V_D_CAMPAIGN_HIERARCHY view takes ~9 minutes, so start with that. Give
us EXPLAIN ANALYZE for that query.

Few more comments:

(1) You're using CTEs - be aware that CTEs are not just aliases, but
  impact planning / optimization, and in some cases may prevent
  proper optimization. Try replacing them with plain views.

(2) Varadharajan Mukundan already recommended you to create index on
  s_f_promotion_history.send_dt. Have you tried that? You may also
  try creating an index on all the columns needed by the query, so
  that Index Only Scan is possible.

(3) There are probably additional indexes that might be useful here.
  What I'd try is adding indexes on all columns that are either a
  foreign key or used in a WHERE condition. This might be an
  overkill in some cases, but let's see.

(4) I suspect many of the relations referenced in the views are not
  actually needed in the query, i.e. the join is performed but
  then it's just discarded because those columns are not used.
  Try to simplify the views as much has possible - remove all the
  tables that are not really necessary to run the query. If two
  queries need different tables, maybe defining two views is
  a better approach.

(5) The vmstat / iostat data are pretty useless - what you provided are
  averages since the machine was started, but we need a few samples
  collected when the query is running. I.e. start the query, and then
  give us a few samples from these commands:

  iostat -x -k 1
  vmstat 1


Would like to see if queries of these type can actually run in
postgres server?

Why not? We're running DWH applications on tens/hundreds of GBs.


If yes, what would be the minimum requirements for hardware? We would
like to migrate our whole solution on PostgreSQL as we can spend on
hardware as much as we can but working on a proprietary appliance is
becoming very difficult for us.

That's difficult to say, because we really don't know where the problem
is and how much the queries can be optimized.



I notice that no one appears to have suggested the default setting in
postgresql.conf - these need changing as they are initially set up for small
machines, and to let PostgreSQL take anywhere near full advantage of a box
have large amounts of RAM, you need to change some of the configuration
settings!

For example 'temp_buffers' (default 8MB) and 'maintenance_work_mem' (default
16MB) should be drastically increased,  and there are other settings that
need changing.  The precise values depend on many factors, but the initial
values set by default are definitely far too small for your usage.

Am assuming that you are looking at PostgreSQL 9.4.



Cheers,
Gavin










--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-13 Thread Varadharajan Mukundan

 We might even consider taking experts advice on how to tune queries and
 server, but if postgres is going to behave like this, I am not sure we would
 be able to continue with it.

 Having said that, I would day again that I am completely new to this
 territory, so I might miss lots and lots of thing.

My two cents: Postgres out of the box might not be a good choice for
data warehouse style queries, that is because it is optimized to run
thousands of small queries (OLTP style processing) and not one big
monolithic query. I've faced similar problems myself before and here
are few tricks i followed to get my elephant do real time adhoc
analysis on a table with ~45 columns and few billion rows in it.

1. Partition your table! use constraint exclusion to the fullest extent
2. Fire multiple small queries distributed over partitions and
aggregate them at the application layer. This is needed because, you
might to exploit all your cores to the fullest extent (Assuming that
you've enough memory for effective FS cache). If your dataset goes
beyond the capability of a single system, try something like Stado
(GridSQL)
3. Storing index on a RAM / faster disk disk (using tablespaces) and
using it properly makes the system blazing fast. CAUTION: This
requires some other infrastructure setup for backup and recovery
4. If you're accessing a small set of columns in a big table and if
you feel compressing the data helps a lot, give this FDW a try -
https://github.com/citusdata/cstore_fdw


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-13 Thread Vivekanand Joshi

I am really worried about the performance of PostgreSQL as we have almost
1.5 billion records in promotion history table. Do you guys really think
PostgreSQL can handle this much load. We have fact tables which are more
than 15 GB in size and we have to make joins with those tables in almost
every query.
On 13 Mar 2015 18:40, Varadharajan Mukundan srinath...@gmail.com wrote:

 If the s_f_promotion_history table will have a explosive growth, then its
 worth considering partitioning by date and using constraint exclusion to
 speed up the queries. Else, it makes sense to get started with multiple
 partial index (like, have a index for each week or something like that. You
 may want to start with a coarse grain timeline for the index and then fine
 grain it based on the needs)

Re: [PERFORM] Performance issues

2015-03-13 Thread Vivekanand Joshi

Since I was doing it only for the testing purposes and on a development
server which has only  8 GB of RAM, I used only 10m rows. But the original
table has 1.5 billion rows. We will obviously be using a server with very
high capacity, but I am not satisfied with the performance at all. This
might be only a start, so I might get a better performance later.

Yes, the view is complex and almost is created by using 10 tables. Same
goes with other views as well but this is what we are using in Netezza as
well. And we are getting results of the full report in less than 5 seconds.
And add to that, this is only a very little part of the whole query used in
a report.

I will post the result of whole query with Explain analyze tomorrow.

We might even consider taking experts advice on how to tune queries and
server, but if postgres is going to behave like this, I am not sure we
would be able to continue with it.

Having said that, I would day again that I am completely new to this
territory, so I might miss lots and lots of thing.
On 14 Mar 2015 02:07, Tomas Vondra tomas.von...@2ndquadrant.com wrote:

 Hi,

 On 13.3.2015 20:59, Vivekanand Joshi wrote:
  I am really worried about the performance of PostgreSQL as we have
  almost 1.5 billion records in promotion history table. Do you guys

 In the previous message you claimed the post table has 10M rows ...

  really think PostgreSQL can handle this much load. We have fact
  tables which are more than 15 GB in size and we have to make joins
  with those tables in almost every query.

 That depends on what performance you're looking for. You'll have to
 provide considerably more information until we can help you. You might
 want to check this:

https://wiki.postgresql.org/wiki/Slow_Query_Questions

 You have not provided the full query, just a query apparently
 referencing views, so that the actual query is way more complicated.
 Also, plain EXPLAIN is not really sufficient, we need EXPLAIN ANALYZE.

 regards

 --
 Tomas Vondrahttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-13 Thread Tomas Vondra

On 13.3.2015 21:46, Vivekanand Joshi wrote:
 Since I was doing it only for the testing purposes and on a
 development server which has only  8 GB of RAM, I used only 10m rows.
 But the original table has 1.5 billion rows. We will obviously be
 using a server with very high capacity, but I am not satisfied with
 the performance at all. This might be only a start, so I might get a
 better performance later.

OK, understood.

 Yes, the view is complex and almost is created by using 10 tables. Same
 goes with other views as well but this is what we are using in Netezza
 as well. And we are getting results of the full report in less than 5
 seconds. And add to that, this is only a very little part of the whole
 query used in a report.

Well, in the very first message you asked Is the query written
correctly as per the PostgreSQL? - how can we decide that when most of
the query is hidden in some unknown view?

 I will post the result of whole query with Explain analyze tomorrow.

Please also collect some information about the system using iostat,
vmstat and such, so that we know what is the bottleneck.

 We might even consider taking experts advice on how to tune queries
 and server, but if postgres is going to behave like this, I am not
 sure we would be able to continue with it.

That's probably a good idea.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-13 Thread Tomas Vondra

Hi,

On 13.3.2015 20:59, Vivekanand Joshi wrote:
 I am really worried about the performance of PostgreSQL as we have
 almost 1.5 billion records in promotion history table. Do you guys

In the previous message you claimed the post table has 10M rows ...

 really think PostgreSQL can handle this much load. We have fact
 tables which are more than 15 GB in size and we have to make joins
 with those tables in almost every query.

That depends on what performance you're looking for. You'll have to
provide considerably more information until we can help you. You might
want to check this:

   https://wiki.postgresql.org/wiki/Slow_Query_Questions

You have not provided the full query, just a query apparently
referencing views, so that the actual query is way more complicated.
Also, plain EXPLAIN is not really sufficient, we need EXPLAIN ANALYZE.

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-13 Thread Vivekanand Joshi

Hi Guys,

So here is the full information attached as well as in the link provided
below:

http://pgsql.privatepaste.com/41207bea45

I can provide new information as well.

Would like to see if queries of these type can actually run in postgres
server?

If yes, what would be the minimum requirements for hardware? We would like
to migrate our whole solution on PostgreSQL as we can spend on hardware as
much as we can but working on a proprietary appliance is becoming very
difficult for us.

Vivek


-Original Message-
From: Scott Marlowe [mailto:scott.marl...@gmail.com]
Sent: Saturday, March 14, 2015 3:56 AM
To: Varadharajan Mukundan
Cc: vjo...@zetainteractive.com; Tomas Vondra;
pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance issues

On Fri, Mar 13, 2015 at 4:03 PM, Varadharajan Mukundan
srinath...@gmail.com wrote:
 We might even consider taking experts advice on how to tune queries
 and server, but if postgres is going to behave like this, I am not
 sure we would be able to continue with it.

 Having said that, I would day again that I am completely new to this
 territory, so I might miss lots and lots of thing.

 My two cents: Postgres out of the box might not be a good choice for
 data warehouse style queries, that is because it is optimized to run
 thousands of small queries (OLTP style processing) and not one big
 monolithic query. I've faced similar problems myself before and here
 are few tricks i followed to get my elephant do real time adhoc
 analysis on a table with ~45 columns and few billion rows in it.

 1. Partition your table! use constraint exclusion to the fullest
 extent 2. Fire multiple small queries distributed over partitions and
 aggregate them at the application layer. This is needed because, you
 might to exploit all your cores to the fullest extent (Assuming that
 you've enough memory for effective FS cache). If your dataset goes
 beyond the capability of a single system, try something like Stado
 (GridSQL)
 3. Storing index on a RAM / faster disk disk (using tablespaces) and
 using it properly makes the system blazing fast. CAUTION: This
 requires some other infrastructure setup for backup and recovery 4. If
 you're accessing a small set of columns in a big table and if you feel
 compressing the data helps a lot, give this FDW a try -
 https://github.com/citusdata/cstore_fdw

Agreed here. IF you're gonna run reporting queries against postgresql you
have to optimize for fast seq scan stuff. I.e. an IO subsystem that can read
a big table in hundreds of megabytes per second.
Gigabytes if you can get it. A lot of spinning drives on a fast RAID card or
good software raid can do this on the cheapish, since a lot of times you
don't need big drives if you have a lot. 24 cheap 1TB drives that each can
read at ~100 MB/s can gang up on the data and you can read a 100GB in a few
seconds. But you can't deny physics. If you need to read a 2TB table it's
going to take time.

If you're only running 1 or 2 queries at a time, you can crank up the
work_mem to something crazy like 1GB even on an 8GB machine. Stopping sorts
from spilling to disk, or at least giving queries a big playground to work
in can make a huge difference. If you're gonna give big work_mem then
definitely limit connections to a handful. If you need a lot of persistent
connections then use a pooler.

The single biggest mistake people make in setting up reporting servers on
postgresql is thinking that the same hardware that worked well for
transactional stuff (a handful of SSDs and lots of memory) might not help
when you're working with TB data sets. The hardware you need isn't the same,
and using that for a reporting server is gonna result in sub-optimal
performance.

--
To understand recursion, one must first understand recursion.
Query Used in report:

=

SELECT
Z.SENT_IND AS IS_SENT,
COALESCE(X.EMAILS, 0) AS EMAILS,
COALESCE(X.PERCENT, 0) PERCENT
FROM
(
SELECT
CASE WHEN SENT_IND = 1
THEN 1
WHEN SENT_IND = 5
THEN 2
WHEN SENT_IND = 10
THEN 3
WHEN SENT_IND = 15
THEN 4
WHEN SENT_IND = 20
THEN 5
WHEN SENT_IND = 30
THEN 6
WHEN SENT_IND = 50
THEN 7
WHEN SENT_IND = 75
THEN 8
WHEN SENT_IND = 100
THEN 9
ELSE 10
END AS SEND_RANK,
COUNT(DISTINCT TARGET_ID) AS EMAILS,
(COUNT(DISTINCT TARGET_ID)+0.0) / (
SELECT
COUNT(DISTINCT TARGET_ID)
FROM
S_V_F_PROMOTION_HISTORY_EMAIL PH
INNER JOIN S_V_D_CAMPAIGN_HIERARCHY CH

Re: [PERFORM] Performance issues

2015-03-13 Thread Scott Marlowe

On Fri, Mar 13, 2015 at 4:03 PM, Varadharajan Mukundan
srinath...@gmail.com wrote:
 We might even consider taking experts advice on how to tune queries and
 server, but if postgres is going to behave like this, I am not sure we would
 be able to continue with it.

 Having said that, I would day again that I am completely new to this
 territory, so I might miss lots and lots of thing.

 My two cents: Postgres out of the box might not be a good choice for
 data warehouse style queries, that is because it is optimized to run
 thousands of small queries (OLTP style processing) and not one big
 monolithic query. I've faced similar problems myself before and here
 are few tricks i followed to get my elephant do real time adhoc
 analysis on a table with ~45 columns and few billion rows in it.

 1. Partition your table! use constraint exclusion to the fullest extent
 2. Fire multiple small queries distributed over partitions and
 aggregate them at the application layer. This is needed because, you
 might to exploit all your cores to the fullest extent (Assuming that
 you've enough memory for effective FS cache). If your dataset goes
 beyond the capability of a single system, try something like Stado
 (GridSQL)
 3. Storing index on a RAM / faster disk disk (using tablespaces) and
 using it properly makes the system blazing fast. CAUTION: This
 requires some other infrastructure setup for backup and recovery
 4. If you're accessing a small set of columns in a big table and if
 you feel compressing the data helps a lot, give this FDW a try -
 https://github.com/citusdata/cstore_fdw

Agreed here. IF you're gonna run reporting queries against postgresql
you have to optimize for fast seq scan stuff. I.e. an IO subsystem
that can read a big table in hundreds of megabytes per second.
Gigabytes if you can get it. A lot of spinning drives on a fast RAID
card or good software raid can do this on the cheapish, since a lot of
times you don't need big drives if you have a lot. 24 cheap 1TB drives
that each can read at ~100 MB/s can gang up on the data and you can
read a 100GB in a few seconds. But you can't deny physics. If you need
to read a 2TB table it's going to take time.

If you're only running 1 or 2 queries at a time, you can crank up the
work_mem to something crazy like 1GB even on an 8GB machine. Stopping
sorts from spilling to disk, or at least giving queries a big
playground to work in can make a huge difference. If you're gonna give
big work_mem then definitely limit connections to a handful. If you
need a lot of persistent connections then use a pooler.

The single biggest mistake people make in setting up reporting servers
on postgresql is thinking that the same hardware that worked well for
transactional stuff (a handful of SSDs and lots of memory) might not
help when you're working with TB data sets. The hardware you need
isn't the same, and using that for a reporting server is gonna result
in sub-optimal performance.

-- 
To understand recursion, one must first understand recursion.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-13 Thread Tomas Vondra

On 14.3.2015 00:28, Vivekanand Joshi wrote:
 Hi Guys,
 
 So here is the full information attached as well as in the link
 provided below:
 
 http://pgsql.privatepaste.com/41207bea45
 
 I can provide new information as well.

Thanks.

We still don't have EXPLAIN ANALYZE - how long was the query running (I
assume it got killed at some point)? It's really difficult to give you
any advices because we don't know where the problem is.

If EXPLAIN ANALYZE really takes too long (say, it does not complete
after an hour / over night), you'll have to break the query into parts
and first tweak those independently.

For example in the first message you mentioned that select from the
S_V_D_CAMPAIGN_HIERARCHY view takes ~9 minutes, so start with that. Give
us EXPLAIN ANALYZE for that query.

Few more comments:

(1) You're using CTEs - be aware that CTEs are not just aliases, but
impact planning / optimization, and in some cases may prevent
proper optimization. Try replacing them with plain views.

(2) Varadharajan Mukundan already recommended you to create index on
s_f_promotion_history.send_dt. Have you tried that? You may also
try creating an index on all the columns needed by the query, so
that Index Only Scan is possible.

(3) There are probably additional indexes that might be useful here.
What I'd try is adding indexes on all columns that are either a
foreign key or used in a WHERE condition. This might be an
overkill in some cases, but let's see.

(4) I suspect many of the relations referenced in the views are not
actually needed in the query, i.e. the join is performed but
then it's just discarded because those columns are not used.
Try to simplify the views as much has possible - remove all the
tables that are not really necessary to run the query. If two
queries need different tables, maybe defining two views is
a better approach.

(5) The vmstat / iostat data are pretty useless - what you provided are
averages since the machine was started, but we need a few samples
collected when the query is running. I.e. start the query, and then
give us a few samples from these commands:

iostat -x -k 1
vmstat 1

 Would like to see if queries of these type can actually run in
 postgres server?

Why not? We're running DWH applications on tens/hundreds of GBs.

 If yes, what would be the minimum requirements for hardware? We would
 like to migrate our whole solution on PostgreSQL as we can spend on
 hardware as much as we can but working on a proprietary appliance is
 becoming very difficult for us.

That's difficult to say, because we really don't know where the problem
is and how much the queries can be optimized.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2015-03-13 Thread Varadharajan Mukundan

If the s_f_promotion_history table will have a explosive growth, then its
worth considering partitioning by date and using constraint exclusion to
speed up the queries. Else, it makes sense to get started with multiple
partial index (like, have a index for each week or something like that. You
may want to start with a coarse grain timeline for the index and then fine
grain it based on the needs)

Re: [PERFORM] Performance issues

2015-03-13 Thread Varadharajan Mukundan

Hi Vivekanand,

From the query plan, we can see that good amount of time is spent in this
line

-  Seq Scan on public.s_f_promotion_history base  (cost=0.00..283334.00
rows=1296 width=74)

Output: base.promo_hist_id, base.target_id,
base.audience_member_id, base.touchpoint_execution_id,
base.contact_group_id, base.content_version_execution_id, base.sent_ind,
base.send_dt, base.creation_dt, base.modified_dt
Filter: ((base.send_dt = '2014-03-13 00:00:00'::timestamp without time
zone) AND (base.send_dt = '2015-03-14 00:00:00'::timestamp without time
zone))



Can you try creating (partial) index based on the filter fields? ( Good
tutorial @ https://devcenter.heroku.com/articles/postgresql-indexes). Did
you try doing a VACUUM ANALYZE? Other approach worth trying it out is
partitioning the public.s_f_promotion_history table by date (BTW, what is
the size and number of rows in this table?).

On Fri, Mar 13, 2015 at 12:44 PM, Vivekanand Joshi 
vjo...@zetainteractive.com wrote:

 Hi Team,



 I am a novice to this territory. We are trying to migrate few jasper
 reports from Netezza to PostgreSQL.



 I have one report ready with me but queries are taking too much time. To
 be honest, it is not giving any result  most of the time.



 The same query in Netezza is running in less than 2-3 seconds.




 



 This is the query :





 SELECT

 COUNT(DISTINCT TARGET_ID)

 FROM

 S_V_F_PROMOTION_HISTORY_EMAIL PH

 INNER JOIN S_V_D_CAMPAIGN_HIERARCHY CH

 ON PH.TOUCHPOINT_EXECUTION_ID =
 CH.TOUCHPOINT_EXECUTION_ID

 WHERE

 1=1

 AND SEND_DT = '2014-03-13'

 AND SEND_DT = '2015-03-14'



 Statistics:



 Select Count(1) from S_V_F_PROMOTION_HISTORY_EMAIL

 4559289

 Time: 16781.409 ms



 Select count(1) from S_V_D_CAMPAIGN_HIERARCHY;

 count

 ---

 45360

 (1 row)



 Time: 467869.185 ms

 ==

 EXPLAIN PLAN FOR QUERY:



 Aggregate  (cost=356422.36..356422.37 rows=1 width=8)

   Output: count(DISTINCT base.target_id)

   -  Nested Loop  (cost=68762.23..356422.36 rows=1 width=8)

 Output: base.target_id

 Join Filter: (base.touchpoint_execution_id =
 tp_exec.touchpoint_execution_id)

 -  Nested Loop  (cost=33927.73..38232.16 rows=1 width=894)

   Output: camp.campaign_id, camp.campaign_name,
 camp.initiative, camp.objective, camp.category_id,
 CATEGORY.category_name, camp_exec.campaign_execution_id,
 camp_exec.campaign_execution_name, camp_exec.group_id, grup.group_name,
 camp_exec.star (...)

   Join Filter: (tp_exec.touchpoint_execution_id =
 valid_executions.touchpoint_execution_id)

   CTE valid_executions

 -  Merge Join  (cost=30420.45..31971.94 rows=1 width=8)

   Output:
 s_f_touchpoint_execution_status_history_2.touchpoint_execution_id

   Merge Cond:
 ((s_f_touchpoint_execution_status_history_2.touchpoint_execution_id =
 s_f_touchpoint_execution_status_history_1_1.touchpoint_execution_id) AND
 (s_f_touchpoint_execution_status_history_2.creation_dt =
 (max(s_f_touchpoint_ex (...)

   -  Sort  (cost=17196.30..17539.17 rows=137149
 width=16)

 Output:
 s_f_touchpoint_execution_status_history_2.touchpoint_execution_id,
 s_f_touchpoint_execution_status_history_2.creation_dt

 Sort Key:
 s_f_touchpoint_execution_status_history_2.touchpoint_execution_id,
 s_f_touchpoint_execution_status_history_2.creation_dt

 -  Seq Scan on
 public.s_f_touchpoint_execution_status_history
 s_f_touchpoint_execution_status_history_2  (cost=0.00..5493.80 rows=137149
 width=16)

   Output:
 s_f_touchpoint_execution_status_history_2.touchpoint_execution_id,
 s_f_touchpoint_execution_status_history_2.creation_dt

   Filter:
 (s_f_touchpoint_execution_status_history_2.touchpoint_execution_status_type_id
 = ANY ('{3,4}'::integer[]))

   -  Sort  (cost=13224.15..13398.43 rows=69715
 width=16)

 Output:
 s_f_touchpoint_execution_status_history_1_1.touchpoint_execution_id,
 (max(s_f_touchpoint_execution_status_history_1_1.creation_dt))

 Sort Key:
 s_f_touchpoint_execution_status_history_1_1.touchpoint_execution_id,
 (max(s_f_touchpoint_execution_status_history_1_1.creation_dt))

 -  HashAggregate  (cost=6221.56..6918.71
 rows=69715 width=16)

   Output:
 s_f_touchpoint_execution_status_history_1_1.touchpoint_execution_id,

Re: [PERFORM] Performance issues

2015-03-13 Thread Vivekanand Joshi

10 million records in s_f_promotion_history table.



*From:* Varadharajan Mukundan [mailto:srinath...@gmail.com]
*Sent:* Friday, March 13, 2015 6:29 PM
*To:* vjo...@zetainteractive.com
*Cc:* pgsql-performance@postgresql.org
*Subject:* Re: [PERFORM] Performance issues



Hi Vivekanand,



From the query plan, we can see that good amount of time is spent in this
line



-  Seq Scan on public.s_f_promotion_history base  (cost=0.00..283334.00
rows=1296 width=74)

Output: base.promo_hist_id, base.target_id,
base.audience_member_id, base.touchpoint_execution_id,
base.contact_group_id, base.content_version_execution_id, base.sent_ind,
base.send_dt, base.creation_dt, base.modified_dt
Filter: ((base.send_dt = '2014-03-13 00:00:00'::timestamp without time
zone) AND (base.send_dt = '2015-03-14 00:00:00'::timestamp without time
zone))





Can you try creating (partial) index based on the filter fields? ( Good
tutorial @ https://devcenter.heroku.com/articles/postgresql-indexes). Did
you try doing a VACUUM ANALYZE? Other approach worth trying it out is
partitioning the public.s_f_promotion_history table by date (BTW, what is
the size and number of rows in this table?).



On Fri, Mar 13, 2015 at 12:44 PM, Vivekanand Joshi 
vjo...@zetainteractive.com wrote:

Hi Team,



I am a novice to this territory. We are trying to migrate few jasper
reports from Netezza to PostgreSQL.



I have one report ready with me but queries are taking too much time. To be
honest, it is not giving any result  most of the time.



The same query in Netezza is running in less than 2-3 seconds.







This is the query :





SELECT

COUNT(DISTINCT TARGET_ID)

FROM

S_V_F_PROMOTION_HISTORY_EMAIL PH

INNER JOIN S_V_D_CAMPAIGN_HIERARCHY CH

ON PH.TOUCHPOINT_EXECUTION_ID =
CH.TOUCHPOINT_EXECUTION_ID

WHERE

1=1

AND SEND_DT = '2014-03-13'

AND SEND_DT = '2015-03-14'



Statistics:



Select Count(1) from S_V_F_PROMOTION_HISTORY_EMAIL

4559289

Time: 16781.409 ms



Select count(1) from S_V_D_CAMPAIGN_HIERARCHY;

count

---

45360

(1 row)



Time: 467869.185 ms

==

EXPLAIN PLAN FOR QUERY:



Aggregate  (cost=356422.36..356422.37 rows=1 width=8)

  Output: count(DISTINCT base.target_id)

  -  Nested Loop  (cost=68762.23..356422.36 rows=1 width=8)

Output: base.target_id

Join Filter: (base.touchpoint_execution_id =
tp_exec.touchpoint_execution_id)

-  Nested Loop  (cost=33927.73..38232.16 rows=1 width=894)

  Output: camp.campaign_id, camp.campaign_name,
camp.initiative, camp.objective, camp.category_id,
CATEGORY.category_name, camp_exec.campaign_execution_id,
camp_exec.campaign_execution_name, camp_exec.group_id, grup.group_name,
camp_exec.star (...)

  Join Filter: (tp_exec.touchpoint_execution_id =
valid_executions.touchpoint_execution_id)

  CTE valid_executions

-  Merge Join  (cost=30420.45..31971.94 rows=1 width=8)

  Output:
s_f_touchpoint_execution_status_history_2.touchpoint_execution_id

  Merge Cond:
((s_f_touchpoint_execution_status_history_2.touchpoint_execution_id =
s_f_touchpoint_execution_status_history_1_1.touchpoint_execution_id) AND
(s_f_touchpoint_execution_status_history_2.creation_dt =
(max(s_f_touchpoint_ex (...)

  -  Sort  (cost=17196.30..17539.17 rows=137149
width=16)

Output:
s_f_touchpoint_execution_status_history_2.touchpoint_execution_id,
s_f_touchpoint_execution_status_history_2.creation_dt

Sort Key:
s_f_touchpoint_execution_status_history_2.touchpoint_execution_id,
s_f_touchpoint_execution_status_history_2.creation_dt

-  Seq Scan on
public.s_f_touchpoint_execution_status_history
s_f_touchpoint_execution_status_history_2  (cost=0.00..5493.80 rows=137149
width=16)

  Output:
s_f_touchpoint_execution_status_history_2.touchpoint_execution_id,
s_f_touchpoint_execution_status_history_2.creation_dt

  Filter:
(s_f_touchpoint_execution_status_history_2.touchpoint_execution_status_type_id
= ANY ('{3,4}'::integer[]))

  -  Sort  (cost=13224.15..13398.43 rows=69715
width=16)

Output:
s_f_touchpoint_execution_status_history_1_1.touchpoint_execution_id,
(max(s_f_touchpoint_execution_status_history_1_1.creation_dt))

Sort Key:
s_f_touchpoint_execution_status_history_1_1.touchpoint_execution_id,
(max(s_f_touchpoint_execution_status_history_1_1.creation_dt))

-  HashAggregate  (cost=6221.56

Re: [PERFORM] Performance issues

2011-03-08 Thread Andreas Forø Tollefsen

Hi. Thanks for the comments. My data is right, and the result is exactly
what i want, but as you say i think what causes the query to be slow is the
ST_Intersection which creates the intersection between the vector grid
(fishnet) and the country polygons.
I will check with the postgis user list if they have any idea on how to
speed up this query.

Best,
Andreas

2011/3/8 Tom Lane t...@sss.pgh.pa.us

 =?ISO-8859-1?Q?Andreas_For=F8_Tollefsen?= andrea...@gmail.com writes:
  This is a query i am working on now. It creates an intersection of two
  geometries. One is a grid of 0.5 x 0.5 decimal degree sized cells, while
 the
  other is the country geometries of all countries in the world for a
 certain
  year.

 Hm, are you sure your data is right?  Because the actual rowcounts imply
 that each country intersects about half of the grid cells, which doesn't
 seem right.

  priogrid=# EXPLAIN ANALYZE SELECT priogrid_land.gid, gwcode,
  ST_Intersection(pri
  ogrid_land.cell, cshapeswdate.geom) FROM priogrid_land, cshapeswdate
 WHERE
  ST_Intersects(priogrid_land.cell, cshapeswdate.geom);
  QUERY
  PLAN

 
 
  --
   Nested Loop  (cost=0.00..12644.85 rows=43351 width=87704) (actual
  time=1.815..7
  074973.711 rows=130331 loops=1)
 Join Filter: _st_intersects(priogrid_land.cell, cshapeswdate.geom)
 -  Seq Scan on cshapeswdate  (cost=0.00..14.42 rows=242 width=87248)
  (actual
   time=0.007..0.570 rows=242 loops=1)
 -  Index Scan using idx_priogrid_land_cell on priogrid_land
   (cost=0.00..7.1
  5 rows=1 width=456) (actual time=0.069..5.604 rows=978 loops=242)
   Index Cond: (priogrid_land.cell  cshapeswdate.geom)
   Total runtime: 7075188.549 ms
  (6 rows)

 AFAICT, all of the runtime is going into calculating the ST_Intersects
 and/or ST_Intersection functions.  The two scans are only accounting for
 perhaps 5.5 seconds, and the join infrastructure isn't going to be
 terribly expensive, so it's got to be those functions.  Not knowing much
 about PostGIS, I don't know if the functions themselves can be expected
 to be really slow.  If it's not them, it could be the cost of fetching
 their arguments --- in particular, I bet the country outlines are very
 large objects and are toasted out-of-line.  There's been some past
 discussion of automatically avoiding repeated detoastings in scenarios
 like the above, but nothing's gotten to the point of acceptance yet.
 Possibly you could do something to force detoasting in a subquery.

regards, tom lane

Re: [PERFORM] Performance issues

2011-03-08 Thread Andy Colson

I have seen really complex geometries cause problems.  If you have 
thousands of points, when 10 would do, try ST_Simplify and see if it 
doesnt speed things up.


-Andy


On 3/8/2011 2:42 AM, Andreas Forø Tollefsen wrote:

Hi. Thanks for the comments. My data is right, and the result is exactly
what i want, but as you say i think what causes the query to be slow is
the ST_Intersection which creates the intersection between the vector
grid (fishnet) and the country polygons.
I will check with the postgis user list if they have any idea on how to
speed up this query.

Best,
Andreas

2011/3/8 Tom Lane t...@sss.pgh.pa.us mailto:t...@sss.pgh.pa.us

=?ISO-8859-1?Q?Andreas_For=F8_Tollefsen?= andrea...@gmail.com
mailto:andrea...@gmail.com writes:
  This is a query i am working on now. It creates an intersection
of two
  geometries. One is a grid of 0.5 x 0.5 decimal degree sized
cells, while the
  other is the country geometries of all countries in the world for
a certain
  year.

Hm, are you sure your data is right?  Because the actual rowcounts imply
that each country intersects about half of the grid cells, which doesn't
seem right.

  priogrid=# EXPLAIN ANALYZE SELECT priogrid_land.gid, gwcode,
  ST_Intersection(pri
  ogrid_land.cell, cshapeswdate.geom) FROM priogrid_land,
cshapeswdate WHERE
  ST_Intersects(priogrid_land.cell, cshapeswdate.geom);
 
   QUERY
  PLAN

 


  --
   Nested Loop  (cost=0.00..12644.85 rows=43351 width=87704) (actual
  time=1.815..7
  074973.711 rows=130331 loops=1)
 Join Filter: _st_intersects(priogrid_land.cell, cshapeswdate.geom)
 -  Seq Scan on cshapeswdate  (cost=0.00..14.42 rows=242
width=87248)
  (actual
   time=0.007..0.570 rows=242 loops=1)
 -  Index Scan using idx_priogrid_land_cell on priogrid_land
   (cost=0.00..7.1
  5 rows=1 width=456) (actual time=0.069..5.604 rows=978 loops=242)
   Index Cond: (priogrid_land.cell  cshapeswdate.geom)
   Total runtime: 7075188.549 ms
  (6 rows)

AFAICT, all of the runtime is going into calculating the ST_Intersects
and/or ST_Intersection functions.  The two scans are only accounting for
perhaps 5.5 seconds, and the join infrastructure isn't going to be
terribly expensive, so it's got to be those functions.  Not knowing much
about PostGIS, I don't know if the functions themselves can be expected
to be really slow.  If it's not them, it could be the cost of fetching
their arguments --- in particular, I bet the country outlines are very
large objects and are toasted out-of-line.  There's been some past
discussion of automatically avoiding repeated detoastings in scenarios
like the above, but nothing's gotten to the point of acceptance yet.
Possibly you could do something to force detoasting in a subquery.

regards, tom lane





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2011-03-08 Thread Andreas Forø Tollefsen

Andy. Thanks. That is a great tips. I tried it but i get the error:
NOTICE: ptarray_simplify returned a 2 pts array.

Query:
SELECT ST_Intersection(priogrid_land.cell,
ST_Simplify(cshapeswdate.geom,0.1)) AS geom,
priogrid_land.gid AS divider, gwcode, gwsyear, gweyear, startdate, enddate,
capname, caplong, caplat, col, row, xcoord, ycoord
FROM priogrid_land, cshapeswdate WHERE ST_Intersects(priogrid_land.cell,
ST_Simplify(cshapeswdate.geom,0.1)) AND cshapeswdate.gwsyear =1946 AND
cshapeswdate.gweyear =1946 AND cshapeswdate.startdate = '1946/1/1';


2011/3/8 Andy Colson a...@squeakycode.net

 I have seen really complex geometries cause problems.  If you have
 thousands of points, when 10 would do, try ST_Simplify and see if it doesnt
 speed things up.

 -Andy



 On 3/8/2011 2:42 AM, Andreas Forř Tollefsen wrote:

 Hi. Thanks for the comments. My data is right, and the result is exactly
 what i want, but as you say i think what causes the query to be slow is
 the ST_Intersection which creates the intersection between the vector
 grid (fishnet) and the country polygons.
 I will check with the postgis user list if they have any idea on how to
 speed up this query.

 Best,
 Andreas

 2011/3/8 Tom Lane t...@sss.pgh.pa.us mailto:t...@sss.pgh.pa.us


=?ISO-8859-1?Q?Andreas_For=F8_Tollefsen?= andrea...@gmail.com
mailto:andrea...@gmail.com writes:
  This is a query i am working on now. It creates an intersection
of two
  geometries. One is a grid of 0.5 x 0.5 decimal degree sized
cells, while the
  other is the country geometries of all countries in the world for
a certain
  year.

Hm, are you sure your data is right?  Because the actual rowcounts
 imply
that each country intersects about half of the grid cells, which
 doesn't
seem right.

  priogrid=# EXPLAIN ANALYZE SELECT priogrid_land.gid, gwcode,
  ST_Intersection(pri
  ogrid_land.cell, cshapeswdate.geom) FROM priogrid_land,
cshapeswdate WHERE
  ST_Intersects(priogrid_land.cell, cshapeswdate.geom);
 
   QUERY
  PLAN

 

  
 
  --
   Nested Loop  (cost=0.00..12644.85 rows=43351 width=87704) (actual
  time=1.815..7
  074973.711 rows=130331 loops=1)
 Join Filter: _st_intersects(priogrid_land.cell,
 cshapeswdate.geom)
 -  Seq Scan on cshapeswdate  (cost=0.00..14.42 rows=242
width=87248)
  (actual
   time=0.007..0.570 rows=242 loops=1)
 -  Index Scan using idx_priogrid_land_cell on priogrid_land
   (cost=0.00..7.1
  5 rows=1 width=456) (actual time=0.069..5.604 rows=978 loops=242)
   Index Cond: (priogrid_land.cell  cshapeswdate.geom)
   Total runtime: 7075188.549 ms
  (6 rows)

AFAICT, all of the runtime is going into calculating the ST_Intersects
and/or ST_Intersection functions.  The two scans are only accounting
 for
perhaps 5.5 seconds, and the join infrastructure isn't going to be
terribly expensive, so it's got to be those functions.  Not knowing
 much
about PostGIS, I don't know if the functions themselves can be expected
to be really slow.  If it's not them, it could be the cost of fetching
their arguments --- in particular, I bet the country outlines are very
large objects and are toasted out-of-line.  There's been some past
discussion of automatically avoiding repeated detoastings in scenarios
like the above, but nothing's gotten to the point of acceptance yet.
Possibly you could do something to force detoasting in a subquery.

regards, tom lane

Re: [PERFORM] Performance issues

2011-03-08 Thread Andreas Forø Tollefsen

Forgot to mention that the query terminates the connection because of a
crash of server process.

2011/3/8 Andreas Forø Tollefsen andrea...@gmail.com

 Andy. Thanks. That is a great tips. I tried it but i get the error:
 NOTICE: ptarray_simplify returned a 2 pts array.

 Query:
 SELECT ST_Intersection(priogrid_land.cell,
 ST_Simplify(cshapeswdate.geom,0.1)) AS geom,
 priogrid_land.gid AS divider, gwcode, gwsyear, gweyear, startdate, enddate,
 capname, caplong, caplat, col, row, xcoord, ycoord
 FROM priogrid_land, cshapeswdate WHERE ST_Intersects(priogrid_land.cell,
 ST_Simplify(cshapeswdate.geom,0.1)) AND cshapeswdate.gwsyear =1946 AND
 cshapeswdate.gweyear =1946 AND cshapeswdate.startdate = '1946/1/1';


 2011/3/8 Andy Colson a...@squeakycode.net

 I have seen really complex geometries cause problems.  If you have
 thousands of points, when 10 would do, try ST_Simplify and see if it doesnt
 speed things up.

 -Andy



 On 3/8/2011 2:42 AM, Andreas Forř Tollefsen wrote:

 Hi. Thanks for the comments. My data is right, and the result is exactly
 what i want, but as you say i think what causes the query to be slow is
 the ST_Intersection which creates the intersection between the vector
 grid (fishnet) and the country polygons.
 I will check with the postgis user list if they have any idea on how to
 speed up this query.

 Best,
 Andreas

 2011/3/8 Tom Lane t...@sss.pgh.pa.us mailto:t...@sss.pgh.pa.us


=?ISO-8859-1?Q?Andreas_For=F8_Tollefsen?= andrea...@gmail.com
mailto:andrea...@gmail.com writes:
  This is a query i am working on now. It creates an intersection
of two
  geometries. One is a grid of 0.5 x 0.5 decimal degree sized
cells, while the
  other is the country geometries of all countries in the world for
a certain
  year.

Hm, are you sure your data is right?  Because the actual rowcounts
 imply
that each country intersects about half of the grid cells, which
 doesn't
seem right.

  priogrid=# EXPLAIN ANALYZE SELECT priogrid_land.gid, gwcode,
  ST_Intersection(pri
  ogrid_land.cell, cshapeswdate.geom) FROM priogrid_land,
cshapeswdate WHERE
  ST_Intersects(priogrid_land.cell, cshapeswdate.geom);
 
   QUERY
  PLAN

 

  
 
  --
   Nested Loop  (cost=0.00..12644.85 rows=43351 width=87704) (actual
  time=1.815..7
  074973.711 rows=130331 loops=1)
 Join Filter: _st_intersects(priogrid_land.cell,
 cshapeswdate.geom)
 -  Seq Scan on cshapeswdate  (cost=0.00..14.42 rows=242
width=87248)
  (actual
   time=0.007..0.570 rows=242 loops=1)
 -  Index Scan using idx_priogrid_land_cell on priogrid_land
   (cost=0.00..7.1
  5 rows=1 width=456) (actual time=0.069..5.604 rows=978 loops=242)
   Index Cond: (priogrid_land.cell  cshapeswdate.geom)
   Total runtime: 7075188.549 ms
  (6 rows)

AFAICT, all of the runtime is going into calculating the ST_Intersects
and/or ST_Intersection functions.  The two scans are only accounting
 for
perhaps 5.5 seconds, and the join infrastructure isn't going to be
terribly expensive, so it's got to be those functions.  Not knowing
 much
about PostGIS, I don't know if the functions themselves can be
 expected
to be really slow.  If it's not them, it could be the cost of fetching
their arguments --- in particular, I bet the country outlines are very
large objects and are toasted out-of-line.  There's been some past
discussion of automatically avoiding repeated detoastings in scenarios
like the above, but nothing's gotten to the point of acceptance yet.
Possibly you could do something to force detoasting in a subquery.

regards, tom lane

Re: [PERFORM] Performance issues

2011-03-08 Thread Andy Colson


On 3/8/2011 10:58 AM, Andreas Forø Tollefsen wrote:

Andy. Thanks. That is a great tips. I tried it but i get the error:
NOTICE: ptarray_simplify returned a 2 pts array.

Query:
SELECT ST_Intersection(priogrid_land.cell,
ST_Simplify(cshapeswdate.geom,0.1)) AS geom,
priogrid_land.gid AS divider, gwcode, gwsyear, gweyear, startdate,
enddate, capname, caplong, caplat, col, row, xcoord, ycoord
FROM priogrid_land, cshapeswdate WHERE ST_Intersects(priogrid_land.cell,
ST_Simplify(cshapeswdate.geom,0.1)) AND cshapeswdate.gwsyear =1946 AND
cshapeswdate.gweyear =1946 AND cshapeswdate.startdate = '1946/1/1';


2011/3/8 Andy Colson a...@squeakycode.net mailto:a...@squeakycode.net

I have seen really complex geometries cause problems.  If you have
thousands of points, when 10 would do, try ST_Simplify and see if it
doesnt speed things up.

-Andy



On 3/8/2011 2:42 AM, Andreas Forř Tollefsen wrote:

Hi. Thanks for the comments. My data is right, and the result is
exactly
what i want, but as you say i think what causes the query to be
slow is
the ST_Intersection which creates the intersection between the
vector
grid (fishnet) and the country polygons.
I will check with the postgis user list if they have any idea on
how to
speed up this query.

Best,
Andreas

2011/3/8 Tom Lane t...@sss.pgh.pa.us mailto:t...@sss.pgh.pa.us
mailto:t...@sss.pgh.pa.us mailto:t...@sss.pgh.pa.us


=?ISO-8859-1?Q?Andreas_For=F8_Tollefsen?=
andrea...@gmail.com mailto:andrea...@gmail.com
mailto:andrea...@gmail.com mailto:andrea...@gmail.com writes:
  This is a query i am working on now. It creates an intersection
of two
  geometries. One is a grid of 0.5 x 0.5 decimal degree sized
cells, while the
  other is the country geometries of all countries in the world for
a certain
  year.

Hm, are you sure your data is right?  Because the actual
rowcounts imply
that each country intersects about half of the grid cells,
which doesn't
seem right.

  priogrid=# EXPLAIN ANALYZE SELECT priogrid_land.gid, gwcode,
  ST_Intersection(pri
  ogrid_land.cell, cshapeswdate.geom) FROM priogrid_land,
cshapeswdate WHERE
  ST_Intersects(priogrid_land.cell, cshapeswdate.geom);
 
   QUERY
  PLAN

 

  

 
--
   Nested Loop  (cost=0.00..12644.85 rows=43351 width=87704)
(actual
  time=1.815..7
  074973.711 rows=130331 loops=1)
 Join Filter: _st_intersects(priogrid_land.cell,
cshapeswdate.geom)
 -  Seq Scan on cshapeswdate  (cost=0.00..14.42 rows=242
width=87248)
  (actual
   time=0.007..0.570 rows=242 loops=1)
 -  Index Scan using idx_priogrid_land_cell on priogrid_land
   (cost=0.00..7.1
  5 rows=1 width=456) (actual time=0.069..5.604 rows=978 loops=242)
   Index Cond: (priogrid_land.cell  cshapeswdate.geom)
   Total runtime: 7075188.549 ms
  (6 rows)

AFAICT, all of the runtime is going into calculating the
ST_Intersects
and/or ST_Intersection functions.  The two scans are only
accounting for
perhaps 5.5 seconds, and the join infrastructure isn't going
to be
terribly expensive, so it's got to be those functions.  Not
knowing much
about PostGIS, I don't know if the functions themselves can
be expected
to be really slow.  If it's not them, it could be the cost
of fetching
their arguments --- in particular, I bet the country
outlines are very
large objects and are toasted out-of-line.  There's been
some past
discussion of automatically avoiding repeated detoastings in
scenarios
like the above, but nothing's gotten to the point of
acceptance yet.
Possibly you could do something to force detoasting in a
subquery.

regards, tom lane








ew... thats not good.  Seems like it simplified it down to a single 
point?  (not 100% sure that's what the error means, just a guess)


Try getting some info about it:

select
  ST_Npoints(geom) As before,
  ST_NPoints(ST_Simplify(geom,0.1)) as after
from cshapeswdate


Also try things like ST_IsSimple ST_IsValid.  I seem to recall sometimes 
needing ST_Points or st_NumPoints instead of ST_Npoints.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:

Re: [PERFORM] Performance issues

2011-03-07 Thread Andreas Forø Tollefsen

Thanks, Ken.

It seems like the tip to turn off synchronous_commit did the trick:

/usr/lib/postgresql/8.4/bin/pgbench -T 60 test1
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 1
duration: 60 s
number of transactions actually processed: 86048
tps = 1434.123199 (including connections establishing)
tps = 1434.183362 (excluding connections establishing)

Is this acceptable compared to others when considering my setup?

Cheers,
Andreas

2011/3/7 Kenneth Marshall k...@rice.edu

 On Mon, Mar 07, 2011 at 02:45:03PM +0100, Andreas For? Tollefsen wrote:
  Hi,
 
  I am running Postgresql 8.4.7 with Postgis 2.0 (for raster support).
  Server is mainly 1 user for spatial data processing. This involves
 queries
  that can take hours.
 
  This is running on a ubuntu 10.10 Server with Core2Duo 6600 @ 2.4 GHZ, 6
 GB
  RAM.
 
  My postgresql.conf:
  # - Memory -
  shared_buffers = 1024MB # min 128kB
  # (change requires restart)
  temp_buffers = 256MB# min 800kB
  #max_prepared_transactions = 0  # zero disables the feature
  # (change requires restart)
  # Note:  Increasing max_prepared_transactions costs ~600 bytes of shared
  memory
  # per transaction slot, plus lock space (see max_locks_per_transaction).
  # It is not advisable to set max_prepared_transactions nonzero unless you
  # actively intend to use prepared transactions.
  work_mem = 1024MB   # min 64kB
  maintenance_work_mem = 256MB# min 1MB
  max_stack_depth = 7MB   # min 100kB
  wal_buffers = 8MB
  effective_cache_size = 3072MB
 
  Everything else is default.
 
  My Pgbench results:
  /usr/lib/postgresql/8.4/bin/pgbench -T 60 test1
  starting vacuum...end.
  transaction type: TPC-B (sort of)
  scaling factor: 1
  query mode: simple
  number of clients: 1
  duration: 60 s
  number of transactions actually processed: 7004
  tps = 116.728199 (including connections establishing)
  tps = 116.733012 (excluding connections establishing)
 
 
  My question is if these are acceptable results, or if someone can
 recommend
  settings which will improve my servers performance.
 
  Andreas

 Your results are I/O limited. Depending upon your requirements,
 you may be able to turn off synchronous_commit which can help.
 Your actual workload may be able to use batching to help as well.
 Your work_mem looks pretty darn high for a 6GB system.

 Cheers,
 Ken

Re: [PERFORM] Performance issues

2011-03-07 Thread Kenneth Marshall

On Mon, Mar 07, 2011 at 03:17:05PM +0100, Andreas For? Tollefsen wrote:
 Thanks, Ken.
 
 It seems like the tip to turn off synchronous_commit did the trick:
 
 /usr/lib/postgresql/8.4/bin/pgbench -T 60 test1
 starting vacuum...end.
 transaction type: TPC-B (sort of)
 scaling factor: 1
 query mode: simple
 number of clients: 1
 duration: 60 s
 number of transactions actually processed: 86048
 tps = 1434.123199 (including connections establishing)
 tps = 1434.183362 (excluding connections establishing)
 
 Is this acceptable compared to others when considering my setup?
 
 Cheers,
 Andreas
 


These are typical results for synchronous_commit off. The caveat
is you must be able to handle loosing transactions if you have a
database crash, but your database is still intact. This differs
from turning fsync off in which a crash means you would need to
restore from a backup.

Cheers,
Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2011-03-07 Thread Andreas Forø Tollefsen

Ok. Cheers. I will do some more testing on my heavy PostGIS queries which
often takes hours to complete.

Thanks.
Andreas

2011/3/7 Kenneth Marshall k...@rice.edu

 On Mon, Mar 07, 2011 at 03:17:05PM +0100, Andreas For? Tollefsen wrote:
  Thanks, Ken.
 
  It seems like the tip to turn off synchronous_commit did the trick:
 
  /usr/lib/postgresql/8.4/bin/pgbench -T 60 test1
  starting vacuum...end.
  transaction type: TPC-B (sort of)
  scaling factor: 1
  query mode: simple
  number of clients: 1
  duration: 60 s
  number of transactions actually processed: 86048
  tps = 1434.123199 (including connections establishing)
  tps = 1434.183362 (excluding connections establishing)
 
  Is this acceptable compared to others when considering my setup?
 
  Cheers,
  Andreas
 


 These are typical results for synchronous_commit off. The caveat
 is you must be able to handle loosing transactions if you have a
 database crash, but your database is still intact. This differs
 from turning fsync off in which a crash means you would need to
 restore from a backup.

 Cheers,
 Ken

Re: [PERFORM] Performance issues

2011-03-07 Thread Oleg Bartunov


On Mon, 7 Mar 2011, Andreas For? Tollefsen wrote:


Ok. Cheers. I will do some more testing on my heavy PostGIS queries which
often takes hours to complete.


I'd like to see hours long queries :) EXPLAIN ANALYZE



Thanks.
Andreas

2011/3/7 Kenneth Marshall k...@rice.edu


On Mon, Mar 07, 2011 at 03:17:05PM +0100, Andreas For? Tollefsen wrote:

Thanks, Ken.

It seems like the tip to turn off synchronous_commit did the trick:

/usr/lib/postgresql/8.4/bin/pgbench -T 60 test1
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 1
duration: 60 s
number of transactions actually processed: 86048
tps = 1434.123199 (including connections establishing)
tps = 1434.183362 (excluding connections establishing)

Is this acceptable compared to others when considering my setup?

Cheers,
Andreas




These are typical results for synchronous_commit off. The caveat
is you must be able to handle loosing transactions if you have a
database crash, but your database is still intact. This differs
from turning fsync off in which a crash means you would need to
restore from a backup.

Cheers,
Ken





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2011-03-07 Thread Andreas Forø Tollefsen

The synchronous_commit off increased the TPS, but not the speed of the below
query.

Oleg:
This is a query i am working on now. It creates an intersection of two
geometries. One is a grid of 0.5 x 0.5 decimal degree sized cells, while the
other is the country geometries of all countries in the world for a certain
year.

priogrid=# EXPLAIN ANALYZE SELECT priogrid_land.gid, gwcode,
ST_Intersection(pri
ogrid_land.cell, cshapeswdate.geom) FROM priogrid_land, cshapeswdate WHERE
ST_In
tersects(priogrid_land.cell, cshapeswdate.geom);
QUERY
PLAN


--
 Nested Loop  (cost=0.00..12644.85 rows=43351 width=87704) (actual
time=1.815..7
074973.711 rows=130331 loops=1)
   Join Filter: _st_intersects(priogrid_land.cell, cshapeswdate.geom)
   -  Seq Scan on cshapeswdate  (cost=0.00..14.42 rows=242 width=87248)
(actual
 time=0.007..0.570 rows=242 loops=1)
   -  Index Scan using idx_priogrid_land_cell on priogrid_land
 (cost=0.00..7.1
5 rows=1 width=456) (actual time=0.069..5.604 rows=978 loops=242)
 Index Cond: (priogrid_land.cell  cshapeswdate.geom)
 Total runtime: 7075188.549 ms
(6 rows)

2011/3/7 Oleg Bartunov o...@sai.msu.su

 On Mon, 7 Mar 2011, Andreas For? Tollefsen wrote:

  Ok. Cheers. I will do some more testing on my heavy PostGIS queries which
 often takes hours to complete.


 I'd like to see hours long queries :) EXPLAIN ANALYZE



 Thanks.
 Andreas

 2011/3/7 Kenneth Marshall k...@rice.edu

  On Mon, Mar 07, 2011 at 03:17:05PM +0100, Andreas For? Tollefsen wrote:

 Thanks, Ken.

 It seems like the tip to turn off synchronous_commit did the trick:

 /usr/lib/postgresql/8.4/bin/pgbench -T 60 test1
 starting vacuum...end.
 transaction type: TPC-B (sort of)
 scaling factor: 1
 query mode: simple
 number of clients: 1
 duration: 60 s
 number of transactions actually processed: 86048
 tps = 1434.123199 (including connections establishing)
 tps = 1434.183362 (excluding connections establishing)

 Is this acceptable compared to others when considering my setup?

 Cheers,
 Andreas



 These are typical results for synchronous_commit off. The caveat
 is you must be able to handle loosing transactions if you have a
 database crash, but your database is still intact. This differs
 from turning fsync off in which a crash means you would need to
 restore from a backup.

 Cheers,
 Ken



Regards,
Oleg
 _
 Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
 Sternberg Astronomical Institute, Moscow University, Russia
 Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
 phone: +007(495)939-16-83, +007(495)939-23-83

Re: [PERFORM] Performance issues

2011-03-07 Thread David Kerr

On Mon, Mar 07, 2011 at 10:49:48PM +0100, Andreas For Tollefsen wrote:
- The synchronous_commit off increased the TPS, but not the speed of the below
- query.
- 
- Oleg:
- This is a query i am working on now. It creates an intersection of two
- geometries. One is a grid of 0.5 x 0.5 decimal degree sized cells, while the
- other is the country geometries of all countries in the world for a certain
- year.
- 
- priogrid=# EXPLAIN ANALYZE SELECT priogrid_land.gid, gwcode,
- ST_Intersection(pri
- ogrid_land.cell, cshapeswdate.geom) FROM priogrid_land, cshapeswdate WHERE
- ST_In
- tersects(priogrid_land.cell, cshapeswdate.geom);
- QUERY
- PLAN
- 
- 

- --
-  Nested Loop  (cost=0.00..12644.85 rows=43351 width=87704) (actual
- time=1.815..7
- 074973.711 rows=130331 loops=1)
-Join Filter: _st_intersects(priogrid_land.cell, cshapeswdate.geom)
--  Seq Scan on cshapeswdate  (cost=0.00..14.42 rows=242 width=87248)
- (actual
-  time=0.007..0.570 rows=242 loops=1)
--  Index Scan using idx_priogrid_land_cell on priogrid_land
-  (cost=0.00..7.1
- 5 rows=1 width=456) (actual time=0.069..5.604 rows=978 loops=242)
-  Index Cond: (priogrid_land.cell  cshapeswdate.geom)
-  Total runtime: 7075188.549 ms
- (6 rows)

Your estimated and actuals are way off, have you analyzed those tables?

Dave

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2011-03-07 Thread Tom Lane

=?ISO-8859-1?Q?Andreas_For=F8_Tollefsen?= andrea...@gmail.com writes:
 This is a query i am working on now. It creates an intersection of two
 geometries. One is a grid of 0.5 x 0.5 decimal degree sized cells, while the
 other is the country geometries of all countries in the world for a certain
 year.

Hm, are you sure your data is right?  Because the actual rowcounts imply
that each country intersects about half of the grid cells, which doesn't
seem right.

 priogrid=# EXPLAIN ANALYZE SELECT priogrid_land.gid, gwcode,
 ST_Intersection(pri
 ogrid_land.cell, cshapeswdate.geom) FROM priogrid_land, cshapeswdate WHERE
 ST_Intersects(priogrid_land.cell, cshapeswdate.geom);
 QUERY
 PLAN

 
 --
  Nested Loop  (cost=0.00..12644.85 rows=43351 width=87704) (actual
 time=1.815..7
 074973.711 rows=130331 loops=1)
Join Filter: _st_intersects(priogrid_land.cell, cshapeswdate.geom)
-  Seq Scan on cshapeswdate  (cost=0.00..14.42 rows=242 width=87248)
 (actual
  time=0.007..0.570 rows=242 loops=1)
-  Index Scan using idx_priogrid_land_cell on priogrid_land
  (cost=0.00..7.1
 5 rows=1 width=456) (actual time=0.069..5.604 rows=978 loops=242)
  Index Cond: (priogrid_land.cell  cshapeswdate.geom)
  Total runtime: 7075188.549 ms
 (6 rows)

AFAICT, all of the runtime is going into calculating the ST_Intersects
and/or ST_Intersection functions.  The two scans are only accounting for
perhaps 5.5 seconds, and the join infrastructure isn't going to be
terribly expensive, so it's got to be those functions.  Not knowing much
about PostGIS, I don't know if the functions themselves can be expected
to be really slow.  If it's not them, it could be the cost of fetching
their arguments --- in particular, I bet the country outlines are very
large objects and are toasted out-of-line.  There's been some past
discussion of automatically avoiding repeated detoastings in scenarios
like the above, but nothing's gotten to the point of acceptance yet.
Possibly you could do something to force detoasting in a subquery.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues with postgresql-8.4.0: Query gets stuck sometimes

2010-07-05 Thread Robert Haas

On Fri, Jul 2, 2010 at 1:40 AM, Sachin Kumar sachin...@globallogic.com wrote:
 At times we have observed that postgres stops responding for several
 minutes, even couldn’t fetch the number of entries in a particular table.
 One such instance happens when we execute the following steps:

Sounds sort of like a checkpoint spike.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues with postgresql-8.4.0

2010-07-04 Thread Craig Ringer

On 29/06/10 15:01, Sachin Kumar wrote:

 At times we have observed that postgres stops responding for several
 minutes, even couldn't fetch the number of entries in a particular
 table.

Quick guess: checkpoints. Enable checkpoint logging, follow the logs,
see if there's any correspondance.

In general, looking at the logs might help you identify the issue.

 One such instance happens when we execute the following steps:
 
 - Add few lakh entries (~20) to table X on the master DB.
 
 - After addition, slony starts replication on the slave DB. It
 takes several minutes (~25 mins) for replication to finish.
 
 - During this time (while replication is in progress), sometimes
 postgres stops responding, i.e. we couldn't even fetch the number of
 entries in any table (X, Y, etc).

Fetching the number of entries in a table - using count(...) - is
actually a rather expensive operation, and a poor choice if you just
want to see if the server is responsive.

  SELECT id FROM tablename LIMIT 1;

where id is the primary key of the table would be a better option.

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues with postgresql-8.4.0: Query gets stuck sometimes

2010-07-01 Thread Scott Marlowe

On Fri, Jul 2, 2010 at 1:40 AM, Sachin Kumar sachin...@globallogic.com wrote:
 Hi,

 We are using postgresql-8.4.0 on 64-bit Linux machine (open-SUSE 11.x). It’s
 a master/slave deployment  slony-2.0.4.rc2 is used for DB replication (from
 master to slave).

You should really be running 8.4.4, not 8.4.0, as there are quite a
few bug fixes since 8.4.0 was released.

slony 2.0.4 is latest, and I'm not sure I trust it completely just
yet, and am still running 1.2.latest myself.  At least move forward
from 2.0.4.rc2 to 2.0.4 release.

 At times we have observed that postgres stops responding for several
 minutes, even couldn’t fetch the number of entries in a particular table.

Note that retrieving the number of entries in a table is not a cheap
operation in pgsql.  Try something cheaper like select * from
sometable limit 1; and see if that responds.  If that seems to hang,
open another session and see what select * from pg_statistic has to
say about waiting queries.

 One such instance happens when we execute the following steps:

 - Add few lakh entries (~20) to table X on the master DB.

Note that most westerner's don't know what a lakh is.  (100k I believe?)

 - After addition, slony starts replication on the slave DB. It takes
 several minutes (~25 mins) for replication to finish.

 - During this time (while replication is in progress), sometimes
 postgres stops responding, i.e. we couldn’t even fetch the number of entries
 in any table (X, Y, etc).

I have seen some issues pop up during subscription of large sets like
this.  Most of the time you're just outrunning your IO subsystem.
Occasionally a nasty interaction between slony, autovacuum, and user
queries causes a problem.

 Can you please let us know what could the reason for such a behavior and how
 it can be fixed/improved.

You'll need to see what's happening on your end.  If pg_statistic says
your simple select * from X limit 1 is waiting, we'll go from there.
If it returns but bigger queries take a long time you've got a
different issue and probably need to monitor your IO subsystem with
things like iostat, vmstat, iotop, etc.

 Please let us know if any information is required wrt hardware
 details/configurations etc.

Always useful to have.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues when the number of records are around 10 Million

2010-05-25 Thread Robert Haas

On Wed, May 12, 2010 at 1:45 AM, venu madhav venutaurus...@gmail.com wrote:
 [Venu] Yes, autovacuum is running every hour. I could see in the log
 messages. All the configurations for autovacuum are disabled except that it
 should run for every hour. This application runs on an embedded box, so
 can't change the parameters as they effect the other applications running on
 it. Can you please explain what do you mean by default parameters.
 autovacuum = on # enable autovacuum
 subprocess?
 autovacuum_naptime = 3600   # time between autovacuum runs, in
 secs

The default value for autovacuum_naptime is a minute.  Why would you
want to increase it by a factor of 60?  That seems likely to result in
I/O spikes, table bloat, and generally poor performance.

There are dramatic performance improvements in PostgreSQL 8.3 and 8.4.
 Upgrading would probably help, a lot.

The points already made about LIMIT some huge value are also right on target.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues when the number of records are around 10 Million

2010-05-14 Thread venu madhav

On Wed, May 12, 2010 at 3:17 AM, Jorge Montero 
jorge_mont...@homedecorators.com wrote:

  First, are you sure you are getting autovacuum to run hourly? Autovacuum
 will only vacuum when certain configuration thresholds are reached. You can
 set it to only check for those thresholds every so often, but no vacuuming
 or analyzing will be done unless they are hit, regardless of how often
 autovacuum checks the tables. Whenever you are dealing with time series, the
 default thresholds are often insufficient, especially when you are
 especially interested in the last few records on a large table.


[Venu] Yes, autovacuum is running every hour. I could see in the log
messages. All the configurations for autovacuum are disabled except that it
should run for every hour. This application runs on an embedded box, so
can't change the parameters as they effect the other applications running on
it. Can you please explain what do you mean by default parameters.


 What are your autovacuum configuration parameters?

[Venu] Except these all others are disabled.
 #---

# AUTOVACUUM
PARAMETERS
#---



autovacuum = on # enable autovacuum
subprocess?
autovacuum_naptime = 3600   # time between autovacuum runs, in
secs

When were the two tables last autovacuum and analyzed, according to
 pg_stat_user_tables?

[Venu] This is the content of pg_stat_user_tables for the two tables I am
using in that query.
* relid | schemaname | relname  | seq_scan | seq_tup_read | idx_scan
| idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del
---++--+--+--+--+---+---+---+---
 41188 | public | event|  117 |   1201705723 |  998
|  2824 |28 | 0 | 0
 41209 | public | signature|  153 | 5365 |2
|72 | 1 | 0 | 0
*

 Could you post the output of explain analyze of your query?

 snort=# *EXPLAIN ANALYZE select e.cid, timestamp, s.sig_class,
s.sig_priority, s.sig_name, e.sniff_ip, e.sniff_channel, s.sig_config,
e.wifi_addr_1,  e.wifi_addr_2, e.view_status, bssid  FROM event e, signature
s WHERE s.sig_id = e.signature   AND e.timestamp = '1270449180' AND
e.timestamp  '1273473180'  ORDER BY e.cid DESC,
e.cid DESC limit 21 offset 10539780; *
 QUERY
PLAN
---

--
 Limit  (cost=7885743.98..7885743.98 rows=1 width=287) (actual
time=1462193.060..1462193.083 rows=14 loops=1)
   -  Sort  (cost=7859399.66..7885743.98 rows=10537727 width=287)
(actual time=1349648.207..1456496.334 rows=10539794 loops=1)
 Sort Key: e.cid
 -  Hash Join  (cost=2.44..645448.31 rows=10537727 width=287)
(actual time=0.182..139745.001 rows=10539794 loops=1)
   Hash Cond: (outer.signature = inner.sig_id)
   -  Seq Scan on event e  (cost=0.00..487379.97
rows=10537727 width=104) (actual time=0.012..121595.257 rows=10539794
loops=1)
 Filter: ((timestamp = 1270449180::bigint) AND
(timestamp  1273473180::bigint))
   -  Hash  (cost=2.35..2.35 rows=35 width=191) (actual
time=0.097..0.097 rows=36 loops=1)
 -  Seq Scan on signature s  (cost=0.00..2.35
rows=35 width=191) (actual time=0.005..0.045 rows=36 loops=1)
 Total runtime: 1463829.145 ms
(10 rows)

 Which default statistic collection parameters do you use? Have you changed
 them specifically for the tables you are using?

[Venu] These are the statistic collection parameters:
* # - Query/Index Statistics Collector -

stats_start_collector = on
stats_command_string = on
#stats_block_level = off
stats_row_level = on
#stats_reset_on_server_start = off*
Please let me know if you are referring to something else.

 Which version of Postgres are you running? Which OS?

[Venu] Postgres Version 8.1 and Cent OS 5.1 is the Operating System.

Thank you,
Venu




  venu madhav venutaurus...@gmail.com 05/11/10 3:47 AM 

 Hi all,
 In my database application, I've a table whose records can reach 10M and
 insertions can happen at a faster rate like 100 insertions per second in the
 peak times. I configured postgres to do auto vacuum on hourly basis. I have
 frontend GUI application in CGI which displays the data from the database.
 When I try to get the last twenty records from the database, it takes around
 10-15 mins to complete the operation.This is the query which is used:

 *select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name,
 e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
 e.wifi_addr_2, e.view_status, bssid FROM event e, signature s WHERE
 s.sig_id

Re: [PERFORM] Performance issues when the number of records are around 10 Million

2010-05-14 Thread venu madhav

On Wed, May 12, 2010 at 3:22 AM, Shrirang Chitnis 
shrirang.chit...@hovservices.com wrote:

 Venu,

 For starters,

 1) You have used the e.cid twice in ORDER BY clause.

[Venu] Actually the second cid acts as a secondary sort order if any other
column in the table is used for sorting. In the query since the primary
sorting key was also  cid, we are seeing it twice. I can remove it.

 2) If you want last twenty records in the table matching the criteria of
 timestamp, why do you need the offset?

[Venu] It is part of an UI  application where a user can ask for date
between any dates. It has the options to browse through the data retrieved
between those intervals.

 3) Do you have indexes on sig_id, signature and timestamp fields?

[Venu] Yes, I do have indexes on those three.


 If you do not get a good response after that, please post the EXPLAIN
 ANALYZE for the query.

snort=# EXPLAIN ANALYZE select e.cid, timestamp, s.sig_class,
s.sig_priority, s.sig_name, e.sniff_ip, e.sniff_channel, s.sig_config,
e.wifi_addr_1,  e.wifi_addr_2, e.view_status, bssid  FROM event e, signature
s WHERE s.sig_id = e.signature   AND e.timestamp = '1270449180' AND
e.timestamp  '1273473180'  ORDER BY e.cid DESC,  e.cid DESC limit 21 offset
10539780;
 QUERY
PLAN
-
 Limit  (cost=7885743.98..7885743.98 rows=1 width=287) (actual
time=1462193.060..1462193.083 rows=14 loops=1)
   -  Sort  (cost=7859399.66..7885743.98 rows=10537727 width=287) (actual
time=1349648.207..1456496.334 rows=10539794 loops=1)
 Sort Key: e.cid
 -  Hash Join  (cost=2.44..645448.31 rows=10537727 width=287)
(actual time=0.182..139745.001 rows=10539794 loops=1)
   Hash Cond: (outer.signature = inner.sig_id)
   -  Seq Scan on event e  (cost=0.00..487379.97 rows=10537727
width=104) (actual time=0.012..121595.257 rows=10539794 loops=1)
 Filter: ((timestamp = 1270449180::bigint) AND
(timestamp  1273473180::bigint))
   -  Hash  (cost=2.35..2.35 rows=35 width=191) (actual
time=0.097..0.097 rows=36 loops=1)
 -  Seq Scan on signature s  (cost=0.00..2.35 rows=35
width=191) (actual time=0.005..0.045 rows=36 loops=1)
 *Total runtime: 1463829.145 ms*
(10 rows)
Thank you,
Venu Madhav.


 Thanks,

 Shrirang Chitnis
 Sr. Manager, Applications Development
 HOV Services
 Office: (866) 808-0935 Ext: 39210
 shrirang.chit...@hovservices.com
 www.hovservices.com


 The information contained in this message, including any attachments, is
 attorney privileged and/or confidential information intended only for the
 use of the individual or entity named as addressee.  The review,
 dissemination, distribution or copying of this communication by or to anyone
 other than the intended addressee is strictly prohibited.  If you have
 received this communication in error, please immediately notify the sender
 by replying to the message and destroy all copies of the original message.

 From: pgsql-performance-ow...@postgresql.org [mailto:
 pgsql-performance-ow...@postgresql.org] On Behalf Of venu madhav
 Sent: Tuesday, May 11, 2010 2:18 PM
 To: pgsql-performance@postgresql.org
 Subject: [PERFORM] Performance issues when the number of records are around
 10 Million

 Hi all,
  In my database application, I've a table whose records can reach 10M
 and insertions can happen at a faster rate like 100 insertions per second in
 the peak times. I configured postgres to do auto vacuum on hourly basis. I
 have frontend GUI application in CGI which displays the data from the
 database. When I try to get the last twenty records from the database, it
 takes around 10-15  mins to complete the operation.This is the query which
 is used:

 select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name,
 e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
 e.wifi_addr_2, e.view_status, bssid  FROM event e, signature s WHERE
 s.sig_id = e.signature   AND e.timestamp = '1270449180' AND e.timestamp 
 '1273473180'  ORDER BY e.cid DESC,  e.cid DESC limit 21 offset 10539780;

 Can any one suggest me a better solution to improve the performance.

 Please let me know if you've any further queries.


 Thank you,
 Venu

Re: [PERFORM] Performance issues when the number of records are around 10 Million

2010-05-14 Thread venu madhav

On Wed, May 12, 2010 at 5:25 PM, Kevin Grittner kevin.gritt...@wicourts.gov
 wrote:

 venu madhav  wrote:

  AND e.timestamp = '1270449180'
  AND e.timestamp  '1273473180'
  ORDER BY.
  e.cid DESC,
  e.cid DESC
  limit 21
  offset 10539780

  The second column acts as a secondary key for sorting if the
  primary sorting key is a different column. For this query both of
  them are same.

 Any chance you could just leave the second one off in that case?

[Venu] Yes, that can be ignored. But am not sure that removing it would
reduce the time drastically.


  This query is part of an application which allows user to select
  time ranges and retrieve the data in that interval. Hence the time
  stamp.

 Which, of course, is going to affect the number of rows.  Which
 leaves me wondering how you know that once you select and sequence
 the result set you need to read past and ignore exactly 10539780
 rows to get to the last page.

[Venu]For Ex:  My database has 10539793 records. My application first
calculates the count of number of records in that interval. And then based
on user request to display 10/20/30/40 records in one page, it calculates
how many records to be displayed when the last link is clicked.


  To have it in some particular order we're doing order by.

 Which will affect which rows are at any particular offset.

[Venu]Yes, by default it has the primary key for order by.


  If the records are more in the interval,

 How do you know that before you run your query?

 [Venu] I calculate the count first.


  we display in sets of 20/30 etc. The user also has the option to
  browse through any of those records hence the limit and offset.

 Have you considered alternative techniques for paging?  You might
 use values at the edges of the page to run a small query (limit, no
 offset) when they page.  You might generate all the pages on the
 first pass and cache them for a while.

 [Venu] If generate all the pages at once, to retrieve all the 10 M records
at once, it would take much longer time and since the request from the
browser, there is a chance of browser getting timed out.

  When the user asks for the last set of 20 records, this query gets
  executed.

 The DESC on the ORDER BY makes it look like you're trying to use the
 ORDER BY to get to the end, but then your offset tells PostgreSQL to
 skip the 10.5 million result rows with the highest keys.  Is the
 last page the one with the highest or lowest values for cid?

 [Venu] The last page contains the lowest values of cid. By default we get
the records in the decreasing order of cid and then get the last 10/20.

Thank you,
Venu.

 -Kevin

Re: [PERFORM] Performance issues when the number of records are around 10 Million

2010-05-14 Thread venu madhav

On Wed, May 12, 2010 at 7:26 PM, Kevin Grittner kevin.gritt...@wicourts.gov
 wrote:

 venu madhav venutaurus...@gmail.com wrote:

   If the records are more in the interval,
 
  How do you know that before you run your query?
 
  I calculate the count first.

 This and other comments suggest that the data is totally static
 while this application is running.  Is that correct?

[Venu] No, the data gets added when the application is running. As I've
mentioned before it could be as faster as 100-400 records per second. And it
is an important application which will be running 24/7.


  If generate all the pages at once, to retrieve all the 10 M
  records at once, it would take much longer time

 Are you sure of that?  It seems to me that it's going to read all
 ten million rows once for the count and again for the offset.  It
 might actually be faster to pass them just once and build the pages.

[Venu] Even if the retrieval is faster, the client which is viewing the
database and the server where the data gets logged can be any where on the
globe. So, it is not feasible to get all the 1 or 10 M records at once from
the server to client.



 Also, you didn't address the issue of storing enough information on
 the page to read off either edge in the desired sequence with just a
 LIMIT and no offset.  Last page or page up would need to reverse
 the direction on the ORDER BY.  This would be very fast if you have
 appropriate indexes.  Your current technique can never be made very
 fast.

[Venu] I actually didn't understand what did you mean when you said storing
enough information on the page to read off either edge in the desired
sequence with just a
LIMIT and no offset. What kind of information can we store to improve the
performance.  Reversing the order by is one thing, I am trying to figure out
how fast it is. Thanks a lot for this suggestion.

Thank you,
Venu.


 -Kevin

Re: [PERFORM] Performance issues when the number of records are around 10 Million

2010-05-13 Thread Kevin Grittner

venu madhav venutaurus...@gmail.com wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
  I calculate the count first.

 This and other comments suggest that the data is totally static
 while this application is running.  Is that correct?

 No, the data gets added when the application is running. As I've
 mentioned before it could be as faster as 100-400 records per
 second. And it is an important application which will be running
 24/7.
 
Then how can you trust that the count you run before selecting is
accurate when you run the SELECT?  Are they both in the same
REPEATABLE READ or SERIALIZABLE transaction?
 
 Also, you didn't address the issue of storing enough information
 on the page to read off either edge in the desired sequence with
 just a LIMIT and no offset.  Last page or page up would need
 to reverse the direction on the ORDER BY.  This would be very
 fast if you have appropriate indexes.  Your current technique can
 never be made very fast.

 I actually didn't understand what did you mean when you said
 storing enough information on the page to read off either edge in
 the desired sequence with just a LIMIT and no offset. What kind
 of information can we store to improve the performance.
 
Well, for starters, it's entirely possible that the hitlist
approach posted by Craig James will work better for you than what
I'm about to describe.  Be sure to read this post carefully:
 
http://archives.postgresql.org/pgsql-performance/2010-05/msg00058.php
 
The reason that might work better than the idea I was suggesting is
that the combination of selecting on timestamp and ordering by
something else might make it hard to use reasonable indexes to
position and limit well enough for the technique I was suggesting to
perform well.  It's hard to say without testing.
 
For what I was describing, you must use an ORDER BY which guarantees
a consistent sequence for the result rows.  I'm not sure whether you
always have that currently; if not, that's another nail in the
coffin of your current technique, since the same OFFSET into the
result might be different rows from one time to the next, even if
data didn't change.  If your ORDER BY can't guarantee a unique set
of ordering values for every row in the result set, you need to add
any missing columns from a unique index (usually the primary key) to
the ORDER BY clause.
 
Anyway, once you are sure you have an ORDER BY which is
deterministic, you make sure your software remembers the ORDER BY
values for the first and last entries on the page.  Then you can do
something like (abstractly):
 
SELECT x, y, z
  FROM a, b
  WHERE ts BETWEEN m AND n
AND a.x = b.a_x
AND (x, y)  (lastx, lasty)
  ORDER BY x, y
  LIMIT 20;
 
With the right indexes, data distributions, selection criteria, and
ORDER BY columns -- that *could* be very fast.  If not, look at
Craig's post.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues when the number of records are around 10 Million

2010-05-12 Thread Kevin Grittner

venu madhav  wrote:
 
 AND e.timestamp = '1270449180'
 AND e.timestamp  '1273473180'
 ORDER BY.
 e.cid DESC,
 e.cid DESC
 limit 21
 offset 10539780
 
 The second column acts as a secondary key for sorting if the
 primary sorting key is a different column. For this query both of
 them are same.
 
Any chance you could just leave the second one off in that case?
 
 This query is part of an application which allows user to select
 time ranges and retrieve the data in that interval. Hence the time
 stamp.
 
Which, of course, is going to affect the number of rows.  Which
leaves me wondering how you know that once you select and sequence
the result set you need to read past and ignore exactly 10539780
rows to get to the last page.
 
 To have it in some particular order we're doing order by.
 
Which will affect which rows are at any particular offset.
 
 If the records are more in the interval,
 
How do you know that before you run your query?
 
 we display in sets of 20/30 etc. The user also has the option to
 browse through any of those records hence the limit and offset.
 
Have you considered alternative techniques for paging?  You might
use values at the edges of the page to run a small query (limit, no
offset) when they page.  You might generate all the pages on the
first pass and cache them for a while.
 
 When the user asks for the last set of 20 records, this query gets
 executed.
 
The DESC on the ORDER BY makes it look like you're trying to use the
ORDER BY to get to the end, but then your offset tells PostgreSQL to
skip the 10.5 million result rows with the highest keys.  Is the
last page the one with the highest or lowest values for cid?
 
-Kevin



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues when the number of records are around 10 Million

2010-05-12 Thread Kevin Grittner

venu madhav venutaurus...@gmail.com wrote:
 
  If the records are more in the interval,

 How do you know that before you run your query?

 I calculate the count first.
 
This and other comments suggest that the data is totally static
while this application is running.  Is that correct?
 
 If generate all the pages at once, to retrieve all the 10 M
 records at once, it would take much longer time
 
Are you sure of that?  It seems to me that it's going to read all
ten million rows once for the count and again for the offset.  It
might actually be faster to pass them just once and build the pages.
 
Also, you didn't address the issue of storing enough information on
the page to read off either edge in the desired sequence with just a
LIMIT and no offset.  Last page or page up would need to reverse
the direction on the ORDER BY.  This would be very fast if you have
appropriate indexes.  Your current technique can never be made very
fast.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues when the number of records are around 10 Million

2010-05-12 Thread Craig James


On 5/12/10 4:55 AM, Kevin Grittner wrote:

venu madhav  wrote:

we display in sets of 20/30 etc. The user also has the option to
browse through any of those records hence the limit and offset.


Have you considered alternative techniques for paging?  You might
use values at the edges of the page to run a small query (limit, no
offset) when they page.  You might generate all the pages on the
first pass and cache them for a while.


Kevin is right.  You need to you hitlists - a semi-temporary table that holds 
the results of your initial query.  You're repeating a complex, expensive query over and 
over, once for each page of data that the user wants to see.  Instead, using a hitlist, 
your initial query looks something like this:

create table hitlist_xxx(
   objectid integer,
   sortorder integer default nextval('hitlist_seq')
);

insert into hitlist_xxx (objectid)
(select ... your original query ... order by ...)

You store some object ID or primary key in the hitlist table, and the 
sequence records your original order.

Then when your user asks for page 1, 2, 3 ... N, all you have to do is join 
your hitlist to your original data:

  select ... from mytables join hitlist_xxx on (...)
 where sortorder = 100 and sortorder  120;

which would instantly return page 5 of your data.

To do this, you need a way to know when a user is finished so that you can 
discard the hitlist.

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues when the number of records are around 10 Million

2010-05-11 Thread Kevin Grittner

venu madhav venutaurus...@gmail.com wrote:
 
 When I try to get the last twenty records from the database, it
 takes around 10-15  mins to complete the operation.
 
Making this a little easier to read (for me, at least) I get this:
 
select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name,
e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
e.wifi_addr_2, e.view_status, bssid
  FROM event e,
   signature s
  WHERE s.sig_id = e.signature
AND e.timestamp = '1270449180'
AND e.timestamp   '1273473180'
  ORDER BY
e.cid DESC,
e.cid DESC
  limit 21
  offset 10539780
;
 
Why the timestamp range, the order by, the limit, *and* the offset?
On the face of it, that seems a bit confused.  Not to mention that
your ORDER BY has the same column twice.
 
Perhaps that OFFSET is not needed?  It is telling PostgreSQL that
whatever results are generated based on the rest of the query, read
through and ignore the first ten and a half million.  Since you said
you had about ten million rows, you wanted the last 20, and the
ORDER by is DESCending, you're probably not going to get what you
want.
 
What, exactly, *is* it you want again?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues when the number of records are around 10 Million

2010-05-11 Thread Shrirang Chitnis

Venu,

For starters,

1) You have used the e.cid twice in ORDER BY clause.
2) If you want last twenty records in the table matching the criteria of 
timestamp, why do you need the offset?
3) Do you have indexes on sig_id, signature and timestamp fields?

If you do not get a good response after that, please post the EXPLAIN ANALYZE 
for the query.

Thanks,

Shrirang Chitnis
Sr. Manager, Applications Development
HOV Services
Office: (866) 808-0935 Ext: 39210
shrirang.chit...@hovservices.com
www.hovservices.com


The information contained in this message, including any attachments, is 
attorney privileged and/or confidential information intended only for the use 
of the individual or entity named as addressee.  The review, dissemination, 
distribution or copying of this communication by or to anyone other than the 
intended addressee is strictly prohibited.  If you have received this 
communication in error, please immediately notify the sender by replying to the 
message and destroy all copies of the original message.

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of venu madhav
Sent: Tuesday, May 11, 2010 2:18 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Performance issues when the number of records are around 10 
Million

Hi all,
  In my database application, I've a table whose records can reach 10M and 
insertions can happen at a faster rate like 100 insertions per second in the 
peak times. I configured postgres to do auto vacuum on hourly basis. I have 
frontend GUI application in CGI which displays the data from the database. When 
I try to get the last twenty records from the database, it takes around 10-15  
mins to complete the operation.This is the query which is used:

select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name, e.sniff_ip, 
e.sniff_channel, s.sig_config, e.wifi_addr_1,
e.wifi_addr_2, e.view_status, bssid  FROM event e, signature s WHERE s.sig_id = 
e.signature   AND e.timestamp = '1270449180' AND e.timestamp  '1273473180'  
ORDER BY e.cid DESC,  e.cid DESC limit 21 offset 10539780;

Can any one suggest me a better solution to improve the performance.

Please let me know if you've any further queries.


Thank you,
Venu

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues when the number of records are around 10 Million

2010-05-11 Thread Josh Berkus


 * select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name,
 e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
 e.wifi_addr_2, e.view_status, bssid  FROM event e, signature s WHERE
 s.sig_id = e.signature   AND e.timestamp = '1270449180' AND e.timestamp
  '1273473180'  ORDER BY e.cid DESC,  e.cid DESC limit 21 offset 10539780;

Anything with an offset that high is going to result in a sequential
scan of most of the table.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues with large amounts of time-series data

2009-09-01 Thread हृषीकेश मेहेंदळ े

Hi Tom, Greg,

Thanks for your helpful suggestions - switching the BIGINT to FLOAT
and fixing the postgresql.conf to better match my server configuration
gave me about 30% speedup on the queries.

Because of the fact that my data insert order was almost never the
data retrieval order, I also got a significant (about 3x - 10x)
speedup by CLUSTERing the tables on an index that represented the most
frequent query orders (main_id, timestamp, sub_id, device_id) - the
queries that were taking a few seconds earlier now complete in a few
hundred milliseconds (5s vs. 600ms in some instances).

Thanks Again,
Hrishikesh

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues with large amounts of time-series data

2009-08-27 Thread Greg Smith


On Wed, 26 Aug 2009, Hrishikesh (??? ) wrote:


key = {device_id (uint64), identifier (uint32), sub_identifier (uint32), 
unix_time} (these four taken together are unique)


You should probably tag these fields as NOT NULL to eliminate needing to 
consider that possibility during query planning.  As of V8.3 this isn't as 
critical anymore, but it's still good practice.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues with large amounts of time-series data

2009-08-26 Thread Tom Lane

=?UTF-8?B?SHJpc2hpa2VzaCAo4KS54KWD4KS34KWA4KSV4KWH4KS2IOCkruClh+CkueClh+CkguCkpuCksw==?=
 =?UTF-8?B?4KWHKQ==?= hashincl...@gmail.com writes:
 In my timing tests, the performance of PG is quite a lot worse than the
 equivalent BerkeleyDB implementation.

Are you actually comparing apples to apples?  I don't recall that BDB
has any built-in aggregation functionality.  It looks to me like you've
moved some work out of the client into the database.

 1. Is there anything I can do to speed up performance for the queries?

Do the data columns have to be bigint, or would int be enough to hold
the expected range?  SUM(bigint) is a *lot* slower than SUM(int),
because the former has to use numeric arithmetic whereas the latter
can sum in bigint.  If you want to keep the data on-disk as bigint,
but you know the particular values being summed here are not that
big, you could cast in the query (SUM(data_1::int) etc).

I'm also wondering if you've done something to force indexscans to be
used.  If I'm interpreting things correctly, some of these scans are
traversing all/most of a partition and would be better off as seqscans.

  shared_buffers = 128MB

This is really quite lame for the size of machine and database you've
got.  Consider knocking it up to 1GB or so.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues with large amounts of time-series data

2009-08-26 Thread हृषीकेश मेहेंदळ े

Hi Tom,

Thanks for your quick response.

2009/8/26 Tom Lane t...@sss.pgh.pa.us
 hashincl...@gmail.com writes:
  In my timing tests, the performance of PG is quite a lot worse than the
  equivalent BerkeleyDB implementation.

 Are you actually comparing apples to apples?  I don't recall that BDB
 has any built-in aggregation functionality.  It looks to me like you've
 moved some work out of the client into the database.

I'm measuring end-to-end time, which includes the in-code aggregation
with BDB (post DB fetch) and the in-query aggregation in PG.

  1. Is there anything I can do to speed up performance for the queries?

 Do the data columns have to be bigint, or would int be enough to hold
 the expected range?  SUM(bigint) is a *lot* slower than SUM(int),
 because the former has to use numeric arithmetic whereas the latter
 can sum in bigint.  If you want to keep the data on-disk as bigint,
 but you know the particular values being summed here are not that
 big, you could cast in the query (SUM(data_1::int) etc).

For the 300-sec tables I probably can drop it to an integer, but for
3600 and 86400 tables (1 hr, 1 day) will probably need to be BIGINTs.
However, given that I'm on a 64-bit platform (sorry if I didn't
mention it earlier), does it make that much of a difference? How does
a float (REAL) compare in terms of SUM()s ?

 I'm also wondering if you've done something to force indexscans to be
 used.  If I'm interpreting things correctly, some of these scans are
 traversing all/most of a partition and would be better off as seqscans.
One thing I noticed is that if I specify what devices I want the data
for (specifically, all of them, listed out as DEVICE IN (1,2,3,4,5...)
in the WHERE clause, PG uses a Bitmap heap scan, while if I don't
specify the list (which still gives me data for all the devices), PG
uses a sequential scan. (I might have missed the DEVICE IN (...) in my
earlier query). However, more often than not, the query _will_ be of
the form DEVICE IN (...). If I actually execute the queries (on the
psql command line), their runtimes are about the same (15s vs 16s)

       shared_buffers = 128MB

 This is really quite lame for the size of machine and database you've
 got.  Consider knocking it up to 1GB or so.

OK, I've bumped it up to 1 GB. However, that doesn't seem to make a
huge difference (unless I need to do the same on libpqxx's connection
object too).

Cheers,
Hrishi

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues with large amounts of time-series data

2009-08-26 Thread Tom Lane

=?UTF-8?B?SHJpc2hpa2VzaCAo4KS54KWD4KS34KWA4KSV4KWH4KS2IOCkruClh+CkueClh+CkguCkpuCksw==?=
 =?UTF-8?B?4KWHKQ==?= hashincl...@gmail.com writes:
 2009/8/26 Tom Lane t...@sss.pgh.pa.us
 Do the data columns have to be bigint, or would int be enough to hold
 the expected range?

 For the 300-sec tables I probably can drop it to an integer, but for
 3600 and 86400 tables (1 hr, 1 day) will probably need to be BIGINTs.
 However, given that I'm on a 64-bit platform (sorry if I didn't
 mention it earlier), does it make that much of a difference?

Even more so.

 How does a float (REAL) compare in terms of SUM()s ?

Casting to float or float8 is certainly a useful alternative if you
don't mind the potential for roundoff error.  On any non-ancient
platform those will be considerably faster than numeric.  BTW,
I think that 8.4 might be noticeably faster than 8.3 for summing
floats, because of the switch to pass-by-value for them.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues with large amounts of time-series data

2009-08-26 Thread Greg Stark

2009/8/26 Tom Lane t...@sss.pgh.pa.us:
 How does a float (REAL) compare in terms of SUM()s ?

 Casting to float or float8 is certainly a useful alternative if you
 don't mind the potential for roundoff error.  On any non-ancient
 platform those will be considerably faster than numeric.  BTW,
 I think that 8.4 might be noticeably faster than 8.3 for summing
 floats, because of the switch to pass-by-value for them.

It occurs to me we could build a special case state variable which
contains a bigint or a numeric only if it actually overflows. This
would be like my other suggestion with dates only it would never be
exposed. The final function would always convert to a numeric.

Alternatively we could change the numeric data type as was proposed
aeons ago but make it more general so it stores integers that fit in a
bigint as a 64-bit integer internally. That would be more work but be
more generally useful. I'm not sure it would be possible to avoid
generating palloc garbage for sum() that way though.

-- 
greg
http://mit.edu/~gsstark/resume.pdf

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues migrating from 743 to 826

2008-01-28 Thread Gregory Stark

Matthew Lunnon [EMAIL PROTECTED] writes:

 In this case the query takes 6.037 ms to run on 862 and 2.332 to run on 743.

The difference between 2ms and 6ms is pretty negligable. A single context
switch or disk cache miss could throw the results off by that margin in either
direction.

But what plan does 7.4.3 come up with if you set enable_hashjoins = off? I'm
curious whether it comes up with the same nested loops plan as 8.2 and what
cost it says it has.

I think you need to find queries which take longer to have any reliable
performance comparisons. Note that the configuration parameters here aren't
the same at all, it's possible the change of effective_cache_size from 800k to
2GB is what's changing the cost estimation. I seem to recall a change in the
arithmetic for calculatin Nested loop costs too which made it more aggressive
in estimating cache effectiveness.

Incidentally, default_statistics_target=1000 is awfully aggressive. I found in
the past that that caused the statistics table to become much larger and much
slower to access. It may have caused some statistics to be toasted or it may
have just been the sheer volume of data present. It will also make your
ANALYZEs take a lot longer. I would suggest trying 100 first and incrementally
raising it rather than jumping straight to 1000. And preferably only on the
columns which really matter.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Performance issues migrating from 743 to 826

2008-01-28 Thread Matthew Lunnon




Scott Marlowe wrote:

Whatever email agent you're using seems to be quoting in a way that
doesn't get along well with gmail, so I'm just gonna chop most of it
rather than have it quoted confusingly...  Heck, I woulda chopped a
lot anyway to keep it small. :)
  

Thanks again for your time. I'm using Thunderbird, maybe I need to upgrade.

On Jan 28, 2008 9:27 AM, Matthew Lunnon [EMAIL PROTECTED] wrote:
  

 Scott Marlowe wrote:
 On Jan 28, 2008 5:41 AM, Matthew Lunnon [EMAIL PROTECTED] wrote:
default_statistics_target = 1000


 That's very high for the default. Planning times will be increased
noticeably
  

 I had originally left the default_statistics_target at its default and then
increased it to 100, but this did not seem to make much difference.  I will
reduce this down to something more normal again.



You do know that if you create a column when the default is 10, then
increase the default, it won't change the column's stats target,
right?  So, assuming the table was first created, then you changed the
default, you'll now need to do:

alter table xyz alter column abc set statistics 100;
analyze xyz;

for it to make any difference.
  
Thanks I haven't looked into this yet, I'll look.  When I changed the 
default_stats_target it did take a very long time to do its analyze so I 
assumed it was doing something.
  

 The queries were on exactly the same data. My interpretation of what is
going on here is that 8.2.6 seems to be leaving the filtering of market_id
to the very last point, which is why it ends up with 189 rows at this point
instead of the 2 that 743 has. 743 seems to do that filtering much earlier
and so reduce the number of rows at a much earlier point in the execution of
the query. I guess that this is something to do with the planner which is
why I tried increasing the default_statistics_target.



Ahh, I'm guessing it's something that your 7.4 database CAN use an
index on and your 8.2 data base can't use an index on.  Like text in a
non-C locale.  Or something...  Table def?
  
Thanks, I'll take a look at that, is there any documentation on what 
8.2.6. can't use in an index? It didn't seem to have complained about 
any of my indexes when I generated the database.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

_
This e-mail has been scanned for viruses by Verizon Business Internet Managed 
Scanning Services - powered by MessageLabs. For further information visit 
http://www.verizonbusiness.com/uk
  


--
Matthew Lunnon
Technical Consultant
RWA Ltd.

[EMAIL PROTECTED]
Tel: +44 (0)29 2081 5056
www.rwa-net.co.uk
--

Re: [PERFORM] Performance issues migrating from 743 to 826

2008-01-28 Thread Matthew Lunnon


Hi Scott,
Thanks for your time
Regards
Matthew

Scott Marlowe wrote:

On Jan 28, 2008 5:41 AM, Matthew Lunnon [EMAIL PROTECTED] wrote:
  

Hi

I am investigating migrating from postgres 743 to postgres 826 but
although the performance in postgres 826 seems to be generally better
there are some instances where it seems to be markedly worse, a factor
of up to 10.  The problem seems to occur when I join to more than 4
tables. Has anyone else experienced anything similar or got any
suggestions as to what I might do? I am running on an intel box with two
hyper threaded cores and 4GB of RAM. I have tweaked the postgres.conf
files with these values and the query and explain output are below. In
this case the query takes 6.037 ms to run on 862 and 2.332 to run on 743.



It looks like the data are not the same in these two environments.

  

8.2.6
shared_buffers = 500MB
work_mem = 10MB
maintenance_work_mem = 100MB
effective_cache_size = 2048MB
default_statistics_target = 1000



That's very high for the default.  Planning times will be increased noticeably
  
I had originally left the default_statistics_target at its default and 
then increased it to 100, but this did not seem to make much 
difference.  I will reduce this down to something more normal again.

Plan for 7.4:

  

Nested Loop  (cost=37.27..48.34 rows=1 width=458) (actual
time=1.474..2.138 rows=14 loops=1)
  -  Nested Loop  (cost=37.27..42.34 rows=1 width=282) (actual
time=1.428..1.640 rows=2 loops=1)



This is processing 2 rows...

  

Total runtime: 2.332 ms



While this is processing 189 rows:

  

Nested Loop  (cost=0.00..30.39 rows=1 width=458) (actual
time=0.123..5.841 rows=14 loops=1)
  -  Nested Loop  (cost=0.00..29.70 rows=1 width=439) (actual
time=0.099..4.590 rows=189 loops=1)



Hardly seems a fair comparison.
  
The queries were on exactly the same data. My interpretation of what is 
going on here is that 8.2.6 seems to be leaving the filtering of 
market_id to the very last point, which is why it ends up with 189 rows 
at this point instead of the 2 that 743 has. 743 seems to do that 
filtering much earlier and so reduce the number of rows at a much 
earlier point in the execution of the query. I guess that this is 
something to do with the planner which is why I tried increasing the 
default_statistics_target.

_
This e-mail has been scanned for viruses by Verizon Business Internet Managed 
Scanning Services - powered by MessageLabs. For further information visit 
http://www.verizonbusiness.com/uk

Re: [PERFORM] Performance issues migrating from 743 to 826

2008-01-28 Thread Scott Marlowe

On Jan 28, 2008 5:41 AM, Matthew Lunnon [EMAIL PROTECTED] wrote:
 Hi

 I am investigating migrating from postgres 743 to postgres 826 but
 although the performance in postgres 826 seems to be generally better
 there are some instances where it seems to be markedly worse, a factor
 of up to 10.  The problem seems to occur when I join to more than 4
 tables. Has anyone else experienced anything similar or got any
 suggestions as to what I might do? I am running on an intel box with two
 hyper threaded cores and 4GB of RAM. I have tweaked the postgres.conf
 files with these values and the query and explain output are below. In
 this case the query takes 6.037 ms to run on 862 and 2.332 to run on 743.

It looks like the data are not the same in these two environments.

 8.2.6
 shared_buffers = 500MB
 work_mem = 10MB
 maintenance_work_mem = 100MB
 effective_cache_size = 2048MB
 default_statistics_target = 1000

That's very high for the default.  Planning times will be increased noticeably

Plan for 7.4:

 Nested Loop  (cost=37.27..48.34 rows=1 width=458) (actual
 time=1.474..2.138 rows=14 loops=1)
   -  Nested Loop  (cost=37.27..42.34 rows=1 width=282) (actual
 time=1.428..1.640 rows=2 loops=1)

This is processing 2 rows...

 Total runtime: 2.332 ms

While this is processing 189 rows:

 Nested Loop  (cost=0.00..30.39 rows=1 width=458) (actual
 time=0.123..5.841 rows=14 loops=1)
   -  Nested Loop  (cost=0.00..29.70 rows=1 width=439) (actual
 time=0.099..4.590 rows=189 loops=1)

Hardly seems a fair comparison.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Performance issues migrating from 743 to 826

2008-01-28 Thread Scott Marlowe

Whatever email agent you're using seems to be quoting in a way that
doesn't get along well with gmail, so I'm just gonna chop most of it
rather than have it quoted confusingly...  Heck, I woulda chopped a
lot anyway to keep it small. :)

On Jan 28, 2008 9:27 AM, Matthew Lunnon [EMAIL PROTECTED] wrote:

  Scott Marlowe wrote:
  On Jan 28, 2008 5:41 AM, Matthew Lunnon [EMAIL PROTECTED] wrote:
 default_statistics_target = 1000
   That's very high for the default. Planning times will be increased
  noticeably

  I had originally left the default_statistics_target at its default and then
 increased it to 100, but this did not seem to make much difference.  I will
 reduce this down to something more normal again.

You do know that if you create a column when the default is 10, then
increase the default, it won't change the column's stats target,
right?  So, assuming the table was first created, then you changed the
default, you'll now need to do:

alter table xyz alter column abc set statistics 100;
analyze xyz;

for it to make any difference.

  The queries were on exactly the same data. My interpretation of what is
 going on here is that 8.2.6 seems to be leaving the filtering of market_id
 to the very last point, which is why it ends up with 189 rows at this point
 instead of the 2 that 743 has. 743 seems to do that filtering much earlier
 and so reduce the number of rows at a much earlier point in the execution of
 the query. I guess that this is something to do with the planner which is
 why I tried increasing the default_statistics_target.

Ahh, I'm guessing it's something that your 7.4 database CAN use an
index on and your 8.2 data base can't use an index on.  Like text in a
non-C locale.  Or something...  Table def?

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Performance issues migrating from 743 to 826

2008-01-28 Thread Matthew Lunnon


Hi Gregory/All,

Thanks for your time.

Yes the difference is pretty small but does seem to be consistent, the 
problem that I have is that this is just part of the query, I have tried 
to break things down so that I can see where the time is being spent.  I 
set the  default_statistics_target to 1000 after going via 100 but it 
seemed to make no difference.


I have a confession to make though,  this is not like for like. I did in 
fact have to add a couple of indexes to the data as the performance was 
so bad with 8.2.6.  Very sorry for that, it doesn't help. The actual 
difference if from 2ms to 57ms when these indexes are removed which is  
much more significant.  Here is the like for like comparison with 8.2.6, 
the indexes were added to the market_group_relation table since it is 
doing a seq scan at the very end.


Nested Loop  (cost=0.00..54.03 rows=1 width=458) (actual 
time=0.279..57.457 rows=14 loops=1)

  Join Filter: (mgr.market_group_id = mgpr.market_group_id)
  -  Nested Loop  (cost=0.00..29.19 rows=1 width=439) (actual 
time=0.102..4.867 rows=189 loops=1)
-  Nested Loop  (cost=0.00..28.91 rows=1 width=358) (actual 
time=0.095..3.441 rows=189 loops=1)
  -  Nested Loop  (cost=0.00..20.60 rows=1 width=327) 
(actual time=0.082..1.639 rows=189 loops=1)
-  Nested Loop  (cost=0.00..9.95 rows=1 width=147) 
(actual time=0.054..0.138 rows=27 loops=1)
  -  Seq Scan on market mrkt  (cost=0.00..1.65 
rows=1 width=87) (actual time=0.020..0.020 rows=1 loops=1)
Filter: ((live  'X'::bpchar) AND 
(market_id = 10039))
  -  Index Scan using 
accommodation_price_panel_idx1 on accommodation_price_panel app  
(cost=0.00..8.30 rows=1 width=60) (actual time=0.029..0.079 rows=27 
loops=1)
Index Cond: ((contract_id = 16077) AND 
((code)::text = 'LONHRL'::text) AND (code_type = 'IS'::bpchar))

Filter: (live  'X'::bpchar)
-  Index Scan using daily_rates_pkey on 
daily_rates dr  (cost=0.00..10.63 rows=1 width=180) (actual 
time=0.021..0.041 rows=7 loops=27)
  Index Cond: 
((app.accommodation_price_panel_id = dr.accommodation_price_panel_id) 
AND (dr.room_type = 'Zk'::bpchar) AND (dr.board_type = 'BB'::bpchar) AND 
(dr.min_group_size = 0))
  Filter: (((start_date = '2008-05-22'::date) 
AND (start_date = '2008-05-31'::date)) OR (('2008-05-22'::date = 
start_date) AND ('2008-05-22'::date = end_date)))
  -  Index Scan using market_group_price_relation_pkey on 
market_group_price_relation mgpr  (cost=0.00..8.30 rows=1 width=35) 
(actual time=0.005..0.006 rows=1 loops=189)
Index Cond: (app.accommodation_price_panel_id = 
mgpr.price_panel_id)
-  Index Scan using market_group_pkey on market_group mg  
(cost=0.00..0.27 rows=1 width=81) (actual time=0.003..0.004 rows=1 
loops=189)

  Index Cond: (mgpr.market_group_id = mg.market_group_id)
  Filter: (live  'X'::bpchar)
  -  Seq Scan on market_group_relation mgr  (cost=0.00..24.46 rows=30 
width=31) (actual time=0.068..0.259 rows=30 loops=189)

Filter: (10039 = market_id)
Total runtime: 57.648 ms



Gregory Stark wrote:

Matthew Lunnon [EMAIL PROTECTED] writes:

  

In this case the query takes 6.037 ms to run on 862 and 2.332 to run on 743.



The difference between 2ms and 6ms is pretty negligable. A single context
switch or disk cache miss could throw the results off by that margin in either
direction.

But what plan does 7.4.3 come up with if you set enable_hashjoins = off? I'm
curious whether it comes up with the same nested loops plan as 8.2 and what
cost it says it has.
  

I'll investigate and let you know.

I think you need to find queries which take longer to have any reliable
performance comparisons. Note that the configuration parameters here aren't
the same at all, it's possible the change of effective_cache_size from 800k to
2GB is what's changing the cost estimation. I seem to recall a change in the
arithmetic for calculatin Nested loop costs too which made it more aggressive
in estimating cache effectiveness.

Incidentally, default_statistics_target=1000 is awfully aggressive. I found in
the past that that caused the statistics table to become much larger and much
slower to access. It may have caused some statistics to be toasted or it may
have just been the sheer volume of data present. It will also make your
ANALYZEs take a lot longer. I would suggest trying 100 first and incrementally
raising it rather than jumping straight to 1000. And preferably only on the
columns which really matter.

  


--
Matthew Lunnon
Technical Consultant
RWA Ltd.

[EMAIL PROTECTED]
Tel: +44 (0)29 2081 5056
www.rwa-net.co.uk
--

Re: [PERFORM] Performance Issues on Opteron Dual Core

2006-05-09 Thread Gregory Stewart

I installed Ubuntu 5.10 on the production server (64-Bit version), and sure
enough the peformance is like I expected. Opening up that table (320,000
records) takes 6 seconds, with CPU usage of one of the cores going up to
90% - 100% for the 6 seconds.
I assume only one core is being used per user / session / query?

Gregory


-Original Message-
From: Jim C. Nasby [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 04, 2006 12:47 PM
To: Gregory Stewart
Cc: Mark Kirkwood; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance Issues on Opteron Dual Core


All the machines I've been able to replicate this on have been SMP w2k3
machines running SP1. I've been unable to replicate it on anything not
running w2k3, but the only 'SMP' machine I've tested in that manner was
an Intel with HT enabled. I now have an intel with HT and running w2k3
sitting in my office, but I haven't had a chance to fire it up and try
it yet. Once I test that machine it should help narrow down if this
problem exists with HT machines (which someone on -hackers mentioned
they had access to and could do testing with). If it does affect HT
machines then I suspect that this is not an issue for XP...

On Tue, May 02, 2006 at 11:27:02PM -0500, Gregory Stewart wrote:
 Jim,

 Have you seen this happening only on W2k3? I am wondering if I should try
 out 2000 Pro or XP Pro.
 Not my first choice, but if it works...



 -Original Message-
 From: Jim C. Nasby [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, May 02, 2006 3:29 PM
 To: Mark Kirkwood
 Cc: Gregory Stewart; pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Performance Issues on Opteron Dual Core


 On Sun, Apr 30, 2006 at 10:59:56PM +1200, Mark Kirkwood wrote:
  Pgadmin can give misleading times for queries that return large result
  sets over a network, due to:
 
  1/ It takes time to format the (large) result set for display.
  2/ It has to count the time spent waiting for the (large) result set to
  travel across the network.
 
  You aren't running Pgadmin off the dev server are you? If not check your
  network link to dev and prod  - is one faster than the other? (etc).
 
  To eliminate Pgadmin and the network as factors try wrapping your query
  in a 'SELECT count(*) FROM (your query here) AS a', and see if it
  changes anything!

 FWIW, I've found problems running PostgreSQL on Windows in a multi-CPU
 environment on w2k3. It runs fine for some period, and then CPU and
 throughput drop to zero. So far I've been unable to track down any more
 information than that, other than the fact that I haven't been able to
 reproduce this on any single-CPU machines.
 --
 Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
 Pervasive Software  http://pervasive.comwork: 512-231-6117
 vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461


 --
 No virus found in this incoming message.
 Checked by AVG Free Edition.
 Version: 7.1.385 / Virus Database: 268.5.1/328 - Release Date: 5/1/2006



 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly


--
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461


--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.392 / Virus Database: 268.5.3/331 - Release Date: 5/3/2006



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PERFORM] Performance Issues on Opteron Dual Core

2006-05-05 Thread Magnus Hagander

 FWIW, I've found problems running PostgreSQL on Windows in a 
 multi-CPU environment on w2k3. It runs fine for some 
 period, and 
 then CPU and throughput drop to zero. So far I've 
 been unable to 
 track down any more information than that, other than the
   fact that
 I haven't been able to reproduce this on any 
 single-CPU machines.

I have had previous correspondence about this with 
 Magnus (search 
-general and -hackers). If you uninstall SP1 the problem
   goes away. We
played a bit with potential fixes but didn't find any.
   
   Interesting; does SP2 fix the problem? Anything we can do 
 over here 
   to help?
  
  There is no SP2 for Windows 2003.
  
  Have you tried this with latest-and-greatest CVS HEAD? Meaning with 
  the new semaphore code that was committed a couple of days ago?
 
 I'd be happy to test this if someone could provide a build, 
 or if there's instructions somewhere for doing such a build...

Instructions are here:
http://www.postgresql.org/docs/faqs.FAQ_MINGW.html

Let me know if you can't get that working an I can get a set of binaries
for you.

//Magnus

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] Performance Issues on Opteron Dual Core

2006-05-04 Thread Jim C. Nasby

All the machines I've been able to replicate this on have been SMP w2k3
machines running SP1. I've been unable to replicate it on anything not
running w2k3, but the only 'SMP' machine I've tested in that manner was
an Intel with HT enabled. I now have an intel with HT and running w2k3
sitting in my office, but I haven't had a chance to fire it up and try
it yet. Once I test that machine it should help narrow down if this
problem exists with HT machines (which someone on -hackers mentioned
they had access to and could do testing with). If it does affect HT
machines then I suspect that this is not an issue for XP...

On Tue, May 02, 2006 at 11:27:02PM -0500, Gregory Stewart wrote:
 Jim,
 
 Have you seen this happening only on W2k3? I am wondering if I should try
 out 2000 Pro or XP Pro.
 Not my first choice, but if it works...
 
 
 
 -Original Message-
 From: Jim C. Nasby [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, May 02, 2006 3:29 PM
 To: Mark Kirkwood
 Cc: Gregory Stewart; pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Performance Issues on Opteron Dual Core
 
 
 On Sun, Apr 30, 2006 at 10:59:56PM +1200, Mark Kirkwood wrote:
  Pgadmin can give misleading times for queries that return large result
  sets over a network, due to:
 
  1/ It takes time to format the (large) result set for display.
  2/ It has to count the time spent waiting for the (large) result set to
  travel across the network.
 
  You aren't running Pgadmin off the dev server are you? If not check your
  network link to dev and prod  - is one faster than the other? (etc).
 
  To eliminate Pgadmin and the network as factors try wrapping your query
  in a 'SELECT count(*) FROM (your query here) AS a', and see if it
  changes anything!
 
 FWIW, I've found problems running PostgreSQL on Windows in a multi-CPU
 environment on w2k3. It runs fine for some period, and then CPU and
 throughput drop to zero. So far I've been unable to track down any more
 information than that, other than the fact that I haven't been able to
 reproduce this on any single-CPU machines.
 --
 Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
 Pervasive Software  http://pervasive.comwork: 512-231-6117
 vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461
 
 
 --
 No virus found in this incoming message.
 Checked by AVG Free Edition.
 Version: 7.1.385 / Virus Database: 268.5.1/328 - Release Date: 5/1/2006
 
 
 
 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly
 

-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Performance Issues on Opteron Dual Core

2006-05-04 Thread Jim C. Nasby

On Wed, May 03, 2006 at 09:29:15AM +0200, Magnus Hagander wrote:
FWIW, I've found problems running PostgreSQL on Windows in a 
multi-CPU environment on w2k3. It runs fine for some period, and 
then CPU and throughput drop to zero. So far I've been unable to 
track down any more information than that, other than the 
  fact that 
I haven't been able to reproduce this on any single-CPU machines.
   
   I have had previous correspondence about this with Magnus (search 
   -general and -hackers). If you uninstall SP1 the problem 
  goes away. We 
   played a bit with potential fixes but didn't find any.
  
  Interesting; does SP2 fix the problem? Anything we can do 
  over here to help?
 
 There is no SP2 for Windows 2003.
 
 Have you tried this with latest-and-greatest CVS HEAD? Meaning with the
 new semaphore code that was committed a couple of days ago?

I'd be happy to test this if someone could provide a build, or if
there's instructions somewhere for doing such a build...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Performance Issues on Opteron Dual Core

2006-05-03 Thread Gregory Stewart

I am using the onboard NVRAID controller. It has to be configured in the
BIOS and windows needs a raid driver at install to even see the raid drive.
But the onboard controller still utilizes system resources. So it is not a
pure software raid, but a mix of hardware (controller) / software I guess.
But I don't really know a whole lot about it.

-Original Message-
From: Mark Kirkwood [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 30, 2006 7:04 PM
To: Gregory Stewart
Cc: Theodore Loscalzo
Subject: Re: [PERFORM] Performance Issues on Opteron Dual Core

Gregory Stewart wrote:
 Theodore,

 Thank you for your reply.
 I am using the onboard NVidia RAID that is on the Asus A8N-E motherboard,
so
 it is a software raid.
 But as I said, the CPU utilization on that machine is basically 0%. I also
 ran some system performance tests, and the machine flies including the HD
 performance, all better than the dev machine which doesn't use raid.

(Ooops sorry about so many mails), Might be worth using Google or
Technet to see if there are known performance issues with the (NVidia?)
SATA controller on the A8N-E (as there seem to be a lot of crappy SATA
controllers around at the moment).

Also (I'm not a Windows guy) by software RAID, do you mean you are using
the firmware RAID1 from the controller or are you using Windows
software RAID1 on the two disks directly?

Cheers

Mark

--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.385 / Virus Database: 268.5.1/327 - Release Date: 4/28/2006

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] Performance Issues on Opteron Dual Core

2006-05-03 Thread Gregory Stewart

Jim,

Have you seen this happening only on W2k3? I am wondering if I should try
out 2000 Pro or XP Pro.
Not my first choice, but if it works...



-Original Message-
From: Jim C. Nasby [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 02, 2006 3:29 PM
To: Mark Kirkwood
Cc: Gregory Stewart; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance Issues on Opteron Dual Core


On Sun, Apr 30, 2006 at 10:59:56PM +1200, Mark Kirkwood wrote:
 Pgadmin can give misleading times for queries that return large result
 sets over a network, due to:

 1/ It takes time to format the (large) result set for display.
 2/ It has to count the time spent waiting for the (large) result set to
 travel across the network.

 You aren't running Pgadmin off the dev server are you? If not check your
 network link to dev and prod  - is one faster than the other? (etc).

 To eliminate Pgadmin and the network as factors try wrapping your query
 in a 'SELECT count(*) FROM (your query here) AS a', and see if it
 changes anything!

FWIW, I've found problems running PostgreSQL on Windows in a multi-CPU
environment on w2k3. It runs fine for some period, and then CPU and
throughput drop to zero. So far I've been unable to track down any more
information than that, other than the fact that I haven't been able to
reproduce this on any single-CPU machines.
--
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461


--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.385 / Virus Database: 268.5.1/328 - Release Date: 5/1/2006



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PERFORM] Performance Issues on Opteron Dual Core

2006-05-03 Thread Magnus Hagander

   FWIW, I've found problems running PostgreSQL on Windows in a 
   multi-CPU environment on w2k3. It runs fine for some period, and 
   then CPU and throughput drop to zero. So far I've been unable to 
   track down any more information than that, other than the 
 fact that 
   I haven't been able to reproduce this on any single-CPU machines.
  
  I have had previous correspondence about this with Magnus (search 
  -general and -hackers). If you uninstall SP1 the problem 
 goes away. We 
  played a bit with potential fixes but didn't find any.
 
 Interesting; does SP2 fix the problem? Anything we can do 
 over here to help?

There is no SP2 for Windows 2003.

Have you tried this with latest-and-greatest CVS HEAD? Meaning with the
new semaphore code that was committed a couple of days ago?

//Magnus

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Performance Issues on Opteron Dual Core

2006-05-03 Thread Jan de Visser

On Wednesday 03 May 2006 03:29, Magnus Hagander wrote:
FWIW, I've found problems running PostgreSQL on Windows in a
multi-CPU environment on w2k3. It runs fine for some period, and
then CPU and throughput drop to zero. So far I've been unable to
track down any more information than that, other than the
 
  fact that
 
I haven't been able to reproduce this on any single-CPU machines.
  
   I have had previous correspondence about this with Magnus (search
   -general and -hackers). If you uninstall SP1 the problem
 
  goes away. We
 
   played a bit with potential fixes but didn't find any.
 
  Interesting; does SP2 fix the problem? Anything we can do
  over here to help?

 There is no SP2 for Windows 2003.

That's what I thought. Jim confused me there for a minute.


 Have you tried this with latest-and-greatest CVS HEAD? Meaning with the
 new semaphore code that was committed a couple of days ago?

No I haven't. Worth a test on a rainy afternoon I'd say...


 //Magnus

jan

-- 
--
Jan de Visser                     [EMAIL PROTECTED]

                Baruk Khazad! Khazad ai-menu!
--

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] Performance Issues on Opteron Dual Core

2006-05-02 Thread Jim C. Nasby

On Sun, Apr 30, 2006 at 10:59:56PM +1200, Mark Kirkwood wrote:
 Pgadmin can give misleading times for queries that return large result 
 sets over a network, due to:
 
 1/ It takes time to format the (large) result set for display.
 2/ It has to count the time spent waiting for the (large) result set to 
 travel across the network.
 
 You aren't running Pgadmin off the dev server are you? If not check your 
 network link to dev and prod  - is one faster than the other? (etc).
 
 To eliminate Pgadmin and the network as factors try wrapping your query 
 in a 'SELECT count(*) FROM (your query here) AS a', and see if it 
 changes anything!

FWIW, I've found problems running PostgreSQL on Windows in a multi-CPU
environment on w2k3. It runs fine for some period, and then CPU and
throughput drop to zero. So far I've been unable to track down any more
information than that, other than the fact that I haven't been able to
reproduce this on any single-CPU machines.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Performance Issues on Opteron Dual Core

2006-05-02 Thread Jan de Visser

On Tuesday 02 May 2006 16:28, Jim C. Nasby wrote:
 On Sun, Apr 30, 2006 at 10:59:56PM +1200, Mark Kirkwood wrote:
  Pgadmin can give misleading times for queries that return large result
  sets over a network, due to:
 
  1/ It takes time to format the (large) result set for display.
  2/ It has to count the time spent waiting for the (large) result set to
  travel across the network.
 
  You aren't running Pgadmin off the dev server are you? If not check your
  network link to dev and prod  - is one faster than the other? (etc).
 
  To eliminate Pgadmin and the network as factors try wrapping your query
  in a 'SELECT count(*) FROM (your query here) AS a', and see if it
  changes anything!

 FWIW, I've found problems running PostgreSQL on Windows in a multi-CPU
 environment on w2k3. It runs fine for some period, and then CPU and
 throughput drop to zero. So far I've been unable to track down any more
 information than that, other than the fact that I haven't been able to
 reproduce this on any single-CPU machines.

I have had previous correspondence about this with Magnus (search -general 
and -hackers). If you uninstall SP1 the problem goes away. We played a bit 
with potential fixes but didn't find any.

jan

-- 
--
Jan de Visser                     [EMAIL PROTECTED]

                Baruk Khazad! Khazad ai-menu!
--

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Performance Issues on Opteron Dual Core

2006-05-02 Thread Jim C. Nasby

On Tue, May 02, 2006 at 06:49:48PM -0400, Jan de Visser wrote:
 On Tuesday 02 May 2006 16:28, Jim C. Nasby wrote:
  On Sun, Apr 30, 2006 at 10:59:56PM +1200, Mark Kirkwood wrote:
   Pgadmin can give misleading times for queries that return large result
   sets over a network, due to:
  
   1/ It takes time to format the (large) result set for display.
   2/ It has to count the time spent waiting for the (large) result set to
   travel across the network.
  
   You aren't running Pgadmin off the dev server are you? If not check your
   network link to dev and prod  - is one faster than the other? (etc).
  
   To eliminate Pgadmin and the network as factors try wrapping your query
   in a 'SELECT count(*) FROM (your query here) AS a', and see if it
   changes anything!
 
  FWIW, I've found problems running PostgreSQL on Windows in a multi-CPU
  environment on w2k3. It runs fine for some period, and then CPU and
  throughput drop to zero. So far I've been unable to track down any more
  information than that, other than the fact that I haven't been able to
  reproduce this on any single-CPU machines.
 
 I have had previous correspondence about this with Magnus (search -general 
 and -hackers). If you uninstall SP1 the problem goes away. We played a bit 
 with potential fixes but didn't find any.

Interesting; does SP2 fix the problem? Anything we can do over here to
help?
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PERFORM] Performance Issues on Opteron Dual Core

2006-04-30 Thread Mark Kirkwood


Gregory Stewart wrote:

Hello,

We are currently developing a web application and have the webserver and 
PostgreSQL with our dev db running on a machine with these specs:


Win 2003 standard
AMD Athlon XP 3000 / 2.1 GHZ
2 Gig ram
120 gig SATA HD
PostgreSQL 8.1.0
Default pgsql configuration + shared buffers = 30,000

The performance of postgresql and our web application is good on that 
machine, but we decided to build a dedicated database server for our 
production database that scales better and that we can also use for internal 
applications (CRM and so on).


To make a long story short, we built a machine with these specs:

Windows 2003 Standard
AMD Opteron 165 Dual Core / running at 2 GHZ
2 gig ram
2 x 150 Gig SATA II HDs in RAID 1 mode (mirror)
PostgreSQL 8.1.3
Default pgsql configuration + shared buffers = 30,000

Perfomance tests in windows show that the new box outperforms our dev 
machine quite a bit in CPU, HD and memory performance.


I did some EXPLAIN ANALYZE tests on queries and the results were very good, 
3 to 4 times faster than our dev db.


However one thing is really throwing me off.
When I open a table with 320,000 rows / 16 fields in the pgadmin tool (v 
1.4.0) it takes about 6 seconds on the dev server to display the result (all 
rows). During these 6 seconds the CPU usage jumps to 90%-100%.


When I open the same table on the new, faster, better production box, it 
takes 28 seconds!?! During these 28 seconds the CPU usage jumps to 30% for 1 
second, and goes back to 0% for the remaining time while it is running the 
query.


What is going wrong here? It is my understanding that postgresql supports 
multi-core / cpu environments out of the box, but to me it appears that it 
isn't utilizing any of the 2 cpu's available. I doubt that my server is that 
fast that it can perform this operation in idle mode.


I played around with the shared buffers and tried out versions 8.1.3, 8.1.2, 
8.1.0 with the same result.


Has anyone experienced this kind of behaviour before?
How representative is the query performance in pgadmin?



Pgadmin can give misleading times for queries that return large result 
sets over a network, due to:


1/ It takes time to format the (large) result set for display.
2/ It has to count the time spent waiting for the (large) result set to 
travel across the network.


You aren't running Pgadmin off the dev server are you? If not check your 
network link to dev and prod  - is one faster than the other? (etc).


To eliminate Pgadmin and the network as factors try wrapping your query 
in a 'SELECT count(*) FROM (your query here) AS a', and see if it 
changes anything!


Cheers

Mark

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Performance issues with custom functions

2005-10-27 Thread Edward Di Geronimo Jr.





Tom Lane wrote:

  This is fairly hard to read ... it would help a lot if you had shown the
view definitions that the query relies on, so that we could match up the
plan elements with the query a bit better.
  

I wasn't sure how helpful it would be. Here they are:

create view development.network as 
select * from odbc_select('amsterdam', 'bob.dbo.network') as (
 network_id varchar ,
 status_cd varchar ,
 name varchar ,
 network_action varchar ,
 physical_type_cd varchar ,
 service_type_cd varchar ,
 parent_network_id varchar ,
 commission_network_id varchar ,
 rep_id varchar ,
 tax_id varchar ,
 url varchar ,
 entry_method_cd varchar ,
 entry_individual_type_cd varchar ,
 entry_individual_id varchar ,
 service varchar (30),
 cost_routine varchar (150),
 commission_rate numeric(5, 5) ,
 directory_number varchar (11),
 search_url varchar (200),
 member_rate numeric(15, 2) ,
 free_months numeric(18, 0) ,
 eligibility_hound varchar (60)
)

create view development.network_state as 
select * from odbc_select('amsterdam', 'bob.dbo.network_state') as (
 network_id varchar,
 state_cd varchar,
 product varchar (100) ,
 status_cd varchar,
 entry_method_cd varchar,
 entry_individual_type_cd varchar,
 entry_individual_id varchar,
 logo_id int ,
 from_date timestamp ,
 thru_date timestamp 
)

create view development.xlat_tbl as
select * from odbc_select('amsterdam', 'xlat_tbl') as (
 field_name varchar ,
 field_value varchar ,
 status_cd varchar ,
 descr varchar ,
 descrshort varchar ,
 entry_method_cd varchar ,
 entry_individual_type_cd varchar ,
 entry_individual_id varchar 
)


  However, I'm thinking the problem is with this IN clause:

  
  
where pl.network_id in (select ns.network_id
from development.network_state ns
  where ns.from_date  current_time
and (ns.thru_date  current_time or 
ns.thru_date is null)
and (ns.state_cd = pl.state_cd or ns.state_cd='')
  )

  
  
Because the sub-SELECT references pl.state_cd (an outer variable
reference), there's no chance of optimizing this into a join-style IN.
So the sub-SELECT has to be re-executed for each row of the outer query.

BTW, it's not apparent to me that your "flattened" query gives the same
answers as the original.  What if a pl row can join to more than one
row of the ns output?
  

Well, I guess you are right. As far as the database can tell, the
queries aren't the same. In practice, they are. network_state is
essentially tracking our contract dates with different discount
healthcare networks. from_date and thru_date track the timeframe we use
that network, with thru_date being null for the current networks. Some
networks cover all states, in which case state_cd is an empty string.
Otherwise, there will be a row per state covered. I can't think of any
way to enforce data integrity on this other than maybe via triggers. Is
there any way to make things more clear to the database (both in
general and on the postgres end of this) ? At the moment, the SQL
Server table has the primary key defined as (network_id, state_cd,
product), which is ok for now, but I'm realizing going forward could be
an issue if we ever stopped using a network in a state and then went
back to it.

I guess the next question is, is there any way I can give postgres
hints about what constraints exist on the data in these views?

Ed

Re: [PERFORM] Performance issues with custom functions

2005-10-26 Thread Tom Lane

Edward Di Geronimo Jr. [EMAIL PROTECTED] writes:
 ... I'd like to know exactly what causes 
 the bottleneck in the original query, and if there are other approaches 
 to solving the issue in case I need them in future queries.

This is fairly hard to read ... it would help a lot if you had shown the
view definitions that the query relies on, so that we could match up the
plan elements with the query a bit better.

However, I'm thinking the problem is with this IN clause:

 where pl.network_id in (select ns.network_id
 from development.network_state ns
   where ns.from_date  current_time
 and (ns.thru_date  current_time or 
 ns.thru_date is null)
 and (ns.state_cd = pl.state_cd or ns.state_cd='')
   )

Because the sub-SELECT references pl.state_cd (an outer variable
reference), there's no chance of optimizing this into a join-style IN.
So the sub-SELECT has to be re-executed for each row of the outer query.

BTW, it's not apparent to me that your flattened query gives the same
answers as the original.  What if a pl row can join to more than one
row of the ns output?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

97 matches

Mail list logo