Re: [PERFORM] Suboptimal query plan when using expensive BCRYPT functions

2014-03-24 Thread Heikki Linnakangas

On 03/22/2014 02:59 AM, Erik van Zijst wrote:

Is there any way I can get postgres to perform the hash calculations
on the *result* of the other parts of the where clause, instead of the
other way around? Or else rewrite the query?


The planner doesn't know that the crypt function is expensive. That can 
be fixed with ALTER FUNCTION crypt(text, text) COST high value. Even 
with that, I'm not sure if the planner is smart enough to optimize the 
query the way you'd want, but it's worth a try.


- Heikki


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Connection pooling - Number of connections

2014-03-24 Thread Brett Wooldridge
Hi, Brett Wooldridge here, one of the principals of HikariCP.  I thought
I'd wade into the conversation pool a little myself if you guys don't mind.

Speaking to David's point...
 Reaching the maxPoolSize from the minPoolSize means creating the
 connections at the crucial moment where the client application is in the
 desperate need of completing an important query/transaction which the
 primary responsibility since it cannot hold the data collected.

This was one of the reasons I was proposing the fixed pool design.  In my
experience, even in pools that maintain a minimum number of idle
connections, responding to spike demands is problematic.  If you have a
pool with say 30 max. connections, and a 10 minimum idle connection goal, a
sudden spike demand for 20 connections means the pool can satisfy 10
instantly but then is left to [try to] establish 10 connections before the
application's connectionTimeout (read acquisition timeout from the pool) is
reached.  This in turn generates a spike demand on the database slowing
down not only the connection establishments themselves but also slowing
down the completion of transactions that might actually return connections
to the pool.

As I think Tom noted is a slidestack I read somewhere, there is a knee in
the performance curve beyond which additional connections cause a drop in
TPS.  While users often think it is a good idea to have maxPoolSize of 100,
the reality is they can retire/reuse connections faster with a much smaller
pool.  I didn't see a pool of a 2 or 3 dozen connections actually impacting
performance much when half of them are idle and half are executing
transactions (ie. the idle ones don't impact the overall performance much).

Finally, one of my contentions was, either your database server has
resources or it doesn't.  Either it has enough memory and processing power
for N connections or it doesn't.  If the pool is set below, near, or at
that capacity what is the purpose of releasing connections in that case?
 Yes, it frees up memory, but that memory is not really available for other
use given that at any instant the maximum capacity of the pool may be
demanded.
Instead releasing resources only to try to reallocate them during a demand
peak seems counter-productive.

I'd appreciate any shared thoughts on my presuppositions.

-Brett


Re: [PERFORM] Suboptimal query plan when using expensive BCRYPT functions

2014-03-24 Thread Erik van Zijst
On Mon, Mar 24, 2014 at 12:08 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 03/22/2014 02:59 AM, Erik van Zijst wrote:

 Is there any way I can get postgres to perform the hash calculations
 on the *result* of the other parts of the where clause, instead of the
 other way around? Or else rewrite the query?


 The planner doesn't know that the crypt function is expensive. That can be
 fixed with ALTER FUNCTION crypt(text, text) COST high value. Even with
 that, I'm not sure if the planner is smart enough to optimize the query the
 way you'd want, but it's worth a try.

Thanks. That's the kind of hint I was looking for.

I just gave it a go, but unfortunately it's not enough to get the
optimizer to change the plan, regardless of the cost value.

Cheers,
Erik


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] slow join not using index properly

2014-03-24 Thread Stefan Amshey
Hi Ilya-

Thanks so much for taking a stab at optimizing that query.  I had to fiddle
a bit with your proposed version in order to get it function. Here's what I
came up with in the end:

with RECURSIVE qq(cont_key, anc_key) AS
   (
 SELECT
   a1.context_key, ancestor_key
 FROM
   virtual_ancestors a1
 UNION (
   SELECT
 a1.context_key, a1.ancestor_key
   FROM
 virtual_ancestors a1, qq
   WHERE
 context_key  qq.cont_key
   ORDER BY
 context_key LIMIT 1
 )
   )
   SELECT
 distinct a.cont_key
   FROM
 qq a, collection_data, virtual_ancestors a2
   WHERE
 a.cont_key IS NOT NULL
 AND a.anc_key = collection_data.context_key
 AND collection_data.collection_context_key = a2.context_key
 AND a2.ancestor_key = 1072173;


I had to drop the MIN( a1.context_key ) and LIMIT 1 from your version off
of the first select statement in order to avoid syntax issues or other
errors. The version above does produce the same counts as the original, but
in the end it wasn't really a win for us. Here's the plan it produced:

HashAggregate  (cost=707724.36..707726.36 rows=200 width=4) (actual
 time=27638.844..27639.706 rows=3522 loops=1)
Buffers: shared hit=79323, temp read=49378 written=47557
CTE qq
  -  Recursive Union  (cost=0.00..398869.78 rows=10814203 width=8)
 (actual time=0.018..20196.397 rows=10821685 loops=1)
Buffers: shared hit=74449, temp read=49378 written=23779
-  Seq Scan on virtual_ancestors a1  (cost=0.00..182584.93
 rows=10814193 width=8) (actual time=0.010..2585.411 rows=10821685 loops=1)
  Buffers: shared hit=74443
-  Limit  (cost=0.00..0.08 rows=1 width=8) (actual
 time=7973.297..7973.298 rows=1 loops=1)
  Buffers: shared hit=6, temp read=49378 written=1
  -  Nested Loop  (cost=0.00..30881281719119.79
 rows=389822567470830 width=8) (actual time=7973.296..7973.296 rows=1
 loops=1)
Join Filter: (a1.context_key  qq.cont_key)
Rows Removed by Join Filter: 22470607
Buffers: shared hit=6, temp read=49378 written=1
-  Index Scan using virtual_context_key_idx on
 virtual_ancestors a1  (cost=0.00..18206859.46 rows=10814193 width=8)
 (actual time=0.018..0.036 rows=3 loops=1)
  Buffers: shared hit=6
-  WorkTable Scan on qq  (cost=0.00..2162838.60
 rows=108141930 width=4) (actual time=0.008..1375.445 rows=7490203 loops=3)
  Buffers: temp read=49378 written=1
-  Hash Join  (cost=25283.37..308847.31 rows=2905 width=4) (actual
 time=449.167..27629.759 rows=13152 loops=1)
  Hash Cond: (a.anc_key = collection_data.context_key)
  Buffers: shared hit=79323, temp read=49378 written=47557
  -  CTE Scan on qq a  (cost=0.00..216284.06 rows=10760132
 width=8) (actual time=0.021..25265.179 rows=10821685 loops=1)
Filter: (cont_key IS NOT NULL)
Buffers: shared hit=74449, temp read=49378 written=47557
  -  Hash  (cost=25282.14..25282.14 rows=98 width=4) (actual
 time=373.836..373.836 rows=2109 loops=1)
Buckets: 1024  Batches: 1  Memory Usage: 75kB
Buffers: shared hit=4874
-  Hash Join  (cost=17557.15..25282.14 rows=98 width=4)
 (actual time=368.374..373.013 rows=2109 loops=1)
  Hash Cond: (a2.context_key =
 collection_data.collection_context_key)
  Buffers: shared hit=4874
  -  Index Only Scan using virtual_ancestors_pkey on
 virtual_ancestors a2  (cost=0.00..238.57 rows=272 width=4) (actual
 time=0.029..1.989 rows=1976 loops=1)
Index Cond: (ancestor_key = 1072173)
Heap Fetches: 917
Buffers: shared hit=883
  -  Hash  (cost=10020.40..10020.40 rows=602940
 width=8) (actual time=368.057..368.057 rows=603066 loops=1)
Buckets: 65536  Batches: 1  Memory Usage:
 23558kB
Buffers: shared hit=3991
-  Seq Scan on collection_data
  (cost=0.00..10020.40 rows=602940 width=8) (actual time=0.006..146.447
 rows=603066 loops=1)
  Buffers: shared hit=3991
  Total runtime: 27854.200 ms


I also tried including the MIN( a1.context_key ) in the first select
statement as you had written it, but upon doing that it became necessary to
add a GROUP BY clause, and doing that changed the final number of rows
selected:

 ERROR:  column a1.ancestor_key must appear in the GROUP BY clause or be
 used in an aggregate function
 LINE 4:   min( a1.context_key ), ancestor_key

 ^

Including the LIMIT 1 at the end of the first select statement gave a
syntax error that I couldn't seem to 

Re: [PERFORM] Connection pooling - Number of connections

2014-03-24 Thread Gavin Flower

On 25/03/14 02:27, Brett Wooldridge wrote:
Hi, Brett Wooldridge here, one of the principals of HikariCP.  I 
thought I'd wade into the conversation pool a little myself if you 
guys don't mind.


Speaking to David's point...
 Reaching the maxPoolSize from the minPoolSize means creating the
 connections at the crucial moment where the client application is in the
 desperate need of completing an important query/transaction which the
 primary responsibility since it cannot hold the data collected.

This was one of the reasons I was proposing the fixed pool design.  In 
my experience, even in pools that maintain a minimum number of idle 
connections, responding to spike demands is problematic.  If you have 
a pool with say 30 max. connections, and a 10 minimum idle connection 
goal, a sudden spike demand for 20 connections means the pool can 
satisfy 10 instantly but then is left to [try to] establish 10 
connections before the application's connectionTimeout (read 
acquisition timeout from the pool) is reached.  This in turn generates 
a spike demand on the database slowing down not only the connection 
establishments themselves but also slowing down the completion of 
transactions that might actually return connections to the pool.


As I think Tom noted is a slidestack I read somewhere, there is a 
knee in the performance curve beyond which additional connections 
cause a drop in TPS.  While users often think it is a good idea to 
have maxPoolSize of 100, the reality is they can retire/reuse 
connections faster with a much smaller pool.  I didn't see a pool of a 
2 or 3 dozen connections actually impacting performance much when half 
of them are idle and half are executing transactions (ie. the idle 
ones don't impact the overall performance much).


Finally, one of my contentions was, either your database server has 
resources or it doesn't.  Either it has enough memory and processing 
power for N connections or it doesn't.  If the pool is set below, 
near, or at that capacity what is the purpose of releasing connections 
in that case?  Yes, it frees up memory, but that memory is not really 
available for other use given that at any instant the maximum capacity 
of the pool may be demanded.
Instead releasing resources only to try to reallocate them during a 
demand peak seems counter-productive.


I'd appreciate any shared thoughts on my presuppositions.

-Brett

Surely no code changes are required, as one can simply set the min and 
max pool sizes to be the same?



Cheers,
Gavin



Re: [PERFORM] Connection pooling - Number of connections

2014-03-24 Thread Brett Wooldridge
On Tue, Mar 25, 2014 at 5:24 AM, Gavin Flower gavinflo...@archidevsys.co.nz
 wrote:

  Surely no code changes are required, as one can simply set the min and
 max pool sizes to be the same?

 Cheers,
 Gavin


To be sure it can be implemented that way, but its a question of design
targets.  For example, if a pool is allowed to grow and shrink, the design
might encompass a pool of threads that try to maintain the
configured-minimum idle connections to respond to spike demands.  And there
is the additional setting in the pool for users to [typically] misconfigure.

However, if the pool is fixed size, and attrition from the pool is only by
idle timeout (typically measured in minutes), the design does not need to
account for spike demand.  Likely connections that have dropped out can
either be restored on-demand rather than something running constantly in
the background trying to maintain and idle level.

One of the attributes of HikariCP is a minimalistic set of configuration
options with sane defaults, and minimalistic code.  There are many
competitor pools, offering dozens of settings ranging form banal to unsafe.
 HikariCP initially even offered some of these options, but one-ny-one
they're getting the ax.  More and more we're trying to look at what is the
true core functionality that user's need -- eliminate what is unnecessary
and easily misconfigured.

Thus a debate started over in our group, with some similar views as
expressed here (on both sides).  Guillaume Smet was concerned about the
impact of idle connections on active ones in PostgreSQL (in a fixed pool
scenario) and wanted to ask some of the experts over here.


Re: [PERFORM] Connection pooling - Number of connections

2014-03-24 Thread Gavin Flower

On 25/03/14 13:23, Brett Wooldridge wrote:
On Tue, Mar 25, 2014 at 5:24 AM, Gavin Flower 
gavinflo...@archidevsys.co.nz mailto:gavinflo...@archidevsys.co.nz 
wrote:



Surely no code changes are required, as one can simply set the
min and max pool sizes to be the same?

Cheers,
Gavin


To be sure it can be implemented that way, but its a question of 
design targets.  For example, if a pool is allowed to grow and shrink, 
the design might encompass a pool of threads that try to maintain the 
configured-minimum idle connections to respond to spike demands.  And 
there is the additional setting in the pool for users to [typically] 
misconfigure.


However, if the pool is fixed size, and attrition from the pool is 
only by idle timeout (typically measured in minutes), the design does 
not need to account for spike demand.  Likely connections that have 
dropped out can either be restored on-demand rather than something 
running constantly in the background trying to maintain and idle level.


One of the attributes of HikariCP is a minimalistic set of 
configuration options with sane defaults, and minimalistic code. 
 There are many competitor pools, offering dozens of settings ranging 
form banal to unsafe.  HikariCP initially even offered some of these 
options, but one-ny-one they're getting the ax.  More and more we're 
trying to look at what is the true core functionality that user's need 
-- eliminate what is unnecessary and easily misconfigured.


Thus a debate started over in our group, with some similar views as 
expressed here (on both sides).  Guillaume Smet was concerned about 
the impact of idle connections on active ones in PostgreSQL (in a 
fixed pool scenario) and wanted to ask some of the experts over here.


Would it be a valid option to switch in simpler code when min = max and 
both could be set to the same default?  This would allow more efficient 
code to be run for a fixed pool size and allow a sane default, while 
preserving the option to have a range, though obviously not as simple as 
only allowing a fixed pool size in terms of code complexity.




Re: [PERFORM] Connection pooling - Number of connections

2014-03-24 Thread Brett Wooldridge
Sure.  It's all just code.  It's not particularly a question of efficiency,
I'm sure it could be made equally efficient.  But simpler code-wise would
be not having two implementations, or not having one that is designed to
try to keep up with spike demands.  The question for this group was really
around PostgreSQL performance regarding the impact of idle connections.  If
there is a pool of 20 connections, and half of them are idle, what is the
impact performance-wise of the idle connections on the active connections?


On Tue, Mar 25, 2014 at 10:42 AM, Gavin Flower 
gavinflo...@archidevsys.co.nz wrote:

  On 25/03/14 13:23, Brett Wooldridge wrote:

 On Tue, Mar 25, 2014 at 5:24 AM, Gavin Flower 
 gavinflo...@archidevsys.co.nz wrote:

Surely no code changes are required, as one can simply set the min
 and max pool sizes to be the same?

  Cheers,
 Gavin


  To be sure it can be implemented that way, but its a question of design
 targets.  For example, if a pool is allowed to grow and shrink, the design
 might encompass a pool of threads that try to maintain the
 configured-minimum idle connections to respond to spike demands.  And there
 is the additional setting in the pool for users to [typically] misconfigure.

  However, if the pool is fixed size, and attrition from the pool is only
 by idle timeout (typically measured in minutes), the design does not need
 to account for spike demand.  Likely connections that have dropped out can
 either be restored on-demand rather than something running constantly in
 the background trying to maintain and idle level.

  One of the attributes of HikariCP is a minimalistic set of configuration
 options with sane defaults, and minimalistic code.  There are many
 competitor pools, offering dozens of settings ranging form banal to unsafe.
  HikariCP initially even offered some of these options, but one-ny-one
 they're getting the ax.  More and more we're trying to look at what is the
 true core functionality that user's need -- eliminate what is unnecessary
 and easily misconfigured.

  Thus a debate started over in our group, with some similar views as
 expressed here (on both sides).  Guillaume Smet was concerned about the
 impact of idle connections on active ones in PostgreSQL (in a fixed pool
 scenario) and wanted to ask some of the experts over here.

   Would it be a valid option to switch in simpler code when min = max and
 both could be set to the same default?  This would allow more efficient
 code to be run for a fixed pool size and allow a sane default, while
 preserving the option to have a range, though obviously not as simple as
 only allowing a fixed pool size in terms of code complexity.




[PERFORM] pg_dump vs pg_basebackup

2014-03-24 Thread gianfranco caca
Hai,

Can anyone tell me the difference and performance between pgdump and 
pg_basebackup if I want to backup a large database.

Thanks