Re: [PERFORM] Suboptimal query plan when using expensive BCRYPT functions
On 03/22/2014 02:59 AM, Erik van Zijst wrote: Is there any way I can get postgres to perform the hash calculations on the *result* of the other parts of the where clause, instead of the other way around? Or else rewrite the query? The planner doesn't know that the crypt function is expensive. That can be fixed with ALTER FUNCTION crypt(text, text) COST high value. Even with that, I'm not sure if the planner is smart enough to optimize the query the way you'd want, but it's worth a try. - Heikki -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Connection pooling - Number of connections
Hi, Brett Wooldridge here, one of the principals of HikariCP. I thought I'd wade into the conversation pool a little myself if you guys don't mind. Speaking to David's point... Reaching the maxPoolSize from the minPoolSize means creating the connections at the crucial moment where the client application is in the desperate need of completing an important query/transaction which the primary responsibility since it cannot hold the data collected. This was one of the reasons I was proposing the fixed pool design. In my experience, even in pools that maintain a minimum number of idle connections, responding to spike demands is problematic. If you have a pool with say 30 max. connections, and a 10 minimum idle connection goal, a sudden spike demand for 20 connections means the pool can satisfy 10 instantly but then is left to [try to] establish 10 connections before the application's connectionTimeout (read acquisition timeout from the pool) is reached. This in turn generates a spike demand on the database slowing down not only the connection establishments themselves but also slowing down the completion of transactions that might actually return connections to the pool. As I think Tom noted is a slidestack I read somewhere, there is a knee in the performance curve beyond which additional connections cause a drop in TPS. While users often think it is a good idea to have maxPoolSize of 100, the reality is they can retire/reuse connections faster with a much smaller pool. I didn't see a pool of a 2 or 3 dozen connections actually impacting performance much when half of them are idle and half are executing transactions (ie. the idle ones don't impact the overall performance much). Finally, one of my contentions was, either your database server has resources or it doesn't. Either it has enough memory and processing power for N connections or it doesn't. If the pool is set below, near, or at that capacity what is the purpose of releasing connections in that case? Yes, it frees up memory, but that memory is not really available for other use given that at any instant the maximum capacity of the pool may be demanded. Instead releasing resources only to try to reallocate them during a demand peak seems counter-productive. I'd appreciate any shared thoughts on my presuppositions. -Brett
Re: [PERFORM] Suboptimal query plan when using expensive BCRYPT functions
On Mon, Mar 24, 2014 at 12:08 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 03/22/2014 02:59 AM, Erik van Zijst wrote: Is there any way I can get postgres to perform the hash calculations on the *result* of the other parts of the where clause, instead of the other way around? Or else rewrite the query? The planner doesn't know that the crypt function is expensive. That can be fixed with ALTER FUNCTION crypt(text, text) COST high value. Even with that, I'm not sure if the planner is smart enough to optimize the query the way you'd want, but it's worth a try. Thanks. That's the kind of hint I was looking for. I just gave it a go, but unfortunately it's not enough to get the optimizer to change the plan, regardless of the cost value. Cheers, Erik -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] slow join not using index properly
Hi Ilya- Thanks so much for taking a stab at optimizing that query. I had to fiddle a bit with your proposed version in order to get it function. Here's what I came up with in the end: with RECURSIVE qq(cont_key, anc_key) AS ( SELECT a1.context_key, ancestor_key FROM virtual_ancestors a1 UNION ( SELECT a1.context_key, a1.ancestor_key FROM virtual_ancestors a1, qq WHERE context_key qq.cont_key ORDER BY context_key LIMIT 1 ) ) SELECT distinct a.cont_key FROM qq a, collection_data, virtual_ancestors a2 WHERE a.cont_key IS NOT NULL AND a.anc_key = collection_data.context_key AND collection_data.collection_context_key = a2.context_key AND a2.ancestor_key = 1072173; I had to drop the MIN( a1.context_key ) and LIMIT 1 from your version off of the first select statement in order to avoid syntax issues or other errors. The version above does produce the same counts as the original, but in the end it wasn't really a win for us. Here's the plan it produced: HashAggregate (cost=707724.36..707726.36 rows=200 width=4) (actual time=27638.844..27639.706 rows=3522 loops=1) Buffers: shared hit=79323, temp read=49378 written=47557 CTE qq - Recursive Union (cost=0.00..398869.78 rows=10814203 width=8) (actual time=0.018..20196.397 rows=10821685 loops=1) Buffers: shared hit=74449, temp read=49378 written=23779 - Seq Scan on virtual_ancestors a1 (cost=0.00..182584.93 rows=10814193 width=8) (actual time=0.010..2585.411 rows=10821685 loops=1) Buffers: shared hit=74443 - Limit (cost=0.00..0.08 rows=1 width=8) (actual time=7973.297..7973.298 rows=1 loops=1) Buffers: shared hit=6, temp read=49378 written=1 - Nested Loop (cost=0.00..30881281719119.79 rows=389822567470830 width=8) (actual time=7973.296..7973.296 rows=1 loops=1) Join Filter: (a1.context_key qq.cont_key) Rows Removed by Join Filter: 22470607 Buffers: shared hit=6, temp read=49378 written=1 - Index Scan using virtual_context_key_idx on virtual_ancestors a1 (cost=0.00..18206859.46 rows=10814193 width=8) (actual time=0.018..0.036 rows=3 loops=1) Buffers: shared hit=6 - WorkTable Scan on qq (cost=0.00..2162838.60 rows=108141930 width=4) (actual time=0.008..1375.445 rows=7490203 loops=3) Buffers: temp read=49378 written=1 - Hash Join (cost=25283.37..308847.31 rows=2905 width=4) (actual time=449.167..27629.759 rows=13152 loops=1) Hash Cond: (a.anc_key = collection_data.context_key) Buffers: shared hit=79323, temp read=49378 written=47557 - CTE Scan on qq a (cost=0.00..216284.06 rows=10760132 width=8) (actual time=0.021..25265.179 rows=10821685 loops=1) Filter: (cont_key IS NOT NULL) Buffers: shared hit=74449, temp read=49378 written=47557 - Hash (cost=25282.14..25282.14 rows=98 width=4) (actual time=373.836..373.836 rows=2109 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 75kB Buffers: shared hit=4874 - Hash Join (cost=17557.15..25282.14 rows=98 width=4) (actual time=368.374..373.013 rows=2109 loops=1) Hash Cond: (a2.context_key = collection_data.collection_context_key) Buffers: shared hit=4874 - Index Only Scan using virtual_ancestors_pkey on virtual_ancestors a2 (cost=0.00..238.57 rows=272 width=4) (actual time=0.029..1.989 rows=1976 loops=1) Index Cond: (ancestor_key = 1072173) Heap Fetches: 917 Buffers: shared hit=883 - Hash (cost=10020.40..10020.40 rows=602940 width=8) (actual time=368.057..368.057 rows=603066 loops=1) Buckets: 65536 Batches: 1 Memory Usage: 23558kB Buffers: shared hit=3991 - Seq Scan on collection_data (cost=0.00..10020.40 rows=602940 width=8) (actual time=0.006..146.447 rows=603066 loops=1) Buffers: shared hit=3991 Total runtime: 27854.200 ms I also tried including the MIN( a1.context_key ) in the first select statement as you had written it, but upon doing that it became necessary to add a GROUP BY clause, and doing that changed the final number of rows selected: ERROR: column a1.ancestor_key must appear in the GROUP BY clause or be used in an aggregate function LINE 4: min( a1.context_key ), ancestor_key ^ Including the LIMIT 1 at the end of the first select statement gave a syntax error that I couldn't seem to
Re: [PERFORM] Connection pooling - Number of connections
On 25/03/14 02:27, Brett Wooldridge wrote: Hi, Brett Wooldridge here, one of the principals of HikariCP. I thought I'd wade into the conversation pool a little myself if you guys don't mind. Speaking to David's point... Reaching the maxPoolSize from the minPoolSize means creating the connections at the crucial moment where the client application is in the desperate need of completing an important query/transaction which the primary responsibility since it cannot hold the data collected. This was one of the reasons I was proposing the fixed pool design. In my experience, even in pools that maintain a minimum number of idle connections, responding to spike demands is problematic. If you have a pool with say 30 max. connections, and a 10 minimum idle connection goal, a sudden spike demand for 20 connections means the pool can satisfy 10 instantly but then is left to [try to] establish 10 connections before the application's connectionTimeout (read acquisition timeout from the pool) is reached. This in turn generates a spike demand on the database slowing down not only the connection establishments themselves but also slowing down the completion of transactions that might actually return connections to the pool. As I think Tom noted is a slidestack I read somewhere, there is a knee in the performance curve beyond which additional connections cause a drop in TPS. While users often think it is a good idea to have maxPoolSize of 100, the reality is they can retire/reuse connections faster with a much smaller pool. I didn't see a pool of a 2 or 3 dozen connections actually impacting performance much when half of them are idle and half are executing transactions (ie. the idle ones don't impact the overall performance much). Finally, one of my contentions was, either your database server has resources or it doesn't. Either it has enough memory and processing power for N connections or it doesn't. If the pool is set below, near, or at that capacity what is the purpose of releasing connections in that case? Yes, it frees up memory, but that memory is not really available for other use given that at any instant the maximum capacity of the pool may be demanded. Instead releasing resources only to try to reallocate them during a demand peak seems counter-productive. I'd appreciate any shared thoughts on my presuppositions. -Brett Surely no code changes are required, as one can simply set the min and max pool sizes to be the same? Cheers, Gavin
Re: [PERFORM] Connection pooling - Number of connections
On Tue, Mar 25, 2014 at 5:24 AM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: Surely no code changes are required, as one can simply set the min and max pool sizes to be the same? Cheers, Gavin To be sure it can be implemented that way, but its a question of design targets. For example, if a pool is allowed to grow and shrink, the design might encompass a pool of threads that try to maintain the configured-minimum idle connections to respond to spike demands. And there is the additional setting in the pool for users to [typically] misconfigure. However, if the pool is fixed size, and attrition from the pool is only by idle timeout (typically measured in minutes), the design does not need to account for spike demand. Likely connections that have dropped out can either be restored on-demand rather than something running constantly in the background trying to maintain and idle level. One of the attributes of HikariCP is a minimalistic set of configuration options with sane defaults, and minimalistic code. There are many competitor pools, offering dozens of settings ranging form banal to unsafe. HikariCP initially even offered some of these options, but one-ny-one they're getting the ax. More and more we're trying to look at what is the true core functionality that user's need -- eliminate what is unnecessary and easily misconfigured. Thus a debate started over in our group, with some similar views as expressed here (on both sides). Guillaume Smet was concerned about the impact of idle connections on active ones in PostgreSQL (in a fixed pool scenario) and wanted to ask some of the experts over here.
Re: [PERFORM] Connection pooling - Number of connections
On 25/03/14 13:23, Brett Wooldridge wrote: On Tue, Mar 25, 2014 at 5:24 AM, Gavin Flower gavinflo...@archidevsys.co.nz mailto:gavinflo...@archidevsys.co.nz wrote: Surely no code changes are required, as one can simply set the min and max pool sizes to be the same? Cheers, Gavin To be sure it can be implemented that way, but its a question of design targets. For example, if a pool is allowed to grow and shrink, the design might encompass a pool of threads that try to maintain the configured-minimum idle connections to respond to spike demands. And there is the additional setting in the pool for users to [typically] misconfigure. However, if the pool is fixed size, and attrition from the pool is only by idle timeout (typically measured in minutes), the design does not need to account for spike demand. Likely connections that have dropped out can either be restored on-demand rather than something running constantly in the background trying to maintain and idle level. One of the attributes of HikariCP is a minimalistic set of configuration options with sane defaults, and minimalistic code. There are many competitor pools, offering dozens of settings ranging form banal to unsafe. HikariCP initially even offered some of these options, but one-ny-one they're getting the ax. More and more we're trying to look at what is the true core functionality that user's need -- eliminate what is unnecessary and easily misconfigured. Thus a debate started over in our group, with some similar views as expressed here (on both sides). Guillaume Smet was concerned about the impact of idle connections on active ones in PostgreSQL (in a fixed pool scenario) and wanted to ask some of the experts over here. Would it be a valid option to switch in simpler code when min = max and both could be set to the same default? This would allow more efficient code to be run for a fixed pool size and allow a sane default, while preserving the option to have a range, though obviously not as simple as only allowing a fixed pool size in terms of code complexity.
Re: [PERFORM] Connection pooling - Number of connections
Sure. It's all just code. It's not particularly a question of efficiency, I'm sure it could be made equally efficient. But simpler code-wise would be not having two implementations, or not having one that is designed to try to keep up with spike demands. The question for this group was really around PostgreSQL performance regarding the impact of idle connections. If there is a pool of 20 connections, and half of them are idle, what is the impact performance-wise of the idle connections on the active connections? On Tue, Mar 25, 2014 at 10:42 AM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: On 25/03/14 13:23, Brett Wooldridge wrote: On Tue, Mar 25, 2014 at 5:24 AM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: Surely no code changes are required, as one can simply set the min and max pool sizes to be the same? Cheers, Gavin To be sure it can be implemented that way, but its a question of design targets. For example, if a pool is allowed to grow and shrink, the design might encompass a pool of threads that try to maintain the configured-minimum idle connections to respond to spike demands. And there is the additional setting in the pool for users to [typically] misconfigure. However, if the pool is fixed size, and attrition from the pool is only by idle timeout (typically measured in minutes), the design does not need to account for spike demand. Likely connections that have dropped out can either be restored on-demand rather than something running constantly in the background trying to maintain and idle level. One of the attributes of HikariCP is a minimalistic set of configuration options with sane defaults, and minimalistic code. There are many competitor pools, offering dozens of settings ranging form banal to unsafe. HikariCP initially even offered some of these options, but one-ny-one they're getting the ax. More and more we're trying to look at what is the true core functionality that user's need -- eliminate what is unnecessary and easily misconfigured. Thus a debate started over in our group, with some similar views as expressed here (on both sides). Guillaume Smet was concerned about the impact of idle connections on active ones in PostgreSQL (in a fixed pool scenario) and wanted to ask some of the experts over here. Would it be a valid option to switch in simpler code when min = max and both could be set to the same default? This would allow more efficient code to be run for a fixed pool size and allow a sane default, while preserving the option to have a range, though obviously not as simple as only allowing a fixed pool size in terms of code complexity.
[PERFORM] pg_dump vs pg_basebackup
Hai, Can anyone tell me the difference and performance between pgdump and pg_basebackup if I want to backup a large database. Thanks