Re: [HACKERS] Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers

2013-09-13 Thread Josh Berkus
On 09/13/2013 09:27 AM, Merlin Moncure wrote:
 I happen to be one of those couple people.  Load goes from 0.1 to
 500 without warning then back to 0.1 equally without warning.
 Unfortunately the server is in a different jurisdiction such that it
 makes deep forensic analysis impossible.  I think this is happening
 more and more often as postgres is becoming increasingly deployed on
 high(er) -end servers.  I've personally (alone) dealt with 4-5
 confirmed cases and there have been many more.  We have a problem.

Can you explain a bit more about this?  I'm currently grappling with a
db cluster which has periodic mysterious total LWLock paralysis, and is
configured with 8GB shared_buffers (and 512GB ram installed).

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers

2013-09-13 Thread Merlin Moncure
On Fri, Sep 13, 2013 at 2:36 PM, Josh Berkus j...@agliodbs.com wrote:
 On 09/13/2013 09:27 AM, Merlin Moncure wrote:
 I happen to be one of those couple people.  Load goes from 0.1 to
 500 without warning then back to 0.1 equally without warning.
 Unfortunately the server is in a different jurisdiction such that it
 makes deep forensic analysis impossible.  I think this is happening
 more and more often as postgres is becoming increasingly deployed on
 high(er) -end servers.  I've personally (alone) dealt with 4-5
 confirmed cases and there have been many more.  We have a problem.

 Can you explain a bit more about this?  I'm currently grappling with a
 db cluster which has periodic mysterious total LWLock paralysis, and is
 configured with 8GB shared_buffers (and 512GB ram installed).

see thread: 
http://postgresql.1045698.n5.nabble.com/StrategyGetBuffer-optimization-take-2-td5766307.html
plus others.  In this particular case, s_b needs to be set to 2GB or
lower (they tried raising to 4GB for no good reason) and problem
reoccured.

what are the specific symptoms of your problem?  anything interesting
in pg_locks?  is $client willing to experiment with custom patches?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers

2013-09-13 Thread Josh Berkus
On 09/13/2013 12:55 PM, Merlin Moncure wrote:
 what are the specific symptoms of your problem?  anything interesting
 in pg_locks?  is $client willing to experiment with custom patches?

3 servers: 1 master, two replicas.
32-core Xeon, hyperthreaded to 64 cores
512GB RAM each
s_b set to 8GB
Load-balanced between all 3
~~ 11 different databases
combined database size around 600GB
using pgbouncer

Irregularly, during periods of high activity (although not necessarily
peak activity) one or another of the systems will go into paralysis,
with all backends apparently waiting on LWLocks (we need more tracing
information to completely confirm this).  Activity at this time is
usually somewhere between 50 and 100 concurrent queries (and 80 to 150
connections).  pg_locks doesn't think anything is waiting on a lock.

What's notable is that sometimes it's just *one* of the replicas which
goes into paralysis.  If the master gets this issue though, the replicas
experience it soon afterwards.  Increasing wal_buffers from 16GB to 64GB
seems to make this issue happen less frequently, but it doesn't go away
entirely.  Only a restart of the server, or killing all backend, ends
the lockup.

The workload is OLTP, essentially, around 20/80 write/read.  They use
PostGIS.  The other notable thing about their workload is that due to an
ORM defect, they get idle-in-transactions which last from 5 to 15
seconds several times a minute.

They are willing to use experimental patches, but only if those patches
can be applied only to a replica.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers

2013-09-13 Thread Josh Berkus
On 09/13/2013 01:25 PM, Andres Freund wrote:
 
 
 Josh Berkus j...@agliodbs.com schrieb:
 
 What's notable is that sometimes it's just *one* of the replicas which
 goes into paralysis.  If the master gets this issue though, the
 replicas
 experience it soon afterwards.  Increasing wal_buffers from 16GB to
 64GB
 seems to make this issue happen less frequently, but it doesn't go away
 entirely.  
 
 Youre talking about MB, not GB here, right?

Oh, yes, right.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers

2013-09-13 Thread Andres Freund


Josh Berkus j...@agliodbs.com schrieb:

What's notable is that sometimes it's just *one* of the replicas which
goes into paralysis.  If the master gets this issue though, the
replicas
experience it soon afterwards.  Increasing wal_buffers from 16GB to
64GB
seems to make this issue happen less frequently, but it doesn't go away
entirely.  

Youre talking about MB, not GB here, right?

Andres

-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers

2013-09-13 Thread Merlin Moncure
On Fri, Sep 13, 2013 at 3:20 PM, Josh Berkus j...@agliodbs.com wrote:
 On 09/13/2013 12:55 PM, Merlin Moncure wrote:
 what are the specific symptoms of your problem?  anything interesting
 in pg_locks?  is $client willing to experiment with custom patches?

 3 servers: 1 master, two replicas.
 32-core Xeon, hyperthreaded to 64 cores
 512GB RAM each
 s_b set to 8GB
 Load-balanced between all 3
 ~~ 11 different databases
 combined database size around 600GB
 using pgbouncer

 Irregularly, during periods of high activity (although not necessarily
 peak activity) one or another of the systems will go into paralysis,
 with all backends apparently waiting on LWLocks (we need more tracing
 information to completely confirm this).  Activity at this time is
 usually somewhere between 50 and 100 concurrent queries (and 80 to 150
 connections).  pg_locks doesn't think anything is waiting on a lock.

 What's notable is that sometimes it's just *one* of the replicas which
 goes into paralysis.  If the master gets this issue though, the replicas
 experience it soon afterwards.  Increasing wal_buffers from 16GB to 64GB
 seems to make this issue happen less frequently, but it doesn't go away
 entirely.  Only a restart of the server, or killing all backend, ends
 the lockup.

 The workload is OLTP, essentially, around 20/80 write/read.  They use
 PostGIS.  The other notable thing about their workload is that due to an
 ORM defect, they get idle-in-transactions which last from 5 to 15
 seconds several times a minute.

 They are willing to use experimental patches, but only if those patches
 can be applied only to a replica.

ok, points similar:
*) master/slave config (two slaves for me)
*) 'big' server 256GB mem, 32 core
*) 80% approx. (perhaps more)
*) some spacial searching (but not very much)
*) OLTP
*) presentation of load, although in my case it did resolve anywhere
from 30 secs to half hour
*) aside from the spike, 100% healthy

points different
*) application side pooling: 96 app servers, max 5 connections each
(aside: are you using transaction mode pgbouncer?)
*) I saw gripes about relation extension in pg_locks

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers

2013-09-13 Thread Josh Berkus
On 09/13/2013 01:58 PM, Merlin Moncure wrote:
 ok, points similar:
 *) master/slave config (two slaves for me)
 *) 'big' server 256GB mem, 32 core
 *) 80% approx. (perhaps more)
 *) some spacial searching (but not very much)
 *) OLTP
 *) presentation of load, although in my case it did resolve anywhere
 from 30 secs to half hour
 *) aside from the spike, 100% healthy
 
 points different
 *) application side pooling: 96 app servers, max 5 connections each
 (aside: are you using transaction mode pgbouncer?)

Yes.

 *) I saw gripes about relation extension in pg_locks

I'll check for that next time.

We're also working on seeing if we can reproduce this under test conditions.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers

2013-09-13 Thread Merlin Moncure
On Fri, Sep 13, 2013 at 4:15 PM, Josh Berkus j...@agliodbs.com wrote:
 On 09/13/2013 01:58 PM, Merlin Moncure wrote:
 ok, points similar:
 *) master/slave config (two slaves for me)
 *) 'big' server 256GB mem, 32 core
 *) 80% approx. (perhaps more)
 *) some spacial searching (but not very much)
 *) OLTP
 *) presentation of load, although in my case it did resolve anywhere
 from 30 secs to half hour
 *) aside from the spike, 100% healthy

 points different
 *) application side pooling: 96 app servers, max 5 connections each
 (aside: are you using transaction mode pgbouncer?)

 Yes

Given that, I would be curious to see what dropping connection pool
down to somewhere between 32-64 connections would do.  This is a
band-aid obviously since not everyone can or wants to run pgbouncer.
But this first thing that pops into my mind in these situations is
that it might alleviate the symptoms.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers