Re: [HACKERS] Determine state of cluster (HA)

2017-10-16 Thread Jehan-Guillaume de Rorthais
On Mon, 16 Oct 2017 10:39:16 +0800
Craig Ringer  wrote:

> On 13 October 2017 at 08:50, Joshua D. Drake  wrote:

> > I had a long call with a firm developing front end proxy/cache/HA for
> > Postgres today. Essentially the software is a replacement for PGPool in
> > entirety but also supports analytics etc... When I was asking them about
> > pain points they talked about the below and I was wondering if this is a
> > problem we would like to solve.  
> 
> IMO: no one node knows the full state of the system, or can know it.

+1

> I'd love PostgreSQL to help users more with scaling, HA, etc. But I
> think it's a big job. We'd need:
> 
> - a node topology of some kind, communicated between nodes
> - heartbeat and monitoring
> - failover coordination
> - pooling/proxying
> - STONITH/fencing
> - etc.

And some of items on this list can not be in core. However, there's some things
PostgreSQL can do to make HA easier to deal with.

> That said, I do think it'd be very desirable for us to introduce a
> greater link from a standby to master:
> 
> - Get info about master. We should finish merging recovery.conf into
> postgresql.conf.

agree. +1.

To make things easier from the "cluster manager" piece outside of PostgreSQL, I
would add: 

* being able to "demote" a master as a standby without restart.
* being able to check the status of each node without eating a backend
  connection (to avoid hitting "max_connection" limit)
* being able to monitor each step of a switchover (or "controlled
  failover": standby/master role swapping between two nodes)

> > b. Attempt to connect to the host directly, if not...
> > c. use the slave and use the hostname via dblink to connect to the master,
> > as the hostname , i.e. select * from dblink('" + connInfo + "
> > dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr inet).
> > This is necessary in the event the hostname used in the recovery.conf file
> > is not resolvable from the outside.  
> 
> OK, so "connect directly" here means from some 3rd party, the one
> doing the querying of the replica.

It seems to me the failover process is issuing all required commands to move the
master role to another available standby. The knowledge of the orchestration
and final status (if everything went good) is in this piece of software. If you
want to know where is your master in an exotic or complex setup, ask who was
responsible to promote your master.

HA should stay as simple as possible. The more the architecture is complex, the
more you will have failing scenarios.

> > 1.  The dblink call doesn't have a way to specify a timeout, so we have to
> > use Java futures to control how long this may take to a reasonable amount of
> > time;  
> 
> statement_timeout doesn't work?
> 
> If not, that sounds like a sensible, separate feature to add. Patches welcome!
> 
> > 2.  NAT mapping may result in us detecting IP ranges that are not accessible
> > to the application nodes.  
> 
> PostgreSQL can't do anything about this one.

You could get the master IP address from the "pg_stat_wal_receiver" view. But
this is still not enough though. You might have dedicated networks for
applications and for pgsql replication both separated. If you want a standby
to tell the application where to connect to the master then you'll have to
put this information yourself somewhere, accessible from application nodes.

> > 3.  there is no easy way to monitor for state changes as they happen,
> > allowing faster failovers, everything has to be polled based on events;  

In the corosync world (the clustering piece of the Pacemaker ecosystem), node
failure are detected really really fast. About 1s.

Considering application failure (pgsql here), this will be polling, yes. But I
fail to imagine how a dying application can warn the cluster before dying. Not
only crashing (systemd could help there), but eg. before entering an infinite
dummy loop or an exhausting one.

> It'd be pretty simple to write a function that sleeps in the backend
> until it's promoted. I don't know off the top of my head if we set all
> proc latches when we promote, but if we don't it's probably harmless
> and somewhat useful to do so.

As soon as the cluster manager promoted a new master, it can trigger and event
to notify whatever you need.

> Either way, you'd do long-polling. Call the function and let the
> connection block until something interesting happens. Use TCP
> keepalives to make sure you notice if it dies. Have the function
> return when the state changes.

This would still rely on TCP keepalive frequency, back to polling :(

Regards,


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Replication stats from slaves

2015-12-04 Thread Jehan-Guillaume de Rorthais
Hi hackers,

We are developing and maintaining a Pacemaker resource agent for PostgreSQL
here. This agent is "stateful", ie. it knows where is the master and where are
the slaves. See [1]. Now that this resource agent version is close to what we
wanted to achieve, we will make some official announce soon, with details &
stuff. We welcome feedback, help issues, etc, but on github please, not in this
thread.

For the next version, we are now considering to improve the switchover
mechanism with appropriate checks for every steps. For reminder, the switchover
in PostgreSQL is possible since the following commit:

  commit 985bd7d49726c9f178558491d31a570d47340459
  Author: Fujii Masao <fu...@postgresql.org>
  Date:   Wed Jun 26 02:14:37 2013 +0900

It requires:

  (1) shutdown the master first
  (2) make sure the slave received the shutdown checkpoint from the old master
  (3) promote the slave as master
  (4) start the old master as slave

The problem here is step (2). After discussing IRL with Magnus and
Heikki, they confirmed me checking this using pg_xlogdump is fine, eg.
(reformated):

  $ pg_xlogdump 0001000B0014 -s 'B/14D845F8' -n1 
  rmgr: XLOG
  len (rec/tot): 72/104, 
  tx: 0, 
  lsn: B/14D845F8, prev B/14D84590, bkp: , 
  desc: checkpoint: redo B/14D845F8; 
  tli 1; prev tli 1; fpw true; 
  xid 0/6422462;oid 1183646; multi 1; offset 0; 
  oldest xid 712 in DB 1; oldest multi 1 in DB 1; oldest running xid 0;
  shutdown

This is possible from the resource agent point of view, but not really in a
clean way. It requires:

  * to keep in memory the last LSN of the master after shutdown
  * check on the slave this LSN has been received
  * check the record is a rmgr XLOG with a shutdown information as payload
  * check this is the very last WAL record received (nothing after).

First, looking at the last LSN and creating a cluster attribute (in
Pacemaker context) from the old master to share it with slaves is possible, but
not really elegant for a resource agent. Then, the -n1 in sample command here
avoid pg_xlogdump to exit with an error and a rc=1. But it is not compatible
with the last check (very last WAL record) and I need to check the command
succeed.


A best solution here would be to be able to check from a view on the slave, say
pg_stat_standby, when it was connected to the master for the last time, the last
wal restored by log shipping, last LSN received by streaming rep, flushed,
how/why the SR has been disconnected. As instance, reasons for SR
disconnection might be: master shutdown, too much lag, connection reset.

I can try to give a try to such patch after some acceptance and discussing
what exactly we should push in such view.

Comments? Guidance? Thoughts? Other solutions?

Thanks!

Regards,

[1] https://github.com/dalibo/pgsql-resource-agent/tree/master/multistate
-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Bad row estimation with indexed func returning bool

2015-09-11 Thread Jehan-Guillaume de Rorthais
Hi,

I faced a correlation problem on a query today and tried the usual trick
consisting of using an functional index and rewriting the query to use it.

However, after writing the function, indexing it and rewriting the query, I
faced an optimizer behavior I was not expecting. I wrote a short scenario to
reproduce it on current HEAD:

  CREATE TABLE correl AS SELECT (i-1)%2 AS i, i%2 AS j 
  FROM generate_series(1,10) AS i;

  ANALYZE correl ;

  EXPLAIN ANALYZE SELECT * FROM correl WHERE i = 1 AND j = 1;

  -- Seq Scan on correl (rows=25000) (rows=0)
  --Filter: ((i = 1) AND (j = 1))
  --Rows Removed by Filter: 10
  -- Planning time: 0.356 ms
  -- Execution time: 21.937 ms

  CREATE FUNCTION fix_correl(int, int) RETURNS bool AS
'BEGIN RETURN $1 = 1 AND $2 = 1; END '
IMMUTABLE
CALLED ON NULL INPUT
LANGUAGE plpgsql;

  CREATE INDEX ON correl ( fix_correl(i, j) );

  ANALYZE correl ;

  EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl(i, j);

  -- Index Scan using correl_fix_correl_idx on correl  (rows=3) (rows=0)
  --Index Cond: (fix_correl(i, j) = true)
  --Filter: fix_correl(i, j)
  -- Planning time: 0.421 ms
  -- Execution time: 0.102 ms


Using a function returning integer work as expected:

  CREATE FUNCTION fix_correl_add(int, int) RETURNS int AS
'BEGIN RETURN $1 + $2 ; END '
IMMUTABLE
CALLED ON NULL INPUT
LANGUAGE plpgsql;

  CREATE INDEX ON correl ( fix_correl_add( i, j ) );

  ANALYZE correl ;

  EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl_add( i, j ) = 2;
  -- Index Scan using correl_fix_correl_add_idx on correl  (rows=1) (rows=0)
  --Index Cond: (fix_correl_add(i, j) = 2) 
  -- Planning time: 0.462 ms
  -- Execution time: 0.102 ms


It works works as expected with a simple index on (i + j) with no function, but
I wanted to have the same conditions in both tests.

Why does the optimizer behave differently in both cases? Why do it add a Filter
when index scan-ing on correl_fix_correl_idx indexing booleans?

Please, find the complete scenario in attachment.

Regards,
-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com
ioguix@erg:~$ dropdb correl 
ioguix@erg:~$ createdb correl
ioguix@erg:~$ psql -qtX correl 
correl=# CREATE TABLE correl AS SELECT (i-1)%2 AS i, i%2 AS j FROM 
generate_series(1,10) AS i;

correl=# ANALYZE correl ;

correl=# EXPLAIN ANALYZE SELECT * FROM correl WHERE i = 1 AND j = 1;
 Seq Scan on correl  (cost=0.00..1943.00 rows=25000 width=8) (actual 
time=21.898..21.898 rows=0 loops=1)
   Filter: ((i = 1) AND (j = 1))
   Rows Removed by Filter: 10
 Planning time: 0.356 ms
 Execution time: 21.937 ms

correl=# CREATE FUNCTION fix_correl(int, int) RETURNS bool AS '
BEGIN RETURN $1 = 1 AND $2 = 1; END '
IMMUTABLE
CALLED ON NULL INPUT
LANGUAGE plpgsql;

correl=# CREATE INDEX ON correl ( fix_correl(i, j) );

correl=# ANALYZE correl ;

correl=# EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl(i, j);
 Index Scan using correl_fix_correl_idx on correl  (cost=0.29..4.56 rows=3 
width=8) (actual time=0.053..0.053 rows=0 loops=1)
   Index Cond: (fix_correl(i, j) = true)
   Filter: fix_correl(i, j)
 Planning time: 0.421 ms
 Execution time: 0.102 ms

correl=# SELECT * FROM pg_stats WHERE tablename ~ '^correl_';
schemaname | public
tablename  | correl_fix_correl_idx
attname| fix_correl
inherited  | f
null_frac  | 0
avg_width  | 1
n_distinct | 1
most_common_vals   | {f}
most_common_freqs  | {1}
histogram_bounds   | 
correlation| 1
most_common_elems  | 
most_common_elem_freqs | 
elem_count_histogram   | 


correl=# CREATE INDEX ON correl ( (i + j) );

correl=# ANALYZE correl ;

correl=# EXPLAIN ANALYZE SELECT * FROM correl WHERE ( i + j ) = 2;
 Index Scan using correl_expr_idx on correl  (cost=0.29..4.31 rows=1 width=8) 
(actual time=0.027..0.027 rows=0 loops=1)
   Index Cond: ((i + j) = 2)
 Planning time: 0.175 ms
 Execution time: 0.076 ms

correl=# CREATE FUNCTION fix_correl_add(int, int) RETURNS int AS '
BEGIN RETURN $1 + $2 ; END '
IMMUTABLE
CALLED ON NULL INPUT
LANGUAGE plpgsql;

correl=# CREATE INDEX ON correl ( fix_correl_add( i, j ) );

correl=# ANALYZE correl ;

correl=# EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl_add( i, j ) = 2;
 Index Scan using correl_fix_correl_add_idx on correl  (cost=0.29..4.31 rows=1 
width=8) (actual time=0.052..0.052 rows=0 loops=1)
   Index Cond: (fix_correl_add(i, j) = 2)
 Planning time: 0.462 ms
 Execution time: 0.102 ms

correl=# SELECT * FROM pg_stats WHERE tablename ~ '^correl_';
schemaname | public
tablename  | correl_fix_correl_idx
attname| fix_correl
inherited  | f
null_frac  | 0
avg_width  | 1
n_distinct | 1
most_common_vals   | {f}
most_common_freqs  | {1}
histogram_bounds   | 
correlation| 1
most_common_elems

Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory

2015-04-01 Thread Jehan-Guillaume de Rorthais
Hi,

As I'm writing a doc patch for 9.4 - 9.0, I'll discuss below on this formula
as this is the last one accepted by most of you.

On Mon, 3 Nov 2014 12:39:26 -0800
Jeff Janes jeff.ja...@gmail.com wrote:

 It looked to me that the formula, when descending from a previously
 stressed state, would be:
 
 greatest(1 + checkpoint_completion_target) * checkpoint_segments,
 wal_keep_segments) + 1 +
 2 * checkpoint_segments + 1

It lacks a closing parenthesis. I guess the formula is:

  greatest (
(1 + checkpoint_completion_target) * checkpoint_segments,
 wal_keep_segments
  ) 
  + 1 + 2 * checkpoint_segments + 1

 This assumes logs are filled evenly over a checkpoint cycle, which is
 probably not true because there is a spike in full page writes right after
 a checkpoint starts.
 
 But I didn't have a great deal of confidence in my analysis.

The only problem I have with this formula is that considering
checkpoint_completion_target ~ 1 and wal_keep_segments = 0, it becomes:

  4 * checkpoint_segments + 2

Which violate the well known, observed and default one:

  3 * checkpoint_segments + 1

A value above this formula means the system can not cope with the number of
file to flush. The doc is right about that:

   If, due to a short-term peak of log output rate, there
   are more than 3 * varnamecheckpoint_segments/varname + 1
   segment files, the unneeded segment files will be deleted

The formula is wrong in the doc when wal_keep_segments  0

 The first line reflects the number of WAL that will be retained as-is,

I agree with this files MUST be retained: the set of checkpoint_segments WALs
beeing flushed and the checkpoint_completion_target ones written in
the meantime.

 the second is the number that will be recycled for future use before starting 
 to delete them.

disagree cause the WAL files beeing written are actually consuming recycled
WALs in the meantime.

Your formula expect written files are created and recycled ones never touched,
leading to this checkpoint_segment + 1 difference between formulas.

 My reading of the code is that wal_keep_segments is computed from the
 current end of WAL (i.e the checkpoint record), not from the checkpoint
 redo point.  If I distribute the part outside the 'greatest' into both
 branches of the 'greatest', I don't get the same answer as you do for
 either branch.

So The formula, using checkpoint_completion_target=1, should be:

  greatest (
 checkpoint_segments,
 wal_keep_segments
  ) 
  + 2 * checkpoint_segments + 1

Please find attached to this email a documentation patch for 9.4 using this
formula.

Regards,
-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index d2392b2..1ed780b 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,17 +546,18 @@
 
   para
There will always be at least one WAL segment file, and will normally
-   not be more than (2 + varnamecheckpoint_completion_target/varname) * varnamecheckpoint_segments/varname + 1
-   or varnamecheckpoint_segments/ + xref linkend=guc-wal-keep-segments + 1
-   files.  Each segment file is normally 16 MB (though this size can be
+   not be more than greatest(varnamecheckpoint_segments/, xref linkend=guc-wal-keep-segments)
+   + (1 + varnamecheckpoint_completion_target/varname) * varnamecheckpoint_segments/varname
+   + 1 files.  Each segment file is normally 16 MB (though this size can be
altered when building the server).  You can use this to estimate space
requirements for acronymWAL/acronym.
Ordinarily, when old log segment files are no longer needed, they
are recycled (that is, renamed to become future segments in the numbered
sequence). If, due to a short-term peak of log output rate, there
-   are more than 3 * varnamecheckpoint_segments/varname + 1
-   segment files, the unneeded segment files will be deleted instead
-   of recycled until the system gets back under this limit.
+   are more than greatest(varnamecheckpoint_segments/, varnamewal_keep_segments/varname)
+   + 2 * varnamecheckpoint_segments/varname + 1 segment files, the
+   unneeded segment files will be deleted instead of recycled until the system
+   gets back under this limit.
   /para
 
   para

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory

2015-03-31 Thread Jehan-Guillaume de Rorthais
On Tue, 3 Mar 2015 11:15:13 -0500
Bruce Momjian br...@momjian.us wrote:

 On Tue, Oct 14, 2014 at 01:21:53PM -0400, Bruce Momjian wrote:
  On Tue, Oct 14, 2014 at 09:20:22AM -0700, Jeff Janes wrote:
   On Mon, Oct 13, 2014 at 12:11 PM, Bruce Momjian br...@momjian.us wrote:
   
   
   I looked into this, and came up with more questions.  Why is
   checkpoint_completion_target involved in the total number of WAL
   segments?  If checkpoint_completion_target is 0.5 (the default), the
   calculation is:
   
           (2 + 0.5) * checkpoint_segments + 1
   
   while if it is 0.9, it is:
   
           (2 + 0.9) * checkpoint_segments + 1
   
   Is this trying to estimate how many WAL files are going to be created
   during the checkpoint?  If so, wouldn't it be (1 +
   checkpoint_completion_target), not 2 +.  My logic is you have the
   old WAL files being checkpointed (that's the 1), plus you have new WAL
   files being created during the checkpoint, which would be
   checkpoint_completion_target * checkpoint_segments, plus one for the
   current WAL file.
   
   
   WAL is not eligible to be recycled until there have been 2 successful
   checkpoints.
   
   So at the end of a checkpoint, you have 1 cycle of WAL which has just
   become eligible for recycling,
   1 cycle of WAL which is now expendable but which is kept anyway, and
   checkpoint_completion_target worth of WAL which has occurred while the
   checkpoint was occurring and is still needed for crash recovery.
  
  OK, so based on this analysis, what is the right calculation?  This?
  
  (1 + checkpoint_completion_target) * checkpoint_segments + 1 +
  max(wal_keep_segments, checkpoint_segments)
 
 Now that we have min_wal_size and max_wal_size in 9.5, I don't see any
 value to figuring out the proper formula for backpatching.

I guess it worth backpatching the documentation as 9.4 - 9.1 will be supported
for somes the next 4 years

-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory

2015-03-31 Thread Jehan-Guillaume de Rorthais
On Tue, 31 Mar 2015 08:24:15 +0200
Jehan-Guillaume de Rorthais j...@dalibo.com wrote:

 On Tue, 3 Mar 2015 11:15:13 -0500
 Bruce Momjian br...@momjian.us wrote:
 
  On Tue, Oct 14, 2014 at 01:21:53PM -0400, Bruce Momjian wrote:
   On Tue, Oct 14, 2014 at 09:20:22AM -0700, Jeff Janes wrote:
On Mon, Oct 13, 2014 at 12:11 PM, Bruce Momjian br...@momjian.us
wrote:


I looked into this, and came up with more questions.  Why is
checkpoint_completion_target involved in the total number of WAL
segments?  If checkpoint_completion_target is 0.5 (the default), the
calculation is:

        (2 + 0.5) * checkpoint_segments + 1

while if it is 0.9, it is:

        (2 + 0.9) * checkpoint_segments + 1

Is this trying to estimate how many WAL files are going to be
created during the checkpoint?  If so, wouldn't it be (1 +
checkpoint_completion_target), not 2 +.  My logic is you have the
old WAL files being checkpointed (that's the 1), plus you have new WAL
files being created during the checkpoint, which would be
checkpoint_completion_target * checkpoint_segments, plus one for the
current WAL file.


WAL is not eligible to be recycled until there have been 2 successful
checkpoints.

So at the end of a checkpoint, you have 1 cycle of WAL which has just
become eligible for recycling,
1 cycle of WAL which is now expendable but which is kept anyway, and
checkpoint_completion_target worth of WAL which has occurred while the
checkpoint was occurring and is still needed for crash recovery.
   
   OK, so based on this analysis, what is the right calculation?  This?
   
 (1 + checkpoint_completion_target) * checkpoint_segments + 1 +
 max(wal_keep_segments, checkpoint_segments)
  
  Now that we have min_wal_size and max_wal_size in 9.5, I don't see any
  value to figuring out the proper formula for backpatching.
 
 I guess it worth backpatching the documentation as 9.4 - 9.1 will be
 supported for somes the next 4 years

Sorry, lack of caffeine this morning. Fired the mail before correcting
and finishing it:

I guess it worth backpatching the documentation as 9.4 - 9.1 will be supported
for some more years.

I'll give it a try this week.

Regards,
-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

2014-09-18 Thread Jehan-Guillaume de Rorthais
Hi hackers,

We spend some time with Guillaume Lelarge studying this issue.

CreateRestartPoint() calls RemoveOldXlogFiles() to drop/recycle old WALs. This
one is calling XLogArchiveCheckDone() to check if the given WAL can be dropped.
As our slave has archive_mode  archive_command set, XLogArchivingActive()
returns true, forcing XLogArchiveCheckDone() to look for a .done file. If the
corresponding .done file doesn't exist, XLogArchiveCheckDone() calls
XLogArchiveNotify() to actually create the .ready file.


Now, on a standby, we are supposed to create the .done files by calling
XLogArchiveForceDone() either after a successful restore_command or when the
streaming replication finishes and flushes a WAL. It seems like the root cause
of this trouble: maybe a race condition where XLogArchiveForceDone() would not
be called for a WAL. As the .done file is never created, we end up calling
XLogArchiveNotify() to create a .ready and this WAL will never be archived by
the slave, sitting in the pg_xlog dir.

But worst, we experience this under two 9.2.8 clusters and realized
this version is not supposed to have 256 segments per WAL, finishing a WAL to
the *FE one. I'm quoting back the history of my previous mail as we clearly
have some *FF files on the slave side!

0001040E00FF
0001041400DA
0001046E00FF
0001047000FF
00010485000F
000104850010
[...normal WAL sequence...]
000104850052

Note that it seems to happen only during streaming replication, not in pure Log
Shipping replication. We added the debug message on the master side to make
sure it never archive *FF file

We kept the WAL files and log files for further analysis. How can we help
regarding this issue?

Regards,


On Mon, 15 Sep 2014 17:37:24 +0200
Jehan-Guillaume de Rorthais j...@dalibo.com wrote:

 Hi hackers,
 
 An issue that seems related to this has been posted on pgsql-admin. See:
 
   
 http://www.postgresql.org/message-id/caas3tylnxaydz0+zhxlpdvtovhqovr+jsphp30o8kvwqqs0...@mail.gmail.com
 
 How can we help on this issue?
 
 Cheers,
 
 On Thu, 4 Sep 2014 17:50:36 +0200
 Jehan-Guillaume de Rorthais j...@dalibo.com wrote:
 
  Hi hackers,
  
  Since few months, we occasionally see .ready files appearing on some slave
   instances from various context. The two I have in mind are under 9.2.x.
  
  I tried to investigate a bit. These .ready files are created when a WAL file
  from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
  easily experience this by deleting such a file: it is created again at the
  next restartpoint or checkpoint received from the master.
  
  Looking at the WAL in pg_xlog folder corresponding to these .ready files,
  they are all much older than the current WAL cycle in both mtime and name
  logic sequence. As instance on one of these box we have currently 6 of
  those ghost WALs:
  
00021E5300FF
00021F1800FF
0002204700FF
000220BF00FF
0002214000FF
0002237000FF
0002255D00A8
0002255D00A9
[...normal WAL sequence...]
0002255E009D
  
  And on another box:
  
0001040E00FF
0001041400DA
0001046E00FF
0001047000FF
00010485000F
000104850010
[...normal WAL sequence...]
000104850052
  
  So it seems for some reasons, these old WALs were forgotten by the
  restartpoint mechanism when they should have been recylced/deleted. 
  
  For one of these servers, I could correlate this with some brutal
  disconnection of the streaming replication appearing in its logs. But there
  was no known SR disconnection on the second one.
  
  Any idea about this weird behaviour? What can we do to help you investigate
  further?
  
  Regards,
 
 
 



-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] .ready files appearing on slaves

2014-09-15 Thread Jehan-Guillaume de Rorthais
Hi hackers,

An issue that seems related to this has been posted on pgsql-admin. See:

  
http://www.postgresql.org/message-id/caas3tylnxaydz0+zhxlpdvtovhqovr+jsphp30o8kvwqqs0...@mail.gmail.com

How can we help on this issue?

Cheers,

On Thu, 4 Sep 2014 17:50:36 +0200
Jehan-Guillaume de Rorthais j...@dalibo.com wrote:

 Hi hackers,
 
 Since few months, we occasionally see .ready files appearing on some slave
  instances from various context. The two I have in mind are under 9.2.x.
 
 I tried to investigate a bit. These .ready files are created when a WAL file
 from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
 easily experience this by deleting such a file: it is created again at the
 next restartpoint or checkpoint received from the master.
 
 Looking at the WAL in pg_xlog folder corresponding to these .ready files, they
 are all much older than the current WAL cycle in both mtime and name logic
 sequence. As instance on one of these box we have currently 6 of those ghost
 WALs:
 
   00021E5300FF
   00021F1800FF
   0002204700FF
   000220BF00FF
   0002214000FF
   0002237000FF
   0002255D00A8
   0002255D00A9
   [...normal WAL sequence...]
   0002255E009D
 
 And on another box:
 
   0001040E00FF
   0001041400DA
   0001046E00FF
   0001047000FF
   00010485000F
   000104850010
   [...normal WAL sequence...]
   000104850052
 
 So it seems for some reasons, these old WALs were forgotten by the
 restartpoint mechanism when they should have been recylced/deleted. 
 
 For one of these servers, I could correlate this with some brutal
 disconnection of the streaming replication appearing in its logs. But there
 was no known SR disconnection on the second one.
 
 Any idea about this weird behaviour? What can we do to help you investigate
 further?
 
 Regards,



-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] .ready files appearing on slaves

2014-09-04 Thread Jehan-Guillaume de Rorthais
Hi hackers,

Since few months, we occasionally see .ready files appearing on some slave
 instances from various context. The two I have in mind are under 9.2.x.

I tried to investigate a bit. These .ready files are created when a WAL file
from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
easily experience this by deleting such a file: it is created again at the next
restartpoint or checkpoint received from the master.

Looking at the WAL in pg_xlog folder corresponding to these .ready files, they
are all much older than the current WAL cycle in both mtime and name logic
sequence. As instance on one of these box we have currently 6 of those ghost
WALs:

  00021E5300FF
  00021F1800FF
  0002204700FF
  000220BF00FF
  0002214000FF
  0002237000FF
  0002255D00A8
  0002255D00A9
  [...normal WAL sequence...]
  0002255E009D

And on another box:

  0001040E00FF
  0001041400DA
  0001046E00FF
  0001047000FF
  00010485000F
  000104850010
  [...normal WAL sequence...]
  000104850052

So it seems for some reasons, these old WALs were forgotten by the
restartpoint mechanism when they should have been recylced/deleted. 

For one of these servers, I could correlate this with some brutal disconnection
of the streaming replication appearing in its logs. But there was no known SR
disconnection on the second one.

Any idea about this weird behaviour? What can we do to help you investigate
further?

Regards,
-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How can we make beta testing better?

2014-04-22 Thread Jehan-Guillaume de Rorthais
On Thu, 17 Apr 2014 16:42:21 -0700
Josh Berkus j...@agliodbs.com wrote:

 On 04/15/2014 09:53 PM, Rod Taylor wrote:
  A documented beta test process/toolset which does the following would help:
  1) Enables full query logging
  2) Creates a replica of a production DB, record $TIME when it stops.
  3) Allow user to make changes (upgrade to 9.4, change hardware, change
  kernel settings, ...)
  4) Plays queries from the CSV logs starting from $TIME mimicking actual
  timing and transaction boundaries
  
  If Pg can make it easy to duplicate activities currently going on in
  production inside another environment, I would be pleased to fire a couple
  billion queries through it over the next few weeks.
  
  #4 should include reporting useful to the project, such as a sampling of
  queries which performed significantly worse and a few relative performance
  stats for overall execution time.
 
 So we have some software we've been procrastinating on OSS'ing, which does:
 
 1) Takes full query CSV logs from a running postgres instance
 2) Runs them against a target instance in parallel
 3) Records response times for all queries
 
 tsung and pgreplay also do this, but have some limitations which make
 them impractical for a general set of logs to replay.

I've been working on another tool able to replay scenario recorded directly
from a network dump (see [pgshark]). It works, can be totally transparent from
the application point of view, the tcpdump can run anywhere, and **ALL** the
real traffic can be replayed...but it needs some more work for reporting and
handling parallel sessions. The drawback of using libpcap is that you can lost
packets while capturing and a very large capture buffer can not keep you safe
for hours of high-speed scenario. So it might require multiple capture and
adjusting the buffer size to capture 100% of the traffic on the required period.

I tried to quickly write a simple proxy using Perl POE to capture ALL the
traffic safely. My POC was doing nothing but forwarding packets and IIRC a 30s
stress test with 10 or 20 sessions using pgbench showed a drop of ~60% of
performances. But it was a very quick POC with a mono-processus/mono-thread
POC.

Maybe another path would be to be able to generate some this traffic dump
from PostgreSQL (which only have the application level to deal with) itself in a
format we can feed to pgbench. 

 What it would need is:
 
 A) scripting around coordinated backups
 B) Scripting for single-command runs, including changing pg.conf to
 record data.

Changing the pg.conf is pretty easy with alter system now. But I'm sure we all
have some scripts out there doing this (at least I do)

 C) tools to *analyze* the output data, including error messages.

That's what I lack in pgshark so far.

[pgshark] https://github.com/dalibo/pgshark

Cheers,
-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unarchived WALs deleted after crash

2013-02-20 Thread Jehan-Guillaume de Rorthais
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Just a quick top-post to thank you all for this fix guys !

Cheers,

On 15/02/2013 18:43, Heikki Linnakangas wrote:
 On 15.02.2013 19:16, Fujii Masao wrote:
 On Sat, Feb 16, 2013 at 2:07 AM, Heikki Linnakangas 
 hlinnakan...@vmware.com  wrote:
 On 15.02.2013 18:10, Fujii Masao wrote:
 
 At least in 9.2, when the archived file is restored into
 pg_xlog, its xxx.done archive status file is created. So we
 don't need to check InArchiveRecovery when deleting old WAL
 files. Checking whether xxx.done exists is enough.
 
 Hmm, what about streamed WAL files? I guess we could go back to
 the pre-9.2 coding, and check WalRcvInProgress(). But I didn't
 actually like that too much, it seems rather random that old
 streamed files are recycled when wal receiver is running at the
 time of restartpoint, and otherwise not. Because whether wal
 receiver is running at the time the restartpoint happens has 
 little to do with which files were created by streaming
 replication. With the right pattern of streaming files from the
 master, but always being teporarily disconnected when the
 restartpoint runs, you could still accumulate WAL files
 infinitely.
 
 Walreceiver always creates .done file when it closes the 
 already-flushed WAL file and switches WAL file to next. So we
 also don't  need to check WalRcvInProgress().
 
 Ah, I missed that part of the patch.
 
 Okay, agreed, that's a better fix. I committed your forward-port of
 the 9.2 patch to master, reverted my earlier fix for this bug, and
 simply removed the 
 InArchiveRecovery/ArchiveRecoveryInProgress()/RecoveryInProgress() 
 condition from RemoveOldXlogFiles().
 
 - Heikki

- -- 
Jehan-Guillaume de Rorthais
http://www.dalibo.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlEk5hwACgkQXu9L1HbaT6JZ3wCg4h7QT+wRMT8KZAA/PjOjZcCV
CS4AnRFeGdXIgklo1/RD2hi+e98pNBEe
=voW3
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Unarchived WALs deleted after crash

2013-02-14 Thread Jehan-Guillaume de Rorthais
Hi,

I am facing an unexpected behavior on a 9.2.2 cluster that I can
reproduce on current HEAD.

On a cluster with archive enabled but failing, after a crash of
postmaster, the checkpoint occurring before leaving the recovery mode
deletes any additional WALs, even those waiting to be archived.

Because of this, after recovering from the crash, previous PITR backup
can not be used to restore the instance to a time where archiving was
failing. Any slaves fed by WAL or lagging in SR need to be recreated.

AFAICT, this is not documented and I would expect the WALs to be
archived by the archiver process when the cluster exits the recovery step.

Here is a simple scenario to reproduce this.

Configuration:

  wal_level = archive
  archive_mode = on
  archive_command = '/bin/false'
  log_checkpoints = on

Scenario:
  createdb test
  psql -c 'create table test as select i, md5(i::text) from
generate_series(1,300) as i;' test
  kill -9 $(head -1 $PGDATA/postmaster.pid)
  pg_ctl start

Using this scenario, log files shows:

  LOG:  archive command failed with exit code 1
  DETAIL:  The failed archive command was: /bin/false
  WARNING:  transaction log file 00010001 could not
be archived: too many failures
  LOG:  database system was interrupted; last known up at 2013-02-14
16:12:58 CET
  LOG:  database system was not properly shut down; automatic recovery
in progress
  LOG:  crash recovery starts in timeline 1 and has target timeline 1
  LOG:  redo starts at 0/11400078
  LOG:  record with zero length at 0/13397190
  LOG:  redo done at 0/13397160
  LOG:  last completed transaction was at log time 2013-02-14
16:12:58.49303+01
  LOG:  checkpoint starting: end-of-recovery immediate
  LOG:  checkpoint complete: wrote 2869 buffers (17.5%); 0 transaction
log file(s) added, 9 removed, 7 recycled; write=0.023 s, sync=0.468 s,
total=0.739 s; sync files=2, longest=0.426 s, average=0.234 s
  LOG:  autovacuum launcher started
  LOG:  database system is ready to accept connections
  LOG:  archive command failed with exit code 1
  DETAIL:  The failed archive command was: /bin/false
  LOG:  archive command failed with exit code 1
  DETAIL:  The failed archive command was: /bin/false
  LOG:  archive command failed with exit code 1
  DETAIL:  The failed archive command was: /bin/false
  WARNING:  transaction log file 00010011 could not
be archived: too many failures

Before the kill, 00010001 was the WAL to archive.
After the kill, the checkpoint deleted 9 files before exiting recovery
mode and 00010011 become the first WAL to archive.
00010001 through 00010010 were
removed or recycled.

Is it expected ?
-- 
Jehan-Guillaume de Rorthais
http://www.dalibo.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Process waiting for ExclusiveLock on INSERT

2012-11-13 Thread Jehan-Guillaume de Rorthais
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,

I had a surprising question today from a customer about exclusive
locks on INSERT. His log file shows something like:


==
  LOG:  process 1881 still waiting for ExclusiveLock on extension of
relation 168318652 of database 58429185 after 1000.065 ms
  STATEMENT:  INSERT INTO  (...cols...) VALUES (...values...)
==


While asking on #postgresql and investigating in the code, I think I
hit a theory. When the system is I/O bound and backends are doing
shared buffer cleanups, at some point they have to extend a relation
with new pages, requiring for an exclusive lock on the relation to
forbid anyone else to extend it in the same time. At this time, if
multiple backends try to extend the relation, one win, all other wait
for the lock, leading to messages in the log file when log_lock_waits
is enabled.

This lock would comes from src/backend/access/heap/hio.c:432:
==
  /*
   * Have to extend the relation.
   *
   * We have to use a lock to ensure no one else is extending the rel
at the
   * same time, else we will both try to initialize the same new page.  We
   * can skip locking for new or temp relations, however, since no one
else
   * could be accessing them.
   */
   needLock = !RELATION_IS_LOCAL(relation);

   if (needLock)
   LockRelationForExtension(relation, ExclusiveLock);
==


Is this theory correct or this issue should be discussed a bit more on
this list ?

PFA a small bash script with a small PostgreSQL configuration in
comments that reproduce this behavior very often on my laptop, eg:


==
  2012-11-13 23:15:51 CET [23137]: [1-1] user=postgres,db=test LOG:
process 23137 still waiting for ExclusiveLock on extension of relation
28118 of database 28115 after 100.086 ms
  2012-11-13 23:15:51 CET [23137]: [2-1] user=postgres,db=test
STATEMENT:  insert into test select i, md5(i::text) from
generate_series(1,1000) AS i
==

Regards,
- -- 
Jehan-Guillaume de Rorthais
http://www.dalibo.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlCiyT4ACgkQXu9L1HbaT6KbdgCgslQiKjP5bovr/eN5gi1TJB6i
9pcAoI9BpfD/4306xSUZTPUcQTLYHJS3
=HgzB
-END PGP SIGNATURE-


test_exclusivelock_on_insert.sh
Description: application/shellscript

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Logging parameters values on bind error

2012-10-08 Thread Jehan-Guillaume de Rorthais
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

We were struggling today with some java code binding values violating
some constraints in a prepared statement.

We don't own the code and couldn't make tests with it. So we tried to
find if PostgreSQL was able to log binded values when the BIND
operation fail and found that this is not possible in current release:
the error raised while planing the statement is not caught.

PFA a patch that catch any error while creating the query plan and add
parameters values to the error message if log_statement or
log_min_duration_statement would have logged it.

Comments?
- -- 
Jehan-Guillaume de Rorthais
http://www.dalibo.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlBy/cgACgkQXu9L1HbaT6Li5QCdEa9Zc4g302znpHmrwB9dnRBI
JSwAnR2Poil0QAP6b+TflM2ebDCPLq3G
=HU3H
-END PGP SIGNATURE-
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
***
*** 1715,1726  exec_bind_message(StringInfo input_message)
  
  	pq_getmsgend(input_message);
  
! 	/*
! 	 * Obtain a plan from the CachedPlanSource.  Any cruft from (re)planning
! 	 * will be generated in MessageContext.  The plan refcount will be
! 	 * assigned to the Portal, so it will be released at portal destruction.
! 	 */
! 	cplan = GetCachedPlan(psrc, params, false);
  
  	/*
  	 * Now we can define the portal.
--- 1715,1738 
  
  	pq_getmsgend(input_message);
  
! 	PG_TRY();
! 	{
! 		/*
! 		 * Obtain a plan from the CachedPlanSource.  Any cruft from (re)planning
! 		 * will be generated in MessageContext.  The plan refcount will be
! 		 * assigned to the Portal, so it will be released at portal destruction.
! 		 */
! 
! 		cplan = GetCachedPlan(psrc, params, false);
! 	}
! 	PG_CATCH();
! 	{
! 		if (check_log_statement(portal-stmts) || check_log_duration(msec_str, false) == 2)
! 			errdetail_params(params);
! 
! 		PG_RE_THROW();
! 	}
! 	PG_END_TRY();
  
  	/*
  	 * Now we can define the portal.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers