Re: [HACKERS] Avoiding shutdown checkpoint at failover

2012-08-22 Thread Fujii Masao
On Fri, Aug 17, 2012 at 8:38 AM, Bruce Momjian br...@momjian.us wrote:
 On Thu, Mar  8, 2012 at 08:20:02AM -0500, Robert Haas wrote:
 On Sat, Jan 28, 2012 at 8:57 AM, Simon Riggs si...@2ndquadrant.com wrote:
  On Thu, Jan 26, 2012 at 5:27 AM, Fujii Masao masao.fu...@gmail.com wrote:
 
  One thing I would like to ask is that why you think walreceiver is more
  appropriate for writing XLOG_END_OF_RECOVERY record than startup
  process. I was thinking the opposite, because if we do so, we might be
  able to skip the end-of-recovery checkpoint even in file-based 
  log-shipping
  case.
 
  Right now, WALReceiver has one code path/use case.
 
  Startup has so many, its much harder to know whether we'll screw up one of 
  them.
 
  If we can add it in either place then I choose the simplest, most
  relevant place. If the code is the same, we can move it around later.
 
  Let me write the code and then we can think some more.

 Are we still considering trying to do this for 9.2?  Seems it's been
 over a month without a new patch, and it's not entirely clear that we
 know what the design should be.

 Did this get completed?

No, not yet.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding shutdown checkpoint at failover

2012-08-16 Thread Bruce Momjian
On Thu, Mar  8, 2012 at 08:20:02AM -0500, Robert Haas wrote:
 On Sat, Jan 28, 2012 at 8:57 AM, Simon Riggs si...@2ndquadrant.com wrote:
  On Thu, Jan 26, 2012 at 5:27 AM, Fujii Masao masao.fu...@gmail.com wrote:
 
  One thing I would like to ask is that why you think walreceiver is more
  appropriate for writing XLOG_END_OF_RECOVERY record than startup
  process. I was thinking the opposite, because if we do so, we might be
  able to skip the end-of-recovery checkpoint even in file-based log-shipping
  case.
 
  Right now, WALReceiver has one code path/use case.
 
  Startup has so many, its much harder to know whether we'll screw up one of 
  them.
 
  If we can add it in either place then I choose the simplest, most
  relevant place. If the code is the same, we can move it around later.
 
  Let me write the code and then we can think some more.
 
 Are we still considering trying to do this for 9.2?  Seems it's been
 over a month without a new patch, and it's not entirely clear that we
 know what the design should be.

Did this get completed?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding shutdown checkpoint at failover

2012-03-09 Thread Simon Riggs
On Thu, Mar 8, 2012 at 1:20 PM, Robert Haas robertmh...@gmail.com wrote:

 Are we still considering trying to do this for 9.2?  Seems it's been
 over a month without a new patch, and it's not entirely clear that we
 know what the design should be.

It's important, but not ready.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding shutdown checkpoint at failover

2012-03-09 Thread Robert Haas
On Fri, Mar 9, 2012 at 3:00 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, Mar 8, 2012 at 1:20 PM, Robert Haas robertmh...@gmail.com wrote:

 Are we still considering trying to do this for 9.2?  Seems it's been
 over a month without a new patch, and it's not entirely clear that we
 know what the design should be.

 It's important, but not ready.

Thanks for the update.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding shutdown checkpoint at failover

2012-03-08 Thread Robert Haas
On Sat, Jan 28, 2012 at 8:57 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, Jan 26, 2012 at 5:27 AM, Fujii Masao masao.fu...@gmail.com wrote:

 One thing I would like to ask is that why you think walreceiver is more
 appropriate for writing XLOG_END_OF_RECOVERY record than startup
 process. I was thinking the opposite, because if we do so, we might be
 able to skip the end-of-recovery checkpoint even in file-based log-shipping
 case.

 Right now, WALReceiver has one code path/use case.

 Startup has so many, its much harder to know whether we'll screw up one of 
 them.

 If we can add it in either place then I choose the simplest, most
 relevant place. If the code is the same, we can move it around later.

 Let me write the code and then we can think some more.

Are we still considering trying to do this for 9.2?  Seems it's been
over a month without a new patch, and it's not entirely clear that we
know what the design should be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding shutdown checkpoint at failover

2012-01-28 Thread Simon Riggs
On Thu, Jan 26, 2012 at 5:27 AM, Fujii Masao masao.fu...@gmail.com wrote:

 One thing I would like to ask is that why you think walreceiver is more
 appropriate for writing XLOG_END_OF_RECOVERY record than startup
 process. I was thinking the opposite, because if we do so, we might be
 able to skip the end-of-recovery checkpoint even in file-based log-shipping
 case.

Right now, WALReceiver has one code path/use case.

Startup has so many, its much harder to know whether we'll screw up one of them.

If we can add it in either place then I choose the simplest, most
relevant place. If the code is the same, we can move it around later.

Let me write the code and then we can think some more.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding shutdown checkpoint at failover

2012-01-25 Thread Fujii Masao
On Fri, Jan 20, 2012 at 12:33 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Jan 18, 2012 at 7:15 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Nov 13, 2011 at 5:13 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Nov 1, 2011 at 12:11 PM, Simon Riggs si...@2ndquadrant.com wrote:

 When I say skip the shutdown checkpoint, I mean remove it from the
 critical path of required actions at the end of recovery. We can still
 have a normal checkpoint kicked off at that time, but that no longer
 needs to be on the critical path.

 Any problems foreseen? If not, looks like a quick patch.

 Patch attached for discussion/review.

 This feature is what I want, and very helpful to shorten the failover time in
 streaming replication.

 Here are the review comments. Though I've not checked enough whether
 this feature works fine in all recovery patterns yet.

 LocalSetXLogInsertAllowed() must be called before LogEndOfRecovery().
 LocalXLogInsertAllowed must be set to -1 after LogEndOfRecovery().

 XLOG_END_OF_RECOVERY record is written to the WAL file with new
 assigned timeline ID. But it must be written to the WAL file with old one.
 Otherwise, when re-entering a recovery after failover, we cannot find
 XLOG_END_OF_RECOVERY record at all.

 Before XLOG_END_OF_RECOVERY record is written,
 RmgrTable[rmid].rm_cleanup() might write WAL records. They also
 should be written to the WAL file with old timeline ID.

 When recovery target is specified, we cannot write new WAL to the file
 with old timeline because which means that valid WAL records in it are
 overwritten with new WAL. So when recovery target is specified,
 ISTM that we cannot skip end of recovery checkpoint. Or we might need
 to save all information about timelines in the database cluster instead
 of writing XLOG_END_OF_RECOVERY record, and use it when re-entering
 a recovery.

 LogEndOfRecovery() seems to need to call XLogFlush(). Otherwise,
 what if the server crashes after new timeline history file is created and
 recovery.conf is removed, but before XLOG_END_OF_RECOVERY record
 has not been flushed to the disk yet?

 During recovery, when we replay XLOG_END_OF_RECOVERY record, we
 should close the currently-opened WAL file and read the WAL file with
 the timeline which XLOG_END_OF_RECOVERY record indicates.
 Otherwise, when re-entering a recovery with old timeline, we cannot
 reach new timeline.



 OK, some bad things there, thanks for the insightful comments.



 I think you're right that we can't skip the checkpoint if xlog_cleanup
 writes WAL records, since that implies at least one and maybe more
 blocks have changed and need to be flushed. That can be improved upon,
 but not now in 9.2.Cleanup WAL is written in either the old or the new
 timeline, depending upon whether we increment it. So we don't need to
 change anything there, IMHO.

 The big problem is how we handle crash recovery after we startup
 without a checkpoint. No quick fixes there.

 So let me rethink this: The idea was that we can skip the checkpoint
 if we promote to normal running during streaming replication.

 WALReceiver has been writing to WAL files, so can write more data
 without all of the problems noted. Rather than write the
 XLOG_END_OF_RECOVERY record via XLogInsert we should write that **from
 the WALreceiver** as a dummy record by direct injection into the WAL
 stream. So the Startup process sees a WAL record that looks like it
 was written by the primary saying promote yourself, although it was
 actually written locally by WALreceiver when requested to shutdown.
 That doesn't damage anything because we know we've received all the
 WAL there is. Most importantly we don't need to change any of the
 logic in a way that endangers the other code paths at end of recovery.

 Writing the record in that way means we would need to calculate the
 new tli slightly earlier, so we can input the correct value into the
 record. That also solves the problem of how to get additional standbys
 to follow the new master. The XLOG_END_OF_RECOVERY record is simply
 the contents of the newly written tli history file.

 If we skip the checkpoint and then crash before the next checkpoint we
 just change timeline when we see XLOG_END_OF_RECOVERY. When we replay
 the XLOG_END_OF_RECOVERY we copy the contents to the appropriate tli
 file and then switch to it.

 So this solves 2 problems: having other standbys follow us when they
 don't have archiving, and avoids the checkpoint.

 Let me know what you think.

Looks good to me.

One thing I would like to ask is that why you think walreceiver is more
appropriate for writing XLOG_END_OF_RECOVERY record than startup
process. I was thinking the opposite, because if we do so, we might be
able to skip the end-of-recovery checkpoint even in file-based log-shipping
case.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] Avoiding shutdown checkpoint at failover

2012-01-19 Thread Simon Riggs
On Wed, Jan 18, 2012 at 7:15 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Nov 13, 2011 at 5:13 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Nov 1, 2011 at 12:11 PM, Simon Riggs si...@2ndquadrant.com wrote:

 When I say skip the shutdown checkpoint, I mean remove it from the
 critical path of required actions at the end of recovery. We can still
 have a normal checkpoint kicked off at that time, but that no longer
 needs to be on the critical path.

 Any problems foreseen? If not, looks like a quick patch.

 Patch attached for discussion/review.

 This feature is what I want, and very helpful to shorten the failover time in
 streaming replication.

 Here are the review comments. Though I've not checked enough whether
 this feature works fine in all recovery patterns yet.

 LocalSetXLogInsertAllowed() must be called before LogEndOfRecovery().
 LocalXLogInsertAllowed must be set to -1 after LogEndOfRecovery().

 XLOG_END_OF_RECOVERY record is written to the WAL file with new
 assigned timeline ID. But it must be written to the WAL file with old one.
 Otherwise, when re-entering a recovery after failover, we cannot find
 XLOG_END_OF_RECOVERY record at all.

 Before XLOG_END_OF_RECOVERY record is written,
 RmgrTable[rmid].rm_cleanup() might write WAL records. They also
 should be written to the WAL file with old timeline ID.

 When recovery target is specified, we cannot write new WAL to the file
 with old timeline because which means that valid WAL records in it are
 overwritten with new WAL. So when recovery target is specified,
 ISTM that we cannot skip end of recovery checkpoint. Or we might need
 to save all information about timelines in the database cluster instead
 of writing XLOG_END_OF_RECOVERY record, and use it when re-entering
 a recovery.

 LogEndOfRecovery() seems to need to call XLogFlush(). Otherwise,
 what if the server crashes after new timeline history file is created and
 recovery.conf is removed, but before XLOG_END_OF_RECOVERY record
 has not been flushed to the disk yet?

 During recovery, when we replay XLOG_END_OF_RECOVERY record, we
 should close the currently-opened WAL file and read the WAL file with
 the timeline which XLOG_END_OF_RECOVERY record indicates.
 Otherwise, when re-entering a recovery with old timeline, we cannot
 reach new timeline.



OK, some bad things there, thanks for the insightful comments.



I think you're right that we can't skip the checkpoint if xlog_cleanup
writes WAL records, since that implies at least one and maybe more
blocks have changed and need to be flushed. That can be improved upon,
but not now in 9.2.Cleanup WAL is written in either the old or the new
timeline, depending upon whether we increment it. So we don't need to
change anything there, IMHO.

The big problem is how we handle crash recovery after we startup
without a checkpoint. No quick fixes there.

So let me rethink this: The idea was that we can skip the checkpoint
if we promote to normal running during streaming replication.

WALReceiver has been writing to WAL files, so can write more data
without all of the problems noted. Rather than write the
XLOG_END_OF_RECOVERY record via XLogInsert we should write that **from
the WALreceiver** as a dummy record by direct injection into the WAL
stream. So the Startup process sees a WAL record that looks like it
was written by the primary saying promote yourself, although it was
actually written locally by WALreceiver when requested to shutdown.
That doesn't damage anything because we know we've received all the
WAL there is. Most importantly we don't need to change any of the
logic in a way that endangers the other code paths at end of recovery.

Writing the record in that way means we would need to calculate the
new tli slightly earlier, so we can input the correct value into the
record. That also solves the problem of how to get additional standbys
to follow the new master. The XLOG_END_OF_RECOVERY record is simply
the contents of the newly written tli history file.

If we skip the checkpoint and then crash before the next checkpoint we
just change timeline when we see XLOG_END_OF_RECOVERY. When we replay
the XLOG_END_OF_RECOVERY we copy the contents to the appropriate tli
file and then switch to it.

So this solves 2 problems: having other standbys follow us when they
don't have archiving, and avoids the checkpoint.

Let me know what you think.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding shutdown checkpoint at failover

2012-01-17 Thread Fujii Masao
On Sun, Nov 13, 2011 at 5:13 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Nov 1, 2011 at 12:11 PM, Simon Riggs si...@2ndquadrant.com wrote:

 When I say skip the shutdown checkpoint, I mean remove it from the
 critical path of required actions at the end of recovery. We can still
 have a normal checkpoint kicked off at that time, but that no longer
 needs to be on the critical path.

 Any problems foreseen? If not, looks like a quick patch.

 Patch attached for discussion/review.

This feature is what I want, and very helpful to shorten the failover time in
streaming replication.

Here are the review comments. Though I've not checked enough whether
this feature works fine in all recovery patterns yet.

LocalSetXLogInsertAllowed() must be called before LogEndOfRecovery().
LocalXLogInsertAllowed must be set to -1 after LogEndOfRecovery().

XLOG_END_OF_RECOVERY record is written to the WAL file with new
assigned timeline ID. But it must be written to the WAL file with old one.
Otherwise, when re-entering a recovery after failover, we cannot find
XLOG_END_OF_RECOVERY record at all.

Before XLOG_END_OF_RECOVERY record is written,
RmgrTable[rmid].rm_cleanup() might write WAL records. They also
should be written to the WAL file with old timeline ID.

When recovery target is specified, we cannot write new WAL to the file
with old timeline because which means that valid WAL records in it are
overwritten with new WAL. So when recovery target is specified,
ISTM that we cannot skip end of recovery checkpoint. Or we might need
to save all information about timelines in the database cluster instead
of writing XLOG_END_OF_RECOVERY record, and use it when re-entering
a recovery.

LogEndOfRecovery() seems to need to call XLogFlush(). Otherwise,
what if the server crashes after new timeline history file is created and
recovery.conf is removed, but before XLOG_END_OF_RECOVERY record
has not been flushed to the disk yet?

During recovery, when we replay XLOG_END_OF_RECOVERY record, we
should close the currently-opened WAL file and read the WAL file with
the timeline which XLOG_END_OF_RECOVERY record indicates.
Otherwise, when re-entering a recovery with old timeline, we cannot
reach new timeline.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding shutdown checkpoint at failover

2011-11-28 Thread Greg Smith

On 11/13/2011 12:13 AM, Simon Riggs wrote:

On Tue, Nov 1, 2011 at 12:11 PM, Simon Riggssi...@2ndquadrant.com  wrote:


When I say skip the shutdown checkpoint, I mean remove it from the
critical path of required actions at the end of recovery. We can still
have a normal checkpoint kicked off at that time, but that no longer
needs to be on the critical path.

Any problems foreseen? If not, looks like a quick patch.

Patch attached for discussion/review.


This one was missed for the November CF; submitted in time but not added 
to the app until just now.  Given that Tom already voiced some specific 
things to consider (detailed review of all WAL replay activities) I 
added it to the January one instead.  If anyone has been looking for 
reason to study WAL replay, by all means knock yourself out before then.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding shutdown checkpoint at failover

2011-11-13 Thread Simon Riggs
On Tue, Nov 1, 2011 at 12:11 PM, Simon Riggs si...@2ndquadrant.com wrote:

 When I say skip the shutdown checkpoint, I mean remove it from the
 critical path of required actions at the end of recovery. We can still
 have a normal checkpoint kicked off at that time, but that no longer
 needs to be on the critical path.

 Any problems foreseen? If not, looks like a quick patch.

Patch attached for discussion/review.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


fast_failover.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Avoiding shutdown checkpoint at failover

2011-11-01 Thread Simon Riggs
When a server fails, we need to promote a standby as quickly as possible.

Currently when we promote a standby to a primary we need to run a
shutdown checkpoint before users can begin write transactions, which
in many cases can take minutes.

The reason we run a shutdown checkpoint is to prevent needing to
re-enter recovery if we crash after promotion. When we only had file
based replication, all WAL files were reloaded from archive each time,
so the restartpoint prior to the end of recovery was not guaranteed to
be available in pg_xlog. Once we had exited archive recovery it would
be difficult to re-access the archive.

Now with streaming replication, we keep the WAL files in pg_xlog
directly, so the last restartpoint is always available if we should
crash.

So if streaming replication is active at the point we promote, then we
can skip the shutdown checkpoint. It's that simple.

To make it even simpler, I suggest we also change file de-archiving so
that it writes normal WAL files, not RECOVERYXLOG, so that way we can
avoid the checkpoint in all cases.

There are comments saying we can only increment a timeline via a
shutdown checkpoint, but if we were smart we'd have noticed the
timeline change via the WAL file numbering anyway. Best way seems to
be to have a XLOG_TIMELINE_CHANGE record written instead of the
shutdown checkpoint.

When I say skip the shutdown checkpoint, I mean remove it from the
critical path of required actions at the end of recovery. We can still
have a normal checkpoint kicked off at that time, but that no longer
needs to be on the critical path.

Any problems foreseen? If not, looks like a quick patch.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding shutdown checkpoint at failover

2011-11-01 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 The reason we run a shutdown checkpoint is to prevent needing to
 re-enter recovery if we crash after promotion.

That's *a* reason, it's not necessarily the only reason.  This proposal
worries me, especially your blithe dismissal of the timeline issues;
but in any case I would not trust it without a detailed review of all
WAL replay activities, which you don't sound to have done.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding shutdown checkpoint at failover

2011-11-01 Thread Simon Riggs
On Tue, Nov 1, 2011 at 1:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 The reason we run a shutdown checkpoint is to prevent needing to
 re-enter recovery if we crash after promotion.

 That's *a* reason, it's not necessarily the only reason.  This proposal
 worries me, especially your blithe dismissal of the timeline issues;
 but in any case I would not trust it without a detailed review of all
 WAL replay activities, which you don't sound to have done.

What timeline issues are you thinking of? Timelines were invented to
avoid confusion with PITR. The reality is that they don't have much
reason to exist in the world of replication and could be dispensed
with in that context easily if there are issues associated with them.

I believe the solution to be simple and wish it had occurred to me earlier.

If you can think of a reason to not do this, let me know.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers