Re: [HACKERS] Remove secondary checkpoint

2017-11-07 Thread Michael Paquier
On Wed, Nov 8, 2017 at 2:23 AM, Simon Riggs  wrote:
> On 31 October 2017 at 12:01, Michael Paquier  
> wrote:
>> While the mention about a manual checkpoint happening after a timed
>> one will cause a full range of WAL segments to be recycled, it is not
>> actually true that segments of the prior's prior checkpoint are not
>> needed, because with your patch the segments of the prior checkpoint
>> are getting recycled. So it seems to me that based on that the formula
>> ought to use 1.0 instead of 2.0...
>
> I think the argument in the comment is right, in that
> CheckPointDistanceEstimate is better if we use multiple checkpoint
> cycles.

Yes, the theory behind is correct. No argument behind that.

> But the implementation of that is bogus and multiplying by 2.0
> wouldn't make it better if CheckPointDistanceEstimate is wrong.

Yes, this is wrong. My apologies if my words looked confusing. By
reading your message I can see that our thoughts are on the same page.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-11-07 Thread Simon Riggs
On 31 October 2017 at 12:01, Michael Paquier  wrote:
> On Tue, Oct 31, 2017 at 2:00 PM, Simon Riggs  wrote:
>> On 30 October 2017 at 18:58, Simon Riggs  wrote:
>>> On 30 October 2017 at 15:22, Simon Riggs  wrote:
>>>
> You forgot to update this formula in xlog.c:
> distance = (2.0 + CheckPointCompletionTarget) * 
> CheckPointDistanceEstimate;
> /* add 10% for good measure. */
> distance *= 1.10;
> You can guess easily where the mistake is.

 Doh - fixed that before posting, so I must have sent the wrong patch.

 It's the 10%, right? ;-)
>>>
>>> So, there are 2 locations that mention 2.0 in xlog.c
>>>
>>> I had already fixed one, which is why I remembered editing it.
>>>
>>> The other one you mention has a detailed comment above it explaining
>>> why it should be 2.0 rather than 1.0, so I left it.
>>>
>>> Let me know if you still think it should be removed? I can see the
>>> argument both ways.
>>> Or maybe we need another patch to account for manual checkpoints.
>>
>> Revised patch to implement this
>
> Here is the comment and the formula:
>  * The reason this calculation is done from the prior
> checkpoint, not the
>  * one that just finished, is that this behaves better if some
> checkpoint
>  * cycles are abnormally short, like if you perform a manual 
> checkpoint
>  * right after a timed one. The manual checkpoint will make
> almost a full
>  * cycle's worth of WAL segments available for recycling, because the
>  * segments from the prior's prior, fully-sized checkpoint cycle are 
> no
>  * longer needed. However, the next checkpoint will make only
> few segments
>  * available for recycling, the ones generated between the timed
>  * checkpoint and the manual one right after that. If at the manual
>  * checkpoint we only retained enough segments to get us to
> the next timed
>  * one, and removed the rest, then at the next checkpoint we would not
>  * have enough segments around for recycling, to get us to the
> checkpoint
>  * after that. Basing the calculations on the distance from
> the prior redo
>  * pointer largely fixes that problem.
>  */
> distance = (2.0 + CheckPointCompletionTarget) *
> CheckPointDistanceEstimate;
>
> While the mention about a manual checkpoint happening after a timed
> one will cause a full range of WAL segments to be recycled, it is not
> actually true that segments of the prior's prior checkpoint are not
> needed, because with your patch the segments of the prior checkpoint
> are getting recycled. So it seems to me that based on that the formula
> ought to use 1.0 instead of 2.0...

I think the argument in the comment is right, in that
CheckPointDistanceEstimate is better if we use multiple checkpoint
cycles.

But the implementation of that is bogus and multiplying by 2.0
wouldn't make it better if CheckPointDistanceEstimate is wrong.

So I will change to 1.0 as you say and rewrite the comment. Thanks for
your review.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-31 Thread Michael Paquier
On Tue, Oct 31, 2017 at 2:00 PM, Simon Riggs  wrote:
> On 30 October 2017 at 18:58, Simon Riggs  wrote:
>> On 30 October 2017 at 15:22, Simon Riggs  wrote:
>>
 You forgot to update this formula in xlog.c:
 distance = (2.0 + CheckPointCompletionTarget) * 
 CheckPointDistanceEstimate;
 /* add 10% for good measure. */
 distance *= 1.10;
 You can guess easily where the mistake is.
>>>
>>> Doh - fixed that before posting, so I must have sent the wrong patch.
>>>
>>> It's the 10%, right? ;-)
>>
>> So, there are 2 locations that mention 2.0 in xlog.c
>>
>> I had already fixed one, which is why I remembered editing it.
>>
>> The other one you mention has a detailed comment above it explaining
>> why it should be 2.0 rather than 1.0, so I left it.
>>
>> Let me know if you still think it should be removed? I can see the
>> argument both ways.
>> Or maybe we need another patch to account for manual checkpoints.
>
> Revised patch to implement this

Here is the comment and the formula:
 * The reason this calculation is done from the prior
checkpoint, not the
 * one that just finished, is that this behaves better if some
checkpoint
 * cycles are abnormally short, like if you perform a manual checkpoint
 * right after a timed one. The manual checkpoint will make
almost a full
 * cycle's worth of WAL segments available for recycling, because the
 * segments from the prior's prior, fully-sized checkpoint cycle are no
 * longer needed. However, the next checkpoint will make only
few segments
 * available for recycling, the ones generated between the timed
 * checkpoint and the manual one right after that. If at the manual
 * checkpoint we only retained enough segments to get us to
the next timed
 * one, and removed the rest, then at the next checkpoint we would not
 * have enough segments around for recycling, to get us to the
checkpoint
 * after that. Basing the calculations on the distance from
the prior redo
 * pointer largely fixes that problem.
 */
distance = (2.0 + CheckPointCompletionTarget) *
CheckPointDistanceEstimate;

While the mention about a manual checkpoint happening after a timed
one will cause a full range of WAL segments to be recycled, it is not
actually true that segments of the prior's prior checkpoint are not
needed, because with your patch the segments of the prior checkpoint
are getting recycled. So it seems to me that based on that the formula
ought to use 1.0 instead of 2.0...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-31 Thread Simon Riggs
On 30 October 2017 at 18:58, Simon Riggs  wrote:
> On 30 October 2017 at 15:22, Simon Riggs  wrote:
>
>>> You forgot to update this formula in xlog.c:
>>> distance = (2.0 + CheckPointCompletionTarget) * 
>>> CheckPointDistanceEstimate;
>>> /* add 10% for good measure. */
>>> distance *= 1.10;
>>> You can guess easily where the mistake is.
>>
>> Doh - fixed that before posting, so I must have sent the wrong patch.
>>
>> It's the 10%, right? ;-)
>
> So, there are 2 locations that mention 2.0 in xlog.c
>
> I had already fixed one, which is why I remembered editing it.
>
> The other one you mention has a detailed comment above it explaining
> why it should be 2.0 rather than 1.0, so I left it.
>
> Let me know if you still think it should be removed? I can see the
> argument both ways.
>
> Or maybe we need another patch to account for manual checkpoints.

Revised patch to implement this

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


remove_secondary_checkpoint.v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-30 Thread Simon Riggs
On 30 October 2017 at 15:22, Simon Riggs  wrote:

>> You forgot to update this formula in xlog.c:
>> distance = (2.0 + CheckPointCompletionTarget) * 
>> CheckPointDistanceEstimate;
>> /* add 10% for good measure. */
>> distance *= 1.10;
>> You can guess easily where the mistake is.
>
> Doh - fixed that before posting, so I must have sent the wrong patch.
>
> It's the 10%, right? ;-)

So, there are 2 locations that mention 2.0 in xlog.c

I had already fixed one, which is why I remembered editing it.

The other one you mention has a detailed comment above it explaining
why it should be 2.0 rather than 1.0, so I left it.

Let me know if you still think it should be removed? I can see the
argument both ways.

Or maybe we need another patch to account for manual checkpoints.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-30 Thread Andres Freund
On 2017-10-30 10:10:19 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > I was mostly just thinking out loud, listing another option rather
> > than advocating for it.
> 
> FWIW, I just wanted the question to be debated and resolved properly.
> 
> After rereading the thread Andres pointed to, I thought of a hazard
> that I think Andres alluded to, but didn't spell out explicitly:
> if we can't read the primary checkpoint, and then back up to a
> secondary one and replay as much of WAL as we can read, we may well
> be left with an inconsistent database.

Exactly.


> I'm content now that removing the secondary checkpoint is an OK
> decision.  (This isn't a review of Simon's patch, though.)

I wonder if we shouldn't add a pg_resetxlog option that sets the
checkpoint to start from to a certain LSN. For the few cases where
there's actual data recovery needed that's a lot more useful than
randomly using checkpoint - 1. And it's an explicit expert only thing,
without costing everyone.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-30 Thread Simon Riggs
On 30 October 2017 at 15:31, Michael Paquier  wrote:
> On Mon, Oct 30, 2017 at 2:22 PM, Simon Riggs  wrote:
>> On 25 October 2017 at 00:17, Michael Paquier  
>> wrote:
>>> -* Delete old log files (those no longer needed even for previous
>>> -* checkpoint or the standbys in XLOG streaming).
>>> +* Delete old log files and recycle them
>>>  */
>>> Here that's more "Delete or recycle old log files". Recycling of a WAL
>>> segment is its renaming into a newer segment.
>>
>> Sometimes files are deleted if there are too many.
>
> Sure, but one segment is never deleted and then recycled, which is
> what your new comment implies as I understand it.

I guess it depends how you read it.

The function performs both deletion AND recycling

Yet one file is only ever deleted OR recycled


>>> -   checkPointLoc = ControlFile->prevCheckPoint;
>>> +   /*
>>> +* It appears to be a bug that we used to use
>>> prevCheckpoint here
>>> +*/
>>> +   checkPointLoc = ControlFile->checkPoint;
>>> Er, no. This is correct because we expect the prior checkpoint to
>>> still be present in the event of a failure detecting the latest
>>> checkpoint.
>>
>> I'm removing "prevCheckPoint", so not sure what you mean.
>
> I mean that there is no actual bug here. And changing the code as you
> do is correct, but the comment is not.

Thanks

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-30 Thread Michael Paquier
On Mon, Oct 30, 2017 at 2:22 PM, Simon Riggs  wrote:
> On 25 October 2017 at 00:17, Michael Paquier  
> wrote:
>> -* Delete old log files (those no longer needed even for previous
>> -* checkpoint or the standbys in XLOG streaming).
>> +* Delete old log files and recycle them
>>  */
>> Here that's more "Delete or recycle old log files". Recycling of a WAL
>> segment is its renaming into a newer segment.
>
> Sometimes files are deleted if there are too many.

Sure, but one segment is never deleted and then recycled, which is
what your new comment implies as I understand it.

>> -   checkPointLoc = ControlFile->prevCheckPoint;
>> +   /*
>> +* It appears to be a bug that we used to use
>> prevCheckpoint here
>> +*/
>> +   checkPointLoc = ControlFile->checkPoint;
>> Er, no. This is correct because we expect the prior checkpoint to
>> still be present in the event of a failure detecting the latest
>> checkpoint.
>
> I'm removing "prevCheckPoint", so not sure what you mean.

I mean that there is no actual bug here. And changing the code as you
do is correct, but the comment is not.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-30 Thread Simon Riggs
On 25 October 2017 at 00:17, Michael Paquier  wrote:
> On Mon, Oct 23, 2017 at 11:20 PM, Simon Riggs  wrote:
>> Remove the code that maintained two checkpoint's WAL files and all
>> associated stuff.
>>
>> Try to avoid breaking anything else
>>
>> This
>> * reduces disk space requirements on master
>> * removes a minor bug in fast failover
>> * simplifies code
>
> In short, yeah! I had a close look at the patch and noticed a couple of 
> issues.

Thanks for the detailed review

> +else
>  /*
> - * The last valid checkpoint record required for a streaming
> - * recovery exists in neither standby nor the primary.
> + * We used to attempt to go back to a secondary checkpoint
> + * record here, but only when not in standby_mode. We now
> + * just fail if we can't read the last checkpoint because
> + * this allows us to simplify processing around checkpoints.
>   */
>  ereport(PANIC,
>  (errmsg("could not locate a valid checkpoint record")));
> -}
> Using brackets in this case is more elegant style IMO. OK, there is
> one line, but the comment is long so the block becomes
> confusingly-shaped.

OK

>  /* Version identifier for this pg_control format */
> -#define PG_CONTROL_VERSION1003
> +#define PG_CONTROL_VERSION1100
> Yes, this had to happen anyway in this release cycle.
>
> backup.sgml describes the following:
> to higher segment numbers.  It's assumed that segment files whose
> contents precede the checkpoint-before-last are no longer of
> interest and can be recycled.
> However this is not true anymore with this patch.

Thanks, will fix.

> The documentation of pg_control_checkpoint() is not updated. func.sgml
> lists the tuple fields returned for this function.

Ah, OK.

> You forgot to update pg_control_checkpoint in pg_proc.h. prior_lsn is
> still listed in the list of arguments returned. And you need to update
> as well the output list of types.
>
>   * whichChkpt identifies the checkpoint (merely for reporting purposes).
> - * 1 for "primary", 2 for "secondary", 0 for "other" (backup_label)
> + * 1 for "primary", 0 for "other" (backup_label)
> + * 2 for "secondary" is no longer supported
>   */
> I would suggest to just remove the reference to "secondary".

Yeh, that was there for review.

> -* Delete old log files (those no longer needed even for previous
> -* checkpoint or the standbys in XLOG streaming).
> +* Delete old log files and recycle them
>  */
> Here that's more "Delete or recycle old log files". Recycling of a WAL
> segment is its renaming into a newer segment.

Sometimes files are deleted if there are too many.

> You forgot to update this formula in xlog.c:
> distance = (2.0 + CheckPointCompletionTarget) * 
> CheckPointDistanceEstimate;
> /* add 10% for good measure. */
> distance *= 1.10;
> You can guess easily where the mistake is.

Doh - fixed that before posting, so I must have sent the wrong patch.

It's the 10%, right? ;-)

> -   checkPointLoc = ControlFile->prevCheckPoint;
> +   /*
> +* It appears to be a bug that we used to use
> prevCheckpoint here
> +*/
> +   checkPointLoc = ControlFile->checkPoint;
> Er, no. This is correct because we expect the prior checkpoint to
> still be present in the event of a failure detecting the latest
> checkpoint.

I'm removing "prevCheckPoint", so not sure what you mean.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-30 Thread Tom Lane
Robert Haas  writes:
> I was mostly just thinking out loud, listing another option rather
> than advocating for it.

FWIW, I just wanted the question to be debated and resolved properly.

After rereading the thread Andres pointed to, I thought of a hazard
that I think Andres alluded to, but didn't spell out explicitly:
if we can't read the primary checkpoint, and then back up to a
secondary one and replay as much of WAL as we can read, we may well
be left with an inconsistent database.  This would happen because
some changes that post-date the part of WAL we could read may have
been flushed to disk, if the system previously thought that WAL
up through the primary checkpoint was all safely down to disk.
Therefore, backing up to the secondary checkpoint is *not* a safe
automatic recovery choice, and it's dubious that it's even a useful
approach for manual recovery.  You're really into restore-from-
backup territory at that point.

I'm content now that removing the secondary checkpoint is an OK
decision.  (This isn't a review of Simon's patch, though.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-30 Thread Robert Haas
On Mon, Oct 30, 2017 at 7:04 PM, Alvaro Herrera  wrote:
>> A sort of middle way would be to keep the secondary checkpoint around
>> but never try to replay from that point, or only if a specific flag is
>> provided.
>
> Why do you want to keep the secondary checkpoint?  If there is no way to
> automatically start a recovery from the prior checkpoint, is it really
> possible to do the same manually?  I think the only advantage of keeping
> it is that the WAL files are kept around for a little bit longer.  But
> is that useful?  Surely for any environment where you really care, you
> have a WAL archive somewhere, so it doesn't matter if files are removed
> from the primary's pg_xlog dir.

(apologies for the empty message)

I don't really want anything in particular here, other than for the
system to be reliable.  If we're confident that there's zero value in
the secondary checkpoint then, sure, ditch it.  Even if you have the
older WAL files around in an archive, it doesn't mean that you know
where the previous checkpoint start location is ... but actually, come
to think of it, if you did need to know that, you could just run
pg_waldump to find it.  That probably wasn't true when this code was
originally written, but it is today.

I was mostly just thinking out loud, listing another option rather
than advocating for it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-30 Thread Robert Haas
On Mon, Oct 30, 2017 at 7:04 PM, Alvaro Herrera  wrote:
> Robert Haas wrote:
>> On Tue, Oct 24, 2017 at 7:25 PM, Andres Freund  wrote:
>> > I think it does the contrary. The current mechanism is, in my opinion,
>> > downright dangerous:
>> > https://www.postgresql.org/message-id/20160201235854.go8...@awork2.anarazel.de
>>
>> A sort of middle way would be to keep the secondary checkpoint around
>> but never try to replay from that point, or only if a specific flag is
>> provided.
>
> Why do you want to keep the secondary checkpoint?  If there is no way to
> automatically start a recovery from the prior checkpoint, is it really
> possible to do the same manually?  I think the only advantage of keeping
> it is that the WAL files are kept around for a little bit longer.  But
> is that useful?  Surely for any environment where you really care, you
> have a WAL archive somewhere, so it doesn't matter if files are removed
> from the primary's pg_xlog dir.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-30 Thread Alvaro Herrera
Robert Haas wrote:
> On Tue, Oct 24, 2017 at 7:25 PM, Andres Freund  wrote:
> > I think it does the contrary. The current mechanism is, in my opinion,
> > downright dangerous:
> > https://www.postgresql.org/message-id/20160201235854.go8...@awork2.anarazel.de
> 
> A sort of middle way would be to keep the secondary checkpoint around
> but never try to replay from that point, or only if a specific flag is
> provided.

Why do you want to keep the secondary checkpoint?  If there is no way to
automatically start a recovery from the prior checkpoint, is it really
possible to do the same manually?  I think the only advantage of keeping
it is that the WAL files are kept around for a little bit longer.  But
is that useful?  Surely for any environment where you really care, you
have a WAL archive somewhere, so it doesn't matter if files are removed
from the primary's pg_xlog dir.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-30 Thread Robert Haas
On Tue, Oct 24, 2017 at 7:25 PM, Andres Freund  wrote:
> I think it does the contrary. The current mechanism is, in my opinion,
> downright dangerous:
> https://www.postgresql.org/message-id/20160201235854.go8...@awork2.anarazel.de

A sort of middle way would be to keep the secondary checkpoint around
but never try to replay from that point, or only if a specific flag is
provided.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-26 Thread Tsunakawa, Takayuki
From: Alvaro Herrera [mailto:alvhe...@alvh.no-ip.org]
> Tsunakawa, Takayuki wrote:
> 
> > (Although unrelated to this, I've also been wondering why PostgreSQL
> > flushes WAL to disk when writing a page in the shared buffer, because
> > PostgreSQL doesn't use WAL for undo.)
> 
> The reason is that if the system crashes after writing the data page to
> disk, but before writing the WAL, the data page would be inconsistent with
> data in pages that weren't flushed, since there is no WAL to update those
> other pages.  Also, if the system crashes after partially writing the page
> (say it writes the first 4kB) then the page is downright corrupted with
> no way to fix it.
> 
> So there has to be a barrier that ensures that the WAL is flushed up to
> the last position that modified a page (i.e. that page's LSN) before actually
> writing that page to disk.  And this is why we can't use mmap() for shared
> buffers -- there is no mechanism to force the WAL down if the operation
> system has the liberty to flush pages whenever it likes.

I see.  The latter is a torn page problem, which is solved by a full page image 
WAL record.  I understood that an example of the former problem is the 
inconsistency between a table page and an index page -- if an index page is 
flushed to disk without slushing the WAL and the corresponding table page, an 
index entry would point to a wroing table record after recovery.

Thanks, my long-standing question has beenn solved.


Regards
Takayuki Tsunakawa


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-26 Thread Alvaro Herrera
Tsunakawa, Takayuki wrote:

> (Although unrelated to this, I've also been wondering why PostgreSQL
> flushes WAL to disk when writing a page in the shared buffer, because
> PostgreSQL doesn't use WAL for undo.)

The reason is that if the system crashes after writing the data page to
disk, but before writing the WAL, the data page would be inconsistent
with data in pages that weren't flushed, since there is no WAL to update
those other pages.  Also, if the system crashes after partially writing
the page (say it writes the first 4kB) then the page is downright
corrupted with no way to fix it.

So there has to be a barrier that ensures that the WAL is flushed up to
the last position that modified a page (i.e. that page's LSN) before
actually writing that page to disk.  And this is why we can't use mmap()
for shared buffers -- there is no mechanism to force the WAL down if the
operation system has the liberty to flush pages whenever it likes.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Michael Paquier
On Tue, Oct 24, 2017 at 7:24 PM, Tsunakawa, Takayuki
 wrote:
> (3)
> Should we change the default value of max_wal_size from 1 GB to a smaller 
> size?  I vote for "no" for performance.

The default has just changed in v10, so changing it again could be
confusing, so I agree with your position.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Simon Riggs
> This
> * reduces disk space requirements on master
> * removes a minor bug in fast failover
> * simplifies code

I welcome this patch.  I was wondering why PostgreSQL retains the previous 
checkpoint.  (Although unrelated to this, I've also been wondering why 
PostgreSQL flushes WAL to disk when writing a page in the shared buffer, 
because PostgreSQL doesn't use WAL for undo.)

Here are my review comments.

(1)
-* a) we keep WAL for two checkpoint cycles, back to the "prev" 
checkpoint.
+* a) we keep WAL for only one checkpoint cycle (prior to PG11 we kept
+*WAL for two checkpoint cycles to allow us to recover from the
+*secondary checkpoint if the first checkpoint failed, though we
+*only did this on the master anyway, not on standby. Keeping just
+*one checkpoint simplifies processing and reduces disk space in
+*many smaller databases.)

/*
-* The last valid checkpoint record required for a 
streaming
-* recovery exists in neither standby nor the primary.
+* We used to attempt to go back to a secondary 
checkpoint
+* record here, but only when not in standby_mode. We 
now
+* just fail if we can't read the last checkpoint 
because
+* this allows us to simplify processing around 
checkpoints.
 */


if (fast_promote)
{
-   checkPointLoc = ControlFile->prevCheckPoint;
+   /*
+* It appears to be a bug that we used to use 
prevCheckpoint here
+*/
+   checkPointLoc = ControlFile->checkPoint;

-   XLByteToSeg(PriorRedoPtr, _logSegNo, wal_segment_size);
+   /* Trim from the last checkpoint, not the last - 1 */
+   XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 
I suggest to remove references to the secondary checkpoint (the old behavior) 
from the comments.  I'm afraid those could confuse new developers.



(2)
@@ -8110,10 +8100,6 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, 
XLogRecPtr RecPtr,
ereport(LOG,
(errmsg("invalid primary 
checkpoint link in control file")));
break;

@@ -8135,10 +8121,6 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, 
XLogRecPtr RecPtr,
ereport(LOG,
(errmsg("invalid primary 
checkpoint record")));
break;

@@ -8154,10 +8136,6 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, 
XLogRecPtr RecPtr,
ereport(LOG,
(errmsg("invalid resource 
manager ID in primary checkpoint record")));
break;

etc, etc.

"primary checkpoint" should be just "checkpoint", now that the 
primary/secondary concept will disappear.


(3)
Should we change the default value of max_wal_size from 1 GB to a smaller size? 
 I vote for "no" for performance.

Regards
Takayuki Tsunakawa





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Michael Paquier
> On Tue, Oct 24, 2017 at 5:58 PM, Tsunakawa, Takayuki
>  wrote:
> > If the latest checkpoint record is unreadable (the WAL
> segment/block/record is corrupt?), recovery from the previous checkpoint
> would also stop at the latest checkpoint.  And we don't need to replay the
> WAL records between the previous checkpoint and the latest one, because
> their changes are already persisted when the latest checkpoint was taken.
> So, the user should just do pg_resetxlog and start the database server when
> the recovery cannot find the latest checkpoint record and PANICs?
> 
> Not necessarily. If a failure is detected when reading the last checkpoint,
> as you say recovery would begin at the checkpoint prior that and would stop
> when reading the record of last checkpoint, still one could use a
> recovery.conf with restore_command to fetch segments from a different
> source, like a static archive, as only the local segment may be corrupted.

Oh, you are right.  "If the crash recovery fails, perform recovery from backup."

Anyway, I agree that the secondary checkpoint isn't necessary.

Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Michael Paquier
On Tue, Oct 24, 2017 at 5:58 PM, Tsunakawa, Takayuki
 wrote:
> If the latest checkpoint record is unreadable (the WAL segment/block/record 
> is corrupt?), recovery from the previous checkpoint would also stop at the 
> latest checkpoint.  And we don't need to replay the WAL records between the 
> previous checkpoint and the latest one, because their changes are already 
> persisted when the latest checkpoint was taken.  So, the user should just do 
> pg_resetxlog and start the database server when the recovery cannot find the 
> latest checkpoint record and PANICs?

Not necessarily. If a failure is detected when reading the last
checkpoint, as you say recovery would begin at the checkpoint prior
that and would stop when reading the record of last checkpoint, still
one could use a recovery.conf with restore_command to fetch segments
from a different source, like a static archive, as only the local
segment may be corrupted.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> Doesn't it also make crash recovery less robust?  The whole point
> of that mechanism is to be able to cope if the latest checkpoint
> record is unreadable.

If the latest checkpoint record is unreadable (the WAL segment/block/record is 
corrupt?), recovery from the previous checkpoint would also stop at the latest 
checkpoint.  And we don't need to replay the WAL records between the previous 
checkpoint and the latest one, because their changes are already persisted when 
the latest checkpoint was taken.  So, the user should just do pg_resetxlog and 
start the database server when the recovery cannot find the latest checkpoint 
record and PANICs?

Regards
Takayuki Tsunakawa






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Michael Paquier
On Mon, Oct 23, 2017 at 11:20 PM, Simon Riggs  wrote:
> Remove the code that maintained two checkpoint's WAL files and all
> associated stuff.
>
> Try to avoid breaking anything else
>
> This
> * reduces disk space requirements on master
> * removes a minor bug in fast failover
> * simplifies code

In short, yeah! I had a close look at the patch and noticed a couple of issues.

+else
 /*
- * The last valid checkpoint record required for a streaming
- * recovery exists in neither standby nor the primary.
+ * We used to attempt to go back to a secondary checkpoint
+ * record here, but only when not in standby_mode. We now
+ * just fail if we can't read the last checkpoint because
+ * this allows us to simplify processing around checkpoints.
  */
 ereport(PANIC,
 (errmsg("could not locate a valid checkpoint record")));
-}
Using brackets in this case is more elegant style IMO. OK, there is
one line, but the comment is long so the block becomes
confusingly-shaped.

 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION1003
+#define PG_CONTROL_VERSION1100
Yes, this had to happen anyway in this release cycle.

backup.sgml describes the following:
to higher segment numbers.  It's assumed that segment files whose
contents precede the checkpoint-before-last are no longer of
interest and can be recycled.
However this is not true anymore with this patch.

The documentation of pg_control_checkpoint() is not updated. func.sgml
lists the tuple fields returned for this function.

You forgot to update pg_control_checkpoint in pg_proc.h. prior_lsn is
still listed in the list of arguments returned. And you need to update
as well the output list of types.

  * whichChkpt identifies the checkpoint (merely for reporting purposes).
- * 1 for "primary", 2 for "secondary", 0 for "other" (backup_label)
+ * 1 for "primary", 0 for "other" (backup_label)
+ * 2 for "secondary" is no longer supported
  */
I would suggest to just remove the reference to "secondary".

-* Delete old log files (those no longer needed even for previous
-* checkpoint or the standbys in XLOG streaming).
+* Delete old log files and recycle them
 */
Here that's more "Delete or recycle old log files". Recycling of a WAL
segment is its renaming into a newer segment.

You forgot to update this formula in xlog.c:
distance = (2.0 + CheckPointCompletionTarget) * CheckPointDistanceEstimate;
/* add 10% for good measure. */
distance *= 1.10;
You can guess easily where the mistake is.

-   checkPointLoc = ControlFile->prevCheckPoint;
+   /*
+* It appears to be a bug that we used to use
prevCheckpoint here
+*/
+   checkPointLoc = ControlFile->checkPoint;
Er, no. This is correct because we expect the prior checkpoint to
still be present in the event of a failure detecting the latest
checkpoint.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Simon Riggs
On 24 October 2017 at 09:50, Tom Lane  wrote:
> Simon Riggs  writes:
>> Remove the code that maintained two checkpoint's WAL files and all
>> associated stuff.
>
>> Try to avoid breaking anything else
>
>> This
>> * reduces disk space requirements on master
>> * removes a minor bug in fast failover
>> * simplifies code
>
> Doesn't it also make crash recovery less robust?  The whole point
> of that mechanism is to be able to cope if the latest checkpoint
> record is unreadable.  If you want to toss that overboard, I think
> you need to make the case why we don't need it, not just post a
> patch removing it.  *Of course* the code is simpler without it.
> That's utterly irrelevant.  The code would be even simpler with
> no crash recovery at all ... but we're not going there.

Well, the mechanism has already been partially removed since we don't
maintain two checkpoints on a standby. So all I'm proposing is we
remove the other half.

I've not seen myself, nor can I find an example online where the
primary failed yet the secondary did not also fail from the same
cause.

If it is a possibility to do this, now we have pg_waldump we can
easily search for a different checkpoint and start from there instead
which is a more flexible approach. If you didn't save your WAL and
don't have any other form of backup, relying on the secondary
checkpoint is not exactly a safe bet.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Andres Freund
On 2017-10-24 09:50:12 -0400, Tom Lane wrote:
> Simon Riggs  writes:
> > Remove the code that maintained two checkpoint's WAL files and all
> > associated stuff.
> 
> > Try to avoid breaking anything else
> 
> > This
> > * reduces disk space requirements on master
> > * removes a minor bug in fast failover
> > * simplifies code
> 
> Doesn't it also make crash recovery less robust?

I think it does the contrary. The current mechanism is, in my opinion,
downright dangerous:
https://www.postgresql.org/message-id/20160201235854.go8...@awork2.anarazel.de

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Tom Lane
Simon Riggs  writes:
> Remove the code that maintained two checkpoint's WAL files and all
> associated stuff.

> Try to avoid breaking anything else

> This
> * reduces disk space requirements on master
> * removes a minor bug in fast failover
> * simplifies code

Doesn't it also make crash recovery less robust?  The whole point
of that mechanism is to be able to cope if the latest checkpoint
record is unreadable.  If you want to toss that overboard, I think
you need to make the case why we don't need it, not just post a
patch removing it.  *Of course* the code is simpler without it.
That's utterly irrelevant.  The code would be even simpler with
no crash recovery at all ... but we're not going there.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers