Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-06 Thread Amit Kapila
On Tue, Feb 2, 2016 at 5:28 AM, Andres Freund  wrote:
>
> Hi,
>
> currently if, when not in standby mode, we can't read a checkpoint
> record, we automatically fall back to the previous checkpoint, and start
> replay from there.
>
> Doing so without user intervention doesn't actually seem like a good
> idea. While not super likely, it's entirely possible that doing so can
> wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a
> tablespace being dropped - going back to the previous checkpoint very
> well could lead to replay not finishing, as the directory to create
> files in doesn't even exist.
>

I think there are similar hazards for deletion of relation when
relfilenode gets reused.  Basically, it can delete the data
for one of the newer relations which is created after the
last checkpoint.

> As there's, afaics, really no "legitimate" reasons for needing to go
> back to the previous checkpoint I don't think we should do so in an
> automated fashion.
>

I have tried to find out why at the first place such a mechanism has
been introduced and it seems to me that commit
4d14fe0048cf80052a3ba2053560f8aab1bb1b22 has introduced it, but
the reason is not apparent.  Then I digged through the archives
and found mail chain which I think has lead to this commit.
Refer [1][2].

If we want to do something for fallback-to-previous-checkpoint
mechanism, then I think it is worth considering whether we want
to retain xlog files from two checkpoints as that also seems to
have been introduced in the same commit.


> All the cases where I could find logs containing "using previous
> checkpoint record at" were when something else had already gone pretty
> badly wrong. Now that obviously doesn't have a very large significance,
> because in the situations where it "just worked" are unlikely to be
> reported...
>
> Am I missing a reason for doing this by default?
>

I am not sure, but may be such hazards won't exist at the time
fallback-to-previous-checkpoint mechanism has been introduced.

I think even if we want to make it non-default, it will be very
difficult for users to decide whether to turn it on or not.  Basically,
I think if such a situation occurs, what ever solution we try to
provide to user, it might not be full-proof, but OTOH we should
provide some way to allow user to start database and dump the
existing contents. Some of the options that comes to mind are
provide some way to get the last checkpoint record from WAL
or provide a way to compute max-lsn from data-pages and use
that with pg_resetxlog utility to allow user to start database.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-06 Thread Amit Kapila
On Sun, Feb 7, 2016 at 10:54 AM, Amit Kapila 
wrote:
>
> On Tue, Feb 2, 2016 at 5:28 AM, Andres Freund  wrote:
> >
> > Hi,
> >
> > currently if, when not in standby mode, we can't read a checkpoint
> > record, we automatically fall back to the previous checkpoint, and start
> > replay from there.
> >
> > Doing so without user intervention doesn't actually seem like a good
> > idea. While not super likely, it's entirely possible that doing so can
> > wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a
> > tablespace being dropped - going back to the previous checkpoint very
> > well could lead to replay not finishing, as the directory to create
> > files in doesn't even exist.
> >
>
> I think there are similar hazards for deletion of relation when
> relfilenode gets reused.  Basically, it can delete the data
> for one of the newer relations which is created after the
> last checkpoint.
>
> > As there's, afaics, really no "legitimate" reasons for needing to go
> > back to the previous checkpoint I don't think we should do so in an
> > automated fashion.
> >
>
> I have tried to find out why at the first place such a mechanism has
> been introduced and it seems to me that commit
> 4d14fe0048cf80052a3ba2053560f8aab1bb1b22 has introduced it, but
> the reason is not apparent.  Then I digged through the archives
> and found mail chain which I think has lead to this commit.
> Refer [1][2].
>

oops, forgot to provide the links, providing them now.

[1] - http://www.postgresql.org/message-id/21559.983467...@sss.pgh.pa.us
[2] - http://www.postgresql.org/message-id/17254.984448...@sss.pgh.pa.us

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-04 Thread Alvaro Herrera
David G. Johnston wrote:

> ​Learning by reading here...
> 
> http://www.postgresql.org/docs/current/static/wal-internals.html
> """
> ​After a checkpoint has been made and the log flushed, the checkpoint's
> position is saved in the file pg_control. Therefore, at the start of
> recovery, the server first reads pg_control and then the checkpoint record;
> then it performs the REDO operation by scanning forward from the log
> position indicated in the checkpoint record. Because the entire content of
> data pages is saved in the log on the first page modification after a
> checkpoint (assuming full_page_writes is not disabled), all pages changed
> since the checkpoint will be restored to a consistent state.
> 
> To deal with the case where pg_control is corrupt, we should support the
> possibility of scanning existing log segments in reverse order — newest to
> oldest — in order to find the latest checkpoint. This has not been
> implemented yet. pg_control is small enough (less than one disk page) that
> it is not subject to partial-write problems, and as of this writing there
> have been no reports of database failures due solely to the inability to
> read pg_control itself. So while it is theoretically a weak spot,
> pg_control does not seem to be a problem in practice.
> ​"""​
> 
> ​The above comment appears out-of-date if this post describes what
> presently happens.

I think you're misinterpreting Andres, or the docs, or both.

What Andres says is that the control file (pg_control) stores two
checkpoint locations: the latest one, and the one before that.  When
recovery occurs, it starts by looking up the latest checkpoint record;
if it cannot find that for whatever reason, it falls back to reading the
previous one.  (He further claims that falling back to the previous one
is a bad idea.)

What the 2nd para in the documentation is saying is something different:
it is talking about reading all the pg_xlog files (in reverse order),
which is not pg_control, and see what checkpoint records are there, then
figure out which one to use.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-04 Thread David G. Johnston
On Thu, Feb 4, 2016 at 3:57 PM, Alvaro Herrera 
wrote:

> David G. Johnston wrote:
>
> > ​Learning by reading here...
> >
> > http://www.postgresql.org/docs/current/static/wal-internals.html
> > """
> > ​After a checkpoint has been made and the log flushed, the checkpoint's
> > position is saved in the file pg_control. Therefore, at the start of
> > recovery, the server first reads pg_control and then the checkpoint
> record;
> > then it performs the REDO operation by scanning forward from the log
> > position indicated in the checkpoint record. Because the entire content
> of
> > data pages is saved in the log on the first page modification after a
> > checkpoint (assuming full_page_writes is not disabled), all pages changed
> > since the checkpoint will be restored to a consistent state.
> >
> > To deal with the case where pg_control is corrupt, we should support the
> > possibility of scanning existing log segments in reverse order — newest
> to
> > oldest — in order to find the latest checkpoint. This has not been
> > implemented yet. pg_control is small enough (less than one disk page)
> that
> > it is not subject to partial-write problems, and as of this writing there
> > have been no reports of database failures due solely to the inability to
> > read pg_control itself. So while it is theoretically a weak spot,
> > pg_control does not seem to be a problem in practice.
> > ​"""​
> >
> > ​The above comment appears out-of-date if this post describes what
> > presently happens.
>
> I think you're misinterpreting Andres, or the docs, or both.
>
> What Andres says is that the control file (pg_control) stores two
> checkpoint locations: the latest one, and the one before that.  When
> recovery occurs, it starts by looking up the latest checkpoint record;
> if it cannot find that for whatever reason, it falls back to reading the
> previous one.  (He further claims that falling back to the previous one
> is a bad idea.)
>
> What the 2nd para in the documentation is saying is something different:
> it is talking about reading all the pg_xlog files (in reverse order),
> which is not pg_control, and see what checkpoint records are there, then
> figure out which one to use.
>

Yes, I inferred something that obviously isn't true - that the system
doesn't go hunting for a valid checkpoint to begin recovery from.  While it
does not do so in the case of a corrupted pg_control file I further assumed
it never did.  That would be because the documentation doesn't make the
point of stating that two checkpoint positions exist and that PostgreSQL
will try the second one if the first one proves unusable.  Given the topic
of this thread that omission makes the documentation out-of-date.  Maybe
its covered elsewhere but since this section addresses locating a starting
point I would expect any such description ​to be here as well.

David J.


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-04 Thread Andres Freund
On 2016-02-03 09:28:24 -0500, Robert Haas wrote:
> Would we still have some way of forcing the older checkpoint record to
> be used if somebody wants to try to do that?

I think currently the best way to force an arbitrary checkpoint to be
used is creating a "custom" backup label. Not that nice.  Not sure if we
need something nice here, I don't really see a frequent need for this.

We could add another option to pg_resetxlog alternatively :/

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-04 Thread Andres Freund
On February 5, 2016 2:52:20 AM GMT+03:00, Jim Nasby  
wrote:
>On 2/4/16 3:37 PM, Andres Freund wrote:
>> On 2016-02-03 09:28:24 -0500, Robert Haas wrote:
>>> Would we still have some way of forcing the older checkpoint record
>to
>>> be used if somebody wants to try to do that?
>>
>> I think currently the best way to force an arbitrary checkpoint to be
>> used is creating a "custom" backup label. Not that nice.  Not sure if
>we
>> need something nice here, I don't really see a frequent need for
>this.
>>
>> We could add another option to pg_resetxlog alternatively :/
>
>I guess you'd have to scan through WAL files by hand to find the next 
>oldest checkpoint?

Just look at pg control, it contains the precious location?

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-04 Thread Jim Nasby

On 2/4/16 5:09 PM, David G. Johnston wrote:


What the 2nd para in the documentation is saying is something different:
it is talking about reading all the pg_xlog files (in reverse order),
which is not pg_control, and see what checkpoint records are there, then
figure out which one to use.


Yes, I inferred something that obviously isn't true - that the system
doesn't go hunting for a valid checkpoint to begin recovery from.  While
it does not do so in the case of a corrupted pg_control file I further
assumed it never did.  That would be because the documentation doesn't
make the point of stating that two checkpoint positions exist and that
PostgreSQL will try the second one if the first one proves unusable.
Given the topic of this thread that omission makes the documentation
out-of-date.  Maybe its covered elsewhere but since this section
addresses locating a starting point I would expect any such description
​to be here as well.


Yeah, I think we should fix the docs. Especially since I imagine that if 
you're reading that part of the docs you're probably having a really bad 
day, and bad info won't help you...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-04 Thread Jim Nasby

On 2/4/16 3:37 PM, Andres Freund wrote:

On 2016-02-03 09:28:24 -0500, Robert Haas wrote:

Would we still have some way of forcing the older checkpoint record to
be used if somebody wants to try to do that?


I think currently the best way to force an arbitrary checkpoint to be
used is creating a "custom" backup label. Not that nice.  Not sure if we
need something nice here, I don't really see a frequent need for this.

We could add another option to pg_resetxlog alternatively :/


I guess you'd have to scan through WAL files by hand to find the next 
oldest checkpoint?


I'm guessing that if this is happening in the field there's a decent 
chance people aren't noticing it, so maybe the best thing for now is to 
turn off the automatic behavior bust still have a relatively easy way to 
re-enable it. In case this is more common than we think...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-03 Thread Robert Haas
On Mon, Feb 1, 2016 at 6:58 PM, Andres Freund  wrote:
> currently if, when not in standby mode, we can't read a checkpoint
> record, we automatically fall back to the previous checkpoint, and start
> replay from there.
>
> Doing so without user intervention doesn't actually seem like a good
> idea. While not super likely, it's entirely possible that doing so can
> wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a
> tablespace being dropped - going back to the previous checkpoint very
> well could lead to replay not finishing, as the directory to create
> files in doesn't even exist.
>
> As there's, afaics, really no "legitimate" reasons for needing to go
> back to the previous checkpoint I don't think we should do so in an
> automated fashion.
>
> All the cases where I could find logs containing "using previous
> checkpoint record at" were when something else had already gone pretty
> badly wrong. Now that obviously doesn't have a very large significance,
> because in the situations where it "just worked" are unlikely to be
> reported...
>
> Am I missing a reason for doing this by default?

I agree: this seems like a terrible idea.  Would we still have some
way of forcing the older checkpoint record to be used if somebody
wants to try to do that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-01 Thread Andres Freund
Hi,

currently if, when not in standby mode, we can't read a checkpoint
record, we automatically fall back to the previous checkpoint, and start
replay from there.

Doing so without user intervention doesn't actually seem like a good
idea. While not super likely, it's entirely possible that doing so can
wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a
tablespace being dropped - going back to the previous checkpoint very
well could lead to replay not finishing, as the directory to create
files in doesn't even exist.

As there's, afaics, really no "legitimate" reasons for needing to go
back to the previous checkpoint I don't think we should do so in an
automated fashion.

All the cases where I could find logs containing "using previous
checkpoint record at" were when something else had already gone pretty
badly wrong. Now that obviously doesn't have a very large significance,
because in the situations where it "just worked" are unlikely to be
reported...

Am I missing a reason for doing this by default?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-01 Thread Andres Freund
On 2016-02-01 17:29:39 -0700, David G. Johnston wrote:
> ​Learning by reading here...
> 
> http://www.postgresql.org/docs/current/static/wal-internals.html
> """
> ​After a checkpoint has been made and the log flushed, the checkpoint's
> position is saved in the file pg_control. Therefore, at the start of
> recovery, the server first reads pg_control and then the checkpoint record;
> then it performs the REDO operation by scanning forward from the log
> position indicated in the checkpoint record. Because the entire content of
> data pages is saved in the log on the first page modification after a
> checkpoint (assuming full_page_writes is not disabled), all pages changed
> since the checkpoint will be restored to a consistent state.

> ​The above comment appears out-of-date if this post describes what
> presently happens.

Where do you see a conflict with what I wrote about? We store both the
last and the previous checkpoint's location in pg_control. Or are you
talking about:

> To deal with the case where pg_control is corrupt, we should support the
> possibility of scanning existing log segments in reverse order — newest to
> oldest — in order to find the latest checkpoint. This has not been
> implemented yet. pg_control is small enough (less than one disk page) that
> it is not subject to partial-write problems, and as of this writing there
> have been no reports of database failures due solely to the inability to
> read pg_control itself. So while it is theoretically a weak spot,
> pg_control does not seem to be a problem in practice.

if so, no, that's not a out-of-date, as we simply store two checkpoint
locations:
 $ pg_controldata /srv/dev/pgdev-dev/|grep 'checkpoint location'
Latest checkpoint location:   B3/2A730028
Prior checkpoint location:B3/2A72FFA0

> Also, I was​ under the impression that tablespace commands resulted in
> checkpoints so that the state of the file system could be presumed
> current...

That actually doesn't really make it any better - it forces the *latest*
checkpoint, but if we can't read that, we'll start with the previous
one...

> I don't know enough internals but its seems like we'd need to distinguish
> between an interrupted checkpoint (pull the plug during checkpoint) and one
> that supposedly completed without interruption but then was somehow
> corrupted (solar flares).  The former seem legitimate for auto-skip while
> the later do not.

I don't think such a distinction is really possible (or necessary). If
pg_control is corrupted we won't even start, and if WAL is corrupted
that badly we won't finish replay...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-01 Thread David G. Johnston
On Mon, Feb 1, 2016 at 4:58 PM, Andres Freund  wrote:

> Hi,
>
> currently if, when not in standby mode, we can't read a checkpoint
> record, we automatically fall back to the previous checkpoint, and start
> replay from there.
>
> Doing so without user intervention doesn't actually seem like a good
> idea. While not super likely, it's entirely possible that doing so can
> wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a
> tablespace being dropped - going back to the previous checkpoint very
> well could lead to replay not finishing, as the directory to create
> files in doesn't even exist.


> As there's, afaics, really no "legitimate" reasons for needing to go
> back to the previous checkpoint I don't think we should do so in an
> automated fashion.
>
> All the cases where I could find logs containing "using previous
> checkpoint record at" were when something else had already gone pretty
> badly wrong. Now that obviously doesn't have a very large significance,
> because in the situations where it "just worked" are unlikely to be
> reported...
>
> Am I missing a reason for doing this by default?
>

​Learning by reading here...

http://www.postgresql.org/docs/current/static/wal-internals.html
"""
​After a checkpoint has been made and the log flushed, the checkpoint's
position is saved in the file pg_control. Therefore, at the start of
recovery, the server first reads pg_control and then the checkpoint record;
then it performs the REDO operation by scanning forward from the log
position indicated in the checkpoint record. Because the entire content of
data pages is saved in the log on the first page modification after a
checkpoint (assuming full_page_writes is not disabled), all pages changed
since the checkpoint will be restored to a consistent state.

To deal with the case where pg_control is corrupt, we should support the
possibility of scanning existing log segments in reverse order — newest to
oldest — in order to find the latest checkpoint. This has not been
implemented yet. pg_control is small enough (less than one disk page) that
it is not subject to partial-write problems, and as of this writing there
have been no reports of database failures due solely to the inability to
read pg_control itself. So while it is theoretically a weak spot,
pg_control does not seem to be a problem in practice.
​"""​

​The above comment appears out-of-date if this post describes what
presently happens.

Also, I was​ under the impression that tablespace commands resulted in
checkpoints so that the state of the file system could be presumed
current...

I don't know enough internals but its seems like we'd need to distinguish
between an interrupted checkpoint (pull the plug during checkpoint) and one
that supposedly completed without interruption but then was somehow
corrupted (solar flares).  The former seem legitimate for auto-skip while
the later do not.

David J.


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-01 Thread David G. Johnston
On Mon, Feb 1, 2016 at 5:48 PM, Andres Freund  wrote:

> On 2016-02-01 17:29:39 -0700, David G. Johnston wrote:
> > ​Learning by reading here...
> >
> > http://www.postgresql.org/docs/current/static/wal-internals.html
> > """
> > ​After a checkpoint has been made and the log flushed, the checkpoint's
> > position is saved in the file pg_control. Therefore, at the start of
> > recovery, the server first reads pg_control and then the checkpoint
> record;
> > then it performs the REDO operation by scanning forward from the log
> > position indicated in the checkpoint record. Because the entire content
> of
> > data pages is saved in the log on the first page modification after a
> > checkpoint (assuming full_page_writes is not disabled), all pages changed
> > since the checkpoint will be restored to a consistent state.
>
> > ​The above comment appears out-of-date if this post describes what
> > presently happens.
>
> Where do you see a conflict with what I wrote about? We store both the
> last and the previous checkpoint's location in pg_control. Or are you
> talking about:
>

​Mainly the following...but the word I used was "out-of-date" and not
"conflict".​  The present state seems to do the above, and then some.


> > To deal with the case where pg_control is corrupt, we should support the
> > possibility of scanning existing log segments in reverse order — newest
> to
> > oldest — in order to find the latest checkpoint. This has not been
> > implemented yet. pg_control is small enough (less than one disk page)
> that
> > it is not subject to partial-write problems, and as of this writing there
> > have been no reports of database failures due solely to the inability to
> > read pg_control itself. So while it is theoretically a weak spot,
> > pg_control does not seem to be a problem in practice.
>
> if so, no, that's not a out-of-date, as we simply store two checkpoint
> locations:
>  $ pg_controldata /srv/dev/pgdev-dev/|grep 'checkpoint location'
> Latest checkpoint location:   B3/2A730028
> Prior checkpoint location:B3/2A72FFA0
>
>
​The quote implies that only a single checkpoint​ is noted and that no
"searching" is performed - whether by scanning or by being told the
position of a previous one so that it can jump there immediately without
scanning backwards.  It isn't strictly the fact that we do not "scan"
backwards but the implications that arise in making that statement.  Maybe
this is being picky but if you cannot trust the value of "Latest checkpoint
location" then pg_control is arguably corrupt.  Corruption is not strictly
limited to "unable to be read" but does include "contains invalid data".

> Also, I was​ under the impression that tablespace commands resulted in
> > checkpoints so that the state of the file system could be presumed
> > current...
>
> That actually doesn't really make it any better - it forces the *latest*
> checkpoint, but if we can't read that, we'll start with the previous
> one...


> > I don't know enough internals but its seems like we'd need to distinguish
> > between an interrupted checkpoint (pull the plug during checkpoint) and
> one
> > that supposedly completed without interruption but then was somehow
> > corrupted (solar flares).  The former seem legitimate for auto-skip while
> > the later do not.
>
> I don't think such a distinction is really possible (or necessary). If
> pg_control is corrupted we won't even start, and if WAL is corrupted
> that badly we won't finish replay...
>

My takeaway from the above is that we should only record what we think is a
usable/readable/valid checkpoint location to "Latest checkpoint location"​
​
​ (LCL) and if the system is not able to use that information to perform a
successful recovery it should be allowed to die without using the value in
"Previous checkpoint location" - which becomes effectively ignored during
master recovery.

​David J.
​