Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-17 Thread Kyotaro HORIGUCHI
Thank you for committing.

 On 02/14/2014 10:38 AM, Kyotaro HORIGUCHI wrote:
  Finally, the patch you will find attached is fixed only in
  styling mentioned above from your last patch. This patch applies
  current HEAD and I confirmed that it fixes this issue but I have
  not checked the lastSourceFailed section. Simple file removal
  could not lead to there.
 
 Ok, applied. Thanks!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-14 Thread Kyotaro HORIGUCHI
Hello,

Before taking up the topic..

At Thu, 13 Feb 2014 19:45:38 +0200, Heikki Linnakangas wrote
 On 02/13/2014 06:47 PM, Heikki Linnakangas wrote:
  On 02/13/2014 02:42 PM, Heikki Linnakangas wrote:
  The behavior where we prefer a segment from archive with lower TLI
  over
  a file with higher TLI in pg_xlog actually changed in commit
  a068c391ab0. Arguably changing it wasn't a good idea, but the problem
  your test script demonstrates can be fixed by not archiving the
  partial
  segment, with no change to the preference of archive/pg_xlog. As
  discussed, archiving a partial segment seems like a bad idea anyway,
  so
  let's just stop doing that.

It surely makes things simple and I rather like the idea but as
long as the final and possiblly partial segment of the lower TLI
is actually created and the recovery mechanism allows users to
command recovery operation requires such segments
(recovery_target_timeline does this), a perfect archive - which
means an archive which can cover all sorts of restore operatoins
- necessarily may have such duplicate segments, I
believe. Besides, I suppose that policy makes operations around
archive/restore way difficult. DBAs should get stuck with tensive
work of picking only actually needed segments for the recovery
undertaken out of the haystack. It sounds somewhat gloomy..

# However I also doubt the appropriateness of stockpiling archive
# segments spanning over so many timelines, two generations are
# enough to cause this issue.

Anyway, returning to the topic,

  After some further thought, while not archiving the partial segment
  fixes your test script, it's not enough to fix all variants of the
  problem. Even if archive recovery doesn't archive the last, partial,
  segment, if the original master server is still running, it's entirely
  possible that it fills the segment and archives it. In that case,
  archive recovery will again prefer the archived segment with lower TLI
  over the segment with newer TLI in pg_xlog.

Yes, it is the generalized description of the case I've
mentioned. (Though I've not reached that thought :)

  So I agree we should commit the patch you posted (or something to that
  effect). The change to not archive the last segment still seems like a
  good idea, but perhaps we should only do that in master.

My opinion on duplicate segments on older timelines is as
decribed above.

 To draw this to conclusion, barring any further insights to this, I'm
 going to commit the attached patch to master and REL9_3_STABLE. Please
 have a look at the patch, to see if I'm missing something. I modified
 the state machine to skip over XLOG_FROM_XLOG state, if reading in
 XLOG_FROM_ARCHIVE failed; otherwise you first scan the archive and
 pg_xlog together, and then pg_xlog alone, which is pointless.
 
 In master, I'm also going to remove the archive last segment on old
 timeline code.

Thank you for finishing the patch. I didn't think of the behavior
after XLOG_FROM_ARCHIVE failure. It seems that the state machine
will go round getting rid of extra round with it. Recovery
process becomes able to grab the segment on highest (expected)
TLI among those with the same segment id regardless of their
locations. I think the recovery process will cope with perfect
archives described above for all types of recovery operation. The
state machine loop considering fallback from archive to pg_xlog
now seems somewhat too complicated than needed but it's also no
harm.

Though, here which was in my original patch,

  readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
currentSource == XLOG_FROM_ARCHIVE ? 
 XLOG_FROM_ANY : currentSource);

is sticking far out the line wrapping boundary and seems somewhat
dirty:(

And what the conditional operator seems to make the meaning of
the XLOG_FROM_ARCHIVE and _ANY a bit confused. But I failed to
unify them to any side so it is left as is..

Finally, the patch you will find attached is fixed only in
styling mentioned above from your last patch. This patch applies
current HEAD and I confirmed that it fixes this issue but I have
not checked the lastSourceFailed section. Simple file removal
could not lead to there.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 508970a..85a0ce9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11006,17 +11006,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	/*---
 	 * Standby mode is implemented by a state machine:
 	 *
-	 * 1. Read from archive (XLOG_FROM_ARCHIVE)
-	 * 2. Read from pg_xlog (XLOG_FROM_PG_XLOG)
-	 * 3. Check trigger file
-	 * 4. Read from primary server via walreceiver (XLOG_FROM_STREAM)
-	 * 5. Rescan timelines
-	 * 6. Sleep 5 seconds, and loop back to 1.
+	 * 1. Read from either archive or pg_xlog (XLOG_FROM_ARCHIVE), or just
+	 *pg_xlog (XLOG_FROM_XLOG)
+	 * 

Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-14 Thread Heikki Linnakangas

On 02/14/2014 10:38 AM, Kyotaro HORIGUCHI wrote:

Finally, the patch you will find attached is fixed only in
styling mentioned above from your last patch. This patch applies
current HEAD and I confirmed that it fixes this issue but I have
not checked the lastSourceFailed section. Simple file removal
could not lead to there.


Ok, applied. Thanks!

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-13 Thread Heikki Linnakangas

On 02/13/2014 08:44 AM, Kyotaro HORIGUCHI wrote:

Wouldn't it be better to not archive the old segment, and instead
switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?


It would be better to zero-fill and switch segments, yes. We should
NEVER be in a position of archiving two different versions of the same
segment.


Ok, I think we're in agreement that that's the way to go for master.


I've almost inclined to that but on some thoughts on the idea,
comming to think of recovery upto target timeline, the old
segment found to be necessary for the case. Without the old
segment, we would be obliged to seek to the first segment of the
*next* timeline (Is there any (simple) means to predict where is
it?) to complete the task.


How did the server that created the new timeline get the old, partial, 
segment? Was it already archived? Or did the DBA copy it into pg_xlog 
manually? Or was it streamed by streaming replication? Whatever the 
mechanism, the same mechanism ought to make sure the old segment is 
available for PITR, too.


Hmm. If you have set up streaming replication and a WAL archive, and 
your master dies and you fail over to a standby, what you describe does 
happen. The partial old segment is not in the archive, so you cannot 
PITR to a point in the old timeline that falls within the partial 
segment (ie. just before the failover). However, it's not guaranteed 
that all the preceding WAL segments on the old timeline were already 
archived, anyway, so even if the partial segment is archived, it's not 
guaranteed to work.


The old master is responsible for archiving the WAL on the old timeline, 
and the new master is responsible for archiving all the WAL on the new 
timeline. That's a straightforward, easy-to-understand rule. It might be 
useful to have a mode where the standby also archives all the received 
WAL, but that would need to be a separate option.



Is it the right way we kick the older one out of archive?


If it's already in the archive, it's not going to be removed from the 
archive.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-13 Thread Heikki Linnakangas

On 02/12/2014 01:24 PM, Christoph Berg wrote:

Re: Heikki Linnakangas 2014-01-13 52d3caff.3010...@vmware.com

Actually, why is the partially-filled 00010002 file
archived in the first place? Looking at the code, it's been like that
forever, but it seems like a bad idea. If the original server is still
up and running, and writing more data to that file, what will happen is
that when the original server later tries to archive it, it will fail
because the partial version of the file is already in the archive. Or
worse, the partial version overwrites a previously archived more
complete version.


Oh!  This explains some transient errors I've seen.


Wouldn't it be better to not archive the old segment, and instead switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?


It would be better to zero-fill and switch segments, yes.  We should
NEVER be in a position of archiving two different versions of the same
segment.


Ok, I think we're in agreement that that's the way to go for master.

Now, what to do about back-branches? On one hand, I'd like to apply
the same fix to all stable branches, as the current behavior is
silly and always has been. On the other hand, we haven't heard any
complaints about it, so we probably shouldn't fix what ain't broken.
Perhaps we should apply it to 9.3, as that's where we have the acute
problem the OP reported. Thoughts?

In summary, I propose that we change master and REL9_3_STABLE to not
archive the partial segment from previous timeline. Older branches
will keep the current behavior.


I've seen the can't archive file from the old timeline problem on
9.2 and 9.3 slaves after promotion. The problem is in conjunction with
the proposed archive_command in the default postgresql.conf comments:

# e.g. 'test ! -f /mnt/server/archivedir/%f  cp %p /mnt/server/archivedir/%f'

With 9.1, it works, but 9.2 and 9.3 don't archive anything until I
remove the test ! -f part. (An alternative fix would be to declare
the behavior ok and adjust that example in the config.)


Hmm, the behavior is the same in 9.1 and 9.2. Did you use a different 
archive_command in 9.1, without the test?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-13 Thread Kyotaro HORIGUCHI
Hello, I might have been misunderstood your words.

At Thu, 13 Feb 2014 10:11:22 +0200, Heikki Linnakangas 
hlinnakan...@vmware.com wrote in 52fc7e2a.9060...@vmware.com
 On 02/13/2014 08:44 AM, Kyotaro HORIGUCHI wrote:
  Wouldn't it be better to not archive the old segment, and instead
  switch
  to a new segment after writing the end-of-recovery checkpoint, so that
  the segment on the new timeline is archived sooner?
 
  It would be better to zero-fill and switch segments, yes. We should
  NEVER be in a position of archiving two different versions of the same
  segment.
 
  Ok, I think we're in agreement that that's the way to go for master.

Does this mean that no mechanical solution for this situation
will be given and an operator should remove the older segment for
the same segment id before starting recovoery process?

  I've almost inclined to that but on some thoughts on the idea,
  comming to think of recovery upto target timeline, the old
  segment found to be necessary for the case. Without the old
  segment, we would be obliged to seek to the first segment of the
  *next* timeline (Is there any (simple) means to predict where is
  it?) to complete the task.
 
 How did the server that created the new timeline get the old, partial,
 segment? Was it already archived? Or did the DBA copy it into pg_xlog
 manually? Or was it streamed by streaming replication? Whatever the
 mechanism, the same mechanism ought to make sure the old segment is
 available for PITR, too.

Sure.

 Hmm. If you have set up streaming replication and a WAL archive, and
 your master dies and you fail over to a standby, what you describe
 does happen. The partial old segment is not in the archive, so you
 cannot PITR to a point in the old timeline that falls within the
 partial segment (ie. just before the failover). However, it's not
 guaranteed that all the preceding WAL segments on the old timeline
 were already archived, anyway, so even if the partial segment is
 archived, it's not guaranteed to work.

Yes, and putting aside the insane or vanished segments in
archive, I understand that a pair of master and standby (standby
and cascaded standby and so on, too) can share one WAL archive,
or archived WAL segments and all the WAL segments not archived
and left in pg_xlog of the old master should be merged into WAL
archive of new master (promoted old slave) to keep the
availability of the online backup taken from the old master. Even
with the shared WAL archive, missing segments in archive should
be filled up using pg_xlog though. Nevertheless, the process can
be implemented in automatic way.

The test script at first of this thread is for the case of shared
archive and I have unconsciously put that as the context.


 The old master is responsible for archiving the WAL on the old
 timeline, and the new master is responsible for archiving all the WAL
 on the new timeline. That's a straightforward, easy-to-understand
 rule.

Yes, I was somewhat confused because of my assumption of shared
archive, but it actually can be converged into single archive,
and the older version of PostgreSQL could cope with that
situation.

  It might be useful to have a mode where the standby also
 archives all the received WAL, but that would need to be a separate
 option.

Perhaps such a mechanism is not demanded :)

  Is it the right way we kick the older one out of archive?
 
 If it's already in the archive, it's not going to be removed from the
 archive.

I have understood the conclusion so far is not archiving the
older segment when promotion but it seems a bit odd story as you
suggested. If the conclusion here is no aid as my new
understanding, would you let me hear the reason why recovery have
changed to prefer archive to pg_xlog?

The commit abf5c5c9a4 seems to change the behavior but I don't
find the reason for the change.

ragards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-13 Thread Heikki Linnakangas

On 02/13/2014 01:37 PM, Kyotaro HORIGUCHI wrote:

At Thu, 13 Feb 2014 10:11:22 +0200, Heikki Linnakangas hlinnakan...@vmware.com 
wrote in 52fc7e2a.9060...@vmware.com

On 02/13/2014 08:44 AM, Kyotaro HORIGUCHI wrote:

Wouldn't it be better to not archive the old segment, and instead
switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?


It would be better to zero-fill and switch segments, yes. We should
NEVER be in a position of archiving two different versions of the same
segment.


Ok, I think we're in agreement that that's the way to go for master.


Does this mean that no mechanical solution for this situation
will be given and an operator should remove the older segment for
the same segment id before starting recovoery process?


I don't understand. With the proposed change to never archive a partial 
segment, there is no segment with the same filename in the archive. 
There is nothing to remove. Or did you mean that you have to do that in 
9.0-9.2 back-branches, if we only back-patch this change to 9.3?



If it's already in the archive, it's not going to be removed from the
archive.


I have understood the conclusion so far is not archiving the
older segment when promotion but it seems a bit odd story as you
suggested. If the conclusion here is no aid as my new
understanding, would you let me hear the reason why recovery have
changed to prefer archive to pg_xlog?

The commit abf5c5c9a4 seems to change the behavior but I don't
find the reason for the change.


Well, it didn't work too well before abf5c5c9a4 either. With your test 
script, and commit abf5c5c9a4 reverted [1], you got this:


### STAGE 3/3: 2/2 START SERVER IN ARCHIVE RECOVERY MODE 2ND RUN ###
waiting for server to startLOG:  database system was interrupted; 
last known up at 2014-02-13 14:17:53 EET

LOG:  starting archive recovery
LOG:  restored log file 0002.history from archive
/bin/cp: cannot stat 
‘/home/heikki/pgsql.93stable/test/arc/00020002’: No such 
file or directory

LOG:  restored log file 00010002 from archive
LOG:  record with zero length at 0/290
LOG:  invalid primary checkpoint record
/bin/cp: cannot stat 
‘/home/heikki/pgsql.93stable/test/arc/00020002’: No such 
file or directory

LOG:  restored log file 00010002 from archive
LOG:  using previous checkpoint record at 0/228
LOG:  consistent recovery state reached at 0/228
LOG:  record with zero length at 0/290
LOG:  redo is not required
/bin/cp: cannot stat 
‘/home/heikki/pgsql.93stable/test/arc/00020002’: No such 
file or directory

LOG:  restored log file 00010002 from archive
/bin/cp: cannot stat 
‘/home/heikki/pgsql.93stable/test/arc/0003.history’: No such file or 
directory

LOG:  selected new timeline ID: 3
LOG:  restored log file 0002.history from archive
LOG:  archive recovery complete
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

Note the invalid primary checkpoint record message. It restored the 
partial 00010002 file from the archive, and tried to 
recover using that instead of the 00020002 file from 
pg_xlog. The primary checkpoint record is not present in 
00010002, so it fell back to recover from the previous 
checkpoint.


It was the exact same issue, even though it appeared to work if you 
don't look carefully. The second recovery never replayed the WAL 
generated on timeline 2, it branched off directly from timeline 1, at 
the same point where timeline 2 did. You can see that by looking at the 
history files. Also, if you create a table and insert a row to it after 
the first recovery, just before killing the server, you'll see that the 
row is not there after the second recovery. That's wrong; the data files 
might already contain some changes from timeline 2, and you end up with 
a corrupt database. Thanks to commit abf5c5c9a4, the situation is at 
least detected, and the system refuses to start up.


The behavior where we prefer a segment from archive with lower TLI over 
a file with higher TLI in pg_xlog actually changed in commit 
a068c391ab0. Arguably changing it wasn't a good idea, but the problem 
your test script demonstrates can be fixed by not archiving the partial 
segment, with no change to the preference of archive/pg_xlog. As 
discussed, archiving a partial segment seems like a bad idea anyway, so 
let's just stop doing that.



[1] just applying commit abf5c5c9a4 in reverse doesn't apply cleanly due 
to further changes to the code, but I did this to simulate the old behavior:


diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c

index 6a4a504..352d3f7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4366,6 +4366,7 @@ readRecoveryCommandFile(void)

/* Enable fetching 

Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-13 Thread Christoph Berg
Re: Heikki Linnakangas 2014-02-13 52fc9468.4050...@vmware.com
 With 9.1, it works, but 9.2 and 9.3 don't archive anything until I
 remove the test ! -f part. (An alternative fix would be to declare
 the behavior ok and adjust that example in the config.)
 
 Hmm, the behavior is the same in 9.1 and 9.2. Did you use a
 different archive_command in 9.1, without the test?

The behavior is different: 9.1 doesn't try to re-archive the last
segment from the old timeline. (I've seen it always happen in training
workshops, so that's always mostly idle machines. Possibly 9.1 will
re-archive the last segment under load.)

I've just reproduced it here with these settings (in Debian's
/etc/postgresql-common/createcluster.conf syntax):
archive_mode = on
archive_command = 'test ! -f /tmp/pg-%v/%%f  cp %%p /tmp/pg-%v/%%f'
wal_level = hot_standby
max_wal_senders = 3

... pg_basebackup ...

recovery.conf:
restore_command = 'cp /tmp/pg-9.3/%f %p'

9.1 slave log (in German, but it should be easy to figure out):

LOG:  konsistenter Wiederherstellungszustand erreicht bei 0/2D8
cp: der Aufruf von stat für „/tmp/pg-9.1/00010003“ ist nicht 
möglich: Datei oder Verzeichnis nicht gefunden
LOG:  konnte Datei »pg_xlog/00010003« nicht öffnen (Logdatei 0, 
Segment 3): Datei oder Verzeichnis nicht gefunden
LOG:  Redo fertig bei 0/2D8
LOG:  Logdatei »00010002« aus Archiv wiederhergestellt
cp: der Aufruf von stat für „/tmp/pg-9.1/0002.history“ ist nicht möglich: 
Datei oder Verzeichnis nicht gefunden
LOG:  gewählte neue Timeline-ID: 2
cp: der Aufruf von stat für „/tmp/pg-9.1/0001.history“ ist nicht möglich: 
Datei oder Verzeichnis nicht gefunden
LOG:  Wiederherstellung aus Archiv abgeschlossen
LOG:  unvollständiges Startpaket
LOG:  Autovacuum-Launcher startet
LOG:  Datenbanksystem ist bereit, um Verbindungen anzunehmen
(eof)

ls /tmp/pg-9.1
00010001  00010002.0020.backup 
00020003
00010002  00020002 0002.history

9.2 slave log:

LOG:  konsistenter Wiederherstellungszustand erreicht bei 0/2E0
cp: der Aufruf von stat für „/tmp/pg-9.2/00010003“ ist nicht 
möglich: Datei oder Verzeichnis nicht gefunden
LOG:  konnte Datei »pg_xlog/00010003« nicht öffnen (Logdatei 0, 
Segment 3): Datei oder Verzeichnis nicht gefunden
LOG:  Redo fertig bei 0/2E0
LOG:  Logdatei »00010002« aus Archiv wiederhergestellt
cp: der Aufruf von stat für „/tmp/pg-9.2/0002.history“ ist nicht möglich: 
Datei oder Verzeichnis nicht gefunden
LOG:  gewählte neue Timeline-ID: 2
cp: der Aufruf von stat für „/tmp/pg-9.2/0001.history“ ist nicht möglich: 
Datei oder Verzeichnis nicht gefunden
LOG:  Wiederherstellung aus Archiv abgeschlossen
LOG:  unvollständiges Startpaket
LOG:  Autovacuum-Launcher startet
LOG:  Datenbanksystem ist bereit, um Verbindungen anzunehmen
LOG:  Archivbefehl ist fehlgeschlagen mit Statuscode 1
DETAIL:  Der fehlgeschlagene Archivbefehl war: test ! -f 
/tmp/pg-9.2/00010002  cp pg_xlog/00010002 
/tmp/pg-9.2/00010002
LOG:  Archivbefehl ist fehlgeschlagen mit Statuscode 1
...

ls /tmp/pg-9.2
00010001  00010002 
00010002.0020.backup

The 9.3 log/ls looks the same.

Mit freundlichen Grüßen,
Christoph Berg
-- 
Senior Berater, Tel.: +49 (0)21 61 / 46 43-187
credativ GmbH, HRB Mönchengladbach 12080, USt-ID-Nummer: DE204566209
Hohenzollernstr. 133, 41061 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
pgp fingerprint: 5C48 FE61 57F4 9179 5970  87C6 4C5A 6BAB 12D2 A7AE


signature.asc
Description: Digital signature


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-13 Thread Heikki Linnakangas

On 02/13/2014 02:42 PM, Christoph Berg wrote:

Re: Heikki Linnakangas 2014-02-13 52fc9468.4050...@vmware.com

With 9.1, it works, but 9.2 and 9.3 don't archive anything until I
remove the test ! -f part. (An alternative fix would be to declare
the behavior ok and adjust that example in the config.)


Hmm, the behavior is the same in 9.1 and 9.2. Did you use a
different archive_command in 9.1, without the test?


The behavior is different: 9.1 doesn't try to re-archive the last
segment from the old timeline. (I've seen it always happen in training
workshops, so that's always mostly idle machines. Possibly 9.1 will
re-archive the last segment under load.)

I've just reproduced it here with these settings (in Debian's
/etc/postgresql-common/createcluster.conf syntax):


Ah, I think I see what's going on. 9.2 changed the archive recovery 
behavior so that the restored WAL segments are copied into pg_xlog, so 
that you can recover the standby back to the same point after restart, 
even if the archive later stops working (also needed for cascading 
standbys). In 9.1, we only attempted to archive the last segment from 
previous timeline if it wasn't restored from archive. In 9.2, it's 
always archived.


I was testing this with streaming replication; 9.1 and 9.2 behave the 
same in that scenario. But they differ when doing archive recovery.


Is this an argument for back-patching the don't archive last segment 
from old timeline patch to 9.2 and 9.3, but leaving 9.1 alone? You have 
the same problem with 9.1 and streaming replication, but no-one's 
complained..


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-13 Thread Christoph Berg
Re: Heikki Linnakangas 2014-02-13 52fcca40.3060...@vmware.com
 I was testing this with streaming replication; 9.1 and 9.2 behave
 the same in that scenario. But they differ when doing archive
 recovery.
 
 Is this an argument for back-patching the don't archive last
 segment from old timeline patch to 9.2 and 9.3, but leaving 9.1
 alone? You have the same problem with 9.1 and streaming replication,
 but no-one's complained..

I can't say what would be better from a technical viewpoint. For me,
this is literally an academic problem people run into when they try
replication during the PostgreSQL course, using the archive_command
example from postgresql.conf. I guess if no one else has complained
yet, there's no reason to touch the back branches.

Is removing the test ! -f part and hence overwriting files in the
archive safe, i.e. are the files the same?

Mit freundlichen Grüßen,
Christoph Berg
-- 
Senior Berater, Tel.: +49 (0)21 61 / 46 43-187
credativ GmbH, HRB Mönchengladbach 12080, USt-ID-Nummer: DE204566209
Hohenzollernstr. 133, 41061 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
pgp fingerprint: 5C48 FE61 57F4 9179 5970  87C6 4C5A 6BAB 12D2 A7AE


signature.asc
Description: Digital signature


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-13 Thread Heikki Linnakangas

On 02/13/2014 03:53 PM, Christoph Berg wrote:

Is removing the test ! -f part and hence overwriting files in the
archive safe, i.e. are the files the same?


No. Not in general, anyway. If the old master keeps running, even for a 
moment, after the partial file was copied, it will have created more WAL 
records which are present in the old master's copy of the segment, but 
not in the standby's.


You could perhaps use diff as part of archive_command, to check if 
archived version is identical to the one about to be archived.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-13 Thread Christoph Berg
Re: Heikki Linnakangas 2014-02-13 52fcd02c.3060...@vmware.com
 Is removing the test ! -f part and hence overwriting files in the
 archive safe, i.e. are the files the same?
 
 No. Not in general, anyway. If the old master keeps running, even
 for a moment, after the partial file was copied, it will have
 created more WAL records which are present in the old master's copy
 of the segment, but not in the standby's.

So why does PostgreSQL then ever try to archive something from the old
timeline? Isn't overwriting files on the old timeline just a very bad
idea in general? If so, it shouldn't be the responsibility of
archive_command to prevent this, but the server shouldn't be trying
that in the first place.

 You could perhaps use diff as part of archive_command, to check if
 archived version is identical to the one about to be archived.

Well, if that makes it into the default postgresql.conf (and docs)...

Otoh, it's still not nice that this could still be an archive command
that fails, i.e. requires admin intervention before archiving can
proceed.

Mit freundlichen Grüßen,
Christoph Berg
-- 
Senior Berater, Tel.: +49 (0)21 61 / 46 43-187
credativ GmbH, HRB Mönchengladbach 12080, USt-ID-Nummer: DE204566209
Hohenzollernstr. 133, 41061 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
pgp fingerprint: 5C48 FE61 57F4 9179 5970  87C6 4C5A 6BAB 12D2 A7AE


signature.asc
Description: Digital signature


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-13 Thread Heikki Linnakangas

On 02/13/2014 04:07 PM, Christoph Berg wrote:

Re: Heikki Linnakangas 2014-02-13 52fcd02c.3060...@vmware.com

Is removing the test ! -f part and hence overwriting files in the
archive safe, i.e. are the files the same?


No. Not in general, anyway. If the old master keeps running, even
for a moment, after the partial file was copied, it will have
created more WAL records which are present in the old master's copy
of the segment, but not in the standby's.


So why does PostgreSQL then ever try to archive something from the old
timeline? Isn't overwriting files on the old timeline just a very bad
idea in general?


Yes.


If so, it shouldn't be the responsibility of
archive_command to prevent this, but the server shouldn't be trying
that in the first place.


Totally agreed. That's why I'm proposing that we stop doing that :-).

It's been a bad idea from day one, but I'm a bit hesitent to back-patch 
until someone actually complains. There has been subtle changes to 
archive recovery behavior in each release, so I'm worried that the old 
behavior was in fact useful to someone in some corner case that we're 
missing.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-13 Thread Heikki Linnakangas

On 02/13/2014 02:42 PM, Heikki Linnakangas wrote:

The behavior where we prefer a segment from archive with lower TLI over
a file with higher TLI in pg_xlog actually changed in commit
a068c391ab0. Arguably changing it wasn't a good idea, but the problem
your test script demonstrates can be fixed by not archiving the partial
segment, with no change to the preference of archive/pg_xlog. As
discussed, archiving a partial segment seems like a bad idea anyway, so
let's just stop doing that.


After some further thought, while not archiving the partial segment 
fixes your test script, it's not enough to fix all variants of the 
problem. Even if archive recovery doesn't archive the last, partial, 
segment, if the original master server is still running, it's entirely 
possible that it fills the segment and archives it. In that case, 
archive recovery will again prefer the archived segment with lower TLI 
over the segment with newer TLI in pg_xlog.


So I agree we should commit the patch you posted (or something to that 
effect). The change to not archive the last segment still seems like a 
good idea, but perhaps we should only do that in master.


Even if after that patch, you can have a problem in more complicated 
scenarios involving both an archive and streaming replication. For 
example, imagine a timeline history like this:


TLI

1 +---
  |
2 +---


Now imagine that timeline 1 has been fully archived, and there are WAL 
segments much higher than the points where the timeline switch occurred 
present in the archive. But none of the WAL segments for timeline 2 have 
been archived; they are only present in a master server. You want to 
perform recovery to timeline 2, using the archived WAL segments for 
timelines 1, and streaming replication to catch up to the tip of timeline 2.


Whether we prefer files from pg_xlog or archive will make no difference 
in this case, as there are no files in pg_xlog. But it will merrily 
apply all the WAL for timeline 1 from the archive that it can find, past 
the timeline switch point. After that, when it tries to connect to the 
server will streaming replication, it will fail.


There's not much we can do about that in 9.2 and below, but in 9.3 the 
timeline history file contains the exact timeline switch points, so we 
could be more careful and not apply any extra WAL on the old timeline 
past the switch point. We could also be more exact in which files we try 
to restore from the archive, instead of just polling every future TLI in 
the history.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-13 Thread Heikki Linnakangas

On 02/13/2014 06:47 PM, Heikki Linnakangas wrote:

On 02/13/2014 02:42 PM, Heikki Linnakangas wrote:

The behavior where we prefer a segment from archive with lower TLI over
a file with higher TLI in pg_xlog actually changed in commit
a068c391ab0. Arguably changing it wasn't a good idea, but the problem
your test script demonstrates can be fixed by not archiving the partial
segment, with no change to the preference of archive/pg_xlog. As
discussed, archiving a partial segment seems like a bad idea anyway, so
let's just stop doing that.


After some further thought, while not archiving the partial segment
fixes your test script, it's not enough to fix all variants of the
problem. Even if archive recovery doesn't archive the last, partial,
segment, if the original master server is still running, it's entirely
possible that it fills the segment and archives it. In that case,
archive recovery will again prefer the archived segment with lower TLI
over the segment with newer TLI in pg_xlog.

So I agree we should commit the patch you posted (or something to that
effect). The change to not archive the last segment still seems like a
good idea, but perhaps we should only do that in master.


To draw this to conclusion, barring any further insights to this, I'm 
going to commit the attached patch to master and REL9_3_STABLE. Please 
have a look at the patch, to see if I'm missing something. I modified 
the state machine to skip over XLOG_FROM_XLOG state, if reading in 
XLOG_FROM_ARCHIVE failed; otherwise you first scan the archive and 
pg_xlog together, and then pg_xlog alone, which is pointless.


In master, I'm also going to remove the archive last segment on old 
timeline code.


- Heikki
From 15b0cadb0006b97f87257f4f903db8d33411ada9 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas heikki.linnakan...@iki.fi
Date: Thu, 13 Feb 2014 19:16:00 +0200
Subject: [PATCH 1/1] Change the order that pg_xlog and WAL archive are polled
 for WAL segments.

If there is a WAL segment with same ID but different TLI present in both
the WAL archive and pg_xlog, prefer the one with higher TLI. Before this
patch, the archive was polled first, for all expected TLIs, and only if no
file was found was pg_xlog scanned. This was a change in behavior from 9.3,
which first scanned archive and pg_xlog for the highest TLI, then archive
and pg_xlog for the next highest TLI and so forth. This patch reverts the
behavior back to what it was in 9.2.

The reason for this is that if for example you try to do archive recovery
to timeline 2, which branched off timeline 1, but the WAL for timeline 2 is
not archived yet, we would replay past the timeline switch point on
timeline 1 using the archived files, before even looking timeline 2's files
in pg_xlog

Report and patch by Kyotaro Horiguchi. Backpatch to 9.3 where the behavior
was changed.
---
 src/backend/access/transam/xlog.c | 22 +-
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a4a504..b2d8e1a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9576,17 +9576,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	/*---
 	 * Standby mode is implemented by a state machine:
 	 *
-	 * 1. Read from archive (XLOG_FROM_ARCHIVE)
-	 * 2. Read from pg_xlog (XLOG_FROM_PG_XLOG)
-	 * 3. Check trigger file
-	 * 4. Read from primary server via walreceiver (XLOG_FROM_STREAM)
-	 * 5. Rescan timelines
-	 * 6. Sleep 5 seconds, and loop back to 1.
+	 * 1. Read from either archive or pg_xlog (XLOG_FROM_ARCHIVE), or just
+	 *pg_xlog (XLOG_FROM_XLOG)
+	 * 2. Check trigger file
+	 * 3. Read from primary server via walreceiver (XLOG_FROM_STREAM)
+	 * 4. Rescan timelines
+	 * 5. Sleep 5 seconds, and loop back to 1.
 	 *
 	 * Failure to read from the current source advances the state machine to
-	 * the next state. In addition, successfully reading a file from pg_xlog
-	 * moves the state machine from state 2 back to state 1 (we always prefer
-	 * files in the archive over files in pg_xlog).
+	 * the next state.
 	 *
 	 * 'currentSource' indicates the current state. There are no currentSource
 	 * values for check trigger, rescan timelines, and sleep states,
@@ -9614,9 +9612,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			switch (currentSource)
 			{
 case XLOG_FROM_ARCHIVE:
-	currentSource = XLOG_FROM_PG_XLOG;
-	break;
-
 case XLOG_FROM_PG_XLOG:
 
 	/*
@@ -9781,7 +9776,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  * Try to restore the file from archive, or read an existing
  * file from pg_xlog.
  */
-readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, currentSource);
+readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
+			  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY : currentSource);
 if (readFile = 0)
 	return true;	/* success! */
 
-- 

Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-12 Thread Tomonari Katsumata

Hi Heikki,

I need PostgreSQL9.3 which fixed this problem.

It didn't happen in PostgreSQL9.2, so I agree
with your proposal which changes are done
against 93_STABLE and master.

Can you fix this in next release(9.3.3)?


Tomonari Katsumata

(2014/01/13 20:16), Heikki Linnakangas wrote:
 On 01/09/2014 10:55 PM, Josh Berkus wrote:
 On 01/09/2014 12:05 PM, Heikki Linnakangas wrote:
 Actually, why is the partially-filled 00010002 file
 archived in the first place? Looking at the code, it's been like that
 forever, but it seems like a bad idea. If the original server is still
 up and running, and writing more data to that file, what will happen is
 that when the original server later tries to archive it, it will fail
 because the partial version of the file is already in the archive. Or
 worse, the partial version overwrites a previously archived more
 complete version.

 Oh!  This explains some transient errors I've seen.
 Wouldn't it be better to not archive the old segment, and instead 
switch

 to a new segment after writing the end-of-recovery checkpoint, so that
 the segment on the new timeline is archived sooner?

 It would be better to zero-fill and switch segments, yes. We should
 NEVER be in a position of archiving two different versions of the same
 segment.

 Ok, I think we're in agreement that that's the way to go for master.

 Now, what to do about back-branches? On one hand, I'd like to apply 
the same fix to all stable branches, as the current behavior is silly 
and always has been. On the other hand, we haven't heard any complaints 
about it, so we probably shouldn't fix what ain't broken. Perhaps we 
should apply it to 9.3, as that's where we have the acute problem the OP 
reported. Thoughts?


 In summary, I propose that we change master and REL9_3_STABLE to not 
archive the partial segment from previous timeline. Older branches will 
keep the current behavior.


 - Heikki





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-12 Thread Christoph Berg
Re: Heikki Linnakangas 2014-01-13 52d3caff.3010...@vmware.com
 Actually, why is the partially-filled 00010002 file
 archived in the first place? Looking at the code, it's been like that
 forever, but it seems like a bad idea. If the original server is still
 up and running, and writing more data to that file, what will happen is
 that when the original server later tries to archive it, it will fail
 because the partial version of the file is already in the archive. Or
 worse, the partial version overwrites a previously archived more
 complete version.
 
 Oh!  This explains some transient errors I've seen.
 
 Wouldn't it be better to not archive the old segment, and instead switch
 to a new segment after writing the end-of-recovery checkpoint, so that
 the segment on the new timeline is archived sooner?
 
 It would be better to zero-fill and switch segments, yes.  We should
 NEVER be in a position of archiving two different versions of the same
 segment.
 
 Ok, I think we're in agreement that that's the way to go for master.
 
 Now, what to do about back-branches? On one hand, I'd like to apply
 the same fix to all stable branches, as the current behavior is
 silly and always has been. On the other hand, we haven't heard any
 complaints about it, so we probably shouldn't fix what ain't broken.
 Perhaps we should apply it to 9.3, as that's where we have the acute
 problem the OP reported. Thoughts?
 
 In summary, I propose that we change master and REL9_3_STABLE to not
 archive the partial segment from previous timeline. Older branches
 will keep the current behavior.

I've seen the can't archive file from the old timeline problem on
9.2 and 9.3 slaves after promotion. The problem is in conjunction with
the proposed archive_command in the default postgresql.conf comments:

# e.g. 'test ! -f /mnt/server/archivedir/%f  cp %p /mnt/server/archivedir/%f'

With 9.1, it works, but 9.2 and 9.3 don't archive anything until I
remove the test ! -f part. (An alternative fix would be to declare
the behavior ok and adjust that example in the config.)

I've always blamed 9.2+'s cascading replication for this, but haven't
investigated deeper.

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


signature.asc
Description: Digital signature


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-02-12 Thread Kyotaro HORIGUCHI
Hello,

 I need PostgreSQL9.3 which fixed this problem.
 
 It didn't happen in PostgreSQL9.2, so I agree
 with your proposal which changes are done
 against 93_STABLE and master.
 
 Can you fix this in next release(9.3.3)?

I was going to push to move this a bit, but...

  Wouldn't it be better to not archive the old segment, and instead
  switch
  to a new segment after writing the end-of-recovery checkpoint, so that
  the segment on the new timeline is archived sooner?
 
  It would be better to zero-fill and switch segments, yes. We should
  NEVER be in a position of archiving two different versions of the same
  segment.
 
  Ok, I think we're in agreement that that's the way to go for master.

I've almost inclined to that but on some thoughts on the idea,
comming to think of recovery upto target timeline, the old
segment found to be necessary for the case. Without the old
segment, we would be obliged to seek to the first segment of the
*next* timeline (Is there any (simple) means to predict where is
it?) to complete the task. Is it the right way we kick the older
one out of archive?

  Now, what to do about back-branches? On one hand, I'd like to apply
  the same fix to all stable branches, as the current behavior is silly
  and always has been. On the other hand, we haven't heard any
  complaints about it, so we probably shouldn't fix what ain't
  broken. Perhaps we should apply it to 9.3, as that's where we have the
  acute problem the OP reported. Thoughts?
 
  In summary, I propose that we change master and REL9_3_STABLE to not
  archive the partial segment from previous timeline. Older branches
  will keep the current behavior.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-13 Thread Heikki Linnakangas

On 01/09/2014 10:55 PM, Josh Berkus wrote:

On 01/09/2014 12:05 PM, Heikki Linnakangas wrote:


Actually, why is the partially-filled 00010002 file
archived in the first place? Looking at the code, it's been like that
forever, but it seems like a bad idea. If the original server is still
up and running, and writing more data to that file, what will happen is
that when the original server later tries to archive it, it will fail
because the partial version of the file is already in the archive. Or
worse, the partial version overwrites a previously archived more
complete version.


Oh!  This explains some transient errors I've seen.


Wouldn't it be better to not archive the old segment, and instead switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?


It would be better to zero-fill and switch segments, yes.  We should
NEVER be in a position of archiving two different versions of the same
segment.


Ok, I think we're in agreement that that's the way to go for master.

Now, what to do about back-branches? On one hand, I'd like to apply the 
same fix to all stable branches, as the current behavior is silly and 
always has been. On the other hand, we haven't heard any complaints 
about it, so we probably shouldn't fix what ain't broken. Perhaps we 
should apply it to 9.3, as that's where we have the acute problem the OP 
reported. Thoughts?


In summary, I propose that we change master and REL9_3_STABLE to not 
archive the partial segment from previous timeline. Older branches will 
keep the current behavior.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Tomonari Katsumata
Hi,

Somebody is reading this thread?

This problem seems still remaining on REL9_3_STABLE.
Many users would face this problem, so we should
resolve this in next release.

I think his patch is reasonable to fix this problem.

Please check this again.

regards,
--
Tomonari Katsumata



2013/12/12 Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp

 Hello, we happened to see server crash on archive recovery under
 some condition.

 After TLI was incremented, there should be the case that the WAL
 file for older timeline is archived but not for that of the same
 segment id but for newer timeline. Archive recovery should fail
 for the case with PANIC error like follows,

 | PANIC: record with zero length at 0/1820D40

 Replay script is attached. This issue occured for 9.4dev, 9.3.2,
 and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
 TLI before trying archive for older TLIs.

 This occurrs during fetching checkpoint redo record in archive
 recovery.

  if (checkPoint.redo  RecPtr)
  {
/* back up to find the record */
record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);

 And this is caused by that the segment file for older timeline in
 archive directory is preferred to that for newer timeline in
 pg_xlog.

 Looking into pg_xlog before trying the older TLIs in archive like
 9.2- fixes this issue. The attached patch is one possible
 solution for 9.4dev.

 Attached files are,

  - recvtest.sh: Replay script. Step 1 and 2 makes the condition
and step 3 causes the issue.

  - archrecvfix_20131212.patch: The patch fixes the issue. Archive
recovery reads pg_xlog before trying older TLI in archive
similarly to 9.1- by this patch.

 regards,

 --
 Kyotaro Horiguchi
 NTT Open Source Software Center


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Amit Langote
On Thu, Dec 12, 2013 at 11:00 AM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Hello, we happened to see server crash on archive recovery under
 some condition.

 After TLI was incremented, there should be the case that the WAL
 file for older timeline is archived but not for that of the same
 segment id but for newer timeline. Archive recovery should fail
 for the case with PANIC error like follows,

 | PANIC: record with zero length at 0/1820D40

 Replay script is attached. This issue occured for 9.4dev, 9.3.2,
 and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
 TLI before trying archive for older TLIs.

 This occurrs during fetching checkpoint redo record in archive
 recovery.

 if (checkPoint.redo  RecPtr)
 {
   /* back up to find the record */
   record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);

 And this is caused by that the segment file for older timeline in
 archive directory is preferred to that for newer timeline in
 pg_xlog.

 Looking into pg_xlog before trying the older TLIs in archive like
 9.2- fixes this issue. The attached patch is one possible
 solution for 9.4dev.

 Attached files are,

  - recvtest.sh: Replay script. Step 1 and 2 makes the condition
and step 3 causes the issue.

  - archrecvfix_20131212.patch: The patch fixes the issue. Archive
recovery reads pg_xlog before trying older TLI in archive
similarly to 9.1- by this patch.

 regards,


Horiguchi-san,

Wonder if the following commit in 9.2 branch was to address a
similar/same problem?
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4aed94f1660fb55bc825bf7f3135379dab28eb55

--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Heikki Linnakangas

On 12/12/2013 04:00 AM, Kyotaro HORIGUCHI wrote:

Hello, we happened to see server crash on archive recovery under
some condition.

After TLI was incremented, there should be the case that the WAL
file for older timeline is archived but not for that of the same
segment id but for newer timeline. Archive recovery should fail
for the case with PANIC error like follows,

| PANIC: record with zero length at 0/1820D40

Replay script is attached. This issue occured for 9.4dev, 9.3.2,
and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
TLI before trying archive for older TLIs.

This occurrs during fetching checkpoint redo record in archive
recovery.


if (checkPoint.redo  RecPtr)
{
/* back up to find the record */
record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);


Hmm. After running the script, pg_controldata says:

Latest checkpoint location:   0/290
Prior checkpoint location:0/290
Latest checkpoint's REDO location:0/290

The PANIC is actually coming from here:


/*
 * Re-fetch the last valid or last applied record, so we can identify 
the
 * exact endpoint of what we consider the valid portion of WAL.
 */
record = ReadRecord(xlogreader, LastRec, PANIC, false);


If recovery started from an online checkpoint, then I think you'd get a 
similar PANIC from the code you quoted.



And this is caused by that the segment file for older timeline in
archive directory is preferred to that for newer timeline in
pg_xlog.


Yep.


Looking into pg_xlog before trying the older TLIs in archive like
9.2- fixes this issue. The attached patch is one possible
solution for 9.4dev.

Attached files are,

  - recvtest.sh: Replay script. Step 1 and 2 makes the condition
and step 3 causes the issue.

  - archrecvfix_20131212.patch: The patch fixes the issue. Archive
recovery reads pg_xlog before trying older TLI in archive
similarly to 9.1- by this patch.


Hmm, that seems reasonable at a quick glance. I think it also needs a 
change to the state transition code earlier in the loop, to not move 
from XLOG_FROM_ARCHIVE to XLOG_FROM_PG_XLOG, if we've already tried 
reading pg_xlog.


Another way to look at this is that it's wrong that we immediately PANIC 
if we successfully restore a file from archive, but then fail to read 
the record we're looking for. Instead, we should advance to next state 
in the state machine, ie. try reading the same record from pg_xlog, and 
only give up if the read fails from all sources.


Yet another way to look at this is that we shouldn't even try to read 
the file with TLI 1 from the archive, when we know the checkpoint record 
is on timeline 2. It can't possibly work.


I've been thinking for some time that we should stop doing the scan of 
all possible TLIs, and only try to read the exact file that contains the 
record we're reading. We have that information now, in the timeline 
history file, and we already do that during streaming replication. I was 
afraid of changing the behavior of archive recovery in 9.4, but maybe 
that's the way to go in the long term.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Heikki Linnakangas

On 01/09/2014 08:18 PM, Heikki Linnakangas wrote:

On 12/12/2013 04:00 AM, Kyotaro HORIGUCHI wrote:

Hello, we happened to see server crash on archive recovery under
some condition.

After TLI was incremented, there should be the case that the WAL
file for older timeline is archived but not for that of the same
segment id but for newer timeline. Archive recovery should fail
for the case with PANIC error like follows,

| PANIC: record with zero length at 0/1820D40

Replay script is attached. This issue occured for 9.4dev, 9.3.2,
and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
TLI before trying archive for older TLIs.

This occurrs during fetching checkpoint redo record in archive
recovery.


if (checkPoint.redo  RecPtr)
{
/* back up to find the record */
record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);


Hmm. After running the script, pg_controldata says:

Latest checkpoint location:   0/290
Prior checkpoint location:0/290
Latest checkpoint's REDO location:0/290

The PANIC is actually coming from here:


/*
 * Re-fetch the last valid or last applied record, so we can
identify the
 * exact endpoint of what we consider the valid portion of WAL.
 */
record = ReadRecord(xlogreader, LastRec, PANIC, false);


If recovery started from an online checkpoint, then I think you'd get a
similar PANIC from the code you quoted.


And this is caused by that the segment file for older timeline in
archive directory is preferred to that for newer timeline in
pg_xlog.


Yep.


Actually, why is the partially-filled 00010002 file 
archived in the first place? Looking at the code, it's been like that 
forever, but it seems like a bad idea. If the original server is still 
up and running, and writing more data to that file, what will happen is 
that when the original server later tries to archive it, it will fail 
because the partial version of the file is already in the archive. Or 
worse, the partial version overwrites a previously archived more 
complete version.


This is the code that does that:


/*
 * If we are establishing a new timeline, we have to copy data from the
 * last WAL segment of the old timeline to create a starting WAL segment
 * for the new timeline.
 *
 * Notify the archiver that the last WAL segment of the old timeline is
 * ready to copy to archival storage. Otherwise, it is not archived for 
a
 * while.
 */
if (endTLI != ThisTimeLineID)
{
XLogFileCopy(endLogSegNo, endTLI, endLogSegNo);

if (XLogArchivingActive())
{
XLogFileName(xlogpath, endTLI, endLogSegNo);
XLogArchiveNotify(xlogpath);
}
}


So, the rationale is that otherwise it would take a long time until that 
segment is archived. To be precise, I don't think the segment with the 
old TLI would ever be archived without the above, but the same segment 
on the new timeline would, after it fills up.


Wouldn't it be better to not archive the old segment, and instead switch 
to a new segment after writing the end-of-recovery checkpoint, so that 
the segment on the new timeline is archived sooner?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 Actually, why is the partially-filled 00010002 file 
 archived in the first place? ...

 So, the rationale is that otherwise it would take a long time until that 
 segment is archived. To be precise, I don't think the segment with the 
 old TLI would ever be archived without the above, but the same segment 
 on the new timeline would, after it fills up.

 Wouldn't it be better to not archive the old segment, and instead switch 
 to a new segment after writing the end-of-recovery checkpoint, so that 
 the segment on the new timeline is archived sooner?

Don't we want to archive both?  If you want to recover to the end of the
old timeline, you're going to need that file too, no?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Heikki Linnakangas

On 01/09/2014 10:16 PM, Tom Lane wrote:

Heikki Linnakangas hlinnakan...@vmware.com writes:

Actually, why is the partially-filled 00010002 file
archived in the first place? ...



So, the rationale is that otherwise it would take a long time until that
segment is archived. To be precise, I don't think the segment with the
old TLI would ever be archived without the above, but the same segment
on the new timeline would, after it fills up.



Wouldn't it be better to not archive the old segment, and instead switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?


Don't we want to archive both?  If you want to recover to the end of the
old timeline, you're going to need that file too, no?


Hmm. It should be the responsibility of the original server to archive 
the segment on the old timeline. Mind you, partial segments are never 
archived, except for this one case, so how did the old segment find its 
way to the new server? A few possibilities come to mind: the DBA 
manually copied it from the old server to pg_xlog, it was streamed by 
streaming replication, or it was included in a base backup. The OP's 
test script resembles the base backup case.


In all of those cases, I don't think it's the new server's 
responsibility to archive it. If it was copied to pg_xlog manually, the 
administrator may also copy it to the archive if he feels like it. If it 
was streamed from a live server, the original server should take care of 
it. If it was included in a backup, well, it's included in the backup so 
it doesn't necessarily need to be archived.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 01/09/2014 10:16 PM, Tom Lane wrote:
 Don't we want to archive both?  If you want to recover to the end of the
 old timeline, you're going to need that file too, no?

 Hmm. It should be the responsibility of the original server to archive 
 the segment on the old timeline.

Oh ... I was thinking about the case of a timeline switch within one
server, say as a result of PITR recovery.  If you're sure this same case
doesn't come up that way, then nevermind.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Heikki Linnakangas

On 01/09/2014 10:36 PM, Tom Lane wrote:

Heikki Linnakangas hlinnakan...@vmware.com writes:

On 01/09/2014 10:16 PM, Tom Lane wrote:

Don't we want to archive both?  If you want to recover to the end of the
old timeline, you're going to need that file too, no?



Hmm. It should be the responsibility of the original server to archive
the segment on the old timeline.


Oh ... I was thinking about the case of a timeline switch within one
server, say as a result of PITR recovery.  If you're sure this same case
doesn't come up that way, then nevermind.


I'm not sure what you mean, but it does apply to same-server PITR 
recovery too. In that case, again, if you copy the partial segment to 
pg_xlog when doing the PITR recovery, clearly you already somehow have 
access to the partial segment, and if you want to do another PITR to the 
same point, you should just copy it again.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Josh Berkus
On 01/09/2014 12:05 PM, Heikki Linnakangas wrote:

 Actually, why is the partially-filled 00010002 file
 archived in the first place? Looking at the code, it's been like that
 forever, but it seems like a bad idea. If the original server is still
 up and running, and writing more data to that file, what will happen is
 that when the original server later tries to archive it, it will fail
 because the partial version of the file is already in the archive. Or
 worse, the partial version overwrites a previously archived more
 complete version.

Oh!  This explains some transient errors I've seen.

 Wouldn't it be better to not archive the old segment, and instead switch
 to a new segment after writing the end-of-recovery checkpoint, so that
 the segment on the new timeline is archived sooner?

It would be better to zero-fill and switch segments, yes.  We should
NEVER be in a position of archiving two different versions of the same
segment.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [BUG] Archive recovery failure on 9.3+.

2013-12-11 Thread Kyotaro HORIGUCHI
Hello, we happened to see server crash on archive recovery under
some condition.

After TLI was incremented, there should be the case that the WAL
file for older timeline is archived but not for that of the same
segment id but for newer timeline. Archive recovery should fail
for the case with PANIC error like follows,

| PANIC: record with zero length at 0/1820D40

Replay script is attached. This issue occured for 9.4dev, 9.3.2,
and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
TLI before trying archive for older TLIs.

This occurrs during fetching checkpoint redo record in archive
recovery.

 if (checkPoint.redo  RecPtr)
 {
   /* back up to find the record */
   record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);

And this is caused by that the segment file for older timeline in
archive directory is preferred to that for newer timeline in
pg_xlog.

Looking into pg_xlog before trying the older TLIs in archive like
9.2- fixes this issue. The attached patch is one possible
solution for 9.4dev.

Attached files are,

 - recvtest.sh: Replay script. Step 1 and 2 makes the condition
   and step 3 causes the issue.

 - archrecvfix_20131212.patch: The patch fixes the issue. Archive
   recovery reads pg_xlog before trying older TLI in archive
   similarly to 9.1- by this patch.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#/bin/bash

ROOT=`pwd`
PGDATA=$ROOT/test/data
ARCHDIR=$ROOT/test/arc

if [ ! -d $ARCHDIR -o ! -d $PGDATA ]; then
echo $PGDATA and/or $ARCHDIR not found
exit
fi

echo ### EMPTY ARCHIVE DIRECTORY ###
if [ -d $ARCHDIR ]; then rm -f $ARCHDIR/*; fi

echo ### EMPTY PGDATA DIRECTORY ###
if [ -d $PGDATA ]; then rm -r $PGDATA/*; fi

echo ### DO INITDB ###
initdb -D $PGDATA  /dev/null

echo ### set up postgresql.conf ###
cat  $PGDATA/postgresql.conf EOF
wal_level = archive
archive_mode = on
archive_command = '/bin/cp %p $ARCHDIR/%f'
# log_min_messages = debug5
EOF

echo ### STAGE 1/3 -- PUT XLOG ..001...001 AND ..002.HISTORY INTO ARCHIVE ###
echo ### STAGE 1/3: 1/2 START SERVER ###
pg_ctl start -D $PGDATA -w

echo ### STAGE 1/3: 2/2 STOP SERVER ###
pg_ctl stop -D $PGDATA

echo ### STAGE 2/3 -- PUT XLOG ..002...001 INTO ONLY pg_xlog ###
echo ### STAGE 2/3: 1/3 PREPARE recovery.conf ###
cat  $PGDATA/recovery.conf EOF
restore_command = '/bin/cp $ARCHDIR/%f %p'
EOF

echo ### STAGE 2/3: 2/3 START SERVER IN ARCHIVE RECOVERY MODE ###
pg_ctl start -D $PGDATA -w

echo ### STAGE 2/3: 3/3 STOP SERVER IMMEDIATELY ###
pg_ctl stop -m i -D $PGDATA

echo ### ls $ARCHDIR
ls $ARCHDIR

echo ### ls $PGDATA/pg_xlog
ls $PGDATA/pg_xlog


echo ### STAGE 3/3 - START SERVER IN ARCHIVE RECOVERY MODE AGAIN ###
echo ### STAGE 3/3: 1/2 RESTORE recovery.conf ###
mv $PGDATA/recovery.done $PGDATA/recovery.conf

echo ### STAGE 3/3: 2/2 START SERVER IN ARCHIVE RECOVERY MODE 2ND RUN ###
pg_ctl start -D $PGDATA -w -t 2
if [ $? -ne 0 ]; then
  echo  ### SERVER CRASHED ###
  exit
fi
echo ### SERVER SEEMS SUCCESSFULLY UP. STOP IT. ###
pg_ctl stop -D $PGDATA -w
echo  ### SERVER DID NOT CRASH ###
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6fa5479..75be478 100755
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10935,10 +10935,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	curFileTLI = 0;
 
 /*
- * Try to restore the file from archive, or read an existing
- * file from pg_xlog.
+ * When XLOG_FROM_ARCHIVE, read xlog file with largest TLI
+ * preferring archive to pg_xlog. Or when XLOG_FROM_PG_XLOG,
+ * search only pg_xlog.
  */
-readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, currentSource);
+readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
+	currentSource == XLOG_FROM_ARCHIVE ?
+	XLOG_FROM_ANY : currentSource);
 if (readFile = 0)
 	return true;	/* success! */
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-12-02 Thread Heikki Linnakangas

On 04.10.2011 09:43, Fujii Masao wrote:

On Mon, Oct 3, 2011 at 4:32 PM, Simon Riggssi...@2ndquadrant.com  wrote:

I don't think this should use the rm_safe_restartpoint machinery. As you
said, it's not tied to any specific resource manager. And I've actually been
thinking that we will get rid of rm_safe_restartpoint altogether in the
future. The two things that still use it are the b-tree and gin, and I'd
like to change both of those to not require any post-recovery cleanup step
to finish multi-page operations, similar to what I did with GiST in 9.1.


I thought that was quite neat doing it that way, but there's no
specific reason to do it that way I guess. If you're happy to rewrite
the patch then I guess we're OK.

I certainly would like to get rid of rm_safe_restartpoint in the
longer term, hopefully sooner.


Though Heikki might be already working on that,...


Just haven't gotten around to it. It's a fair amount of work with little 
user-visible benefit.



anyway,
the attached patch is the version which doesn't use rm_safe_restartpoint
machinery.


Thanks, committed.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-11-24 Thread Simon Riggs
On Tue, Oct 4, 2011 at 7:43 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Oct 3, 2011 at 4:32 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I don't think this should use the rm_safe_restartpoint machinery. As you
 said, it's not tied to any specific resource manager. And I've actually been
 thinking that we will get rid of rm_safe_restartpoint altogether in the
 future. The two things that still use it are the b-tree and gin, and I'd
 like to change both of those to not require any post-recovery cleanup step
 to finish multi-page operations, similar to what I did with GiST in 9.1.

 I thought that was quite neat doing it that way, but there's no
 specific reason to do it that way I guess. If you're happy to rewrite
 the patch then I guess we're OK.

 I certainly would like to get rid of rm_safe_restartpoint in the
 longer term, hopefully sooner.

 Though Heikki might be already working on that,... anyway,
 the attached patch is the version which doesn't use rm_safe_restartpoint
 machinery.


Heikki - I see you are down on the CF app to review this.

I'd been working on it as well, just forgot to let Greg know.

Did you start already? Should I stop?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-10-04 Thread Fujii Masao
On Mon, Oct 3, 2011 at 4:32 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I don't think this should use the rm_safe_restartpoint machinery. As you
 said, it's not tied to any specific resource manager. And I've actually been
 thinking that we will get rid of rm_safe_restartpoint altogether in the
 future. The two things that still use it are the b-tree and gin, and I'd
 like to change both of those to not require any post-recovery cleanup step
 to finish multi-page operations, similar to what I did with GiST in 9.1.

 I thought that was quite neat doing it that way, but there's no
 specific reason to do it that way I guess. If you're happy to rewrite
 the patch then I guess we're OK.

 I certainly would like to get rid of rm_safe_restartpoint in the
 longer term, hopefully sooner.

Though Heikki might be already working on that,... anyway,
the attached patch is the version which doesn't use rm_safe_restartpoint
machinery.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 557,563  static TimeLineID lastPageTLI = 0;
  static XLogRecPtr minRecoveryPoint;		/* local copy of
  		 * ControlFile-minRecoveryPoint */
  static bool updateMinRecoveryPoint = true;
! static bool reachedMinRecoveryPoint = false;
  
  static bool InRedo = false;
  
--- 557,563 
  static XLogRecPtr minRecoveryPoint;		/* local copy of
  		 * ControlFile-minRecoveryPoint */
  static bool updateMinRecoveryPoint = true;
! bool reachedMinRecoveryPoint = false;
  
  static bool InRedo = false;
  
***
*** 6841,6852  StartupXLOG(void)
  		LocalXLogInsertAllowed = -1;
  
  		/*
- 		 * Check to see if the XLOG sequence contained any unresolved
- 		 * references to uninitialized pages.
- 		 */
- 		XLogCheckInvalidPages();
- 
- 		/*
  		 * Perform a checkpoint to update all our recovery activity to disk.
  		 *
  		 * Note that we write a shutdown checkpoint rather than an on-line
--- 6841,6846 
***
*** 6983,6988  CheckRecoveryConsistency(void)
--- 6977,6988 
  		XLByteLE(minRecoveryPoint, EndRecPtr) 
  		XLogRecPtrIsInvalid(ControlFile-backupStartPoint))
  	{
+ 		/*
+ 		 * Check to see if the XLOG sequence contained any unresolved
+ 		 * references to uninitialized pages.
+ 		 */
+ 		XLogCheckInvalidPages();
+ 
  		reachedMinRecoveryPoint = true;
  		ereport(LOG,
  (errmsg(consistent recovery state reached at %X/%X,
***
*** 7974,7980  RecoveryRestartPoint(const CheckPoint *checkPoint)
  	volatile XLogCtlData *xlogctl = XLogCtl;
  
  	/*
! 	 * Is it safe to checkpoint?  We must ask each of the resource managers
  	 * whether they have any partial state information that might prevent a
  	 * correct restart from this point.  If so, we skip this opportunity, but
  	 * return at the next checkpoint record for another try.
--- 7974,7980 
  	volatile XLogCtlData *xlogctl = XLogCtl;
  
  	/*
! 	 * Is it safe to restartpoint?  We must ask each of the resource managers
  	 * whether they have any partial state information that might prevent a
  	 * correct restart from this point.  If so, we skip this opportunity, but
  	 * return at the next checkpoint record for another try.
***
*** 7994,7999  RecoveryRestartPoint(const CheckPoint *checkPoint)
--- 7994,8015 
  	}
  
  	/*
+ 	 * Is it safe to restartpoint?  We must check whether there are any
+ 	 * unresolved references to invalid pages that might prevent
+ 	 * a correct restart from this point.  If so, we skip this opportunity,
+ 	 * but return at the next checkpoint record for another try.
+ 	 */
+ 	if (have_invalid_pages())
+ 	{
+ 		elog(trace_recovery(DEBUG2),
+ 			 could not record restart point at %X/%X because there 
+ 			 are unresolved references to invalid pages,
+ 			 checkPoint-redo.xlogid,
+ 			 checkPoint-redo.xrecoff);
+ 		return;
+ 	}
+ 
+ 	/*
  	 * Copy the checkpoint record to shared memory, so that bgwriter can use
  	 * it the next time it wants to perform a restartpoint.
  	 */
*** a/src/backend/access/transam/xlogutils.c
--- b/src/backend/access/transam/xlogutils.c
***
*** 52,57  typedef struct xl_invalid_page
--- 52,73 
  static HTAB *invalid_page_tab = NULL;
  
  
+ /* Report a reference to an invalid page */
+ static void
+ report_invalid_page(int elevel, RelFileNode node, ForkNumber forkno,
+ 	BlockNumber blkno, bool present)
+ {
+ 	char	   *path = relpathperm(node, forkno);
+ 
+ 	if (present)
+ 		elog(elevel, page %u of relation %s is uninitialized,
+ 			 blkno, path);
+ 	else
+ 		elog(elevel, page %u of relation %s does not exist,
+ 			 blkno, path);
+ 	pfree(path);
+ }
+ 
  /* Log a reference to an invalid page */
  static void
  log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
***
*** 62,83  log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber 

Re: [HACKERS] bug of recovery?

2011-10-03 Thread Heikki Linnakangas

On 29.09.2011 14:31, Fujii Masao wrote:

On Tue, Sep 27, 2011 at 8:06 PM, Florian Pflugf...@phlo.org  wrote:

Actually, why don't we use that machinery to implement this? There's currently 
no rm_safe_restartpoint callback for RM_XLOG_ID, so we'd just need to create 
one that checks whether invalid_page_tab is empty.


Okay, the attached patch prevents the creation of restartpoints by using
rm_safe_restartpoint callback if we've not reached a consistent state yet
and the invalid-page table is not empty. But the invalid-page table is not
tied to the specific resource manager, so using rm_safe_restartpoint for
that seems to slightly odd. Is this OK?


I don't think this should use the rm_safe_restartpoint machinery. As you 
said, it's not tied to any specific resource manager. And I've actually 
been thinking that we will get rid of rm_safe_restartpoint altogether in 
the future. The two things that still use it are the b-tree and gin, and 
I'd like to change both of those to not require any post-recovery 
cleanup step to finish multi-page operations, similar to what I did with 
GiST in 9.1.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-10-03 Thread Simon Riggs
On Mon, Oct 3, 2011 at 8:21 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 29.09.2011 14:31, Fujii Masao wrote:

 On Tue, Sep 27, 2011 at 8:06 PM, Florian Pflugf...@phlo.org  wrote:

 Actually, why don't we use that machinery to implement this? There's
 currently no rm_safe_restartpoint callback for RM_XLOG_ID, so we'd just need
 to create one that checks whether invalid_page_tab is empty.

 Okay, the attached patch prevents the creation of restartpoints by using
 rm_safe_restartpoint callback if we've not reached a consistent state yet
 and the invalid-page table is not empty. But the invalid-page table is not
 tied to the specific resource manager, so using rm_safe_restartpoint for
 that seems to slightly odd. Is this OK?

 I don't think this should use the rm_safe_restartpoint machinery. As you
 said, it's not tied to any specific resource manager. And I've actually been
 thinking that we will get rid of rm_safe_restartpoint altogether in the
 future. The two things that still use it are the b-tree and gin, and I'd
 like to change both of those to not require any post-recovery cleanup step
 to finish multi-page operations, similar to what I did with GiST in 9.1.

I thought that was quite neat doing it that way, but there's no
specific reason to do it that way I guess. If you're happy to rewrite
the patch then I guess we're OK.

I certainly would like to get rid of rm_safe_restartpoint in the
longer term, hopefully sooner.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-10-03 Thread Simon Riggs
On Mon, Oct 3, 2011 at 6:23 AM, Fujii Masao masao.fu...@gmail.com wrote:

 So I think that the idea should be implemented separately from
 the patch I've posted.

Agreed. I'll do a final review and commit today.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-10-02 Thread Fujii Masao
On Fri, Sep 30, 2011 at 3:57 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I think we should issue PANIC if the source is a critical rmgr, or
 just WARNING if from a non-critical rmgr, such as indexes.

 Ideally, I think we should have a mechanism to allow indexes to be
 marked corrupt. For example, a file that if present shows that the
 index is corrupt and would be marked not valid. We can then create the
 file and send a sinval message to force the index relcache to be
 rebuilt showing valid set to false.

This seems not to be specific to the invalid-page table problem.
All error cases from a non-critical rmgr should be treated not-PANIC.
So I think that the idea should be implemented separately from
the patch I've posted.

Anyway what if read-only query accesses the index marked invalid?
Just emit ERROR?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-30 Thread Simon Riggs
On Fri, Sep 30, 2011 at 2:09 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Sep 29, 2011 at 11:12 PM, Florian Pflug f...@phlo.org wrote:
 On Sep29, 2011, at 13:49 , Simon Riggs wrote:
 This worries me slightly now though because the patch makes us PANIC
 in a place we didn't used to and once we do that we cannot restart the
 server at all. Are we sure we want that? It's certainly a great way to
 shake down errors in other code...

 The patch only introduces a new PANIC condition during archive recovery,
 though. Crash recovery is unaffected, except that we no longer create
 restart points before we reach consistency.

 Also, if we hit an invalid page reference after reaching consistency,
 the cause is probably either a bug in our recovery code, or (quite unlikely)
 a corrupted WAL that passed the CRC check. In both cases, the likelyhood
 of data-corruption seems high, so PANICing seems like the right thing to do.

 Fair enough.

 We might be able to use FATAL or ERROR instead of PANIC because they
 also cause all processes to exit when the startup process emits them.
 For example, we now use FATAL to stop the server in recovery mode
 when recovery is about to end before we've reached a consistent state.

I think we should issue PANIC if the source is a critical rmgr, or
just WARNING if from a non-critical rmgr, such as indexes.

Ideally, I think we should have a mechanism to allow indexes to be
marked corrupt. For example, a file that if present shows that the
index is corrupt and would be marked not valid. We can then create the
file and send a sinval message to force the index relcache to be
rebuilt showing valid set to false.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-29 Thread Fujii Masao
On Tue, Sep 27, 2011 at 8:06 PM, Florian Pflug f...@phlo.org wrote:
 On Sep27, 2011, at 07:59 , Heikki Linnakangas wrote:
 On 27.09.2011 00:28, Florian Pflug wrote:
 On Sep26, 2011, at 22:39 , Tom Lane wrote:
 It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
 we (think we) have reached consistency, rather than leaving it to be
 done only when we exit recovery mode.

 I believe we also need to prevent the creation of restart points before
 we've reached consistency.

 Seems reasonable. We could still allow restartpoints when the hash table is 
 empty, though. And once we've reached consistency, we can throw an error 
 immediately in log_invalid_page(), instead of adding the entry in the hash 
 table.

 That mimics the way the rm_safe_restartpoint callbacks work, which is good.

 Actually, why don't we use that machinery to implement this? There's 
 currently no rm_safe_restartpoint callback for RM_XLOG_ID, so we'd just need 
 to create one that checks whether invalid_page_tab is empty.

Okay, the attached patch prevents the creation of restartpoints by using
rm_safe_restartpoint callback if we've not reached a consistent state yet
and the invalid-page table is not empty. But the invalid-page table is not
tied to the specific resource manager, so using rm_safe_restartpoint for
that seems to slightly odd. Is this OK?

Also, according to other suggestions, the patch changes XLogCheckInvalidPages()
so that it's called as soon as we've reached a consistent state, and changes
log_invalid_page() so that it emits PANIC immediately if consistency is already
reached. These are very good changes, I think. Because they enable us to
notice serious problem which causes PANIC error immediately. Without these
changes, you unfortunately might notice that the standby database is corrupted
when failover happens. Though such a problem might rarely happen, I think it's
worth doing those changes.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***
*** 16,21 
--- 16,22 
  #include access/nbtree.h
  #include access/xact.h
  #include access/xlog_internal.h
+ #include access/xlogutils.h
  #include catalog/storage.h
  #include commands/dbcommands.h
  #include commands/sequence.h
***
*** 25,31 
  
  
  const RmgrData RmgrTable[RM_MAX_ID + 1] = {
! 	{XLOG, xlog_redo, xlog_desc, NULL, NULL, NULL},
  	{Transaction, xact_redo, xact_desc, NULL, NULL, NULL},
  	{Storage, smgr_redo, smgr_desc, NULL, NULL, NULL},
  	{CLOG, clog_redo, clog_desc, NULL, NULL, NULL},
--- 26,32 
  
  
  const RmgrData RmgrTable[RM_MAX_ID + 1] = {
! 	{XLOG, xlog_redo, xlog_desc, NULL, NULL, xlog_safe_restartpoint},
  	{Transaction, xact_redo, xact_desc, NULL, NULL, NULL},
  	{Storage, smgr_redo, smgr_desc, NULL, NULL, NULL},
  	{CLOG, clog_redo, clog_desc, NULL, NULL, NULL},
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 557,563  static TimeLineID lastPageTLI = 0;
  static XLogRecPtr minRecoveryPoint;		/* local copy of
  		 * ControlFile-minRecoveryPoint */
  static bool updateMinRecoveryPoint = true;
! static bool reachedMinRecoveryPoint = false;
  
  static bool InRedo = false;
  
--- 557,563 
  static XLogRecPtr minRecoveryPoint;		/* local copy of
  		 * ControlFile-minRecoveryPoint */
  static bool updateMinRecoveryPoint = true;
! bool reachedMinRecoveryPoint = false;
  
  static bool InRedo = false;
  
***
*** 6840,6851  StartupXLOG(void)
  		LocalXLogInsertAllowed = -1;
  
  		/*
- 		 * Check to see if the XLOG sequence contained any unresolved
- 		 * references to uninitialized pages.
- 		 */
- 		XLogCheckInvalidPages();
- 
- 		/*
  		 * Perform a checkpoint to update all our recovery activity to disk.
  		 *
  		 * Note that we write a shutdown checkpoint rather than an on-line
--- 6840,6845 
***
*** 6982,6987  CheckRecoveryConsistency(void)
--- 6976,6987 
  		XLByteLE(minRecoveryPoint, EndRecPtr) 
  		XLogRecPtrIsInvalid(ControlFile-backupStartPoint))
  	{
+ 		/*
+ 		 * Check to see if the XLOG sequence contained any unresolved
+ 		 * references to uninitialized pages.
+ 		 */
+ 		XLogCheckInvalidPages();
+ 
  		reachedMinRecoveryPoint = true;
  		ereport(LOG,
  (errmsg(consistent recovery state reached at %X/%X,
*** a/src/backend/access/transam/xlogutils.c
--- b/src/backend/access/transam/xlogutils.c
***
*** 52,57  typedef struct xl_invalid_page
--- 52,73 
  static HTAB *invalid_page_tab = NULL;
  
  
+ /* Report a reference to an invalid page */
+ static void
+ report_invalid_page(int elevel, RelFileNode node, ForkNumber forkno,
+ 	BlockNumber blkno, bool present)
+ {
+ 	char	   *path = relpathperm(node, forkno);
+ 
+ 	if (present)
+ 		elog(elevel, page %u of relation %s is uninitialized,
+ 			 blkno, path);
+ 	else
+ 		

Re: [HACKERS] bug of recovery?

2011-09-29 Thread Simon Riggs
On Thu, Sep 29, 2011 at 12:31 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Sep 27, 2011 at 8:06 PM, Florian Pflug f...@phlo.org wrote:
 On Sep27, 2011, at 07:59 , Heikki Linnakangas wrote:
 On 27.09.2011 00:28, Florian Pflug wrote:
 On Sep26, 2011, at 22:39 , Tom Lane wrote:
 It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
 we (think we) have reached consistency, rather than leaving it to be
 done only when we exit recovery mode.

 I believe we also need to prevent the creation of restart points before
 we've reached consistency.

 Seems reasonable. We could still allow restartpoints when the hash table is 
 empty, though. And once we've reached consistency, we can throw an error 
 immediately in log_invalid_page(), instead of adding the entry in the hash 
 table.

 That mimics the way the rm_safe_restartpoint callbacks work, which is good.

 Actually, why don't we use that machinery to implement this? There's 
 currently no rm_safe_restartpoint callback for RM_XLOG_ID, so we'd just need 
 to create one that checks whether invalid_page_tab is empty.

 Okay, the attached patch prevents the creation of restartpoints by using
 rm_safe_restartpoint callback if we've not reached a consistent state yet
 and the invalid-page table is not empty. But the invalid-page table is not
 tied to the specific resource manager, so using rm_safe_restartpoint for
 that seems to slightly odd. Is this OK?

 Also, according to other suggestions, the patch changes 
 XLogCheckInvalidPages()
 so that it's called as soon as we've reached a consistent state, and changes
 log_invalid_page() so that it emits PANIC immediately if consistency is 
 already
 reached. These are very good changes, I think. Because they enable us to
 notice serious problem which causes PANIC error immediately. Without these
 changes, you unfortunately might notice that the standby database is corrupted
 when failover happens. Though such a problem might rarely happen, I think it's
 worth doing those changes.

Patch does everything we agreed it should.

Good suggestion from Florian.

This worries me slightly now though because the patch makes us PANIC
in a place we didn't used to and once we do that we cannot restart the
server at all. Are we sure we want that? It's certainly a great way to
shake down errors in other code...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-29 Thread Florian Pflug
On Sep29, 2011, at 13:49 , Simon Riggs wrote:
 This worries me slightly now though because the patch makes us PANIC
 in a place we didn't used to and once we do that we cannot restart the
 server at all. Are we sure we want that? It's certainly a great way to
 shake down errors in other code...

The patch only introduces a new PANIC condition during archive recovery,
though. Crash recovery is unaffected, except that we no longer create
restart points before we reach consistency.

Also, if we hit an invalid page reference after reaching consistency,
the cause is probably either a bug in our recovery code, or (quite unlikely)
a corrupted WAL that passed the CRC check. In both cases, the likelyhood
of data-corruption seems high, so PANICing seems like the right thing to do.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-29 Thread Fujii Masao
On Thu, Sep 29, 2011 at 11:12 PM, Florian Pflug f...@phlo.org wrote:
 On Sep29, 2011, at 13:49 , Simon Riggs wrote:
 This worries me slightly now though because the patch makes us PANIC
 in a place we didn't used to and once we do that we cannot restart the
 server at all. Are we sure we want that? It's certainly a great way to
 shake down errors in other code...

 The patch only introduces a new PANIC condition during archive recovery,
 though. Crash recovery is unaffected, except that we no longer create
 restart points before we reach consistency.

 Also, if we hit an invalid page reference after reaching consistency,
 the cause is probably either a bug in our recovery code, or (quite unlikely)
 a corrupted WAL that passed the CRC check. In both cases, the likelyhood
 of data-corruption seems high, so PANICing seems like the right thing to do.

Fair enough.

We might be able to use FATAL or ERROR instead of PANIC because they
also cause all processes to exit when the startup process emits them.
For example, we now use FATAL to stop the server in recovery mode
when recovery is about to end before we've reached a consistent state.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-27 Thread Heikki Linnakangas

On 26.09.2011 21:06, Simon Riggs wrote:

On Mon, Sep 26, 2011 at 10:50 AM, Fujii Masaomasao.fu...@gmail.com  wrote:


Currently, if a reference to an invalid page is found during recovery,
its information
is saved in hash table invalid_page_tab. Then, if such a reference
is resolved,
its information is removed from the hash table. If there is unresolved
reference to
an invalid page in the hash table at the end of recovery, PANIC error occurs.

What I'm worried about is that the hash table is volatile. If a user restarts
the server before reaching end of recovery, any information in the
hash table is lost,
and we wrongly miss the PANIC error case because we cannot find any unresolved
reference. That is, even if database is corrupted at the end of recovery,
a user might not be able to notice that. This looks like a serious problem. No?

To prevent the above problem, we should write the contents of the hash table to
the disk for every restartpoints, I think. Then, when the server
starts recovery,
it should reload the hash table from the disk. Thought? Am I missing something?


That doesn't happen because the when we stop the server it will
restart from a valid restartpoint - one where there is no in-progress
multi-phase operation.

So when it replays it will always replay both parts of the operation.


I think you're mixing this up with the multi-page page split operations 
in b-tree. This is different from that. What the invalid_page_tab is 
for is the situation where you for example, insert to a page on table X, 
and later table X is dropped, and then you crash. On WAL replay, you 
will see the insert record, but the file for the table doesn't exist, 
because the table was dropped. In that case we skip the insert, note 
what happened in invalid_page_tab, and move on with recovery. When we 
see the later record to drop the table, we know it was OK that the file 
was missing earlier. But if we don't see it before end of recovery, we 
PANIC, because then the file should've been there.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-27 Thread Heikki Linnakangas

On 27.09.2011 00:28, Florian Pflug wrote:

On Sep26, 2011, at 22:39 , Tom Lane wrote:

It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
we (think we) have reached consistency, rather than leaving it to be
done only when we exit recovery mode.


I believe we also need to prevent the creation of restart points before
we've reached consistency.


Seems reasonable. We could still allow restartpoints when the hash table 
is empty, though. And once we've reached consistency, we can throw an 
error immediately in log_invalid_page(), instead of adding the entry in 
the hash table.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-27 Thread Simon Riggs
On Tue, Sep 27, 2011 at 6:54 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 I think you're mixing this up with the multi-page page split operations in
 b-tree. This is different from that. What the invalid_page_tab is for is
 the situation where you for example, insert to a page on table X, and later
 table X is dropped, and then you crash. On WAL replay, you will see the
 insert record, but the file for the table doesn't exist, because the table
 was dropped. In that case we skip the insert, note what happened in
 invalid_page_tab, and move on with recovery. When we see the later record to
 drop the table, we know it was OK that the file was missing earlier. But if
 we don't see it before end of recovery, we PANIC, because then the file
 should've been there.

OK, yes, I see. Thanks.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-27 Thread Florian Pflug
On Sep27, 2011, at 07:59 , Heikki Linnakangas wrote:
 On 27.09.2011 00:28, Florian Pflug wrote:
 On Sep26, 2011, at 22:39 , Tom Lane wrote:
 It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
 we (think we) have reached consistency, rather than leaving it to be
 done only when we exit recovery mode.
 
 I believe we also need to prevent the creation of restart points before
 we've reached consistency.
 
 Seems reasonable. We could still allow restartpoints when the hash table is 
 empty, though. And once we've reached consistency, we can throw an error 
 immediately in log_invalid_page(), instead of adding the entry in the hash 
 table.

That mimics the way the rm_safe_restartpoint callbacks work, which is good.

Actually, why don't we use that machinery to implement this? There's currently 
no rm_safe_restartpoint callback for RM_XLOG_ID, so we'd just need to create 
one that checks whether invalid_page_tab is empty.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] bug of recovery?

2011-09-26 Thread Fujii Masao
Hi,

Currently, if a reference to an invalid page is found during recovery,
its information
is saved in hash table invalid_page_tab. Then, if such a reference
is resolved,
its information is removed from the hash table. If there is unresolved
reference to
an invalid page in the hash table at the end of recovery, PANIC error occurs.

What I'm worried about is that the hash table is volatile. If a user restarts
the server before reaching end of recovery, any information in the
hash table is lost,
and we wrongly miss the PANIC error case because we cannot find any unresolved
reference. That is, even if database is corrupted at the end of recovery,
a user might not be able to notice that. This looks like a serious problem. No?

To prevent the above problem, we should write the contents of the hash table to
the disk for every restartpoints, I think. Then, when the server
starts recovery,
it should reload the hash table from the disk. Thought? Am I missing something?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-26 Thread Florian Pflug
On Sep26, 2011, at 11:50 , Fujii Masao wrote:
 Currently, if a reference to an invalid page is found during recovery,
 its information
 is saved in hash table invalid_page_tab. Then, if such a reference
 is resolved,
 its information is removed from the hash table. If there is unresolved
 reference to
 an invalid page in the hash table at the end of recovery, PANIC error occurs.
 
 What I'm worried about is that the hash table is volatile. If a user restarts
 the server before reaching end of recovery, any information in the
 hash table is lost,
 and we wrongly miss the PANIC error case because we cannot find any unresolved
 reference. That is, even if database is corrupted at the end of recovery,
 a user might not be able to notice that. This looks like a serious problem. 
 No?
 
 To prevent the above problem, we should write the contents of the hash table 
 to
 the disk for every restartpoints, I think. Then, when the server
 starts recovery,
 it should reload the hash table from the disk. Thought? Am I missing 
 something?

Shouldn't references to invalid pages only occur before we reach a consistent
state? If so, the right fix would be to check whether all invalid page 
references
have been resolved after we've reached a consistent state, and to skip creating
restart points while there're unresolved page references.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-26 Thread Simon Riggs
On Mon, Sep 26, 2011 at 10:50 AM, Fujii Masao masao.fu...@gmail.com wrote:

 Currently, if a reference to an invalid page is found during recovery,
 its information
 is saved in hash table invalid_page_tab. Then, if such a reference
 is resolved,
 its information is removed from the hash table. If there is unresolved
 reference to
 an invalid page in the hash table at the end of recovery, PANIC error occurs.

 What I'm worried about is that the hash table is volatile. If a user restarts
 the server before reaching end of recovery, any information in the
 hash table is lost,
 and we wrongly miss the PANIC error case because we cannot find any unresolved
 reference. That is, even if database is corrupted at the end of recovery,
 a user might not be able to notice that. This looks like a serious problem. 
 No?

 To prevent the above problem, we should write the contents of the hash table 
 to
 the disk for every restartpoints, I think. Then, when the server
 starts recovery,
 it should reload the hash table from the disk. Thought? Am I missing 
 something?

That doesn't happen because the when we stop the server it will
restart from a valid restartpoint - one where there is no in-progress
multi-phase operation.

So when it replays it will always replay both parts of the operation.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-26 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Mon, Sep 26, 2011 at 10:50 AM, Fujii Masao masao.fu...@gmail.com wrote:
 To prevent the above problem, we should write the contents of the hash table 
 to
 the disk for every restartpoints, I think. Then, when the server
 starts recovery,
 it should reload the hash table from the disk. Thought? Am I missing 
 something?

 That doesn't happen because the when we stop the server it will
 restart from a valid restartpoint - one where there is no in-progress
 multi-phase operation.

Not clear that that's true.  The larger point though is that the
invalid-page table is only interesting during crash recovery --- once
you've reached a consistent state, it should be empty and remain so.
So I see no particular value in Fujii's proposal of logging the table to
disk during standby mode.

It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
we (think we) have reached consistency, rather than leaving it to be
done only when we exit recovery mode.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-26 Thread Florian Pflug
On Sep26, 2011, at 22:39 , Tom Lane wrote:
 It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
 we (think we) have reached consistency, rather than leaving it to be
 done only when we exit recovery mode.

I believe we also need to prevent the creation of restart points before
we've reached consistency. If we're starting from an online backup,
and a checkpoint occurred between pg_start_backup() and pg_stop_backup(),
we currently create a restart point upon replaying that checkpoint's
xlog record. At that point, however, unresolved page references are
not an error, since a truncation that happened after the checkpoint
(but before pg_stop_backup()) might or might not be reflected in the
online backup.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-26 Thread Fujii Masao
On Tue, Sep 27, 2011 at 7:28 AM, Florian Pflug f...@phlo.org wrote:
 On Sep26, 2011, at 22:39 , Tom Lane wrote:
 It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
 we (think we) have reached consistency, rather than leaving it to be
 done only when we exit recovery mode.

 I believe we also need to prevent the creation of restart points before
 we've reached consistency. If we're starting from an online backup,
 and a checkpoint occurred between pg_start_backup() and pg_stop_backup(),
 we currently create a restart point upon replaying that checkpoint's
 xlog record. At that point, however, unresolved page references are
 not an error, since a truncation that happened after the checkpoint
 (but before pg_stop_backup()) might or might not be reflected in the
 online backup.

Preventing the creation of restartpoints before reaching consistent point
sounds fragile to the case where the backup takes very long time. It might
also take very long time to reach consistent point when replaying from that
backup. Which prevents also the removal of WAL files (e.g., streamed from
the master server) for a long time, and then might cause disk full failure.

ISTM that writing an invalid-page table to the disk for every restartpoints is
better approach. If an invalid-page table is never updated after we've
reached consistency point, we probably should make restartpoints write
that table only after that point. And, if a reference to an invalid
page is found
after the consistent point, we should emit error and cancel a recovery.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-26 Thread Tom Lane
Fujii Masao masao.fu...@gmail.com writes:
 ISTM that writing an invalid-page table to the disk for every restartpoints is
 better approach.

I still say that's uncalled-for overkill.  The invalid-page table is not
necessary for recovery, it's only a debugging cross-check.  You're more
likely to introduce bugs than fix any by adding a mechanism like that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug of recovery?

2011-09-26 Thread Fujii Masao
On Tue, Sep 27, 2011 at 1:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Fujii Masao masao.fu...@gmail.com writes:
 ISTM that writing an invalid-page table to the disk for every restartpoints 
 is
 better approach.

 I still say that's uncalled-for overkill.  The invalid-page table is not
 necessary for recovery, it's only a debugging cross-check.

If so, there is no risk even if the invalid-page table is lost and the check
is skipped unexpectedly?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers