Re: [HACKERS] exitArchiveRecovery woes

2014-12-18 Thread Robert Haas
On Wed, Dec 17, 2014 at 8:40 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 At the end of archive recovery, we copy the last segment from the old
 timeline, to initialize the first segment on the new timeline. For example,
 if the timeline switch happens in the middle of WAL segment
 00010005, the whole 00010005 segment is
 copied to become 00020005. The copying is necessary, so that
 the new segment contains valid data up to the switch point.

 However, we wouldn't really need to copy the whole segment, copying up to
 the switch point would be enough. In fact, copying the whole segment is a
 bad idea, because the copied WAL looks valid on the new timeline too.

Your proposed change makes sense to me, but why do we need the segment
to contain valid data up to the switch point?  It seems like the
switch between timelines should be crisper: replay WAL on the old
timeline only from the old segment, and from the new timeline only on
the new segment.  Anything else seems like an invitation to unending
corner-case bugs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exitArchiveRecovery woes

2014-12-18 Thread Robert Haas
On Thu, Dec 18, 2014 at 10:49 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 12/18/2014 03:53 PM, Robert Haas wrote:
 On Wed, Dec 17, 2014 at 8:40 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:

 At the end of archive recovery, we copy the last segment from the old
 timeline, to initialize the first segment on the new timeline. For
 example,
 if the timeline switch happens in the middle of WAL segment
 00010005, the whole 00010005 segment is
 copied to become 00020005. The copying is necessary, so
 that
 the new segment contains valid data up to the switch point.

 However, we wouldn't really need to copy the whole segment, copying up to
 the switch point would be enough. In fact, copying the whole segment is a
 bad idea, because the copied WAL looks valid on the new timeline too.


 Your proposed change makes sense to me, but why do we need the segment
 to contain valid data up to the switch point?  It seems like the
 switch between timelines should be crisper: replay WAL on the old
 timeline only from the old segment, and from the new timeline only on
 the new segment.  Anything else seems like an invitation to unending
 corner-case bugs.

 True. That would require some changes to the way archive recovery works,
 though. Currently, when our recovery target timeline is, for example, 5,
 whose parents are 4 and 3, and we're currently on timeline 3, we will try to
 restore each segment first with timeline ID 5, then 4, then 3. It's a bit
 silly, because we know the timeline history and the exact points where the
 timelines changed, so we could just fetch the correct one. That would be a
 good idea, but I'm going to go ahead with just this smaller change now.

Yeah, it's a separate issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] exitArchiveRecovery woes

2014-12-17 Thread Heikki Linnakangas
At the end of archive recovery, we copy the last segment from the old 
timeline, to initialize the first segment on the new timeline. For 
example, if the timeline switch happens in the middle of WAL segment 
00010005, the whole 00010005 segment is 
copied to become 00020005. The copying is necessary, so 
that the new segment contains valid data up to the switch point.


However, we wouldn't really need to copy the whole segment, copying up 
to the switch point would be enough. In fact, copying the whole segment 
is a bad idea, because the copied WAL looks valid on the new timeline 
too. When we read the WAL at crash recovery, we rely on a number of 
things to determine if the next WAL record is valid. Most importantly, 
the checksum, and the prev-pointer. The checksum protects any random 
data from appearing valid, and the prev-pointer makes sure that a WAL 
record copied from another location in the WAL is not mistaken as valid. 
The prev-pointer is particularly important when we recycle old WAL 
segments as new, because the old segment contains valid WAL records with 
checksums and all. When we copy a WAL segment with the same segment 
number, the prev pointer doesn't protect us, as there can be WAL records 
at the exact same locations in both segments. There is a timeline ID on 
the page header, but we could still be mistaken within the page. Also, 
we are lenient with the TLI at start of WAL recovery, when we read the 
first WAL record after the checkpoint. There are further safeguards, 
like the fact that when writing WAL, we always write full blocks. But 
the write could still be torn at the OS or disk level, if you crash 
after writing the WAL, but before fsyncing it.


This is largely academic, but I was able to craft a test case where WAL 
recovery mistakenly starts to replay the WAL copied from the old 
timeline, as if it was on the new timeline. Attached is a shell script I 
used. It's very sensitive to the lengths of the WAL records, so probably 
only works on a similar platform as mine (x86_64 Linux). Running 
pitr-test.sh ends with this:


S LOG:  database system was interrupted; last known up at 2014-12-17 
15:15:42 EET
S LOG:  database system was not properly shut down; automatic recovery 
in progress

S LOG:  redo starts at 0/50A2018
S PANIC:  heap_insert_redo: invalid max offset number
S CONTEXT:  xlog redo Heap/INSERT: off 28
S LOG:  startup process (PID 10640) was terminated by signal 6: Aborted
S LOG:  aborting startup due to startup process failure

That PANIC happens because it tries to apply WAL from different 
timeline, and it doesn't work because it missed an earlier change to the 
same page it modifies. (If you were unlucky, you could get silent 
corruption instead, if the WAL record happens to apply without an error)


A simple way to avoid this is to copy the old WAL segment only up to the 
point of the timeline switch, and zero the rest.



Another thing I noticed is that we copy the last old WAL segment on the 
new timeline, even if the timeline switch happens at a segment boundary. 
In that case, the copied WAL segment is 100% identical to the old 
segment; it contains no records belonging to the new timeline. I guess 
that's not wrong per se, but it seems pointless and confusing.


Attached is a patch that addresses both of those issues. This doesn't 
seem worth the risk to back-patch, but let's fix these in master.



PS. The if (endTLI != ThisTimeLineID) test in exitArchiveRecovery was 
always true, because we always switch to a new timeline after archive 
recovery. I turned that into an Assert.


- Heikki


pitr-test.sh
Description: Bourne shell script
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 2923,2934  XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
   * srcTLI, srclog, srcseg: identify segment to be copied (could be from
   *		a different timeline)
   *
   * Currently this is only used during recovery, and so there are no locking
   * considerations.  But we should be just as tense as XLogFileInit to avoid
   * emplacing a bogus file.
   */
  static void
! XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno)
  {
  	char		path[MAXPGPATH];
  	char		tmppath[MAXPGPATH];
--- 2923,2937 
   * srcTLI, srclog, srcseg: identify segment to be copied (could be from
   *		a different timeline)
   *
+  * upto: how much of the source file to copy? (the rest is filled with zeros)
+  *
   * Currently this is only used during recovery, and so there are no locking
   * considerations.  But we should be just as tense as XLogFileInit to avoid
   * emplacing a bogus file.
   */
  static void
! XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
! 			 int upto)
  {
  	char		path[MAXPGPATH];
  	char		tmppath[MAXPGPATH];
***
*** 2967,2982  XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo