Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 12.11.2012 22:53, Tom Lane wrote: Here's an updated patch that fixes the GIST replay functions as well as the other minor issues that were mentioned. Barring objections, I'll set about back-patching this as far as 9.0. Ok. It won't help all that much on 9.0, though. One thing that could use verification is my fix for gistRedoPageSplitRecord. AFAICS, the first page listed in the WAL record is always the original page, and the ones following it are pages that were split off from it, and can (as yet) only be reached by following right-links from the original page. As such, it should be okay to release locks on the non-first pages as soon as we've written them. We have to hold lock on the original page though to avoid letting readers follow dangling right-links. Also, the update of NSN/FOLLOW_RIGHT on the child page (if any) has to be done atomically with all this, so that has to be done before releasing the original-page lock as well. Does that sound right? Yep. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Heikki Linnakangas hlinnakan...@vmware.com writes: On 12.11.2012 22:53, Tom Lane wrote: Here's an updated patch that fixes the GIST replay functions as well as the other minor issues that were mentioned. Barring objections, I'll set about back-patching this as far as 9.0. Ok. It won't help all that much on 9.0, though. Well, it won't help GIST much, but the actually-reported-from-the-field case is in btree, and it does fix that. It occurs to me that if we're sufficiently scared of this case, we could probably hack the planner (in 9.0 only) to refuse to use GIST indexes in hot-standby queries. That cure might be worse than the disease though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane t...@sss.pgh.pa.us wrote: Ok. It won't help all that much on 9.0, though. Well, it won't help GIST much, but the actually-reported-from-the-field case is in btree, and it does fix that. It occurs to me that if we're sufficiently scared of this case, we could probably hack the planner (in 9.0 only) to refuse to use GIST indexes in hot-standby queries. That cure might be worse than the disease though. if anything, it should be documented. if you do this kind of thing people will stop installing bugfix releases. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 14/11/12 04:32, Merlin Moncure wrote: On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane t...@sss.pgh.pa.us wrote: Ok. It won't help all that much on 9.0, though. Well, it won't help GIST much, but the actually-reported-from-the-field case is in btree, and it does fix that. It occurs to me that if we're sufficiently scared of this case, we could probably hack the planner (in 9.0 only) to refuse to use GIST indexes in hot-standby queries. That cure might be worse than the disease though. if anything, it should be documented. if you do this kind of thing people will stop installing bugfix releases. merlin How about displaying a warning, when people try to use the 'feature', as well as document it? Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On Tue, Nov 13, 2012 at 10:32 AM, Merlin Moncure mmonc...@gmail.com wrote: On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane t...@sss.pgh.pa.us wrote: Ok. It won't help all that much on 9.0, though. Well, it won't help GIST much, but the actually-reported-from-the-field case is in btree, and it does fix that. It occurs to me that if we're sufficiently scared of this case, we could probably hack the planner (in 9.0 only) to refuse to use GIST indexes in hot-standby queries. That cure might be worse than the disease though. if anything, it should be documented. if you do this kind of thing people will stop installing bugfix releases. Agreed. I think doing that in a back-branch release would be extremely user-hostile. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 12.11.2012 01:25, Tom Lane wrote: Worse than that, GIST WAL replay seems to be a total disaster from this standpoint, and I'm not convinced that it's even fixable without incompatible WAL changes. There are several cases where index insertion operations generate more than one WAL record while holding exclusive lock on buffer(s). If the lock continuity is actually necessary to prevent concurrent queries from seeing inconsistent index states, then we have a big problem, because WAL replay isn't designed to hold any buffer lock for more than one WAL action. Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), the GiST index is always in a consistent state. Before the downlink for a newly split page has been inserted yet, its left sibling is flagged so that a search knows to follow the right-link to find it. In normal operation, the lock continuity is needed to prevent another backend from seeing the incomplete split and trying to fix it, but in hot standby, we never try to fix incomplete splits anyway. So I think we're good on = 9.1. The 9.0 code is broken, however. In 9.0, when a child page is split, the parent and new children are kept locked until the downlinks are inserted/updated. If a concurrent scan comes along and sees that incomplete state, it will miss tuples on the new right siblings. We rely on a rm_cleanup operation at the end of WAL replay to fix that situation, if the downlink insertion record is not there. I don't see any easy way to fix that, unfortunately. Perhaps we could backpatch the 9.1 rewrite, now that it's gotten some real-world testing, but it was a big change so I don't feel very comfortable doing that. Also, gistRedoPageDeleteRecord seems to be dead code? I don't see anything that emits XLOG_GIST_PAGE_DELETE. Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was removed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: I already pointed out the inconsistency in heap_xlog_freeze about whether a cleanup lock is needed. As is, this patch uses a cleanup lock, but I suspect that a regular lock is sufficient --- comments? Well, according to storage/buffer/README: 1. To scan a page for tuples, one must hold a pin and either shared or exclusive content lock. To examine the commit status (XIDs and status bits) of a tuple in a shared buffer, one must likewise hold a pin and either shared or exclusive lock. That does indeed make it sound like an x-lock is enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Robert Haas robertmh...@gmail.com writes: On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: I already pointed out the inconsistency in heap_xlog_freeze about whether a cleanup lock is needed. As is, this patch uses a cleanup lock, but I suspect that a regular lock is sufficient --- comments? Well, according to storage/buffer/README: 1. To scan a page for tuples, one must hold a pin and either shared or exclusive content lock. To examine the commit status (XIDs and status bits) of a tuple in a shared buffer, one must likewise hold a pin and either shared or exclusive lock. That does indeed make it sound like an x-lock is enough. Yeah. AFAICS, the only possible downside is that somebody might be using the tuple (while holding just a buffer pin), and that its xmin might change while that's happening. So for instance a nestloop join might read out different xmin values for the same row while the join proceeds. But that could happen anyway if a different plan had been chosen (viz, this table on the inside not the outside of the nestloop). The xmin update ought to be physically atomic, so you shouldn't be able to get a garbage result, just the old value or the new. The cleanup lock is needed for cases where we'd like to remove or move a tuple, but that's not required here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Heikki Linnakangas hlinnakan...@vmware.com writes: On 12.11.2012 01:25, Tom Lane wrote: Worse than that, GIST WAL replay seems to be a total disaster from this standpoint, and I'm not convinced that it's even fixable without incompatible WAL changes. Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), the GiST index is always in a consistent state. Before the downlink for a newly split page has been inserted yet, its left sibling is flagged so that a search knows to follow the right-link to find it. In normal operation, the lock continuity is needed to prevent another backend from seeing the incomplete split and trying to fix it, but in hot standby, we never try to fix incomplete splits anyway. So I think we're good on = 9.1. Okay. I'll update the patch to make sure that the individual WAL replay functions hold all locks, but not worry about holding locks across actions. The 9.0 code is broken, however. In 9.0, when a child page is split, the parent and new children are kept locked until the downlinks are inserted/updated. If a concurrent scan comes along and sees that incomplete state, it will miss tuples on the new right siblings. We rely on a rm_cleanup operation at the end of WAL replay to fix that situation, if the downlink insertion record is not there. I don't see any easy way to fix that, unfortunately. Perhaps we could backpatch the 9.1 rewrite, now that it's gotten some real-world testing, but it was a big change so I don't feel very comfortable doing that. Me either. Given the lack of field complaints, I think we're better advised to just leave it unfixed in 9.0. It'd not be a step forward if we broke something trying to make this work. Also, gistRedoPageDeleteRecord seems to be dead code? I don't see anything that emits XLOG_GIST_PAGE_DELETE. Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was removed. Okay. I see we bumped XLOG_PAGE_MAGIC in 9.0, so there's no longer any way that 9.0 or later versions could see this WAL record type. I'll delete that replay function rather than bothering to fix it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Andres Freund and...@2ndquadrant.com writes: On 2012-11-10 16:24:22 -0500, Tom Lane wrote: If any of this stuff were getting used by external modules, changing it would be problematic ... but fortunately, it isn't, because we lack support for plug-in rmgrs. So I'm not worried about back-patching the change, and would prefer to keep the 9.x branches in sync. XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely used by xlogdump. Not sure if either are worth that much attention, but it seems worth noticing that such a change will break stuff. Hm. Okay, how about we leave the old macros in place in the back branches? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 2012-11-12 10:19:09 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2012-11-10 16:24:22 -0500, Tom Lane wrote: If any of this stuff were getting used by external modules, changing it would be problematic ... but fortunately, it isn't, because we lack support for plug-in rmgrs. So I'm not worried about back-patching the change, and would prefer to keep the 9.x branches in sync. XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely used by xlogdump. Not sure if either are worth that much attention, but it seems worth noticing that such a change will break stuff. Hm. Okay, how about we leave the old macros in place in the back branches? Sounds good to me. The RestoreBkpBlocks change seems unproblematic to me. If anything its good that it has been renamed. Thanks, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 11 November 2012 23:24, Tom Lane t...@sss.pgh.pa.us wrote: Practically all WAL record types that touch multiple pages have some bug of this type. In addition to btree_xlog_split, I found that heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs, spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold locks as required to make their updates safe for concurrent queries. (I'm not totally sure about ginRedoDeletePage, but the original action definitely locks the pages simultaneously, and it's not clear that it's safe not to.) Most of these are okay in cases without any full-page images, but could fail if the wrong subset of the pages-to-be-touched were processed by RestoreBkpBlocks. Some had bugs even without that :-( Hmm, not good. Thanks for spotting. Do these changes do anything to actions that occur across multiple records? I assume not and think those are OK, agreed? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 12 November 2012 14:51, Tom Lane t...@sss.pgh.pa.us wrote: The 9.0 code is broken, however. In 9.0, when a child page is split, the parent and new children are kept locked until the downlinks are inserted/updated. If a concurrent scan comes along and sees that incomplete state, it will miss tuples on the new right siblings. We rely on a rm_cleanup operation at the end of WAL replay to fix that situation, if the downlink insertion record is not there. I don't see any easy way to fix that, unfortunately. Perhaps we could backpatch the 9.1 rewrite, now that it's gotten some real-world testing, but it was a big change so I don't feel very comfortable doing that. Me either. Given the lack of field complaints, I think we're better advised to just leave it unfixed in 9.0. It'd not be a step forward if we broke something trying to make this work. Agreed. Most people running 9.0 with GIST indexes have already upgraded. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Simon Riggs si...@2ndquadrant.com writes: On 11 November 2012 23:24, Tom Lane t...@sss.pgh.pa.us wrote: Practically all WAL record types that touch multiple pages have some bug of this type. In addition to btree_xlog_split, I found that heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs, spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold locks as required to make their updates safe for concurrent queries. (I'm not totally sure about ginRedoDeletePage, but the original action definitely locks the pages simultaneously, and it's not clear that it's safe not to.) Most of these are okay in cases without any full-page images, but could fail if the wrong subset of the pages-to-be-touched were processed by RestoreBkpBlocks. Some had bugs even without that :-( Hmm, not good. Thanks for spotting. Do these changes do anything to actions that occur across multiple records? I assume not and think those are OK, agreed? Right, we were and still are assuming that any state that exists between WAL records is consistent and safe to expose to hot-standby queries. The important thing here is that these WAL replay functions were failing to ensure that their own actions appear atomic to onlookers. This is basically hangover from pre-hot-standby coding conventions, when no such concern existed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Here's an updated patch that fixes the GIST replay functions as well as the other minor issues that were mentioned. Barring objections, I'll set about back-patching this as far as 9.0. One thing that could use verification is my fix for gistRedoPageSplitRecord. AFAICS, the first page listed in the WAL record is always the original page, and the ones following it are pages that were split off from it, and can (as yet) only be reached by following right-links from the original page. As such, it should be okay to release locks on the non-first pages as soon as we've written them. We have to hold lock on the original page though to avoid letting readers follow dangling right-links. Also, the update of NSN/FOLLOW_RIGHT on the child page (if any) has to be done atomically with all this, so that has to be done before releasing the original-page lock as well. Does that sound right? regards, tom lane restore-backup-blocks-individually-3.patch.gz Description: restore-backup-blocks-individually-3.patch.gz -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Attached is a complete draft patch against HEAD for this issue. I found a depressingly large number of pre-existing bugs: Practically all WAL record types that touch multiple pages have some bug of this type. In addition to btree_xlog_split, I found that heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs, spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold locks as required to make their updates safe for concurrent queries. (I'm not totally sure about ginRedoDeletePage, but the original action definitely locks the pages simultaneously, and it's not clear that it's safe not to.) Most of these are okay in cases without any full-page images, but could fail if the wrong subset of the pages-to-be-touched were processed by RestoreBkpBlocks. Some had bugs even without that :-( Worse than that, GIST WAL replay seems to be a total disaster from this standpoint, and I'm not convinced that it's even fixable without incompatible WAL changes. There are several cases where index insertion operations generate more than one WAL record while holding exclusive lock on buffer(s). If the lock continuity is actually necessary to prevent concurrent queries from seeing inconsistent index states, then we have a big problem, because WAL replay isn't designed to hold any buffer lock for more than one WAL action. This needs to be looked at closer by somebody who's more up on the GIST code than I am. The attached patch doesn't try very hard to make the GIST functions safe, it just transposes them to the new API. I also found that spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple --- it would get fooled by the LSN having been advanced already. This is an index corruption bug not just a transient-failure problem. Also, ginHeapTupleFastInsert's merge lists case fails to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. I already pointed out the inconsistency in heap_xlog_freeze about whether a cleanup lock is needed. As is, this patch uses a cleanup lock, but I suspect that a regular lock is sufficient --- comments? Also, gistRedoPageDeleteRecord seems to be dead code? I don't see anything that emits XLOG_GIST_PAGE_DELETE. regards, tom lane #application/octet-stream [restore-backup-blocks-individually-2.patch.gz] /home/tgl/pgsql/restore-backup-blocks-individually-2.patch.gz -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
[ this time *with* the patch ] Attached is a complete draft patch against HEAD for this issue. I found a depressingly large number of pre-existing bugs: Practically all WAL record types that touch multiple pages have some bug of this type. In addition to btree_xlog_split, I found that heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs, spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold locks as required to make their updates safe for concurrent queries. (I'm not totally sure about ginRedoDeletePage, but the original action definitely locks the pages simultaneously, and it's not clear that it's safe not to.) Most of these are okay in cases without any full-page images, but could fail if the wrong subset of the pages-to-be-touched were processed by RestoreBkpBlocks. Some had bugs even without that :-( Worse than that, GIST WAL replay seems to be a total disaster from this standpoint, and I'm not convinced that it's even fixable without incompatible WAL changes. There are several cases where index insertion operations generate more than one WAL record while holding exclusive lock on buffer(s). If the lock continuity is actually necessary to prevent concurrent queries from seeing inconsistent index states, then we have a big problem, because WAL replay isn't designed to hold any buffer lock for more than one WAL action. This needs to be looked at closer by somebody who's more up on the GIST code than I am. The attached patch doesn't try very hard to make the GIST functions safe, it just transposes them to the new API. I also found that spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple --- it would get fooled by the LSN having been advanced already. This is an index corruption bug not just a transient-failure problem. Also, ginHeapTupleFastInsert's merge lists case fails to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. I already pointed out the inconsistency in heap_xlog_freeze about whether a cleanup lock is needed. As is, this patch uses a cleanup lock, but I suspect that a regular lock is sufficient --- comments? Also, gistRedoPageDeleteRecord seems to be dead code? I don't see anything that emits XLOG_GIST_PAGE_DELETE. regards, tom lane binVAAMKoREwG.bin Description: restore-backup-blocks-individually-2.patch.gz -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 2012-11-10 16:24:22 -0500, Tom Lane wrote: Simon Riggs si...@2ndquadrant.com writes: One thing that seems a bit annoying is the use of zero-based backup block indexes in RestoreBackupBlock, when most (not all) of the callers are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc). That's a bug waiting to happen. We could address it by changing RestoreBackupBlock to accept a one-based argument, but what I'm thinking of doing instead is getting rid of the one-based convention entirely; that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One advantage of doing that is that it'll force inspection of all code related to this. I wouldn't do that in a back branch, but I can see why its a good idea. If any of this stuff were getting used by external modules, changing it would be problematic ... but fortunately, it isn't, because we lack support for plug-in rmgrs. So I'm not worried about back-patching the change, and would prefer to keep the 9.x branches in sync. XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely used by xlogdump. Not sure if either are worth that much attention, but it seems worth noticing that such a change will break stuff. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 9 November 2012 23:24, Tom Lane t...@sss.pgh.pa.us wrote: During normal running, operations such as btree page splits are extremely careful about the order in which they acquire and release buffer locks, if they're doing something that concurrently modifies multiple pages. During WAL replay, that all goes out the window. Even if an individual WAL-record replay function does things in the right order for standard cases, RestoreBkpBlocks has no idea what it's doing. So if one or more of the referenced pages gets treated as a full-page image, we are left with no guarantee whatsoever about what order the pages are restored in. That never mattered when the code was originally designed, but it sure matters during Hot Standby when other queries might be able to see the intermediate states. I can't prove that this is the cause of bug #7648, but it's fairly easy to see that it could explain the symptom. You only need to assume that the page-being-split had been handled as a full-page image, and that the new right-hand page had gotten allocated by extending the relation. Then there will be an interval just after RestoreBkpBlocks does its thing where the updated left-hand sibling is in the index and is not locked in any way, but its right-link points off the end of the index. If a few indexscans come along before the replay process gets to continue, you'd get exactly the reported errors. I'm inclined to think that we need to fix this by getting rid of RestoreBkpBlocks per se, and instead having the per-WAL-record restore routines dictate when each full-page image is restored (and whether or not to release the buffer lock immediately). That's not going to be a small change unfortunately :-( No, but it looks like a clear bug scenario and a clear resolution also. I'll start looking at it. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Simon Riggs si...@2ndquadrant.com writes: On 9 November 2012 23:24, Tom Lane t...@sss.pgh.pa.us wrote: I'm inclined to think that we need to fix this by getting rid of RestoreBkpBlocks per se, and instead having the per-WAL-record restore routines dictate when each full-page image is restored (and whether or not to release the buffer lock immediately). That's not going to be a small change unfortunately :-( No, but it looks like a clear bug scenario and a clear resolution also. I'll start looking at it. I'm on it already. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
I wrote: I'm inclined to think that we need to fix this by getting rid of RestoreBkpBlocks per se, and instead having the per-WAL-record restore routines dictate when each full-page image is restored (and whether or not to release the buffer lock immediately). That's not going to be a small change unfortunately :-( Here's a WIP patch that attacks it in this way. I've only gone as far as fixing btree_xlog_split() for the moment, since the main point is to see whether the coding change seems reasonable. At least in this function, it seems that locking considerations are handled correctly as long as no full-page image is used, so it's pretty straightforward to tweak it to handle the case with full-page image(s) as well. I've not looked through any other replay functions yet --- some of them may need more invasive hacking. But so far this looks pretty good. Note that this patch isn't correct yet even for btree_xlog_split(), because I've not removed the RestoreBkpBlocks() call in btree_redo(). All the btree replay routines will have to get fixed before it's testable at all. One thing that seems a bit annoying is the use of zero-based backup block indexes in RestoreBackupBlock, when most (not all) of the callers are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc). That's a bug waiting to happen. We could address it by changing RestoreBackupBlock to accept a one-based argument, but what I'm thinking of doing instead is getting rid of the one-based convention entirely; that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One advantage of doing that is that it'll force inspection of all code related to this. Comments, opinions? regards, tom lane diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index 72ea1719e7d2c3ca230c8f755261b9a439ec2a99..ee7e15fddcfca7dfe7ccba6ab021397937b1a96d 100644 *** a/src/backend/access/nbtree/nbtxlog.c --- b/src/backend/access/nbtree/nbtxlog.c *** btree_xlog_split(bool onleft, bool isroo *** 323,329 datalen -= newitemsz; } ! /* Reconstruct right (new) sibling from scratch */ rbuf = XLogReadBuffer(xlrec-node, xlrec-rightsib, true); Assert(BufferIsValid(rbuf)); rpage = (Page) BufferGetPage(rbuf); --- 323,329 datalen -= newitemsz; } ! /* Reconstruct right (new) sibling page from scratch */ rbuf = XLogReadBuffer(xlrec-node, xlrec-rightsib, true); Assert(BufferIsValid(rbuf)); rpage = (Page) BufferGetPage(rbuf); *** btree_xlog_split(bool onleft, bool isroo *** 357,374 /* don't release the buffer yet; we touch right page's first item below */ ! /* ! * Reconstruct left (original) sibling if needed. Note that this code ! * ensures that the items remaining on the left page are in the correct ! * item number order, but it does not reproduce the physical order they ! * would have had. Is this worth changing? See also _bt_restore_page(). ! */ ! if (!(record-xl_info XLR_BKP_BLOCK_1)) { Buffer lbuf = XLogReadBuffer(xlrec-node, xlrec-leftsib, false); if (BufferIsValid(lbuf)) { Page lpage = (Page) BufferGetPage(lbuf); BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage); --- 357,377 /* don't release the buffer yet; we touch right page's first item below */ ! /* Now reconstruct left (original) sibling page */ ! if (record-xl_info XLR_BKP_BLOCK_1) ! (void) RestoreBackupBlock(lsn, record, 0, false, false); ! else { Buffer lbuf = XLogReadBuffer(xlrec-node, xlrec-leftsib, false); if (BufferIsValid(lbuf)) { + /* + * Note that this code ensures that the items remaining on the + * left page are in the correct item number order, but it does not + * reproduce the physical order they would have had. Is this + * worth changing? See also _bt_restore_page(). + */ Page lpage = (Page) BufferGetPage(lbuf); BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage); *** btree_xlog_split(bool onleft, bool isroo *** 432,439 /* We no longer need the right buffer */ UnlockReleaseBuffer(rbuf); ! /* Fix left-link of the page to the right of the new right sibling */ ! if (xlrec-rnext != P_NONE !(record-xl_info XLR_BKP_BLOCK_2)) { Buffer buffer = XLogReadBuffer(xlrec-node, xlrec-rnext, false); --- 435,451 /* We no longer need the right buffer */ UnlockReleaseBuffer(rbuf); ! /* ! * Fix left-link of the page to the right of the new right sibling. ! * ! * Note: in normal operation, we do this while still holding lock on the ! * two split pages. However, that's not necessary for correctness in WAL ! * replay, because no other index update can be in progress, and readers ! * will cope properly when following an obsolete left-link. ! */ ! if (record-xl_info
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 10 November 2012 18:16, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: I'm inclined to think that we need to fix this by getting rid of RestoreBkpBlocks per se, and instead having the per-WAL-record restore routines dictate when each full-page image is restored (and whether or not to release the buffer lock immediately). That's not going to be a small change unfortunately :-( Here's a WIP patch that attacks it in this way. I've only gone as far as fixing btree_xlog_split() for the moment, since the main point is to see whether the coding change seems reasonable. At least in this function, it seems that locking considerations are handled correctly as long as no full-page image is used, so it's pretty straightforward to tweak it to handle the case with full-page image(s) as well. I've not looked through any other replay functions yet --- some of them may need more invasive hacking. But so far this looks pretty good. Looks fine, but we need a top-level comment in btree code explaining why/how it follows the very well explained rules in the README. Note that this patch isn't correct yet even for btree_xlog_split(), because I've not removed the RestoreBkpBlocks() call in btree_redo(). All the btree replay routines will have to get fixed before it's testable at all. I was just about to write you back you missed that... One thing that seems a bit annoying is the use of zero-based backup block indexes in RestoreBackupBlock, when most (not all) of the callers are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc). That's a bug waiting to happen. We could address it by changing RestoreBackupBlock to accept a one-based argument, but what I'm thinking of doing instead is getting rid of the one-based convention entirely; that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One advantage of doing that is that it'll force inspection of all code related to this. I wouldn't do that in a back branch, but I can see why its a good idea. Thanks for fixing. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Simon Riggs si...@2ndquadrant.com writes: On 10 November 2012 18:16, Tom Lane t...@sss.pgh.pa.us wrote: Here's a WIP patch that attacks it in this way. Looks fine, but we need a top-level comment in btree code explaining why/how it follows the very well explained rules in the README. Not sure what you'd want it to say exactly? Really these issues are per WAL-replay function, not global. I've now been through all of the btree functions, and AFAICS btree_xlog_split is the only one of them that actually has a bug; the others can be a bit lax about the order in which things get done. One thing that seems a bit annoying is the use of zero-based backup block indexes in RestoreBackupBlock, when most (not all) of the callers are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc). That's a bug waiting to happen. We could address it by changing RestoreBackupBlock to accept a one-based argument, but what I'm thinking of doing instead is getting rid of the one-based convention entirely; that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One advantage of doing that is that it'll force inspection of all code related to this. I wouldn't do that in a back branch, but I can see why its a good idea. If any of this stuff were getting used by external modules, changing it would be problematic ... but fortunately, it isn't, because we lack support for plug-in rmgrs. So I'm not worried about back-patching the change, and would prefer to keep the 9.x branches in sync. Attached is an updated patch in which nbtxlog.c has been fully converted, but I've not touched other rmgrs yet. I've tested this to the extent of having a replication slave follow a master that's running the regression tests, and it seems to work. It's difficult to prove much by testing about concurrent behavior, of course. I found that there's another benefit of managing backup-block replay this way, which is that we can de-contort the handling of conflict resolution by moving it into the per-record-type subroutines, instead of needing an extra switch before the RestoreBkpBlocks call. So that gives me more confidence that this is a good way to go. regards, tom lane diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index 72ea1719e7d2c3ca230c8f755261b9a439ec2a99..8f534803a7cea75105869de6eaf2946e03d5437c 100644 *** a/src/backend/access/nbtree/nbtxlog.c --- b/src/backend/access/nbtree/nbtxlog.c *** btree_xlog_insert(bool isleaf, bool isme *** 218,227 datalen -= sizeof(xl_btree_metadata); } ! if ((record-xl_info XLR_BKP_BLOCK_1) !ismeta isleaf) ! return; /* nothing to do */ ! ! if (!(record-xl_info XLR_BKP_BLOCK_1)) { buffer = XLogReadBuffer(xlrec-target.node, ItemPointerGetBlockNumber((xlrec-target.tid)), --- 218,226 datalen -= sizeof(xl_btree_metadata); } ! if (record-xl_info XLR_BKP_BLOCK(0)) ! (void) RestoreBackupBlock(lsn, record, 0, false, false); ! else { buffer = XLogReadBuffer(xlrec-target.node, ItemPointerGetBlockNumber((xlrec-target.tid)), *** btree_xlog_insert(bool isleaf, bool isme *** 249,254 --- 248,260 } } + /* + * Note: in normal operation, we'd update the metapage while still holding + * lock on the page we inserted into. But during replay it's not + * necessary to hold that lock, since no other index updates can be + * happening concurrently, and readers will cope fine with following an + * obsolete link from the metapage. + */ if (ismeta) _bt_restore_meta(xlrec-target.node, lsn, md.root, md.level, *** btree_xlog_split(bool onleft, bool isroo *** 290,296 forget_matching_split(xlrec-node, downlink, false); /* Extract left hikey and its size (still assuming 16-bit alignment) */ ! if (!(record-xl_info XLR_BKP_BLOCK_1)) { /* We assume 16-bit alignment is enough for IndexTupleSize */ left_hikey = (Item) datapos; --- 296,302 forget_matching_split(xlrec-node, downlink, false); /* Extract left hikey and its size (still assuming 16-bit alignment) */ ! if (!(record-xl_info XLR_BKP_BLOCK(0))) { /* We assume 16-bit alignment is enough for IndexTupleSize */ left_hikey = (Item) datapos; *** btree_xlog_split(bool onleft, bool isroo *** 310,316 datalen -= sizeof(OffsetNumber); } ! if (onleft !(record-xl_info XLR_BKP_BLOCK_1)) { /* * We assume that 16-bit alignment is enough to apply IndexTupleSize --- 316,322 datalen -= sizeof(OffsetNumber); } ! if (onleft !(record-xl_info XLR_BKP_BLOCK(0))) { /* * We assume that 16-bit alignment is enough to apply IndexTupleSize *** btree_xlog_split(bool onleft, bool isroo *** 323,329 datalen -= newitemsz; } ! /*
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
While I'm looking at this: there seems to be a bug/inconsistency in heap_xlog_freeze(). It uses a cleanup lock in the normal code path, but it doesn't tell RestoreBkpBlocks to use a cleanup lock. Which choice is correct? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 10 November 2012 21:24, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 10 November 2012 18:16, Tom Lane t...@sss.pgh.pa.us wrote: Here's a WIP patch that attacks it in this way. Looks fine, but we need a top-level comment in btree code explaining why/how it follows the very well explained rules in the README. Not sure what you'd want it to say exactly? Really these issues are per WAL-replay function, not global. I've now been through all of the btree functions, and AFAICS btree_xlog_split is the only one of them that actually has a bug; the others can be a bit lax about the order in which things get done. One thing that seems a bit annoying is the use of zero-based backup block indexes in RestoreBackupBlock, when most (not all) of the callers are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc). That's a bug waiting to happen. We could address it by changing RestoreBackupBlock to accept a one-based argument, but what I'm thinking of doing instead is getting rid of the one-based convention entirely; that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One advantage of doing that is that it'll force inspection of all code related to this. I wouldn't do that in a back branch, but I can see why its a good idea. If any of this stuff were getting used by external modules, changing it would be problematic ... but fortunately, it isn't, because we lack support for plug-in rmgrs. So I'm not worried about back-patching the change, and would prefer to keep the 9.x branches in sync. Attached is an updated patch in which nbtxlog.c has been fully converted, but I've not touched other rmgrs yet. I've tested this to the extent of having a replication slave follow a master that's running the regression tests, and it seems to work. It's difficult to prove much by testing about concurrent behavior, of course. I found that there's another benefit of managing backup-block replay this way, which is that we can de-contort the handling of conflict resolution by moving it into the per-record-type subroutines, instead of needing an extra switch before the RestoreBkpBlocks call. So that gives me more confidence that this is a good way to go. It's a good way to go, I'm just glad you're handling it for the back branches ;-) -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Inadequate thought about buffer locking during hot standby replay
During normal running, operations such as btree page splits are extremely careful about the order in which they acquire and release buffer locks, if they're doing something that concurrently modifies multiple pages. During WAL replay, that all goes out the window. Even if an individual WAL-record replay function does things in the right order for standard cases, RestoreBkpBlocks has no idea what it's doing. So if one or more of the referenced pages gets treated as a full-page image, we are left with no guarantee whatsoever about what order the pages are restored in. That never mattered when the code was originally designed, but it sure matters during Hot Standby when other queries might be able to see the intermediate states. I can't prove that this is the cause of bug #7648, but it's fairly easy to see that it could explain the symptom. You only need to assume that the page-being-split had been handled as a full-page image, and that the new right-hand page had gotten allocated by extending the relation. Then there will be an interval just after RestoreBkpBlocks does its thing where the updated left-hand sibling is in the index and is not locked in any way, but its right-link points off the end of the index. If a few indexscans come along before the replay process gets to continue, you'd get exactly the reported errors. I'm inclined to think that we need to fix this by getting rid of RestoreBkpBlocks per se, and instead having the per-WAL-record restore routines dictate when each full-page image is restored (and whether or not to release the buffer lock immediately). That's not going to be a small change unfortunately :-( regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 2012-11-09 18:24:25 -0500, Tom Lane wrote: I can't prove that this is the cause of bug #7648, but it's fairly easy to see that it could explain the symptom. You only need to assume that the page-being-split had been handled as a full-page image, and that the new right-hand page had gotten allocated by extending the relation. Then there will be an interval just after RestoreBkpBlocks does its thing where the updated left-hand sibling is in the index and is not locked in any way, but its right-link points off the end of the index. If a few indexscans come along before the replay process gets to continue, you'd get exactly the reported errors. Sounds plausible. I'm inclined to think that we need to fix this by getting rid of RestoreBkpBlocks per se, and instead having the per-WAL-record restore routines dictate when each full-page image is restored (and whether or not to release the buffer lock immediately). That's not going to be a small change unfortunately :-( I wonder if we couldn't instead fix it by ensuring the backup blocks are in the right order in the backup blocks at the inserting location. That would just need some care about the order of XLogRecData blocks. I am pretty unfamiliar with the nbtree locking but I seem to remember that we should be fine if we always restore from left to right? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On Fri, Nov 9, 2012 at 3:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: During normal running, operations such as btree page splits are extremely careful about the order in which they acquire and release buffer locks, if they're doing something that concurrently modifies multiple pages. During WAL replay, that all goes out the window. Even if an individual WAL-record replay function does things in the right order for standard cases, RestoreBkpBlocks has no idea what it's doing. So if one or more of the referenced pages gets treated as a full-page image, we are left with no guarantee whatsoever about what order the pages are restored in. That never mattered when the code was originally designed, but it sure matters during Hot Standby when other queries might be able to see the intermediate states. I can't prove that this is the cause of bug #7648, (I was the reporter of 7648) To lend slightly more circumstantial evidence in support of this, I also happened to note that the relfile in question was the last segment and it was about a quarter full, so the access attempt was definitely at the extreme outermost edge of the index most generally. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Andres Freund and...@anarazel.de writes: On 2012-11-09 18:24:25 -0500, Tom Lane wrote: I'm inclined to think that we need to fix this by getting rid of RestoreBkpBlocks per se, and instead having the per-WAL-record restore routines dictate when each full-page image is restored (and whether or not to release the buffer lock immediately). That's not going to be a small change unfortunately :-( I wonder if we couldn't instead fix it by ensuring the backup blocks are in the right order in the backup blocks at the inserting location. That would just need some care about the order of XLogRecData blocks. I don't think that's a good way to go. In the first place, if we did that the fix would require incompatible changes in the contents of WAL streams. In the second place, there are already severe constraints on the positioning of backup blocks to ensure that WAL records can be uniquely decoded (the section of access/transam/README about WAL coding touches on this) --- I don't think it's a good plan to add still more constraints there. And in the third place, the specific problem we're positing here results from a failure to hold the buffer lock for a full-page image until after we're done restoring a *non* full-page image represented elsewhere in the same WAL record. In general, of the set of pages touched by a WAL record, any arbitrary subset of them might be converted to FPIs during XLogInsert; but the replay-time locking requirements are going to be the same regardless of that. So AFAICS, any design in which RestoreBkpBlocks acts independently of the non-full-page-image updates is just broken. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers