Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On Thu, Nov 29, 2018 at 3:44 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Fri, Aug 24, 2018 at 5:53 PM Alexander Korotkov > > wrote: > > > > Given I've no feedback on this idea yet, I'll try to implement a PoC > > patch for that. It doesn't look to be difficult. And we'll see how > > does it work. > > Unfortunately, current version of the patch doesn't pass the tests and fails > on > initdb. But maybe you already have this PoC you were talking about, that will > also incorporate the feedback from this thread? For now I'll move it to the > next CF. Finally, I managed to write a PoC. If look at the list of problems I've enumerated in [1], this PoC is aimed for 1 and 3. > 1) Data corruption on file truncation error (explained in [1]). > 2) Expensive scanning of the whole shared buffers before file truncation. > 3) Cancel of read-only queries on standby even if hot_standby_feedback > is on, caused by replication of AccessExclusiveLock. 2 is pretty independent problem and could be addressed later. Basically, this patch does following: 1. Introduces new flag BM_DIRTY_BARRIER, which prevents dirty buffer from being written out. 2. Implements two-phase truncation of node buffers. First phase is prior to file truncation and marks past truncation point dirty buffers as BM_DIRTY_BARRIER. Second phase is post file truncation and actually wipes out past truncation point buffers. 3. On exception happen during file truncation, BM_DIRTY_BARRIER flag will be released from buffers. Thus, no data corruption should happens here. If file truncation was partially complete, then file might be extended by write of dirty buffer. I'm not sure how likely is it, but extension could lead to the errors again. But this still shouldn't cause a data corruption. 4. Having too many buffers marked as BM_DIRTY_BARRIER, would paralyze buffer manager. This is why we're keeping not more than NBuffers/2 to be marked as BM_DIRTY_BARRIER. If limit is exceeded, then dirty buffers are just written at the first phase. 5. lazy_truncate_heap() now takes ExclusiveLock instead of AccessExclusiveLock. This part is not really complete. At least, we need to ensure that past truncation point reads, caused by real-only queries concurrent to truncation, don't lead to real errors. Any thoughts? 1. https://www.postgresql.org/message-id/CAPpHfdtD3U2DpGZQJNe21s9s1s-Va7NRNcP1isvdCuJxzYypcg%40mail.gmail.com -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company 0001-POC-fix-relation-truncation-1.patch Description: Binary data
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Hi, On 2018-11-29 13:45:15 +0100, Dmitry Dolgov wrote: > > On Fri, Aug 24, 2018 at 5:53 PM Alexander Korotkov > > wrote: > > > > Given I've no feedback on this idea yet, I'll try to implement a PoC > > patch for that. It doesn't look to be difficult. And we'll see how > > does it work. > > Unfortunately, current version of the patch doesn't pass the tests and fails > on > initdb. But maybe you already have this PoC you were talking about, that will > also incorporate the feedback from this thread? For now I'll move it to the > next CF. Given that nothing has happened since, I'm going to close this CF entry with returned as feedback. Greetings, Andres Freund
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
> On Fri, Aug 24, 2018 at 5:53 PM Alexander Korotkov > wrote: > > Given I've no feedback on this idea yet, I'll try to implement a PoC > patch for that. It doesn't look to be difficult. And we'll see how > does it work. Unfortunately, current version of the patch doesn't pass the tests and fails on initdb. But maybe you already have this PoC you were talking about, that will also incorporate the feedback from this thread? For now I'll move it to the next CF.
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On Mon, Aug 27, 2018 at 11:38 AM, Alexander Korotkov wrote: > The aspect I'm more concerned here about is whether we miss ability > for detecting some of IO errors, if we don't distinguish new pages > from pages whose tuples were removed by vacuum. My main concern is correctness. If we ever have valid-looking buffers in shared_buffers after the corresponding data has been truncated away on disk, we've got to make sure that nobody ever confuses one of them with an actually-valid buffer. Reading over your algorithm, I can't convince myself that you have that case nailed down tightly enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Hi! Thank you for feedback. On Sun, Aug 26, 2018 at 4:09 AM Robert Haas wrote: > On Tue, Aug 21, 2018 at 9:10 AM, Alexander Korotkov > wrote: > > After heap truncation using this algorithm, shared buffers may contain > > past-OEF buffers. But those buffers are empty (no used items) and > > clean. So, real-only queries shouldn't hint those buffers dirty > > because there are no used items. Normally, these buffers will be just > > evicted away from the shared buffer arena. If relation extension will > > happen short after heap truncation then some of those buffers could be > > found after relation extension. I think this situation could be > > handled. For instance, we can teach vacuum to claim page as new once > > all the tuples were gone. > > I think this all sounds pretty dangerous and fragile, especially in > view of the pluggable storage work. If we start to add new storage > formats, deductions based on the specifics of the current heap's > hint-bit behavior may turn out not to be valid. Now maybe you could > speculate that it won't matter because perhaps truncation will work > differently in other storage formats too, but it doesn't sound to me > like we'd be wise to bet on it working out that way. Hmm, I'm not especially concerned about pluggable storages here. Pluggable storages are deciding themselves how do they manage vacuum including relation truncation if needed. They might reuse or not reuse function for relation truncation, which we have for heap. The thing we should do for that relation truncation function is understandable and predictable interface. So, if relation truncation function cuts relation tailing pages, which are previously cleaned as new. For me, that looks fair enough. The aspect I'm more concerned here about is whether we miss ability for detecting some of IO errors, if we don't distinguish new pages from pages whose tuples were removed by vacuum. > IIRC, Andres had some patches revising shared buffer management that > allowed the ending block number to be maintained in shared memory. > I'm waving my hands here, but with that kind of a system you can > imagine that maybe there could also be a flag bit indicating whether a > truncation is in progress. So, reduce the number of page and set the > bit; then zap all the pages above that value that are still present in > shared_buffers; then clear the bit. Or maybe we don't even need the > bit, but I think we do need some kind of race-free mechanism to make > sure that we never try to read pages that either have been truncated > away or in the process of being truncated away. If we would have some pre-relation shared memory information, then we can make it work even without special write barrier bit. Instead we can place a mark to the whole relation, which would say "please hold on with writes past following pending truncation point". Also having ending block number in the shared memory can save us from trying to read block past EOF. So, I'm sure that when Andres work for revising shared buffer management will be complete, we would be able to solve these problems better. But there is also a question of time. As I get, revising shared buffer management could not realistically get committed to PostgreSQL 12. And we have pretty nasty set of problems here. For me it would be nice to do something with them during this release cycle. But for sure, we should keep in the mind how this solution should be revising once we have new shared buffer management. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On Tue, Aug 21, 2018 at 9:10 AM, Alexander Korotkov wrote: > After heap truncation using this algorithm, shared buffers may contain > past-OEF buffers. But those buffers are empty (no used items) and > clean. So, real-only queries shouldn't hint those buffers dirty > because there are no used items. Normally, these buffers will be just > evicted away from the shared buffer arena. If relation extension will > happen short after heap truncation then some of those buffers could be > found after relation extension. I think this situation could be > handled. For instance, we can teach vacuum to claim page as new once > all the tuples were gone. I think this all sounds pretty dangerous and fragile, especially in view of the pluggable storage work. If we start to add new storage formats, deductions based on the specifics of the current heap's hint-bit behavior may turn out not to be valid. Now maybe you could speculate that it won't matter because perhaps truncation will work differently in other storage formats too, but it doesn't sound to me like we'd be wise to bet on it working out that way. IIRC, Andres had some patches revising shared buffer management that allowed the ending block number to be maintained in shared memory. I'm waving my hands here, but with that kind of a system you can imagine that maybe there could also be a flag bit indicating whether a truncation is in progress. So, reduce the number of page and set the bit; then zap all the pages above that value that are still present in shared_buffers; then clear the bit. Or maybe we don't even need the bit, but I think we do need some kind of race-free mechanism to make sure that we never try to read pages that either have been truncated away or in the process of being truncated away. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On Tue, Aug 21, 2018 at 4:10 PM Alexander Korotkov wrote: > After reading [1] and [2] I got that there are at least 3 different > issues with heap truncation: > 1) Data corruption on file truncation error (explained in [1]). > 2) Expensive scanning of the whole shared buffers before file truncation. > 3) Cancel of read-only queries on standby even if hot_standby_feedback > is on, caused by replication of AccessExclusiveLock. > > It seems that fixing any of those issues requires redesign of heap > truncation. So, ideally redesign of heap truncation should fix all > the issues of above. Or at least it should be understood how the rest > of issues can be fixed later using the new design. > > I would like to share some my sketchy thoughts about new heap > truncation design. Let's imagine we introduced dirty_barrier buffer > flag, which prevents dirty buffer from being written (and > correspondingly evicted). Then truncation algorithm could look like > this: > > 1) Acquire ExclusiveLock on relation. > 2) Calculate truncation point using count_nondeletable_pages(), while > simultaneously placing dirty_barrier flag on dirty buffers and saving > their numbers to array. Assuming no writes are performing > concurrently, no to-be-truncated-away pages should be written from > this point. > 3) Truncate data files. > 4) Iterate past truncation point buffers and clean dirty and > dirty_barrier flags from them (using numbers we saved to array on step > #2). > 5) Release relation lock. > *) On exception happen after step #2, iterate past truncation point > buffers and clean dirty_barrier flags from them (using numbers we > saved to array on step #2) > > After heap truncation using this algorithm, shared buffers may contain > past-OEF buffers. But those buffers are empty (no used items) and > clean. So, real-only queries shouldn't hint those buffers dirty > because there are no used items. Normally, these buffers will be just > evicted away from the shared buffer arena. If relation extension will > happen short after heap truncation then some of those buffers could be > found after relation extension. I think this situation could be > handled. For instance, we can teach vacuum to claim page as new once > all the tuples were gone. > > We're taking only exclusive lock here. And assuming we will teach our > scans to treat page-past-OEF situation as no-visible-tuples-found, > concurrent read-only queries will work concurrently with heap > truncate. Also we don't have to scan whole shared buffers, only past > truncation point buffers are scanned at step #2. Later flags are > cleaned only from truncation point dirty buffers. Data corruption on > truncation error also shouldn't happen as well, because we don't > forget to write any dirty buffers before insure that data files were > successfully truncated. > > The problem I see with this approach so far is placing too many > dirty_barrier flags can affect concurrent activity. In order to cope > that we may, for instance, truncate relation in multiple iterations > when we find too many past truncation point dirty buffers. > > Any thoughts? Given I've no feedback on this idea yet, I'll try to implement a PoC patch for that. It doesn't look to be difficult. And we'll see how does it work. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On Fri, Aug 17, 2018 at 9:55 PM Tom Lane wrote: > But then you are injecting bad pages into the shared buffer arena. > In any case, depending on that behavior seems like a bad idea, because > it's a pretty questionable kluge in itself. > > Another point is that the truncation code attempts to remove all > to-be-truncated-away pages from the shared buffer arena, but that only > works if nobody else is loading such pages into shared buffers > concurrently. In the presence of concurrent scans, we might be left > with valid-looking buffers for pages that have been truncated away > on-disk. That could cause all sorts of fun later. Yeah, the buffers > should contain only dead tuples ... but, for example, they might not > be hinted dead. If somebody sets one of those hint bits and then > writes the buffer back out to disk, you've got real problems. I didn't yet get how that's possible... count_nondeletable_pages() ensures that every to-be-truncated-away page doesn't contain any used item. From reading [1] I got that we might have even live tuples if truncation was failed. But if truncation wasn't failed, we shouldn't get this hint problem. Please, correct me if I'm wrong. After reading [1] and [2] I got that there are at least 3 different issues with heap truncation: 1) Data corruption on file truncation error (explained in [1]). 2) Expensive scanning of the whole shared buffers before file truncation. 3) Cancel of read-only queries on standby even if hot_standby_feedback is on, caused by replication of AccessExclusiveLock. It seems that fixing any of those issues requires redesign of heap truncation. So, ideally redesign of heap truncation should fix all the issues of above. Or at least it should be understood how the rest of issues can be fixed later using the new design. I would like to share some my sketchy thoughts about new heap truncation design. Let's imagine we introduced dirty_barrier buffer flag, which prevents dirty buffer from being written (and correspondingly evicted). Then truncation algorithm could look like this: 1) Acquire ExclusiveLock on relation. 2) Calculate truncation point using count_nondeletable_pages(), while simultaneously placing dirty_barrier flag on dirty buffers and saving their numbers to array. Assuming no writes are performing concurrently, no to-be-truncated-away pages should be written from this point. 3) Truncate data files. 4) Iterate past truncation point buffers and clean dirty and dirty_barrier flags from them (using numbers we saved to array on step #2). 5) Release relation lock. *) On exception happen after step #2, iterate past truncation point buffers and clean dirty_barrier flags from them (using numbers we saved to array on step #2) After heap truncation using this algorithm, shared buffers may contain past-OEF buffers. But those buffers are empty (no used items) and clean. So, real-only queries shouldn't hint those buffers dirty because there are no used items. Normally, these buffers will be just evicted away from the shared buffer arena. If relation extension will happen short after heap truncation then some of those buffers could be found after relation extension. I think this situation could be handled. For instance, we can teach vacuum to claim page as new once all the tuples were gone. We're taking only exclusive lock here. And assuming we will teach our scans to treat page-past-OEF situation as no-visible-tuples-found, concurrent read-only queries will work concurrently with heap truncate. Also we don't have to scan whole shared buffers, only past truncation point buffers are scanned at step #2. Later flags are cleaned only from truncation point dirty buffers. Data corruption on truncation error also shouldn't happen as well, because we don't forget to write any dirty buffers before insure that data files were successfully truncated. The problem I see with this approach so far is placing too many dirty_barrier flags can affect concurrent activity. In order to cope that we may, for instance, truncate relation in multiple iterations when we find too many past truncation point dirty buffers. Any thoughts? Links. 1. https://www.postgresql.org/message-id/flat/5BBC590AE8DF4ED1A170E4D48F1B53AC%40tunaPC 2. https://www.postgresql.org/message-id/flat/CAHGQGwE5UqFqSq1%3DkV3QtTUtXphTdyHA-8rAj4A%3DY%2Be4kyp3BQ%40mail.gmail.com -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Alexander Korotkov writes: > On Fri, Aug 17, 2018 at 9:55 PM Tom Lane wrote: >> Another point is that the truncation code attempts to remove all >> to-be-truncated-away pages from the shared buffer arena, but that only >> works if nobody else is loading such pages into shared buffers >> concurrently. In the presence of concurrent scans, we might be left >> with valid-looking buffers for pages that have been truncated away >> on-disk. That could cause all sorts of fun later. Yeah, the buffers >> should contain only dead tuples ... but, for example, they might not >> be hinted dead. If somebody sets one of those hint bits and then >> writes the buffer back out to disk, you've got real problems. > Thank you for the explanation. I see that injecting past OEF pages > into shared buffers doesn't look good. I start thinking about letting > caller of ReadBuffer() (or its variation) handle past OEF situation. That'd still have the same race condition, though: between the time we start to drop the doomed pages from shared buffers, and the time we actually perform ftruncate, concurrent scans could re-load such pages into shared buffers. Could it work to ftruncate first and flush shared buffers after? Probably not, I think the write-back-dirty-hint-bits scenario breaks that one. If this were easy, we'd have fixed it years ago :-(. It'd sure be awfully nice not to need AEL during autovacuum, even transiently; but I'm not sure how we get there without adding an unpleasant amount of substitute locking in table scans. regards, tom lane
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On Fri, Aug 17, 2018 at 9:55 PM Tom Lane wrote: > Alexander Korotkov writes: > > On Fri, Aug 17, 2018 at 8:38 PM Tom Lane wrote: > >> Alexander Korotkov writes: > >>> Yes, that's correct. On standby read-only queries can tolerate > >>> concurrent heap truncation. > > >> Uh, what??? > > > VACUUM truncates heap relation only after deletion of all the tuples > > from tail pages. > > Right; the problem I'm worried about is possible accesses to > already-removed pages by concurrent scans. > > > And in md.c we already have logic to return zeroed pages, when trying > > to read past relation in recovery. > > But then you are injecting bad pages into the shared buffer arena. > In any case, depending on that behavior seems like a bad idea, because > it's a pretty questionable kluge in itself. > > Another point is that the truncation code attempts to remove all > to-be-truncated-away pages from the shared buffer arena, but that only > works if nobody else is loading such pages into shared buffers > concurrently. In the presence of concurrent scans, we might be left > with valid-looking buffers for pages that have been truncated away > on-disk. That could cause all sorts of fun later. Yeah, the buffers > should contain only dead tuples ... but, for example, they might not > be hinted dead. If somebody sets one of those hint bits and then > writes the buffer back out to disk, you've got real problems. > > Perhaps there's some gold to be mined by treating truncation locks > differently from other AELs, but I don't think you can just ignore > them on the standby, any more than they can be ignored on the master. Thank you for the explanation. I see that injecting past OEF pages into shared buffers doesn't look good. I start thinking about letting caller of ReadBuffer() (or its variation) handle past OEF situation. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Alexander Korotkov writes: > On Fri, Aug 17, 2018 at 8:38 PM Tom Lane wrote: >> Alexander Korotkov writes: >>> Yes, that's correct. On standby read-only queries can tolerate >>> concurrent heap truncation. >> Uh, what??? > VACUUM truncates heap relation only after deletion of all the tuples > from tail pages. Right; the problem I'm worried about is possible accesses to already-removed pages by concurrent scans. > And in md.c we already have logic to return zeroed pages, when trying > to read past relation in recovery. But then you are injecting bad pages into the shared buffer arena. In any case, depending on that behavior seems like a bad idea, because it's a pretty questionable kluge in itself. Another point is that the truncation code attempts to remove all to-be-truncated-away pages from the shared buffer arena, but that only works if nobody else is loading such pages into shared buffers concurrently. In the presence of concurrent scans, we might be left with valid-looking buffers for pages that have been truncated away on-disk. That could cause all sorts of fun later. Yeah, the buffers should contain only dead tuples ... but, for example, they might not be hinted dead. If somebody sets one of those hint bits and then writes the buffer back out to disk, you've got real problems. Perhaps there's some gold to be mined by treating truncation locks differently from other AELs, but I don't think you can just ignore them on the standby, any more than they can be ignored on the master. regards, tom lane
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On Fri, Aug 17, 2018 at 8:38 PM Tom Lane wrote: > Alexander Korotkov writes: > > On Fri, Aug 17, 2018 at 6:41 PM Andres Freund wrote: > >> There's another patch, which I thought Alexander was referring to, that > >> does something a bit smarger. On a super short skim it seems to > >> introduce a separate type of AEL lock that's not replicated, by my > >> reading? > > > Yes, that's correct. On standby read-only queries can tolerate > > concurrent heap truncation. > > Uh, what??? VACUUM truncates heap relation only after deletion of all the tuples from tail pages. So, on standby heap truncation record would be replayed only after heap tuples deletion records are replayed. Therefore, if query on standby should see some of those tuples, WAL replay stop (or query cancel) should happened before corresponding tuples being deleted by our recovery conflict with snapshot logic. When we're going to replay heap truncation record, no query should see records in the tail pages. And in md.c we already have logic to return zeroed pages, when trying to read past relation in recovery. /* * Short read: we are at or past EOF, or we read a partial block at * EOF. Normally this is an error; upper levels should never try to * read a nonexistent block. However, if zero_damaged_pages is ON or * we are InRecovery, we should instead return zeroes without * complaining. This allows, for example, the case of trying to * update a block that was later truncated away. */ if (zero_damaged_pages || InRecovery) MemSet(buffer, 0, BLCKSZ); else ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("could not read block %u in file \"%s\": read only %d of %d bytes", blocknum, FilePathName(v->mdfd_vfd), nbytes, BLCKSZ))); But, I'm concerned if FileSeek() could return error. And also what _mdfd_getseg() would do on truncated segment. It seems that in recovery, it will automatically extend the relation. That unacceptable for this purpose. So, sorry for bothering, this patch definitely needs to be revised. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Alexander Korotkov writes: > On Fri, Aug 17, 2018 at 6:41 PM Andres Freund wrote: >> There's another patch, which I thought Alexander was referring to, that >> does something a bit smarger. On a super short skim it seems to >> introduce a separate type of AEL lock that's not replicated, by my >> reading? > Yes, that's correct. On standby read-only queries can tolerate > concurrent heap truncation. Uh, what??? regards, tom lane
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On Fri, Aug 17, 2018 at 6:41 PM Andres Freund wrote: > On 2018-08-17 11:35:40 -0400, Tom Lane wrote: > > Andres Freund writes: > > > On 2018-08-17 18:00:20 +0300, Alexander Korotkov wrote: > > >> So, do we have any objections to committing this? > > > > > I think this needs more review by other senior hackers in the community. > > > > TBH it sounds like a horrible hack. Disable vacuum truncation? > > There's another patch, which I thought Alexander was referring to, that > does something a bit smarger. On a super short skim it seems to > introduce a separate type of AEL lock that's not replicated, by my > reading? Yes, that's correct. On standby read-only queries can tolerate concurrent heap truncation. So, there is no point to replicate AEL to standby. Previous versions of patch tries to do that by introducing some flag. But it appears that correct way to do this is just new non-replicated type of AEL lock. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On 2018-08-17 11:35:40 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2018-08-17 18:00:20 +0300, Alexander Korotkov wrote: > >> So, do we have any objections to committing this? > > > I think this needs more review by other senior hackers in the community. > > TBH it sounds like a horrible hack. Disable vacuum truncation? There's another patch, which I thought Alexander was referring to, that does something a bit smarger. On a super short skim it seems to introduce a separate type of AEL lock that's not replicated, by my reading? Greetings, Andres Freund
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Andres Freund writes: > On 2018-08-17 18:00:20 +0300, Alexander Korotkov wrote: >> So, do we have any objections to committing this? > I think this needs more review by other senior hackers in the community. TBH it sounds like a horrible hack. Disable vacuum truncation? That can't be a good idea. If it were something we were doing behind the scenes as a temporary expedient, maybe we could hold our noses and do it for awhile. But once you introduce a reloption based on this, you can't ever get rid of it. I think we need to spend more effort looking for a less invasive, less compromised solution. regards, tom lane
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Hi, On 2018-08-17 18:00:20 +0300, Alexander Korotkov wrote: > So, do we have any objections to committing this? I think this needs more review by other senior hackers in the community. Greetings, Andres Freund
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On Thu, Aug 16, 2018 at 2:16 PM Alexander Korotkov wrote: > On Tue, Aug 14, 2018 at 12:05 PM Masahiko Sawada > wrote: > > > > On Wed, Feb 28, 2018 at 11:24 PM, Ivan Kartyshov > > wrote: > > > The main goal of my changes is to let long read-only transactions run on > > > replica if hot_standby_feedback is turned on. > > > > FWIW the idea proposed on the thread[1], which allows us to disable > > the heap truncation by vacuum per tables, might help this issue. Since > > the original problem on that thread is a performance problem an > > alternative proposal would be better, but I think the way would be > > helpful for this issue too and is simple. A downside of that idea is > > that there is additional work for user to configure the reloption to > > each tables. > > Thank you for pointing this thread. It's possible to cope this > downside to introduce GUC which would serve as default, when reloption > is not defined. But real downside is inability to truncate the heap. > So, options are good if they provide workaround for users, but that > shouldn't stop us from fixing issues around heap truncation. So, do we have any objections to committing this? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Hi! On Tue, Aug 14, 2018 at 12:05 PM Masahiko Sawada wrote: > > On Wed, Feb 28, 2018 at 11:24 PM, Ivan Kartyshov > wrote: > > The main goal of my changes is to let long read-only transactions run on > > replica if hot_standby_feedback is turned on. > > FWIW the idea proposed on the thread[1], which allows us to disable > the heap truncation by vacuum per tables, might help this issue. Since > the original problem on that thread is a performance problem an > alternative proposal would be better, but I think the way would be > helpful for this issue too and is simple. A downside of that idea is > that there is additional work for user to configure the reloption to > each tables. Thank you for pointing this thread. It's possible to cope this downside to introduce GUC which would serve as default, when reloption is not defined. But real downside is inability to truncate the heap. So, options are good if they provide workaround for users, but that shouldn't stop us from fixing issues around heap truncation. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On Wed, Feb 28, 2018 at 11:24 PM, Ivan Kartyshov wrote: > Thank you for your valuable comments. I've made a few adjustments. > Thank you for working on this! > The main goal of my changes is to let long read-only transactions run on > replica if hot_standby_feedback is turned on. FWIW the idea proposed on the thread[1], which allows us to disable the heap truncation by vacuum per tables, might help this issue. Since the original problem on that thread is a performance problem an alternative proposal would be better, but I think the way would be helpful for this issue too and is simple. A downside of that idea is that there is additional work for user to configure the reloption to each tables. [1] https://www.postgresql.org/message-id/CAHGQGwE5UqFqSq1%3DkV3QtTUtXphTdyHA-8rAj4A%3DY%2Be4kyp3BQ%40mail.gmail.com Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Hi! On Fri, Aug 10, 2018 at 5:07 PM Alexander Korotkov wrote: > On Thu, Aug 9, 2018 at 11:26 PM Ivan Kartyshov > wrote: > > Alexander Korotkov писал 2018-06-20 20:54: > > > Thinking about that more I found that adding vacuum mark as an extra > > > argument to LockAcquireExtended is also wrong. It would be still hard > > > to determine if we should log the lock in LogStandbySnapshot(). We'll > > > have to add that flag to shared memory table of locks. And that looks > > > like a kludge. > > > > > > Therefore, I'd like to propose another approach: introduce new lock > > > level. So, AccessExclusiveLock will be split into two > > > AccessExclusiveLocalLock and AccessExclusiveLock. In spite of > > > AccessExclusiveLock, AccessExclusiveLocalLock will be not logged to > > > standby, and used for heap truncation. > > > > > > I expect some resistance to my proposal, because fixing this "small > > > bug" doesn't deserve new lock level. But current behavior of > > > hot_standby_feedback is buggy. From user prospective, > > > hot_standby_feedback option is just non-worker, which causes master to > > > bloat without protection for standby queries from cancel. We need to > > > fix that, but I don't see other proper way to do that except adding > > > new lock level... > > > > Your offer is very interesting, it made patch smaller and more > > beautiful. > > So I update patch and made changes accordance with the proposed concept > > of > > special AccessExclusiveLocalLock. > > > I would like to here you opinion over this implementation. > > In general this patch looks good for me. It seems that comments and > documentation need minor enhancements. I'll make them and post the > revised patch. Find the revised patch attached. It contains some enhancements in comments, formatting and documentation. BTW, I decided that we should enumerate ACCESS EXCLUSIVE LOCAL lock before ACCESS EXCLUSIVE lock, because we enumerate lock on ascending strength. So, since ACCESS EXCLUSIVE is WAL-logged, it's definitely "stronger". I think that introduction of new lock level is significant change and can't be backpatched. But I think it worth to backpatch a note to the documentation, which clarifies why hot_standby_feedback might have unexpected behavior. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company hsfeedback_locallock-2.patch Description: Binary data hsfeedback_backpatch_note.patch Description: Binary data
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Hi! On Thu, Aug 9, 2018 at 11:26 PM Ivan Kartyshov wrote: > Alexander Korotkov писал 2018-06-20 20:54: > > Thinking about that more I found that adding vacuum mark as an extra > > argument to LockAcquireExtended is also wrong. It would be still hard > > to determine if we should log the lock in LogStandbySnapshot(). We'll > > have to add that flag to shared memory table of locks. And that looks > > like a kludge. > > > > Therefore, I'd like to propose another approach: introduce new lock > > level. So, AccessExclusiveLock will be split into two > > AccessExclusiveLocalLock and AccessExclusiveLock. In spite of > > AccessExclusiveLock, AccessExclusiveLocalLock will be not logged to > > standby, and used for heap truncation. > > > > I expect some resistance to my proposal, because fixing this "small > > bug" doesn't deserve new lock level. But current behavior of > > hot_standby_feedback is buggy. From user prospective, > > hot_standby_feedback option is just non-worker, which causes master to > > bloat without protection for standby queries from cancel. We need to > > fix that, but I don't see other proper way to do that except adding > > new lock level... > > Your offer is very interesting, it made patch smaller and more > beautiful. > So I update patch and made changes accordance with the proposed concept > of > special AccessExclusiveLocalLock. > I would like to here you opinion over this implementation. In general this patch looks good for me. It seems that comments and documentation need minor enhancements. I'll make them and post the revised patch. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Hello, thank you for your review. Alexander Korotkov писал 2018-06-20 20:54: Thinking about that more I found that adding vacuum mark as an extra argument to LockAcquireExtended is also wrong. It would be still hard to determine if we should log the lock in LogStandbySnapshot(). We'll have to add that flag to shared memory table of locks. And that looks like a kludge. Therefore, I'd like to propose another approach: introduce new lock level. So, AccessExclusiveLock will be split into two AccessExclusiveLocalLock and AccessExclusiveLock. In spite of AccessExclusiveLock, AccessExclusiveLocalLock will be not logged to standby, and used for heap truncation. I expect some resistance to my proposal, because fixing this "small bug" doesn't deserve new lock level. But current behavior of hot_standby_feedback is buggy. From user prospective, hot_standby_feedback option is just non-worker, which causes master to bloat without protection for standby queries from cancel. We need to fix that, but I don't see other proper way to do that except adding new lock level... Your offer is very interesting, it made patch smaller and more beautiful. So I update patch and made changes accordance with the proposed concept of special AccessExclusiveLocalLock. I don't yet understand what is the problem here and why this patch should solve that. As I get idea of this patch, GetOldestXmin() will initialize xmin of walsender with lowest xmin of replication slot. But xmin of replication slots will be anyway taken into account by GetSnapshotData() called by vacuum. How to test: Make async replica, turn on feedback, reduce max_standby_streaming_delay and aggressive autovacuum. You forgot to mention, that one should setup the replication using replication slot. Naturally, if replication slot isn't exist, then master shouldn't keep dead tuples for disconnected standby. Because master doesn't know if standby will reconnect or is it gone forever. I tried to solve hot_standby_feedback without replication slots, so I found a little window of possibility of case, when vacuum on master make heap truncation of pages that standby needs for just started transaction. How to test: Make async replica, turn on feedback and reduce max_standby_streaming_delay. Make autovacuum more aggressive. autovacuum = on autovacuum_max_workers = 1 autovacuum_naptime = 1s autovacuum_vacuum_threshold = 1 autovacuum_vacuum_cost_delay = 0 Test1: Here we will do a load on the master and simulation of a long transaction with repeated 1 second SEQSCANS on the replica (by calling pg_sleep 1 second duration every 6 seconds). MASTERREPLICA hot_standby = on max_standby_streaming_delay = 1s hot_standby_feedback = on start CREATE TABLE test AS (SELECT id, 1 AS value FROM generate_series(1,1) id); pgbench -T600 -P2 -n --file=master.sql postgres (update test set value = value;) start BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED; SELECT pg_sleep(value) FROM test; \watch 6 ---Autovacuum truncate pages at the end Result on replica: FATAL: terminating connection due to conflict with recovery DETAIL: User was holding a relation lock for too long. On Patched version lazy_vacuum_truncation passed without fatal errors. Only some times Error occurs because this tests is too synthetic ERROR: canceling statement due to conflict with recovery DETAIL: User was holding shared buffer pin for too long. Because of rising ResolveRecoveryConflictWithSnapshot while redo some visibility flags to avoid this conflict we can do test2 or increase max_standby_streaming_delay. Test2: Here we will do a load on the master and simulation of a long transaction on the replica (by taking LOCK on table) MASTERREPLICA hot_standby = on max_standby_streaming_delay = 1s hot_standby_feedback = on start CREATE TABLE test AS (SELECT id, 1 AS value FROM generate_series(1,1) id); pgbench -T600 -P2 -n --file=master.sql postgres (update test set value = value;) start BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED; LOCK TABLE test IN ACCESS SHARE MODE; select * from test; \watch 6 ---Autovacuum truncate pages at the end Result on replica: FATAL: terminating connection due to conflict with recovery DETAIL: User was holding a relation lock for too long. On Patched version lazy_vacuum_truncation passed without fatal errors. I would like to here you opinion over this implementation. -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Companydiff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml index 73934e5..73975cc 100644 --- a/doc/src/sgml/mvcc.sgml +++ b/doc/src/sgml/mvcc.sgml @@ -854,8 +854,8 @@ ERROR: could not serialize access due to read/write dependencies among transact - Conflicts with the ACCESS EXCLUSIVE lock - mode only. + Conflicts with the ACCESS EXCLUSIVE and +
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On Thu, Mar 1, 2018 at 3:24 AM, Ivan Kartyshov wrote: > Could you give me your ideas over these patches. Hi Ivan, Not sure if this is expected at this stage or not, but just in case you didn't know... the tests under src/test/subscription seem to be broken: https://travis-ci.org/postgresql-cfbot/postgresql/builds/409358734 -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Hi! Thank you for your work on this issue. On Wed, Feb 28, 2018 at 5:25 PM Ivan Kartyshov wrote: > Thank you for your valuable comments. I've made a few adjustments. > > The main goal of my changes is to let long read-only transactions run on > replica if hot_standby_feedback is turned on. > > > Patch1 - hsfeedback_av_truncate.patch is made to stop > ResolveRecoveryConflictWithLock occurs on replica, after autovacuum lazy > truncates heap on master cutting some pages at the end. When > hot_standby_feedback is on, we know that the autovacuum does not remove > anything superfluous, which could be needed on standby, so there is no > need to rise any ResolveRecoveryConflict*. > > 1) Add to xl_standby_locks and xl_smgr_truncate isautovacuum flag, which > tells us that autovacuum generates them. > > 2) When autovacuum decides to trim the table (using lazy_truncate_heap), > it takes AccessExclusiveLock and sends this lock to the replica, but > replica should ignore AccessExclusiveLock if hot_standby_feedback=on. > > 3) When autovacuum truncate wal message is replayed on a replica, it > takes ExclusiveLock on a table, so as not to interfere with read-only > requests. I see that you use IsAutoVacuumWorkerProcess() function to determine if this access exclusive lock is caused by vacuum. And I see at least two issues with that. 1) That wouldn't work for manually run vacuums. I understand stat query cancel caused by autovacuum is probably most annoying thing. But unnecessary partial bug fix is undesirable. 2) Access exclusive locks are logged not only immediately by the process taken that, but also all such locks are logged in LogStandbySnapshot(). LogStandbySnapshot() is called by checkpointer, bgwriter and others. So, IsAutoVacuumWorkerProcess() wouldn't help in such cases. I understand that heap truncation is typically fast. And probability that some process will dump all the access exclusive locks while truncation is in progress is low. But nevertheless we need to handle that properly. Based on this, I think we should give up with using IsAutoVacuumWorkerProcess() to determine locks caused by vacuum. Thinking about that more I found that adding vacuum mark as an extra argument to LockAcquireExtended is also wrong. It would be still hard to determine if we should log the lock in LogStandbySnapshot(). We'll have to add that flag to shared memory table of locks. And that looks like a kludge. Therefore, I'd like to propose another approach: introduce new lock level. So, AccessExclusiveLock will be split into two AccessExclusiveLocalLock and AccessExclusiveLock. In spite of AccessExclusiveLock, AccessExclusiveLocalLock will be not logged to standby, and used for heap truncation. I expect some resistance to my proposal, because fixing this "small bug" doesn't deserve new lock level. But current behavior of hot_standby_feedback is buggy. From user prospective, hot_standby_feedback option is just non-worker, which causes master to bloat without protection for standby queries from cancel. We need to fix that, but I don't see other proper way to do that except adding new lock level... > Patch2 - hsfeedback_noninvalide_xmin.patch > When walsender is initialized, its xmin in PROCARRAY is set to > GetOldestXmin() in order to prevent autovacuum running on master from > truncating relation and removing some pages that are required by > replica. This might happen if master's autovacuum and replica's query > started simultaneously. And the replica has not yet reported its xmin > value. I don't yet understand what is the problem here and why this patch should solve that. As I get idea of this patch, GetOldestXmin() will initialize xmin of walsender with lowest xmin of replication slot. But xmin of replication slots will be anyway taken into account by GetSnapshotData() called by vacuum. > How to test: > Make async replica, turn on feedback, reduce max_standby_streaming_delay > and aggressive autovacuum. You forgot to mention, that one should setup the replication using replication slot. Naturally, if replication slot isn't exist, then master shouldn't keep dead tuples for disconnected standby. Because master doesn't know if standby will reconnect or is it gone forever. > autovacuum = on > autovacuum_max_workers = 1 > autovacuum_naptime = 1s > autovacuum_vacuum_threshold = 1 > autovacuum_vacuum_cost_delay = 0 > > Test: > Here we will start replica and begi repeatable read transaction on > table, then we stop replicas postmaster to prevent starting walreceiver > worker (on master startup) and sending master it`s transaction xmin over > hot_standby_feedback message. > MASTERREPLICA > start > CREATE TABLE test AS (SELECT id, 1 AS value FROM > generate_series(1,1000) id); > stop > start I guess you meant start of standby session here, not postmaster. Otherwise, I don't understand how table test will reach standby. > BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; > SELECT
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Thank you for your valuable comments. I've made a few adjustments. The main goal of my changes is to let long read-only transactions run on replica if hot_standby_feedback is turned on. Patch1 - hsfeedback_av_truncate.patch is made to stop ResolveRecoveryConflictWithLock occurs on replica, after autovacuum lazy truncates heap on master cutting some pages at the end. When hot_standby_feedback is on, we know that the autovacuum does not remove anything superfluous, which could be needed on standby, so there is no need to rise any ResolveRecoveryConflict*. 1) Add to xl_standby_locks and xl_smgr_truncate isautovacuum flag, which tells us that autovacuum generates them. 2) When autovacuum decides to trim the table (using lazy_truncate_heap), it takes AccessExclusiveLock and sends this lock to the replica, but replica should ignore AccessExclusiveLock if hot_standby_feedback=on. 3) When autovacuum truncate wal message is replayed on a replica, it takes ExclusiveLock on a table, so as not to interfere with read-only requests. We have two cases of resolving ResolveRecoveryConflictWithLock if timers (max_standby_streaming_delay and max_standby_archive_delay) have run out: backend is idle in transaction (waiting input) - in this case backend will be sent SIGTERM backend transaction is running query - in this case running transaction will be aborted How to test: Make async replica, turn on feedback and reduce max_standby_streaming_delay. Make autovacuum more aggressive. autovacuum = on autovacuum_max_workers = 1 autovacuum_naptime = 1s autovacuum_vacuum_threshold = 1 autovacuum_vacuum_cost_delay = 0 Test1: Here we will do a load on the master and simulation of a long transaction with repeated 1 second SEQSCANS on the replica (by calling pg_sleep 1 second duration every 6 seconds). MASTERREPLICA hot_standby = on max_standby_streaming_delay = 1s hot_standby_feedback = on start CREATE TABLE test AS (SELECT id, 1 AS value FROM generate_series(1,1) id); pgbench -T600 -P2 -n --file=master.sql postgres (update test set value = value;) start BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED; SELECT pg_sleep(value) FROM test; \watch 6 ---Autovacuum truncate pages at the end Result on replica: FATAL: terminating connection due to conflict with recovery DETAIL: User was holding a relation lock for too long. On Patched version lazy_vacuum_truncation passed without fatal errors. Only some times Error occurs because this tests is too synthetic ERROR: canceling statement due to conflict with recovery DETAIL: User was holding shared buffer pin for too long. Because of rising ResolveRecoveryConflictWithSnapshot while redo some visibility flags to avoid this conflict we can do test2 or increase max_standby_streaming_delay. Test2: Here we will do a load on the master and simulation of a long transaction on the replica (by taking LOCK on table) MASTERREPLICA hot_standby = on max_standby_streaming_delay = 1s hot_standby_feedback = on start CREATE TABLE test AS (SELECT id, 1 AS value FROM generate_series(1,1) id); pgbench -T600 -P2 -n --file=master.sql postgres (update test set value = value;) start BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED; LOCK TABLE test IN ACCESS SHARE MODE; select * from test; \watch 6 ---Autovacuum truncate pages at the end Result on replica: FATAL: terminating connection due to conflict with recovery DETAIL: User was holding a relation lock for too long. On Patched version lazy_vacuum_truncation passed without fatal errors. Test3: Here we do a load on the master and simulation of a long transaction with repeated 1 second SEQSCANS on the replica (by calling pg_sleep 1 second duration every 6 seconds). MASTERREPLICA hot_standby = on max_standby_streaming_delay = 4s hot_standby_feedback = on start CREATE TABLE test AS (SELECT id, 200 AS value FROM generate_series(1,1) id); pgbench -T600 -P2 -n --file=master.sql postgres (update test set value = value;) start BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED; SELECT pg_sleep(value) FROM test; ---Autovacuum truncate pages at the end Result on replica: FATAL: terminating connection due to conflict with recovery DETAIL: User was holding a relation lock for too long. On Patched version lazy_vacuum_truncation passed without fatal errors. This way we can make transactions with SEQSCAN, INDEXSCAN or BITMAPSCAN Patch2 - hsfeedback_noninvalide_xmin.patch When walsender is initialized, its xmin in PROCARRAY is set to GetOldestXmin() in order to prevent autovacuum running on master from truncating relation and removing some pages that are required by replica. This might happen if master's autovacuum and replica's query started simultaneously. And the replica has not yet reported its xmin value. How to test: Make async replica, turn on feedback, reduce max_standby_streaming_delay and aggressive