Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 2/19/17 5:27 AM, Robert Haas wrote: (1) a multi-batch hash join, (2) a nested loop, and (3) a merge join. (2) is easy to implement but will generate a ton of random I/O if the table is not resident in RAM. (3) is most suitable for very large tables but takes more work to code, and is also likely to be a lot slower for small tables than a hash or nestloop-based approach. As I understand it, #3 is already in place for validate_index(). I think you'd just need a different callback that checks the heap key. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 19, 2017 at 3:52 PM, Pavan Deolasee wrote: > This particular case of corruption results in a heap tuple getting indexed > by a wrong key (or to be precise, indexed by its old value). So the only way > to detect the corruption is to look at each index key and check if it > matches with the corresponding heap tuple. We could write some kind of self > join that can use a sequential scan and an index-only scan (David Rowley had > suggested something of that sort internally here), but we can't guarantee > index-only scan on a table which is being concurrently updated. So not sure > even that will catch every possible case. Oh, so the problem isn't index entries that are altogether missing? I guess I was confused. You can certainly guarantee an index-only scan if you write the validation code in C rather than using SQL. I think the issue is that if the table is large enough that keeping a TID -> index value mapping in a hash table is impractical, there's not going to be a real efficient strategy for this. Ignoring the question of whether you use the main executor for this or just roll your own code, your options for a large table are (1) a multi-batch hash join, (2) a nested loop, and (3) a merge join. (2) is easy to implement but will generate a ton of random I/O if the table is not resident in RAM. (3) is most suitable for very large tables but takes more work to code, and is also likely to be a lot slower for small tables than a hash or nestloop-based approach. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 19, 2017 at 3:43 PM, Robert Haas wrote: > On Fri, Feb 17, 2017 at 11:15 PM, Tom Lane wrote: > > > Ah, nah, scratch that. If any post-index-build pruning had occurred on a > > page, the evidence would be gone --- the non-matching older tuples would > > be removed and what remained would look consistent. But it wouldn't > > match the index. You might be able to find problems if you were willing > > to do the expensive check on *every* HOT chain, but that seems none too > > practical. > > If the goal is just to detect tuples that aren't indexed and should > be, what about scanning each index and building a TIDBitmap of all of > the index entries, and then scanning the heap for non-HOT tuples and > probing the TIDBitmap for each one? If you find a heap entry for > which you didn't find an index entry, go and recheck the index and see > if one got added concurrently. If not, whine. > > This particular case of corruption results in a heap tuple getting indexed by a wrong key (or to be precise, indexed by its old value). So the only way to detect the corruption is to look at each index key and check if it matches with the corresponding heap tuple. We could write some kind of self join that can use a sequential scan and an index-only scan (David Rowley had suggested something of that sort internally here), but we can't guarantee index-only scan on a table which is being concurrently updated. So not sure even that will catch every possible case. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Fri, Feb 17, 2017 at 11:15 PM, Tom Lane wrote: > I wrote: >> However, you might be able to find it without so much random I/O. >> I'm envisioning a seqscan over the table, in which you simply look for >> HOT chains in which the indexed columns aren't all the same. When you >> find one, you'd have to do a pretty expensive index lookup to confirm >> whether things are OK or not, but hopefully that'd be rare enough to not >> be a big performance sink. > > Ah, nah, scratch that. If any post-index-build pruning had occurred on a > page, the evidence would be gone --- the non-matching older tuples would > be removed and what remained would look consistent. But it wouldn't > match the index. You might be able to find problems if you were willing > to do the expensive check on *every* HOT chain, but that seems none too > practical. If the goal is just to detect tuples that aren't indexed and should be, what about scanning each index and building a TIDBitmap of all of the index entries, and then scanning the heap for non-HOT tuples and probing the TIDBitmap for each one? If you find a heap entry for which you didn't find an index entry, go and recheck the index and see if one got added concurrently. If not, whine. One problem is that you'd need to make sure that the TIDBitmap didn't lossify, but that could be prevented either by specifying a very large maxbytes setting or by using something that serves the same function instead of a TIDBitmap per se. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Fri, Feb 17, 2017 at 9:31 AM, Tom Lane wrote: > This seems like it'd be quite a different tool than amcheck, though. > Also, it would only find broken-HOT-chain corruption, which might be > a rare enough issue to not deserve a single-purpose tool. FWIW, my ambition for amcheck is that it will have checks for a large variety of invariants that involve the heap, and related SLRU structures such as MultiXacts. Though, that would probably necessitate code written by other people that are subject matter experts in areas that I am not. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
I wrote: > However, you might be able to find it without so much random I/O. > I'm envisioning a seqscan over the table, in which you simply look for > HOT chains in which the indexed columns aren't all the same. When you > find one, you'd have to do a pretty expensive index lookup to confirm > whether things are OK or not, but hopefully that'd be rare enough to not > be a big performance sink. Ah, nah, scratch that. If any post-index-build pruning had occurred on a page, the evidence would be gone --- the non-matching older tuples would be removed and what remained would look consistent. But it wouldn't match the index. You might be able to find problems if you were willing to do the expensive check on *every* HOT chain, but that seems none too practical. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Peter Geoghegan writes: > The difference with a test that could detect this variety of > corruption is that that would need to visit the heap, which tends to > be much larger than any one index, or even all indexes. That would > probably need to be random I/O, too. It might be possible to mostly > not visit the heap, though -- I'm not sure offhand. I'd have to study > the problem in detail, which I have no time for at the moment. Unless I misunderstand this problem, the issue is that there might be some broken HOT chains, that is chains in which not all the heap tuples have the same values for the index columns, and in particular the live entry at the end of the chain doesn't agree with the index. It seems pretty impossible to detect that by examining the index alone. However, you might be able to find it without so much random I/O. I'm envisioning a seqscan over the table, in which you simply look for HOT chains in which the indexed columns aren't all the same. When you find one, you'd have to do a pretty expensive index lookup to confirm whether things are OK or not, but hopefully that'd be rare enough to not be a big performance sink. This seems like it'd be quite a different tool than amcheck, though. Also, it would only find broken-HOT-chain corruption, which might be a rare enough issue to not deserve a single-purpose tool. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Fri, Feb 17, 2017 at 8:23 AM, Keith Fiske wrote: > It's not the load I'm worried about, it's the locks that are required at > some point during the rebuild. Doing an index rebuild here and there isn't a > big deal, but trying to do it for an entire heavily loaded, multi-terabyte > database is hardly trivial. And I'd say doing a scan is far less invasive > than actually rebuilding the index since little to no writing is actually > being done. Certainly, the checks that amcheck performs that don't require a heavier lock (just a SELECT-style AccessShareLock) were vastly less expensive than reindexing, and of course had only negligible potential to block other operations. And, REINDEX certainly is a foot-gun, lock-wise, which is why I try to avoid it. The difference with a test that could detect this variety of corruption is that that would need to visit the heap, which tends to be much larger than any one index, or even all indexes. That would probably need to be random I/O, too. It might be possible to mostly not visit the heap, though -- I'm not sure offhand. I'd have to study the problem in detail, which I have no time for at the moment. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Keith Fiske wrote: > I can understandable if it's simply not possible, but if it is, I think in > any cases of data corruption, having some means to check for it to be sure > you're in the clear would be useful. Maybe it is possible. I just didn't try, since it didn't seem very useful. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Fri, Feb 17, 2017 at 11:12 AM, Alvaro Herrera wrote: > Keith Fiske wrote: > > > Was just curious if anyone was able to come up with any sort of method to > > test whether an index was corrupted by this bug, other than just waiting > > for bad query results? We've used concurrent index rebuilding quite > > extensively over the years to remove bloat from busy systems, but > > reindexing the entire database "just in case" is unrealistic in many of > our > > cases. > > As stated, if the CREATE INDEX operates on columns that are previously > already indexed (which is normally the case when you rebuild because of > bloat) then there is no chance of index corruption. > > Scanning indexes+tables is just as load-intensive as rebuilding the > indexes anyway. You don't save any work. I suppose it can be a problem > if you have an index big enough that it doesn't fit on your remaining > free space (but in that case you have a pre-existing problem which you > should solve anyway). > > -- > Álvaro Herrerahttps://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > It's not the load I'm worried about, it's the locks that are required at some point during the rebuild. Doing an index rebuild here and there isn't a big deal, but trying to do it for an entire heavily loaded, multi-terabyte database is hardly trivial. And I'd say doing a scan is far less invasive than actually rebuilding the index since little to no writing is actually being done. I can understandable if it's simply not possible, but if it is, I think in any cases of data corruption, having some means to check for it to be sure you're in the clear would be useful. Keith
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Keith Fiske wrote: > Was just curious if anyone was able to come up with any sort of method to > test whether an index was corrupted by this bug, other than just waiting > for bad query results? We've used concurrent index rebuilding quite > extensively over the years to remove bloat from busy systems, but > reindexing the entire database "just in case" is unrealistic in many of our > cases. As stated, if the CREATE INDEX operates on columns that are previously already indexed (which is normally the case when you rebuild because of bloat) then there is no chance of index corruption. Scanning indexes+tables is just as load-intensive as rebuilding the indexes anyway. You don't save any work. I suppose it can be a problem if you have an index big enough that it doesn't fit on your remaining free space (but in that case you have a pre-existing problem which you should solve anyway). -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 10:17 PM, Amit Kapila wrote: > On Mon, Feb 6, 2017 at 10:28 PM, Tom Lane wrote: > > Amit Kapila writes: > >> Hmm. Consider that the first time relcahe invalidation occurs while > >> computing id_attrs, so now the retry logic will compute the correct > >> set of attrs (considering two indexes, if we take the example given by > >> Alvaro above.). However, the attrs computed for hot_* and key_* will > >> be using only one index, so this will lead to a situation where two of > >> the attrs (hot_attrs and key_attrs) are computed with one index and > >> id_attrs is computed with two indexes. I think this can lead to > >> Assertion in HeapSatisfiesHOTandKeyUpdate(). > > > > That seems like something we'd be best off to fix by changing the > > assertion. > > > > The above-mentioned problem doesn't exist in your version of the patch > (which is now committed) because the index attrs are cached after > invalidation and I have mentioned the same in my yesterday's followup > mail. The guarantee of safety is that we don't handle invalidation > between two consecutive calls to RelationGetIndexAttrBitmap() in > heap_update, but if we do handle due to any reason which is not > apparent to me, then I think there is some chance of hitting the > assertion. > > > I doubt it's really going to be practical to guarantee that > > those bitmapsets are always 100% consistent throughout a transaction. > > > > Agreed. As the code stands, I think it is only expected to be > guaranteed across three consecutive calls to > RelationGetIndexAttrBitmap() in heap_update. Now, if the assertion in > HeapSatisfiesHOTandKeyUpdate() is useless and we can remove it, then > probably we don't need a guarantee to be consistent in three > consecutive calls as well. > > >> Okay, please find attached patch which is an extension of Tom's and > >> Andres's patch and I think it fixes the above problem noted by me. > > > > I don't like this patch at all. It only fixes the above issue if the > > relcache inval arrives while RelationGetIndexAttrBitmap is executing. > > If the inval arrives between two such calls, you still have the problem > > of the second one delivering a bitmapset that might be inconsistent > > with the first one. > > > > I think it won't happen between two consecutive calls to > RelationGetIndexAttrBitmap in heap_update. It can happen during > multi-row update, but that should be fine. I think this version of > patch only defers the creation of fresh bitmapsets where ever > possible. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > Was just curious if anyone was able to come up with any sort of method to test whether an index was corrupted by this bug, other than just waiting for bad query results? We've used concurrent index rebuilding quite extensively over the years to remove bloat from busy systems, but reindexing the entire database "just in case" is unrealistic in many of our cases.
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 10:28 PM, Tom Lane wrote: > Amit Kapila writes: >> Hmm. Consider that the first time relcahe invalidation occurs while >> computing id_attrs, so now the retry logic will compute the correct >> set of attrs (considering two indexes, if we take the example given by >> Alvaro above.). However, the attrs computed for hot_* and key_* will >> be using only one index, so this will lead to a situation where two of >> the attrs (hot_attrs and key_attrs) are computed with one index and >> id_attrs is computed with two indexes. I think this can lead to >> Assertion in HeapSatisfiesHOTandKeyUpdate(). > > That seems like something we'd be best off to fix by changing the > assertion. > The above-mentioned problem doesn't exist in your version of the patch (which is now committed) because the index attrs are cached after invalidation and I have mentioned the same in my yesterday's followup mail. The guarantee of safety is that we don't handle invalidation between two consecutive calls to RelationGetIndexAttrBitmap() in heap_update, but if we do handle due to any reason which is not apparent to me, then I think there is some chance of hitting the assertion. > I doubt it's really going to be practical to guarantee that > those bitmapsets are always 100% consistent throughout a transaction. > Agreed. As the code stands, I think it is only expected to be guaranteed across three consecutive calls to RelationGetIndexAttrBitmap() in heap_update. Now, if the assertion in HeapSatisfiesHOTandKeyUpdate() is useless and we can remove it, then probably we don't need a guarantee to be consistent in three consecutive calls as well. >> Okay, please find attached patch which is an extension of Tom's and >> Andres's patch and I think it fixes the above problem noted by me. > > I don't like this patch at all. It only fixes the above issue if the > relcache inval arrives while RelationGetIndexAttrBitmap is executing. > If the inval arrives between two such calls, you still have the problem > of the second one delivering a bitmapset that might be inconsistent > with the first one. > I think it won't happen between two consecutive calls to RelationGetIndexAttrBitmap in heap_update. It can happen during multi-row update, but that should be fine. I think this version of patch only defers the creation of fresh bitmapsets where ever possible. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 11:54 PM, Tom Lane wrote: > After some discussion among the release team, we've concluded that the > best thing to do is to push Pavan's/my patch into today's releases. > This does not close the matter by any means: we should continue to > study whether there are related bugs or whether there's a more principled > way of fixing this bug. But that patch clearly makes things better, > and we shouldn't let worries about whether there are more bugs stop > us from providing some kind of fix to users. > > I've made the push, and barring negative reports from the buildfarm, > it will be in today's releases. > Thank you for taking care of it. Buildfarm is looking green until now. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
After some discussion among the release team, we've concluded that the best thing to do is to push Pavan's/my patch into today's releases. This does not close the matter by any means: we should continue to study whether there are related bugs or whether there's a more principled way of fixing this bug. But that patch clearly makes things better, and we shouldn't let worries about whether there are more bugs stop us from providing some kind of fix to users. I've made the push, and barring negative reports from the buildfarm, it will be in today's releases. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 5, 2017 at 9:42 PM, Pavan Deolasee wrote: > On Mon, Feb 6, 2017 at 5:41 AM, Peter Geoghegan wrote: >> On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas wrote: >> > I don't think this kind of black-and-white thinking is very helpful. >> > Obviously, data corruption is bad. However, this bug has (from what >> > one can tell from this thread) been with us for over a decade; it must >> > necessarily be either low-probability or low-severity, or somebody >> > would've found it and fixed it before now. Indeed, the discovery of >> > this bug was driven by new feature development, not a user report. It >> > seems pretty clear that if we try to patch this and get it wrong, the >> > effects of our mistake could easily be a lot more serious than the >> > original bug. >> >> +1. The fact that it wasn't driven by a user report convinces me that >> this is the way to go. >> > > I'm not sure that just because the bug wasn't reported by a user, makes it > any less critical. As Tomas pointed down thread, the nature of the bug is > such that the users may not discover it very easily, but that doesn't mean > it couldn't be affecting them all the time. We can now correlate many past > reports of index corruption to this bug, but we don't have evidence to prove > that. Lack of any good tool or built-in checks probably makes it even > harder. > > The reason why I discovered this with WARM is because it now has a built-in > recheck logic, which was discarding some tuples returned by the index scan. > It took me lots of code review and inspection to finally conclude that this > must be an existing bug. Even then for lack of any utility, I could not > detect this easily with master. That doesn't mean I wasn't able to > reproduce, but detection was a problem. > > If you look at the reproduction setup, one in every 10, if not 5, CREATE > INDEX CONCURRENTLY ends up creating a corrupt index. That's a good 10% > probability. And this is on a low end laptop, with just 5-10 concurrent > clients running. Probability of hitting the problem will be much higher on a > bigger machine, with many users (on a decent AWS machine, I would find every > third CIC to be corrupt). Moreover the setup is not doing anything > extraordinary. Just concurrent updates which change between HOT to non-HOT > and a CIC. > Not that I am advocating that we should do a release just for this; having a fix we believe in is certainly as important a factor, but that the idea that the bug has been around a long time means it is less of an issue does seem wrong. We've certainly seen plenty of cases over the years where bugs have existed in the code in seldom used code paths, only to be exposed by new features or other code changes over time. In general, we should be less worried about the age of a bug vs our expectations that users are likely to hit that bug now, which does seem high based on the above numbers. In the meantime, it's certainly worth warning users, providing help on how to determine if this is a likely problem for them, and possibly rolling a patch ahead of upstream in cases where that's feasible. Robert Treat play: xzilla.net work: omniti.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Alvaro Herrera writes: > Tom Lane wrote: >> Better to fix the callers so that they don't have the assumption you >> refer to. Or maybe we could adjust the API of RelationGetIndexAttrBitmap >> so that it returns all the sets needed by a given calling module at >> once, which would allow us to guarantee they're consistent. > Note that my "interesting attrs" patch does away with these independent > bitmaps (which was last posted by Pavan as part of his WARM set). I > think we should fix just this bug now, and for the future look at that > other approach. BTW, if there is a risk of the assertion failure that Amit posits, it seems like it should have happened in the tests that Pavan was doing originally. I'd sort of like to see a demonstration that it can actually happen before we spend any great amount of time fixing it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Tom Lane wrote: > Better to fix the callers so that they don't have the assumption you > refer to. Or maybe we could adjust the API of RelationGetIndexAttrBitmap > so that it returns all the sets needed by a given calling module at > once, which would allow us to guarantee they're consistent. Note that my "interesting attrs" patch does away with these independent bitmaps (which was last posted by Pavan as part of his WARM set). I think we should fix just this bug now, and for the future look at that other approach. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Amit Kapila writes: > Hmm. Consider that the first time relcahe invalidation occurs while > computing id_attrs, so now the retry logic will compute the correct > set of attrs (considering two indexes, if we take the example given by > Alvaro above.). However, the attrs computed for hot_* and key_* will > be using only one index, so this will lead to a situation where two of > the attrs (hot_attrs and key_attrs) are computed with one index and > id_attrs is computed with two indexes. I think this can lead to > Assertion in HeapSatisfiesHOTandKeyUpdate(). That seems like something we'd be best off to fix by changing the assertion. I doubt it's really going to be practical to guarantee that those bitmapsets are always 100% consistent throughout a transaction. > Okay, please find attached patch which is an extension of Tom's and > Andres's patch and I think it fixes the above problem noted by me. I don't like this patch at all. It only fixes the above issue if the relcache inval arrives while RelationGetIndexAttrBitmap is executing. If the inval arrives between two such calls, you still have the problem of the second one delivering a bitmapset that might be inconsistent with the first one. To go in this direction, I think we would have to hot-wire RelationGetIndexAttrBitmap so that once any bitmapset has been requested in a transaction, we intentionally ignore all index set updates for the remainder of the transaction. And that would very likely create more problems than it solves (consider locally initiated DDL within the transaction, for example). Better to fix the callers so that they don't have the assumption you refer to. Or maybe we could adjust the API of RelationGetIndexAttrBitmap so that it returns all the sets needed by a given calling module at once, which would allow us to guarantee they're consistent. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Andres Freund wrote: > To show what I mean here's an *unpolished* and *barely tested* patch > implementing the first of my suggestions. > > Alvaro, Pavan, I think should address the issue as well? Hmm, interesting idea. Maybe a patch along these lines is a good way to make index-list cache less brittle going forward. However, I'm against putting it in the stable branches -- and we should definitely not stall an immediate fix in order to get this patch ready. IMO we should get Tom's patch in tree for all branches very soon, and then after the release you can finalize this one and put it to master. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Tom Lane wrote: > Pavan Deolasee writes: > > 2. In the second patch, we tried to recompute attribute lists if a relcache > > flush happens in between and index list is invalidated. We've seen problems > > with that, especially it getting into an infinite loop with > > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache > > flushes keep happening. > > Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush > happen whenever it possibly could. The way to deal with that without > looping is to test whether the index set *actually* changed, not whether > it just might have changed. Ah, that's a nice and simple way to fix that patch! I see that Pavan confirmed that it fixes the tests we saw failing too. It seems to me that we should go with this one, and it looks unlikely that this causes other problems. BTW, while neiter Pavan nor I sent this patch to -hackers, this implementation is pretty much the same thing we did. Pavan deserves credit as co-author in this commit. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
El 05/02/17 a las 21:57, Tomas Vondra escribió: > > +1 to not rushing fixes into releases. While I think we now finally > understand the mechanics of this bug, the fact that we came up with > three different fixes in this thread, only to discover issues with each > of them, warrants some caution. I agree also with Robert on not rushing the patch. My point was if we had to rush the release. > OTOH I disagree with the notion that bugs that are not driven by user > reports are somehow less severe. Some data corruption bugs cause quite > visible breakage - segfaults, immediate crashes, etc. Those are pretty > clear bugs, and are reported by users. > > Other data corruption bugs are much more subtle - for example this bug > may lead to incorrect results to some queries, failing to detect values > violating UNIQUE constraints, issues with foreign keys, etc. I was recalling just yesterday after sending the mail a logical replication setup we did on a 9.3 server of a customer which brought up data inconsistencies on the primary key of one of the tables. The table had duplicate values. As Tomas says, it's subtle and hard to find unless you constantly run index checks (query a sample of the data from the heap and from the index and check they match). In our case, the customer was not aware of the dups until we found them. Regards, -- Martín Marquéshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
> 6 февр. 2017 г., в 4:57, Peter Geoghegan написал(а): > > I meant that I find the fact that there were no user reports in all > these years to be a good reason to not proceed for now in this > instance. Well, we had some strange situations with indexes (see below for example) but I couldn’t even think that CIC is the problem. And it is really difficult to give diagnostics for problems of such kind. Because 1. you may see the problem several days after last major change in the database and 2. you don’t even know how to start investigating the problem. xdb314g/maildb M # show enable_indexscan ; enable_indexscan -- off (1 row) Time: 0.120 ms xdb314g/maildb M # select * from mail.folders where uid=448300241 and fid=1; -[ RECORD 1 ]---+-- uid | 448300241 fid | 1 <...> Time: 30398.637 ms xdb314g/maildb M # set enable_indexscan to true; SET Time: 0.111 ms xdb314g/maildb M # select * from mail.folders where uid=448300241 and fid=1; (0 rows) Time: 0.312 ms xdb314g/maildb M # The row of course hasn’t been deleted. -- May the force be with you… https://simply.name
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 9:47 AM, Pavan Deolasee wrote: > > > On Mon, Feb 6, 2017 at 9:41 AM, Amit Kapila wrote: >> >> >> >> Hmm. Consider that the first time relcahe invalidation occurs while >> computing id_attrs, so now the retry logic will compute the correct >> set of attrs (considering two indexes, if we take the example given by >> Alvaro above.). > > > I don't quite get that. Since rd_idattr must be already cached at that point > and we don't expect to process a relcache flush between successive calls to > RelationGetIndexAttrBitmap(), we should return a consistent copy of > rd_idattr. > Right, I was mistaken. However, I am not sure if there is a need to perform retry logic in RelationGetIndexAttrBitmap(). It is safe to do that at transaction end. I feel even though Tom's fix looks reliable with respect to the problem reported in this thread, we should take some time and evaluate different proposals and see which one is the best way to move forward. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee wrote: > >> > I like this approach. I'll run the patch on a build with > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. > While it looks certain that the fix will miss this point release, FWIW I ran this patch with CACHE_CLOBBER_ALWAYS and it seems to be working as expected.. Hard to run all the tests, but a small subset of regression and test_decoding seems ok. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 9:41 AM, Amit Kapila wrote: > > > Hmm. Consider that the first time relcahe invalidation occurs while > computing id_attrs, so now the retry logic will compute the correct > set of attrs (considering two indexes, if we take the example given by > Alvaro above.). I don't quite get that. Since rd_idattr must be already cached at that point and we don't expect to process a relcache flush between successive calls to RelationGetIndexAttrBitmap(), we should return a consistent copy of rd_idattr. But may be I am missing something. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 8:35 AM, Pavan Deolasee wrote: > > > On Mon, Feb 6, 2017 at 8:15 AM, Amit Kapila wrote: >> >> On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee >> wrote: >> > >> > >> > On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane wrote: >> >> >> >> >> >> >> >> > 2. In the second patch, we tried to recompute attribute lists if a >> >> > relcache >> >> > flush happens in between and index list is invalidated. We've seen >> >> > problems >> >> > with that, especially it getting into an infinite loop with >> >> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently >> >> > relcache >> >> > flushes keep happening. >> >> >> >> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache >> >> flush >> >> happen whenever it possibly could. >> > >> > >> > Ah, ok. That explains why the retry logic as originally proposed was >> > causing >> > infinite loop. >> > >> >> >> >> The way to deal with that without >> >> looping is to test whether the index set *actually* changed, not >> >> whether >> >> it just might have changed. >> > >> > >> > I like this approach. I'll run the patch on a build with >> > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the >> > fact that retry logic is now limited to only when index set changes, >> > which >> > fits in the situation we're dealing with. >> > >> >> Don't you think it also has the same problem as mentioned by me in >> Alvaro's patch and you also agreed on it? > > > No, I don't think so. There can't be any more relcache flushes occurring > between the first RelationGetIndexAttrBitmap() and the subsequent > RelationGetIndexAttrBitmap() calls. The problem with the earlier proposed > approach was that we were not caching, yet returning stale information. That > implied the next call will again recompute a possibly different value. The > retry logic is trying to close a small race yet important race condition > where index set invalidation happens, but we cache stale information. > Hmm. Consider that the first time relcahe invalidation occurs while computing id_attrs, so now the retry logic will compute the correct set of attrs (considering two indexes, if we take the example given by Alvaro above.). However, the attrs computed for hot_* and key_* will be using only one index, so this will lead to a situation where two of the attrs (hot_attrs and key_attrs) are computed with one index and id_attrs is computed with two indexes. I think this can lead to Assertion in HeapSatisfiesHOTandKeyUpdate(). >> >> Do you see any need to fix >> it locally in >> RelationGetIndexAttrBitmap(), why can't we clear at transaction end? >> > > We're trying to fix it in RelationGetIndexAttrBitmap() because that's where > the race exists. I'm not saying there aren't other and better ways of > handling it, but we don't know (I've studied Andres's proposal yet). > Okay, please find attached patch which is an extension of Tom's and Andres's patch and I think it fixes the above problem noted by me. Basically, invalidate the bitmaps at transaction end rather than in RelationGetIndexAttrBitmap(). I think it is okay for RelationGetIndexAttrBitmap() to use stale information until transaction end because all the updates in the current running transaction will be consumed by the second phase of CIC. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com invalidate_indexattr_v6.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 2017-02-05 22:34:34 -0500, Tom Lane wrote: > Pavan Deolasee writes: > The point is that there's a nontrivial chance of a hasty fix introducing > worse problems than we fix. > > Given the lack of consensus about exactly how to fix this, I'm feeling > like it's a good idea if whatever we come up with gets some time to age > awhile in git before we ship it. Right. And I'm not even convinced that we really know the extent of the bug; it seems fairly plausible that there's further incidences of this. There's also the issue that the mechanics in the older backbranches are different again, because of SnapshotNow. >> I'm bit a surprised with this position. What do we tell our users now that >> we know this bug exists? That we're scheduling a bugfix for the next point release. I don't think we can truthfully claim that there's no known corruption bugs in any of the release in the last few years. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Pavan Deolasee writes: > On Mon, Feb 6, 2017 at 5:44 AM, Andres Freund wrote: >> +1. I don't think we serve our users by putting out a nontrivial bugfix >> hastily. Nor do I think we serve them in this instance by delaying the >> release to cover this fix; there's enough other fixes in the release >> imo. > I'm bit a surprised with this position. The point is that there's a nontrivial chance of a hasty fix introducing worse problems than we fix. Given the lack of consensus about exactly how to fix this, I'm feeling like it's a good idea if whatever we come up with gets some time to age awhile in git before we ship it. Obviously, 2ndQ or EDB or any other distributor can choose to ship a patch in their own builds if they're sufficiently comfortable with the particular patch. That doesn't translate to there having to be a fix in the community's wraps this week. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 2017-02-05 21:49:57 -0500, Tom Lane wrote: > Andres Freund writes: > > I've not yet read the full thread, but I'm a bit confused so far. We > > obviously can get changing information about indexes here, but isn't > > that something we have to deal with anyway? If we guarantee that we > > don't loose knowledge that there's a pending invalidation, why do we > > have to retry? > > We don't *have to* retry; the core of the fix is to not enter stale data > into the cache after we've already received an invalidation signal. Right, the bug is that we do so without remembering that fact. > The current caller can survive with stale data; or if that's not true, > C.I.C. has got more fundamental problems than posited here. Indeed. > But it seems like a generally bad idea for relcache to return data > that it knows (or at least has reason to believe) is stale. I'm not convinced by this - this kind of staleness can only occur if changes happen during the execution of the cache building. And the information returned can be outdated by pretty much the moment you return anyway. It's also how a number of the current caches essentially already work. > Also, even if you're okay with return-stale-data-but-don't-cache-it, > I have little faith that invalidate_indexattr_v3.patch successfully > implements that. I sent a prototype clarifying what I mean. It's not really like invalidate_indexattr_v3.patch Btw, it seems RelationGetIndexList() avoids a similar issue onlye due to either luck or a lot of undocumented assumptions. The only reason that setting relation->rd_indexlist / rd_indexvalid at the end doesn't cause a similar issue seems to be that no invalidations appear to be processed after the systable_beginscan(). And that in turn seems to not process relcache invals after RegisterSnapshot(GetCatalogSnapshot(relid)). Phew, that seems a bit fragile. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 5:44 AM, Andres Freund wrote: > On 2017-02-05 12:51:09 -0500, Tom Lane wrote: > > Michael Paquier writes: > > > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule > wrote: > > >> I agree with Pavan - a release with known important bug is not good > idea. > > > > > This bug has been around for some time, so I would recommend taking > > > the time necessary to make the best fix possible, even if it means > > > waiting for the next round of minor releases. > > > > I think the way to think about this sort of thing is, if we had found > > this bug when a release wasn't imminent, would we consider it bad enough > > to justify an unscheduled release cycle? I have to think the answer for > > this one is "probably not". > > +1. I don't think we serve our users by putting out a nontrivial bugfix > hastily. Nor do I think we serve them in this instance by delaying the > release to cover this fix; there's enough other fixes in the release > imo. > > I'm bit a surprised with this position. What do we tell our users now that we know this bug exists? They can't fully trust CIC and they don't have any mechanism to confirm if the newly built index is corruption free or not. I'm not suggesting that we should hastily refactor the code, but at the very least some sort of band-aid fix which helps the situation, yet have minimal side-effects, is warranted. I believe proposed retry patch does exactly that. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 8:15 AM, Amit Kapila wrote: > On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee > wrote: > > > > > > On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane wrote: > >> > >> > >> > >> > 2. In the second patch, we tried to recompute attribute lists if a > >> > relcache > >> > flush happens in between and index list is invalidated. We've seen > >> > problems > >> > with that, especially it getting into an infinite loop with > >> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently > relcache > >> > flushes keep happening. > >> > >> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache > flush > >> happen whenever it possibly could. > > > > > > Ah, ok. That explains why the retry logic as originally proposed was > causing > > infinite loop. > > > >> > >> The way to deal with that without > >> looping is to test whether the index set *actually* changed, not whether > >> it just might have changed. > > > > > > I like this approach. I'll run the patch on a build with > > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the > > fact that retry logic is now limited to only when index set changes, > which > > fits in the situation we're dealing with. > > > > Don't you think it also has the same problem as mentioned by me in > Alvaro's patch and you also agreed on it? No, I don't think so. There can't be any more relcache flushes occurring between the first RelationGetIndexAttrBitmap() and the subsequent RelationGetIndexAttrBitmap() calls. The problem with the earlier proposed approach was that we were not caching, yet returning stale information. That implied the next call will again recompute a possibly different value. The retry logic is trying to close a small race yet important race condition where index set invalidation happens, but we cache stale information. > Do you see any need to fix > it locally in > RelationGetIndexAttrBitmap(), why can't we clear at transaction end? > > We're trying to fix it in RelationGetIndexAttrBitmap() because that's where the race exists. I'm not saying there aren't other and better ways of handling it, but we don't know (I've studied Andres's proposal yet). Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Andres Freund writes: > I've not yet read the full thread, but I'm a bit confused so far. We > obviously can get changing information about indexes here, but isn't > that something we have to deal with anyway? If we guarantee that we > don't loose knowledge that there's a pending invalidation, why do we > have to retry? We don't *have to* retry; the core of the fix is to not enter stale data into the cache after we've already received an invalidation signal. The current caller can survive with stale data; or if that's not true, C.I.C. has got more fundamental problems than posited here. But it seems like a generally bad idea for relcache to return data that it knows (or at least has reason to believe) is stale. Also, even if you're okay with return-stale-data-but-don't-cache-it, I have little faith that invalidate_indexattr_v3.patch successfully implements that. Consider the sequence: inval received during RelationGetIndexAttrBitmap, we clear rd_indexvalid, something asks for the relation OID list, we recalculate that and set rd_indexvalid, then we reach the end of RelationGetIndexAttrBitmap and see that rd_indexvalid is set. If that happened, we'd cache the bitmaps whether or not they were really up to date. Now admittedly, it's a bit hard to think of a reason why anything would ask for the index list of anything but a system catalog in that sequence, so as long as you assume that we don't allow C.I.C. (or more realistically, REINDEX CONCURRENTLY) on system catalogs, this failure mode is unreachable. But I much prefer having a positive verification that the index list is still what it was when we started. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 2017-02-06 08:08:01 +0530, Amit Kapila wrote: > I don't see in your patch that you are setting rd_bitmapsvalid to 0. IIRC a plain relcache rebuild should do that (note there's also no place that directly resets rd_indexattrs). > Also, I think this suffers from the same problem as the patch proposed > by Alvaro which is that different attrs (hot_attrs, key_attrs and > id_attrs) will get computed based on different index lists which can > cause a problem as mentioned in my mail up thread. I'm not convinced that that's a legitimate concern. If that's an issue with CIC, then we have a lot bigger problems, because relcache entries and such will be inconsistent anyway if you have non-exclusive locks while changing catalog data. In the situations where that happens it better be harmless (otherwis CiC is broken), and the next access will recompute them. > I am not sure but I think your solution can be > extended to make it somewhat similar to what we do with rd_indexvalid > (basically, we should set rd_bitmapsvalid to 2 (bitmap is temporarily > forced) at rel cache invalidation and then clean it up transaction > end). I can try something on those lines. Not sure I understand what you mean with "clean it up at transaction end"? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 5, 2017 at 6:42 PM, Pavan Deolasee wrote: > I'm not sure that just because the bug wasn't reported by a user, makes it > any less critical. As Tomas pointed down thread, the nature of the bug is > such that the users may not discover it very easily, but that doesn't mean > it couldn't be affecting them all the time. We can now correlate many past > reports of index corruption to this bug, but we don't have evidence to prove > that. Lack of any good tool or built-in checks probably makes it even > harder. I think that we need to make an automated checker tool a requirement for very complicated development projects in the future. We're behind here. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee wrote: > > > On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane wrote: >> >> >> >> > 2. In the second patch, we tried to recompute attribute lists if a >> > relcache >> > flush happens in between and index list is invalidated. We've seen >> > problems >> > with that, especially it getting into an infinite loop with >> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache >> > flushes keep happening. >> >> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush >> happen whenever it possibly could. > > > Ah, ok. That explains why the retry logic as originally proposed was causing > infinite loop. > >> >> The way to deal with that without >> looping is to test whether the index set *actually* changed, not whether >> it just might have changed. > > > I like this approach. I'll run the patch on a build with > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the > fact that retry logic is now limited to only when index set changes, which > fits in the situation we're dealing with. > Don't you think it also has the same problem as mentioned by me in Alvaro's patch and you also agreed on it? Do you see any need to fix it locally in RelationGetIndexAttrBitmap(), why can't we clear at transaction end? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 5:41 AM, Peter Geoghegan wrote: > On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas wrote: > > I don't think this kind of black-and-white thinking is very helpful. > > Obviously, data corruption is bad. However, this bug has (from what > > one can tell from this thread) been with us for over a decade; it must > > necessarily be either low-probability or low-severity, or somebody > > would've found it and fixed it before now. Indeed, the discovery of > > this bug was driven by new feature development, not a user report. It > > seems pretty clear that if we try to patch this and get it wrong, the > > effects of our mistake could easily be a lot more serious than the > > original bug. > > +1. The fact that it wasn't driven by a user report convinces me that > this is the way to go. > > I'm not sure that just because the bug wasn't reported by a user, makes it any less critical. As Tomas pointed down thread, the nature of the bug is such that the users may not discover it very easily, but that doesn't mean it couldn't be affecting them all the time. We can now correlate many past reports of index corruption to this bug, but we don't have evidence to prove that. Lack of any good tool or built-in checks probably makes it even harder. The reason why I discovered this with WARM is because it now has a built-in recheck logic, which was discarding some tuples returned by the index scan. It took me lots of code review and inspection to finally conclude that this must be an existing bug. Even then for lack of any utility, I could not detect this easily with master. That doesn't mean I wasn't able to reproduce, but detection was a problem. If you look at the reproduction setup, one in every 10, if not 5, CREATE INDEX CONCURRENTLY ends up creating a corrupt index. That's a good 10% probability. And this is on a low end laptop, with just 5-10 concurrent clients running. Probability of hitting the problem will be much higher on a bigger machine, with many users (on a decent AWS machine, I would find every third CIC to be corrupt). Moreover the setup is not doing anything extraordinary. Just concurrent updates which change between HOT to non-HOT and a CIC. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 6:27 AM, Andres Freund wrote: > Hi, > > On 2017-02-05 16:37:33 -0800, Andres Freund wrote: >> > RelationGetIndexList(Relation relation) >> > @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat >> > * we can include system attributes (e.g., OID) in the bitmap >> > representation. >> > * >> > * Caller had better hold at least RowExclusiveLock on the target relation >> > - * to ensure that it has a stable set of indexes. This also makes it safe >> > - * (deadlock-free) for us to take locks on the relation's indexes. >> > + * to ensure it is safe (deadlock-free) for us to take locks on the >> > relation's >> > + * indexes. Note that since the introduction of CREATE INDEX >> > CONCURRENTLY, >> > + * that lock level doesn't guarantee a stable set of indexes, so we have >> > to >> > + * be prepared to retry here in case of a change in the set of indexes. >> >> I've not yet read the full thread, but I'm a bit confused so far. We >> obviously can get changing information about indexes here, but isn't >> that something we have to deal with anyway? If we guarantee that we >> don't loose knowledge that there's a pending invalidation, why do we >> have to retry? Pretty much by definition the knowledge in a relcache >> entry can be outdated as soon as returned unless locking prevents that >> from being possible - which is not the case here. >> >> ISTM it'd be better not to have retry logic here, but to follow the more >> general pattern of making sure that we know whether the information >> needs to be recomputed at the next access. We could either do that by >> having an additional bit of information about the validity of the >> bitmaps (so we have invalid, building, valid - and only set valid at the >> end of computing the bitmaps when still building and not invalid again), >> or we simply move the bitmap computation into the normal relcache build. > > To show what I mean here's an *unpolished* and *barely tested* patch > implementing the first of my suggestions. > + * Signal that the attribute bitmaps are being built. If there's any + * relcache invalidations while building them, rd_bitmapsvalid will be + * reset to 0. In that case return the bitmaps, but don't mark them as + * valid. + */ I don't see in your patch that you are setting rd_bitmapsvalid to 0. Also, I think this suffers from the same problem as the patch proposed by Alvaro which is that different attrs (hot_attrs, key_attrs and id_attrs) will get computed based on different index lists which can cause a problem as mentioned in my mail up thread. However, I think your general approach and idea that we have to set the things up for next time is on spot. I am not sure but I think your solution can be extended to make it somewhat similar to what we do with rd_indexvalid (basically, we should set rd_bitmapsvalid to 2 (bitmap is temporarily forced) at rel cache invalidation and then clean it up transaction end). I can try something on those lines. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane wrote: > > > > 2. In the second patch, we tried to recompute attribute lists if a > relcache > > flush happens in between and index list is invalidated. We've seen > problems > > with that, especially it getting into an infinite loop with > > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache > > flushes keep happening. > > Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush > happen whenever it possibly could. Ah, ok. That explains why the retry logic as originally proposed was causing infinite loop. > The way to deal with that without > looping is to test whether the index set *actually* changed, not whether > it just might have changed. > I like this approach. I'll run the patch on a build with CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the fact that retry logic is now limited to only when index set changes, which fits in the situation we're dealing with. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 5, 2017 at 4:57 PM, Tomas Vondra wrote: > OTOH I disagree with the notion that bugs that are not driven by user > reports are somehow less severe. Some data corruption bugs cause quite > visible breakage - segfaults, immediate crashes, etc. Those are pretty clear > bugs, and are reported by users. I meant that I find the fact that there were no user reports in all these years to be a good reason to not proceed for now in this instance. I wrote amcheck to detect the latter variety of bug, so clearly I think that they are very serious. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Hi, On 2017-02-05 16:37:33 -0800, Andres Freund wrote: > > RelationGetIndexList(Relation relation) > > @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat > > * we can include system attributes (e.g., OID) in the bitmap > > representation. > > * > > * Caller had better hold at least RowExclusiveLock on the target relation > > - * to ensure that it has a stable set of indexes. This also makes it safe > > - * (deadlock-free) for us to take locks on the relation's indexes. > > + * to ensure it is safe (deadlock-free) for us to take locks on the > > relation's > > + * indexes. Note that since the introduction of CREATE INDEX CONCURRENTLY, > > + * that lock level doesn't guarantee a stable set of indexes, so we have to > > + * be prepared to retry here in case of a change in the set of indexes. > > I've not yet read the full thread, but I'm a bit confused so far. We > obviously can get changing information about indexes here, but isn't > that something we have to deal with anyway? If we guarantee that we > don't loose knowledge that there's a pending invalidation, why do we > have to retry? Pretty much by definition the knowledge in a relcache > entry can be outdated as soon as returned unless locking prevents that > from being possible - which is not the case here. > > ISTM it'd be better not to have retry logic here, but to follow the more > general pattern of making sure that we know whether the information > needs to be recomputed at the next access. We could either do that by > having an additional bit of information about the validity of the > bitmaps (so we have invalid, building, valid - and only set valid at the > end of computing the bitmaps when still building and not invalid again), > or we simply move the bitmap computation into the normal relcache build. To show what I mean here's an *unpolished* and *barely tested* patch implementing the first of my suggestions. Alvaro, Pavan, I think should address the issue as well? - Andres diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 8a7c560e46..9e94495e75 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -4745,9 +4745,12 @@ RelationGetIndexPredicate(Relation relation) * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that * we can include system attributes (e.g., OID) in the bitmap representation. * - * Caller had better hold at least RowExclusiveLock on the target relation - * to ensure that it has a stable set of indexes. This also makes it safe - * (deadlock-free) for us to take locks on the relation's indexes. + * Caller had better hold at least RowExclusiveLock on the target relation to + * ensure that it has a stable set of indexes. This also makes it safe + * (deadlock-free) for us to take locks on the relation's indexes. Note that + * a concurrent CREATE/DROP INDEX CONCURRENTLY can lead to an outdated list + * being returned (will be recomputed at the next access), the CONCURRENTLY + * code deals with that. * * The returned result is palloc'd in the caller's memory context and should * be bms_free'd when not needed anymore. @@ -4766,7 +4769,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind) MemoryContext oldcxt; /* Quick exit if we already computed the result. */ - if (relation->rd_indexattr != NULL) + if (relation->rd_bitmapsvalid == 2) { switch (attrKind) { @@ -4788,6 +4791,14 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind) return NULL; /* + * Signal that the attribute bitmaps are being built. If there's any + * relcache invalidations while building them, rd_bitmapsvalid will be + * reset to 0. In that case return the bitmaps, but don't mark them as + * valid. + */ + relation->rd_bitmapsvalid = 1; + + /* * Get cached list of index OIDs */ indexoidlist = RelationGetIndexList(relation); @@ -4892,13 +4903,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind) bms_free(relation->rd_idattr); relation->rd_idattr = NULL; - /* - * Now save copies of the bitmaps in the relcache entry. We intentionally - * set rd_indexattr last, because that's the one that signals validity of - * the values; if we run out of memory before making that copy, we won't - * leave the relcache entry looking like the other ones are valid but - * empty. - */ + /* now save copies of the bitmaps in the relcache entry */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); relation->rd_keyattr = bms_copy(uindexattrs); relation->rd_pkattr = bms_copy(pkindexattrs); @@ -4906,6 +4911,18 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind) relation->rd_indexattr = bms_copy(indexattrs); MemoryContextSwitchTo(oldcxt); + /* + * If there've been no invalidations while building the entry, mark the + * stored bitmaps as being valid. Need to do so after the copies above,
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 02/06/2017 01:11 AM, Peter Geoghegan wrote: On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas wrote: I don't think this kind of black-and-white thinking is very helpful. Obviously, data corruption is bad. However, this bug has (from what one can tell from this thread) been with us for over a decade; it must necessarily be either low-probability or low-severity, or somebody would've found it and fixed it before now. Indeed, the discovery of this bug was driven by new feature development, not a user report. It seems pretty clear that if we try to patch this and get it wrong, the effects of our mistake could easily be a lot more serious than the original bug. +1. The fact that it wasn't driven by a user report convinces me that this is the way to go. +1 to not rushing fixes into releases. While I think we now finally understand the mechanics of this bug, the fact that we came up with three different fixes in this thread, only to discover issues with each of them, warrants some caution. OTOH I disagree with the notion that bugs that are not driven by user reports are somehow less severe. Some data corruption bugs cause quite visible breakage - segfaults, immediate crashes, etc. Those are pretty clear bugs, and are reported by users. Other data corruption bugs are much more subtle - for example this bug may lead to incorrect results to some queries, failing to detect values violating UNIQUE constraints, issues with foreign keys, etc. It's damn impossible to notice incorrect query results that only affect tiny subset of the rows (e.g. rows updated when the CIC was running), especially when the issue may go away after a while due to additional non-HOT updates. Regarding the other symptoms - I wonder how many strange 'duplicate value' errors were misdiagnosed, wrongly attributed to a recent power outage, etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 2017-02-05 15:14:59 -0500, Tom Lane wrote: > I do not like any of the other patches proposed in this thread, because > they fail to guarantee delivering an up-to-date attribute bitmap to the > caller. I think we need a retry loop, and I think that it needs to look > like the attached. That looks like a reasonable approach, although I'm not sure it's the best one. > (Note that the tests whether rd_pkindex and rd_replidindex changed might > be redundant; but I'm not totally sure of that, and they're cheap.) I don't think they can, but I'm all for re-checking. > RelationGetIndexList(Relation relation) > @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat > * we can include system attributes (e.g., OID) in the bitmap representation. > * > * Caller had better hold at least RowExclusiveLock on the target relation > - * to ensure that it has a stable set of indexes. This also makes it safe > - * (deadlock-free) for us to take locks on the relation's indexes. > + * to ensure it is safe (deadlock-free) for us to take locks on the > relation's > + * indexes. Note that since the introduction of CREATE INDEX CONCURRENTLY, > + * that lock level doesn't guarantee a stable set of indexes, so we have to > + * be prepared to retry here in case of a change in the set of indexes. I've not yet read the full thread, but I'm a bit confused so far. We obviously can get changing information about indexes here, but isn't that something we have to deal with anyway? If we guarantee that we don't loose knowledge that there's a pending invalidation, why do we have to retry? Pretty much by definition the knowledge in a relcache entry can be outdated as soon as returned unless locking prevents that from being possible - which is not the case here. ISTM it'd be better not to have retry logic here, but to follow the more general pattern of making sure that we know whether the information needs to be recomputed at the next access. We could either do that by having an additional bit of information about the validity of the bitmaps (so we have invalid, building, valid - and only set valid at the end of computing the bitmaps when still building and not invalid again), or we simply move the bitmap computation into the normal relcache build. BTW, the interactions of RelationSetIndexList with RelationGetIndexAttrBitmap() look more than a bit fragile to me, even if documented: * * We deliberately do not change rd_indexattr here: even when operating * with a temporary partial index list, HOT-update decisions must be made * correctly with respect to the full index set. It is up to the caller * to ensure that a correct rd_indexattr set has been cached before first * calling RelationSetIndexList; else a subsequent inquiry might cause a * wrong rd_indexattr set to get computed and cached. Likewise, we do not * touch rd_keyattr, rd_pkattr or rd_idattr. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 2017-02-05 12:51:09 -0500, Tom Lane wrote: > Michael Paquier writes: > > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule > > wrote: > >> I agree with Pavan - a release with known important bug is not good idea. > > > This bug has been around for some time, so I would recommend taking > > the time necessary to make the best fix possible, even if it means > > waiting for the next round of minor releases. > > I think the way to think about this sort of thing is, if we had found > this bug when a release wasn't imminent, would we consider it bad enough > to justify an unscheduled release cycle? I have to think the answer for > this one is "probably not". +1. I don't think we serve our users by putting out a nontrivial bugfix hastily. Nor do I think we serve them in this instance by delaying the release to cover this fix; there's enough other fixes in the release imo. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas wrote: > I don't think this kind of black-and-white thinking is very helpful. > Obviously, data corruption is bad. However, this bug has (from what > one can tell from this thread) been with us for over a decade; it must > necessarily be either low-probability or low-severity, or somebody > would've found it and fixed it before now. Indeed, the discovery of > this bug was driven by new feature development, not a user report. It > seems pretty clear that if we try to patch this and get it wrong, the > effects of our mistake could easily be a lot more serious than the > original bug. +1. The fact that it wasn't driven by a user report convinces me that this is the way to go. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 5, 2017 at 1:34 PM, Martín Marqués wrote: > El 05/02/17 a las 10:03, Michael Paquier escribió: >> On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule >> wrote: >>> I agree with Pavan - a release with known important bug is not good idea. >> >> This bug has been around for some time, so I would recommend taking >> the time necessary to make the best fix possible, even if it means >> waiting for the next round of minor releases. > > The fact that the bug has been around for a long time doesn't mean we > shouldn't take it seriously. > > IMO any kind of corruption (heap or index) should be prioritized. > There's also been comments about maybe this being the cause of old > reports about index corruption. > > I ask myself if it's a good idea to make a point release with a know > corruption bug in it. I don't think this kind of black-and-white thinking is very helpful. Obviously, data corruption is bad. However, this bug has (from what one can tell from this thread) been with us for over a decade; it must necessarily be either low-probability or low-severity, or somebody would've found it and fixed it before now. Indeed, the discovery of this bug was driven by new feature development, not a user report. It seems pretty clear that if we try to patch this and get it wrong, the effects of our mistake could easily be a lot more serious than the original bug. Now, I do not object to patching this tomorrow before the wrap if the various senior hackers who have studied this (Tom, Alvaro, Pavan, Amit) are all in agreement on the way forward. But if there is any significant chance that we don't fully understand this issue and that our fix might cause bigger problems, it would be more prudent to wait. I'm all in favor of releasing important bug fixes quickly, but releasing a bug fix before we're sure we've fixed the bug correctly is taking a good principle to an irrational extreme. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
[ Having now read the whole thread, I'm prepared to weigh in ... ] Pavan Deolasee writes: > This seems like a real problem to me. I don't know what the consequences > are, but definitely having various attribute lists to have different view > of the set of indexes doesn't seem right. For sure. > 2. In the second patch, we tried to recompute attribute lists if a relcache > flush happens in between and index list is invalidated. We've seen problems > with that, especially it getting into an infinite loop with > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache > flushes keep happening. Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush happen whenever it possibly could. The way to deal with that without looping is to test whether the index set *actually* changed, not whether it just might have changed. I do not like any of the other patches proposed in this thread, because they fail to guarantee delivering an up-to-date attribute bitmap to the caller. I think we need a retry loop, and I think that it needs to look like the attached. (Note that the tests whether rd_pkindex and rd_replidindex changed might be redundant; but I'm not totally sure of that, and they're cheap.) regards, tom lane diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 8a7c560..b659994 100644 *** a/src/backend/utils/cache/relcache.c --- b/src/backend/utils/cache/relcache.c *** RelationGetFKeyList(Relation relation) *** 4327,4335 * it is the pg_class OID of a unique index on OID when the relation has one, * and InvalidOid if there is no such index. * ! * In exactly the same way, we update rd_replidindex, which is the pg_class ! * OID of an index to be used as the relation's replication identity index, ! * or InvalidOid if there is no such index. */ List * RelationGetIndexList(Relation relation) --- 4327,4336 * it is the pg_class OID of a unique index on OID when the relation has one, * and InvalidOid if there is no such index. * ! * In exactly the same way, we update rd_pkindex, which is the OID of the ! * relation's primary key index if any, else InvalidOid; and rd_replidindex, ! * which is the pg_class OID of an index to be used as the relation's ! * replication identity index, or InvalidOid if there is no such index. */ List * RelationGetIndexList(Relation relation) *** RelationGetIndexPredicate(Relation relat *** 4746,4753 * we can include system attributes (e.g., OID) in the bitmap representation. * * Caller had better hold at least RowExclusiveLock on the target relation ! * to ensure that it has a stable set of indexes. This also makes it safe ! * (deadlock-free) for us to take locks on the relation's indexes. * * The returned result is palloc'd in the caller's memory context and should * be bms_free'd when not needed anymore. --- 4747,4756 * we can include system attributes (e.g., OID) in the bitmap representation. * * Caller had better hold at least RowExclusiveLock on the target relation ! * to ensure it is safe (deadlock-free) for us to take locks on the relation's ! * indexes. Note that since the introduction of CREATE INDEX CONCURRENTLY, ! * that lock level doesn't guarantee a stable set of indexes, so we have to ! * be prepared to retry here in case of a change in the set of indexes. * * The returned result is palloc'd in the caller's memory context and should * be bms_free'd when not needed anymore. *** RelationGetIndexAttrBitmap(Relation rela *** 4760,4765 --- 4763,4769 Bitmapset *pkindexattrs; /* columns in the primary index */ Bitmapset *idindexattrs; /* columns in the replica identity */ List *indexoidlist; + List *newindexoidlist; Oid relpkindex; Oid relreplindex; ListCell *l; *** RelationGetIndexAttrBitmap(Relation rela *** 4788,4795 return NULL; /* ! * Get cached list of index OIDs */ indexoidlist = RelationGetIndexList(relation); /* Fall out if no indexes (but relhasindex was set) */ --- 4792,4800 return NULL; /* ! * Get cached list of index OIDs. If we have to start over, we do so here. */ + restart: indexoidlist = RelationGetIndexList(relation); /* Fall out if no indexes (but relhasindex was set) */ *** RelationGetIndexAttrBitmap(Relation rela *** 4800,4808 * Copy the rd_pkindex and rd_replidindex value computed by * RelationGetIndexList before proceeding. This is needed because a * relcache flush could occur inside index_open below, resetting the ! * fields managed by RelationGetIndexList. (The values we're computing ! * will still be valid, assuming that caller has a sufficient lock on ! * the relation.) */ relpkindex = relation->rd_pkindex; relreplindex = relation->rd_replidindex; --- 4805,4812
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
El 05/02/17 a las 10:03, Michael Paquier escribió: > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule wrote: >> I agree with Pavan - a release with known important bug is not good idea. > > This bug has been around for some time, so I would recommend taking > the time necessary to make the best fix possible, even if it means > waiting for the next round of minor releases. The fact that the bug has been around for a long time doesn't mean we shouldn't take it seriously. IMO any kind of corruption (heap or index) should be prioritized. There's also been comments about maybe this being the cause of old reports about index corruption. I ask myself if it's a good idea to make a point release with a know corruption bug in it. Regards, -- Martín Marquéshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
2017-02-05 18:51 GMT+01:00 Tom Lane : > Michael Paquier writes: > > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule > wrote: > >> I agree with Pavan - a release with known important bug is not good > idea. > > > This bug has been around for some time, so I would recommend taking > > the time necessary to make the best fix possible, even if it means > > waiting for the next round of minor releases. > > I think the way to think about this sort of thing is, if we had found > this bug when a release wasn't imminent, would we consider it bad enough > to justify an unscheduled release cycle? I have to think the answer for > this one is "probably not". > There are a questions 1. is it critical bug? 2. if is it, will be reason for special minor release? 3. how much time is necessary for fixing of this bug? These questions can be unimportant if there is some security fixes in next release, what I cannot to know. Regards Pavel > > regards, tom lane >
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Michael Paquier writes: > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule wrote: >> I agree with Pavan - a release with known important bug is not good idea. > This bug has been around for some time, so I would recommend taking > the time necessary to make the best fix possible, even if it means > waiting for the next round of minor releases. I think the way to think about this sort of thing is, if we had found this bug when a release wasn't imminent, would we consider it bad enough to justify an unscheduled release cycle? I have to think the answer for this one is "probably not". regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule wrote: > I agree with Pavan - a release with known important bug is not good idea. This bug has been around for some time, so I would recommend taking the time necessary to make the best fix possible, even if it means waiting for the next round of minor releases. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
2017-02-05 7:54 GMT+01:00 Pavan Deolasee : > > On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane wrote: > >> >> Based on Pavan's comments, I think trying to force this into next week's >> releases would be extremely unwise. If the bug went undetected this long, >> it can wait for a fix for another three months. > > > Yes, I think bug existed ever since and went unnoticed. One reason could > be that the race happens only when the new index turns HOT updates into > non-HOT updates. Another reason could be that we don't have checks in place > to catch these kinds of corruption. Having said that, since we have > discovered the bug, at least many 2ndQuadrant customers have expressed > worry and want to know if the fix will be available in 9.6.2 and other > minor releases. Since the bug can lead to data corruption, the worry is > justified. Until we fix the bug, there will be a constant worry about using > CIC. > > If we can have some kind of band-aid fix to plug in the hole, that might > be enough as well. I tested my first patch (which will need some polishing) > and that works well AFAIK. I was worried about prepared queries and all, > but that seems ok too. RelationGetIndexList() always get called within > ExecInitModifyTable. The fix seems quite unlikely to cause any other side > effects. > > Another possible band-aid is to throw another relcache invalidation in > CIC. Like adding a dummy index_set_state_flags() within yet another > transaction. Seems ugly, but should fix the problem for now and not cause > any impact on relcache mechanism whatsoever. > > That seems better than >> risking new breakage when it's barely 48 hours to the release wrap >> deadline. We do not have time to recover from any mistakes. >> > > I'm not sure what the project policies are, but can we consider delaying > the release by a week for issues like these? Or do you think it will be > hard to come up with a proper fix for the issue and it will need some > serious refactoring? > I agree with Pavan - a release with known important bug is not good idea. Pavel > Thanks, > Pavan > > -- > Pavan Deolasee http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services >
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane wrote: > > Based on Pavan's comments, I think trying to force this into next week's > releases would be extremely unwise. If the bug went undetected this long, > it can wait for a fix for another three months. Yes, I think bug existed ever since and went unnoticed. One reason could be that the race happens only when the new index turns HOT updates into non-HOT updates. Another reason could be that we don't have checks in place to catch these kinds of corruption. Having said that, since we have discovered the bug, at least many 2ndQuadrant customers have expressed worry and want to know if the fix will be available in 9.6.2 and other minor releases. Since the bug can lead to data corruption, the worry is justified. Until we fix the bug, there will be a constant worry about using CIC. If we can have some kind of band-aid fix to plug in the hole, that might be enough as well. I tested my first patch (which will need some polishing) and that works well AFAIK. I was worried about prepared queries and all, but that seems ok too. RelationGetIndexList() always get called within ExecInitModifyTable. The fix seems quite unlikely to cause any other side effects. Another possible band-aid is to throw another relcache invalidation in CIC. Like adding a dummy index_set_state_flags() within yet another transaction. Seems ugly, but should fix the problem for now and not cause any impact on relcache mechanism whatsoever. That seems better than > risking new breakage when it's barely 48 hours to the release wrap > deadline. We do not have time to recover from any mistakes. > I'm not sure what the project policies are, but can we consider delaying the release by a week for issues like these? Or do you think it will be hard to come up with a proper fix for the issue and it will need some serious refactoring? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Alvaro Herrera writes: > I intend to commit this soon to all branches, to ensure it gets into the > next set of minors. Based on Pavan's comments, I think trying to force this into next week's releases would be extremely unwise. If the bug went undetected this long, it can wait for a fix for another three months. That seems better than risking new breakage when it's barely 48 hours to the release wrap deadline. We do not have time to recover from any mistakes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sat, Feb 4, 2017 at 12:10 PM, Amit Kapila wrote: > > > If we do above, then I think primary key attrs won't be returned > because for those we are using relation copy rather than an original > working copy of attrs. See code below: > > switch (attrKind) > { > .. > case INDEX_ATTR_BITMAP_PRIMARY_KEY: > return bms_copy(relation->rd_pkattr); > > I don't think that would a problem since if relation_rd_indexattr is NULL, primary key attrs will be recomputed and returned. > > Apart from above, I think after the proposed patch, it will be quite > possible that in heap_update(), hot_attrs, key_attrs and id_attrs are > computed using different index lists (consider relcache flush happens > after computation of hot_attrs or key_attrs) which was previously not > possible. I think in the later code we assume that all the three > should be computed using the same indexlist. For ex. see the below > code: > This seems like a real problem to me. I don't know what the consequences are, but definitely having various attribute lists to have different view of the set of indexes doesn't seem right. The problem we are trying to fix is very clear by now. Relcache flush clears rd_indexvalid and rd_indexattr together, but because of the race, rd_indexattr is recomputed with the old information and gets cached again, while rd_indexvalid (and rd_indexlist) remains unset. That leads to the index corruption in CIC and may cause other issues too that we're not aware right now. We want cached attribute lists to become invalid whenever index list is invalidated and that's not happening right now. So we have tried three approaches so far. 1. Simply reset rd_indexattr whenever we detect rd_indexvalid has been reset. This was the first patch. It's a very simple patch and has worked for me with CACHE_CLOBBER_ALWAYS and even fixes the CIC bug. The only downside it seems that we're invalidating cached attribute lists outside the normal relcache invalidation path. It also works on the premise that RelationGetIndexList() will be called every time before RelationGetIndexAttrBitmap() is called, otherwise we could still end up using stale cached copies. I wonder if cached plans can be a problem here. 2. In the second patch, we tried to recompute attribute lists if a relcache flush happens in between and index list is invalidated. We've seen problems with that, especially it getting into an infinite loop with CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache flushes keep happening. 3. We tried the third approach where we don't cache attribute lists if know that index set has changed and hope that the next caller gets the correct info. But Amit rightly identified problems with that approach too. So we either need to find a 4th approach or stay with the first patch unless we see a problem there too (cached plans, anything else). Or may be fix CREATE INDEX CONCURRENTLY so that at least the first phase which add the index entry to the catalog happens with a strong lock. This will lead to momentary blocking of write queries, but protect us from the race. I'm assuming this is the only code that can cause the problem, and I haven't checked other potential hazards. Ideas/suggestions? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sat, Feb 4, 2017 at 12:12 AM, Alvaro Herrera wrote: > Pavan Deolasee wrote: > >> Looking at the history and some past discussions, it seems Tomas reported >> somewhat similar problem and Andres proposed a patch here >> https://www.postgresql.org/message-id/20140514155204.ge23...@awork2.anarazel.de >> which got committed via b23b0f5588d2e2. Not exactly the same issue, but >> related to relcache flush happening in index_open(). >> >> I wonder if we should just add a recheck logic >> in RelationGetIndexAttrBitmap() such that if rd_indexvalid is seen as 0 at >> the end, we go back and start again from RelationGetIndexList(). Or is >> there any code path that relies on finding the old information? > > Good find on that old report. It occured to me to re-run Tomas' test > case with this second patch you proposed (the "ddl" test takes 11 > minutes in my laptop), and lo and behold -- your "recheck" code enters > an infinite loop. Not good. I tried some more tricks that came to > mind, but I didn't get any good results. Pavan and I discussed it > further and he came up with the idea of returning the bitmapset we > compute, but if an invalidation occurs in index_open, simply do not > cache the bitmapsets so that next time this is called they are all > computed again. Patch attached. > > This appears to have appeased both test_decoding's ddl test under > CACHE_CLOBBER_ALWAYS, as well as the CREATE INDEX CONCURRENTLY bug. > > I intend to commit this soon to all branches, to ensure it gets into the > next set of minors. > Thanks for detailed explanation, using steps mentioned in your mail and Pavan's previous mail steps, I could reproduce the problem. Regarding your fix: + * Now save copies of the bitmaps in the relcache entry, but only if no + * invalidation occured in the meantime. We intentionally set rd_indexattr + * last, because that's the one that signals validity of the values; if we + * run out of memory before making that copy, we won't leave the relcache + * entry looking like the other ones are valid but empty. */ - oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - relation->rd_keyattr = bms_copy(uindexattrs); - relation->rd_pkattr = bms_copy(pkindexattrs); - relation->rd_idattr = bms_copy(idindexattrs); - relation->rd_indexattr = bms_copy(indexattrs); - MemoryContextSwitchTo(oldcxt); + if (relation->rd_indexvalid != 0) + { + MemoryContext oldcxt; + + oldcxt = MemoryContextSwitchTo(CacheMemoryContext); + relation->rd_keyattr = bms_copy(uindexattrs); + relation->rd_pkattr = bms_copy(pkindexattrs); + relation->rd_idattr = bms_copy(idindexattrs); + relation->rd_indexattr = bms_copy(indexattrs); + MemoryContextSwitchTo(oldcxt); + } If we do above, then I think primary key attrs won't be returned because for those we are using relation copy rather than an original working copy of attrs. See code below: switch (attrKind) { .. case INDEX_ATTR_BITMAP_PRIMARY_KEY: return bms_copy(relation->rd_pkattr); Apart from above, I think after the proposed patch, it will be quite possible that in heap_update(), hot_attrs, key_attrs and id_attrs are computed using different index lists (consider relcache flush happens after computation of hot_attrs or key_attrs) which was previously not possible. I think in the later code we assume that all the three should be computed using the same indexlist. For ex. see the below code: HeapSatisfiesHOTandKeyUpdate() { .. Assert(bms_is_subset(id_attrs, key_attrs)); Assert(bms_is_subset(key_attrs, hot_attrs)); .. } Also below comments in heap_update indicate that we shouldn't worry about relcache flush between the computation of hot_attrs, key_attrs and id_attrs. heap_update() { .. * Note that we get a copy here, so we need not worry about relcache flush * happening midway through. */ hot_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_ALL); key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY); id_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_IDENTITY_KEY); .. } Typo. /occured/occurred -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Pavan Deolasee wrote: > Looking at the history and some past discussions, it seems Tomas reported > somewhat similar problem and Andres proposed a patch here > https://www.postgresql.org/message-id/20140514155204.ge23...@awork2.anarazel.de > which got committed via b23b0f5588d2e2. Not exactly the same issue, but > related to relcache flush happening in index_open(). > > I wonder if we should just add a recheck logic > in RelationGetIndexAttrBitmap() such that if rd_indexvalid is seen as 0 at > the end, we go back and start again from RelationGetIndexList(). Or is > there any code path that relies on finding the old information? Good find on that old report. It occured to me to re-run Tomas' test case with this second patch you proposed (the "ddl" test takes 11 minutes in my laptop), and lo and behold -- your "recheck" code enters an infinite loop. Not good. I tried some more tricks that came to mind, but I didn't get any good results. Pavan and I discussed it further and he came up with the idea of returning the bitmapset we compute, but if an invalidation occurs in index_open, simply do not cache the bitmapsets so that next time this is called they are all computed again. Patch attached. This appears to have appeased both test_decoding's ddl test under CACHE_CLOBBER_ALWAYS, as well as the CREATE INDEX CONCURRENTLY bug. I intend to commit this soon to all branches, to ensure it gets into the next set of minors. Here is a detailed reproducer for the CREATE INDEX CONCURRENTLY bug. Line numbers are as of today's master; comments are supposed to help locating good spots in other branches. Given the use of gdb breakpoints, there's no good way to integrate this with regression tests, which is a pity because this stuff looks a little brittle to me, and if it breaks it is hard to detect. Prep: DROP TABLE IF EXISTS testtab; CREATE TABLE testtab (a int unique, b int, c int); INSERT INTO testtab VALUES (1, 100, 10); CREATE INDEX testindx ON testtab (b); S1: SELECT * FROM testtab; UPDATE testtab SET b = b + 1 WHERE a = 1; SELECT pg_backend_pid(); attach GDB to this session. break relcache.c:4800 # right before entering the per-index loop cont S2: SELECT pg_backend_pid(); DROP INDEX testindx; attach GDB to this session handle SIGUSR1 nostop break indexcmds.c:666 # right after index_create break indexcmds.c:790 # right before the second CommitTransactionCommand cont CREATE INDEX CONCURRENTLY testindx ON testtab (b); -- this blocks in breakpoint 1. Leave it there. S1: UPDATE testtab SET b = b + 1 WHERE a = 1; -- this blocks in breakpoint 1. Leave it there. S2: gdb -> cont -- This blocks waiting for S1 S1: gdb -> cont -- S1 finishes the update. S2 awakens and blocks again in breakpoint 2. S1: -- Now run more UPDATEs: UPDATE testtab SET b = b + 1 WHERE a = 1; This produces a broken HOT chain. This can be done several times; all these updates should be non-HOT because of the index created by CIC, but they are marked HOT. S2: "cont" From this point onwards, update are correct. SELECT * FROM testtab; SET enable_seqscan TO 0; SET enable_bitmapscan TO 0; SELECT * FROM testtab WHERE b between 100 and 110; -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services commit 7054c21883154f67058d42180606e0c5428d2745[m Author: Alvaro Herrera AuthorDate: Fri Feb 3 15:40:37 2017 -0300 CommitDate: Fri Feb 3 15:41:59 2017 -0300 Fix CREATE INDEX CONCURRENTLY relcache bug diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 8a7c560..003ccfa 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -4745,9 +4745,11 @@ RelationGetIndexPredicate(Relation relation) * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that * we can include system attributes (e.g., OID) in the bitmap representation. * - * Caller had better hold at least RowExclusiveLock on the target relation - * to ensure that it has a stable set of indexes. This also makes it safe - * (deadlock-free) for us to take locks on the relation's indexes. + * While all callers should at least RowExclusiveLock on the target relation, + * we still can't guarantee a stable set of indexes because CREATE INDEX + * CONCURRENTLY and DROP INDEX CONCURRENTLY can change set of indexes, without + * taking any conflicting lock. So we must be prepared to handle changes in the + * set of indexes and recompute bitmaps, when necessary. * * The returned result is palloc'd in the caller's memory context and should * be bms_free'd when not needed anymore. @@ -4763,7 +4765,6 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind) Oid relpkindex; Oid relreplindex; ListCell *l; - MemoryContext oldcxt; /* Quick exit if we already computed th
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Thu, Feb 2, 2017 at 10:14 PM, Alvaro Herrera wrote: > > > I'm going to study the bug a bit more, and put in some patch before the > upcoming minor tag on Monday. > > Looking at the history and some past discussions, it seems Tomas reported somewhat similar problem and Andres proposed a patch here https://www.postgresql.org/message-id/20140514155204.ge23...@awork2.anarazel.de which got committed via b23b0f5588d2e2. Not exactly the same issue, but related to relcache flush happening in index_open(). I wonder if we should just add a recheck logic in RelationGetIndexAttrBitmap() such that if rd_indexvalid is seen as 0 at the end, we go back and start again from RelationGetIndexList(). Or is there any code path that relies on finding the old information? The following comment at the top of RelationGetIndexAttrBitmap() also seems like a problem to me. CIC works with lower strength lock and hence it can change the set of indexes even though caller is holding the RowExclusiveLock, no? The comment seems to suggest otherwise. 4748 * Caller had better hold at least RowExclusiveLock on the target relation 4749 * to ensure that it has a stable set of indexes. This also makes it safe 4750 * (deadlock-free) for us to take locks on the relation's indexes. I've attached a revised patch based on these thoughts. It looks a bit more invasive than the earlier one-liner, but looks better to me. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services invalidate_indexattr_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Thu, Feb 2, 2017 at 6:14 PM, Amit Kapila wrote: > > /* > + * If the index list was invalidated, we better also invalidate the index > + * attribute list (which should automatically invalidate other attributes > + * such as primary key and replica identity) > + */ > > + relation->rd_indexattr = NULL; > + > > I think setting directly to NULL will leak the memory. > Good catch, thanks. Will fix. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Pavan Deolasee wrote: > I can reproduce this entire scenario using gdb sessions. This also explains > why the patch I sent earlier helps to solve the problem. Ouch. Great detective work there. I think it's quite possible that this bug explains some index errors, such as primary keys (or unique indexes) that when recreated fail with duplicate values: the first value inserted would get in via an UPDATE during the CIC race window, and sometime later the second value is inserted correctly. Because the first one is missing from the index, the second one does not give rise to a dupe error upon insertion, but they are of course both detected when the index is recreated. I'm going to study the bug a bit more, and put in some patch before the upcoming minor tag on Monday. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Jan 30, 2017 at 7:20 PM, Pavan Deolasee wrote: > > Based on my investigation so far and the evidence that I've collected, > what probably happens is that after a relcache invalidation arrives at the > new transaction, it recomputes the rd_indexattr but based on the old, > cached rd_indexlist. So the rd_indexattr does not include the new columns. > Later rd_indexlist is updated, but the rd_indexattr remains what it was. > > I was kinda puzzled this report did not get any attention, though it's clearly a data corruption issue. Anyways, now I know why this is happening and can reproduce this very easily via debugger and two sessions. The offending code is RelationGetIndexAttrBitmap(). Here is the sequence of events: Taking the same table as in the report: CREATE TABLE cic_tab_accounts ( aid bigint UNIQUE, abalance bigint, aid1 bigint ); 1. Session S1 calls DROP INDEX pgb_a_aid1 and completes 2. Session S2 is in process of rebuilding its rd_indexattr bitmap (because of previous relcache invalidation caused by DROP INDEX). To be premise, assume that the session has finished rebuilding its index list. Since DROP INDEX has just happened, 4793 indexoidlist = RelationGetIndexList(relation); 3. S1 then starts CREATE INDEX CONCURRENTLY pgb_a_aid1 ON cic_tab_accounts(aid1) 4. S1 creates catalog entry for the new index, commits phase 1 transaction and builds MVCC snapshot for the second phase and also finishes the initial index build 5. S2 now proceeds. Now notice that S2 had computed the index list based on the old view i.e. just one index. The following comments in relcahce.c are now significant: 4799 /* 4800 * Copy the rd_pkindex and rd_replidindex value computed by 4801 * RelationGetIndexList before proceeding. This is needed because a 4802 * relcache flush could occur inside index_open below, resetting the 4803 * fields managed by RelationGetIndexList. (The values we're computing 4804 * will still be valid, assuming that caller has a sufficient lock on 4805 * the relation.) 4806 */ That's what precisely goes wrong. 6. When index_open is called, relcache invalidation generated by the first transaction commit in CIC gets accepted and processed. As part of this invalidation, rd_indexvalid is set to 0, which invalidates rd_indexlist too 7. But S2 is currently using index list obtained at step 2 above (which has only old index). It goes ahead and computed rd_indexattr based on just the old index. 8. While relcahce invalidation processing at step 6 cleared rd_indexattr (along with rd_indexvalid), it is again set at 4906 relation->rd_indexattr = bms_copy(indexattrs); But what gets set at step 8 is based on the old view. There is no way rd_indexattr will be invalidated until S2 receives another relcache invalidation. This will happen when the CIC proceeds. But until then, every update done by S2 will generate broken HOT chains, leading to data corruption. I can reproduce this entire scenario using gdb sessions. This also explains why the patch I sent earlier helps to solve the problem. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Jan 30, 2017 at 7:20 PM, Pavan Deolasee wrote: > Hello All, > > While stress testing WARM, I encountered a data consistency problem. It > happens when index is built concurrently. What I thought to be a WARM > induced bug and it took me significant amount of investigation to finally > conclude that it's a bug in the master. In fact, we tested all the way up to > 9.2 release and the bug exists in all releases. > > In my test set up, I keep a UNIQUE index on an "aid" column and then have > many other "aid[1-4]" columns, on which indexes are built and dropped. The > script then updates a randomly selected row using any of the aid[1-4] > columns. Even these columns are updated during the tests to force WARM > updates, but within a narrow range of non-overlapping values. So we can > always find the unique row using some kind of range condition. The > reproduction case is much simpler and I'll explain that below. > > In the reproduction scenario (attached), it happens fairly quickly, may be > after 3-10 rounds of DROP/CREATE. You may need to adjust the -j value if it > does not happen for you even after a few rounds. Just for the record, I > haven't checked past reports to correlate if the bug explains any of the > index corruption issues we might have received. > > To give a short summary of how CIC works with HOT, it consists of three > phases: > 1. Create index entry in the catalog, mark it not ready for inserts, commit > the transaction, start new transaction and then wait for all potential lock > holders on the table to end > 2. Take a MVCC snapshot, build the index, mark index ready for inserts, > commit and then once again wait for potential lock holders on the table to > end > 3. Do a validation phase where we look at MVCC-visible tuples and add > missing entries to the index. We only look at the root line pointer of each > tuple while deciding whether a particular tuple already exists in the index > or not. IOW we look at HOT chains and not tuples themselves. > > The interesting case is phase 1 and 2. The phase 2 works on the premise that > every transaction started after we finished waiting for all existing > transactions will definitely see the new index. All these new transactions > then look at the columns used by the new index and consider them while > deciding HOT-safety of updates. Now for reasons which I don't fully > understand, there exists a path where a transaction starts after CIC waited > and took its snapshot at the start of the second phase, but still fails to > see the new index. Or may be it sees the index, but fails to update its > rd_indexattr list to take into account the columns of the new index. That > leads to wrong decisions and we end up with a broken HOT chain, something > the CIC algorithm is not prepared to handle. This in turn leads the new > index missing entries for the update tuples i.e. the index may have aid1 = > 10, but the value in the heap could be aid1 = 11. > > Based on my investigation so far and the evidence that I've collected, what > probably happens is that after a relcache invalidation arrives at the new > transaction, it recomputes the rd_indexattr but based on the old, cached > rd_indexlist. So the rd_indexattr does not include the new columns. Later > rd_indexlist is updated, but the rd_indexattr remains what it was. > > There is definitely an evidence of this sequence of events (collected after > sprinkling code with elog messages), but it's not yet clear to me why it > affects only a small percentage of transactions. One possibility is that it > probably depends on at what stage an invalidation is received and processed > by the backend. I find it a bit confusing that even though rd_indexattr is > tightly coupled to the rd_indexlist, we don't invalidate or recompute them > together. Shouldn't we do that? > During invalidation processing, we do destroy them together in function RelationDestroyRelation(). So ideally, after invalidation processing, both should be built again, but not sure why that is not happening in this case. > For example, in the attached patch we > invalidate rd_indexattr whenever we recompute rd_indexlist (i.e. if we see > that rd_indexvalid is 0). /* + * If the index list was invalidated, we better also invalidate the index + * attribute list (which should automatically invalidate other attributes + * such as primary key and replica identity) + */ + relation->rd_indexattr = NULL; + I think setting directly to NULL will leak the memory. > This patch fixes the problem for me (I've only > tried master), but I am not sure if this is a correct fix. > I think it is difficult to say that fix is correct unless there is a clear explanation of what led to this problem. Another angle to investigate is to find out when relation->rd_indexvalid is set to 0 for the particular session where you are seeing the problem. I could see few places where we are setting it to 0 and clearing rd_indexlist, but not rd_indexattr. I am not s
Re: [HACKERS] Index corruption
For the record, here are the results of our (ongoing) inevstigation into the index/heap corruption problems I reported a couple of weeks ago. We were able to trigger the problem with kernels 2.6.16, 2.6.17 and 2.6.18.rc1, with 2.6.16 seeming to be the most flaky. By replacing the NFS-mounted netapp with a fibre-channel SAN, we have eliminated the problem on all kernels. From this, it would seem to be an NFS bug introduced post 2.6.14, though we cannot rule out a postgres bug exposed by unusual timing issues. Our starting systems are: Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875 Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux (and others) kernel boot option: elevator=deadline 16 Gigs of RAM postgresql-8.0.8-1PGDG Bonded e1000/tg3 NICs with 8192 MTU. Slony 1.1.5 NetApp FAS270 OnTap 7.0.3 Mounted with the NFS options rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac Jumbo frames 8192 MTU. All postgres data and logs are stored on the netapp. All tests results were reproduced with postgres 8.0.8 __ Marc On Fri, 2006-06-30 at 23:20 -0400, Tom Lane wrote: > Marc Munro <[EMAIL PROTECTED]> writes: > > We tried all of these suggestions and still get the problem. Nothing > > interesting in the log file so I guess the Asserts did not fire. > > Not surprising, it was a long shot that any of those things were really > broken. But worth testing. > > > We are going to try experimenting with different kernels now. Unless > > anyone has any other suggestions. > > Right at the moment I have no better ideas :-( > > regards, tom lane > signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > We tried all of these suggestions and still get the problem. Nothing > interesting in the log file so I guess the Asserts did not fire. Not surprising, it was a long shot that any of those things were really broken. But worth testing. > We are going to try experimenting with different kernels now. Unless > anyone has any other suggestions. Right at the moment I have no better ideas :-( regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Index corruption
On Thu, 2006-06-29 at 21:47 -0400, Tom Lane wrote: > One easy thing that would be worth trying is to build with > --enable-cassert and see if any Asserts get provoked during the > A couple other things to try, given that you can provoke the failure > fairly easily: > . . . > 1. In studying the code, it bothers me a bit that P_NEW is the same as > InvalidBlockNumber. The intended uses of P_NEW appear to be adequately > . . . > 2. I'm also eyeing this bit of code in hio.c: > > If someone else has just extended the relation, it's possible that this > will allow a process to get to the page before the intended extender has > . . . We tried all of these suggestions and still get the problem. Nothing interesting in the log file so I guess the Asserts did not fire. We are going to try experimenting with different kernels now. Unless anyone has any other suggestions. __ Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index corruption
Ühel kenal päeval, R, 2006-06-30 kell 12:05, kirjutas Jan Wieck: > On 6/30/2006 11:55 AM, Tom Lane wrote: > > > Jan Wieck <[EMAIL PROTECTED]> writes: > >> On 6/30/2006 11:17 AM, Marko Kreen wrote: > >>> If the xxid-s come from different DB-s, then there can still be problems. > > > >> How so? They are allways part of a multi-key index having the > >> originating node ID first. > > > > Really? > > > > create table @[EMAIL PROTECTED] ( > > log_origin int4, > > log_xid @[EMAIL PROTECTED], > > log_tableid int4, > > log_actionseq int8, > > log_cmdtype char, > > log_cmddata text > > ); > > create index sl_log_1_idx1 on @[EMAIL PROTECTED] > > (log_origin, log_xid @[EMAIL PROTECTED], log_actionseq); > > > > create index sl_log_1_idx2 on @[EMAIL PROTECTED] > > (log_xid @[EMAIL PROTECTED]); > > You're right ... forgot about that one. And yes, there can be > transactions originating from multiple origins (masters) in the same > log. The thing is, the index is only there because in a single origin > situation (most installations are), the log_origin is allways the same. > The optimizer therefore sometimes didn't think using an index at all > would be good. > > However, transactions from different origins are NEVER selected together > and it wouldn't make sense to compare their xid's anyway. So the index > might return index tuples for rows from another origin, but the > following qualifications against the log_origin in the heap tuple will > filter them out. The problem was not only with returning too many rows from tuples, but as much returning too few. In case when you return too few rows some actions will just be left out from replication and thus will be missing from slaves. -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index corruption
Jan Wieck <[EMAIL PROTECTED]> writes: > You're right ... forgot about that one. > However, transactions from different origins are NEVER selected together > and it wouldn't make sense to compare their xid's anyway. So the index > might return index tuples for rows from another origin, but the > following qualifications against the log_origin in the heap tuple will > filter them out. No, that's not the point here. The point here is that if the xids that are simultaneously present in the index span more than a 2G-XID range, btree *will fail* because it will be dealing with keys that do not obey the transitive law. You do have a problem --- it doesn't explain Marc's troubles, but sl_log_1_idx2 is broken for multi master situations. All you need is masters with sufficiently different XID counters. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Index corruption
Tom Lane wrote: > Brad Nicholson <[EMAIL PROTECTED]> writes: >> It may or may not be the same issue, but for what it's worth, we've seen >> the same sl_log_1 corruption on AIX 5.1 and 5.3 > > Hm, on what filesystem, and what PG version(s)? > > I'm not completely satisfied by the its-a-kernel-bug theory, because if > it were then ISTM extending an index would be subject to the same risks > as extending a table; but I see no evidence of index page lossage in > Marc's dump. OTOH the usage patterns are different, so maybe PG isn't > stressing the write-to-lseek path quite as hard for indexes. > > regards, tom lane jfs2 in all cases. I don't recall the PG version for 5.1, but it was greater that 7.4.8. For 5.3, it was 7.4.12. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Index corruption
Brad Nicholson <[EMAIL PROTECTED]> writes: > It may or may not be the same issue, but for what it's worth, we've seen > the same sl_log_1 corruption on AIX 5.1 and 5.3 Hm, on what filesystem, and what PG version(s)? I'm not completely satisfied by the its-a-kernel-bug theory, because if it were then ISTM extending an index would be subject to the same risks as extending a table; but I see no evidence of index page lossage in Marc's dump. OTOH the usage patterns are different, so maybe PG isn't stressing the write-to-lseek path quite as hard for indexes. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index corruption
Brad Nicholson wrote: > Tom Lane wrote: >> Marc Munro <[EMAIL PROTECTED]> writes: >>> I'll get back to you with kernel build information tomorrow. We'll also >>> try to talk to some kernel hackers about this. >> Some googling turned up recent discussions about race conditions in >> Linux NFS code: >> >> http://threebit.net/mail-archive/linux-kernel/msg05313.html >> http://lkml.org/lkml/2006/3/1/381 >> >> I don't know the kernel nearly well enough to guess if these are related >> ... > > It may or may not be the same issue, but for what it's worth, we've seen > the same sl_log_1 corruption on AIX 5.1 and 5.3 > On 7.4.12, I should add. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index corruption
On 6/30/2006 11:55 AM, Tom Lane wrote: Jan Wieck <[EMAIL PROTECTED]> writes: On 6/30/2006 11:17 AM, Marko Kreen wrote: If the xxid-s come from different DB-s, then there can still be problems. How so? They are allways part of a multi-key index having the originating node ID first. Really? create table @[EMAIL PROTECTED] ( log_origin int4, log_xid @[EMAIL PROTECTED], log_tableid int4, log_actionseq int8, log_cmdtype char, log_cmddata text ); create index sl_log_1_idx1 on @[EMAIL PROTECTED] (log_origin, log_xid @[EMAIL PROTECTED], log_actionseq); create index sl_log_1_idx2 on @[EMAIL PROTECTED] (log_xid @[EMAIL PROTECTED]); You're right ... forgot about that one. And yes, there can be transactions originating from multiple origins (masters) in the same log. The thing is, the index is only there because in a single origin situation (most installations are), the log_origin is allways the same. The optimizer therefore sometimes didn't think using an index at all would be good. However, transactions from different origins are NEVER selected together and it wouldn't make sense to compare their xid's anyway. So the index might return index tuples for rows from another origin, but the following qualifications against the log_origin in the heap tuple will filter them out. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Index corruption
Tom Lane wrote: > Marc Munro <[EMAIL PROTECTED]> writes: >> I'll get back to you with kernel build information tomorrow. We'll also >> try to talk to some kernel hackers about this. > > Some googling turned up recent discussions about race conditions in > Linux NFS code: > > http://threebit.net/mail-archive/linux-kernel/msg05313.html > http://lkml.org/lkml/2006/3/1/381 > > I don't know the kernel nearly well enough to guess if these are related > ... It may or may not be the same issue, but for what it's worth, we've seen the same sl_log_1 corruption on AIX 5.1 and 5.3 -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Index corruption
Jan Wieck <[EMAIL PROTECTED]> writes: > On 6/30/2006 11:17 AM, Marko Kreen wrote: >> If the xxid-s come from different DB-s, then there can still be problems. > How so? They are allways part of a multi-key index having the > originating node ID first. Really? create table @[EMAIL PROTECTED] ( log_origin int4, log_xid @[EMAIL PROTECTED], log_tableid int4, log_actionseq int8, log_cmdtype char, log_cmddata text ); create index sl_log_1_idx1 on @[EMAIL PROTECTED] (log_origin, log_xid @[EMAIL PROTECTED], log_actionseq); create index sl_log_1_idx2 on @[EMAIL PROTECTED] (log_xid @[EMAIL PROTECTED]); sl_log_1_idx2 doesn't seem to have any such protection. When I was poking at Marc's example, though, it seemed that the numbers going into the table were all *locally generated* XIDs, in fact the same as the XID doing the insertions. If this is only true on the master, and slaves can be inserting XIDs coming from different masters, then I think it will break. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index corruption
On 6/30/2006 11:17 AM, Marko Kreen wrote: On 6/30/06, Jan Wieck <[EMAIL PROTECTED]> wrote: With the final implementation of log switching, the problem of xxid wraparound will be avoided entirely. Every now and then slony will switch from one to another log table and when the old one is drained and logically empty, it is truncated, which should reinitialize all indexes. If the xxid-s come from different DB-s, then there can still be problems. How so? They are allways part of a multi-key index having the originating node ID first. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Index corruption
On 6/30/06, Jan Wieck <[EMAIL PROTECTED]> wrote: With the final implementation of log switching, the problem of xxid wraparound will be avoided entirely. Every now and then slony will switch from one to another log table and when the old one is drained and logically empty, it is truncated, which should reinitialize all indexes. If the xxid-s come from different DB-s, then there can still be problems. -- marko ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Index corruption
On 6/30/2006 9:55 AM, Tom Lane wrote: "Marko Kreen" <[EMAIL PROTECTED]> writes: The sl_log_* tables are indexed on xid, where the relations between values are not exactly stable. When having high enough activity on one node or having nodes with XIDs on different enough positions funny things happen. Yeah, that was one of the first things I thought about, but the range of XIDs involved in these test cases isn't large enough to involve a wraparound, and anyway it's now looking like the problem is loss of heap entries, not index corruption at all. Slony's use of XID comparison semantics for indexes is definitely pretty scary though. If I were them I'd find a way to get rid of it. In theory, since the table is only supposed to contain "recent" XIDs, as long as you keep it vacuumed the index should never contain any inconsistently-comparable XIDs ... but in a big index, the boundary keys for upper-level index pages might hang around an awful long time ... With the final implementation of log switching, the problem of xxid wraparound will be avoided entirely. Every now and then slony will switch from one to another log table and when the old one is drained and logically empty, it is truncated, which should reinitialize all indexes. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Index corruption
I trawled through the first, larger dump you sent me, and found multiple index entries pointing to quite a few heap tuples: Occurrences block item 2 43961 1 2 43961 2 2 43961 3 2 43961 4 2 43961 5 2 43961 6 2 43961 7 2 43961 8 2 43961 9 2 119695 1 2 119695 2 2 119695 3 2 126029 1 2 126029 2 2 126029 3 2 166694 1 2 166865 1 2 166865 2 2 166865 3 2 166865 4 2 166865 5 2 166865 6 2 166865 7 2 206221 1 2 247123 1 2 327775 1 2 327775 2 2 327775 3 2 327775 4 2 327775 5 2 327775 6 2 327775 7 2 327775 8 2 327775 9 2 327775 10 2 327775 11 Both indexes show identical sets of duplicates, which makes it pretty hard to credit that it's a within-index problem. You mentioned that the test had been allowed to run for a good while after the first slave error was noted. So it seems there's no question that we are looking at some mechanism that allows the first few entries on a heap page to be lost and overwritten :-(, and that this happened several times over the course of the larger run. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Index corruption
"Marko Kreen" <[EMAIL PROTECTED]> writes: > The sl_log_* tables are indexed on xid, where the relations between > values are not exactly stable. When having high enough activity on > one node or having nodes with XIDs on different enough positions > funny things happen. Yeah, that was one of the first things I thought about, but the range of XIDs involved in these test cases isn't large enough to involve a wraparound, and anyway it's now looking like the problem is loss of heap entries, not index corruption at all. Slony's use of XID comparison semantics for indexes is definitely pretty scary though. If I were them I'd find a way to get rid of it. In theory, since the table is only supposed to contain "recent" XIDs, as long as you keep it vacuumed the index should never contain any inconsistently-comparable XIDs ... but in a big index, the boundary keys for upper-level index pages might hang around an awful long time ... regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index corruption
On 6/30/06, Tom Lane <[EMAIL PROTECTED]> wrote: I don't know the kernel nearly well enough to guess if these are related ... The sl_log_* tables are indexed on xid, where the relations between values are not exactly stable. When having high enough activity on one node or having nodes with XIDs on different enough positions funny things happen. Have you already excluded this as a probable reason? -- marko ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > I'll get back to you with kernel build information tomorrow. We'll also > try to talk to some kernel hackers about this. Some googling turned up recent discussions about race conditions in Linux NFS code: http://threebit.net/mail-archive/linux-kernel/msg05313.html http://lkml.org/lkml/2006/3/1/381 I don't know the kernel nearly well enough to guess if these are related ... regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Index corruption
On Thu, 2006-06-29 at 21:59 -0400, Tom Lane wrote: > [ back to the start of the thread... ] > > BTW, a couple of thoughts here: > > * If my theory about the low-level cause is correct, then reindexing > sl_log_1 would make the "duplicate key" errors go away, but nonetheless > you'd have lost data --- the overwritten rows would be gone. I suppose > that this would result in the slave missing some rows that are present > on the master. Have you tried comparing slave and master databases to > see if you can find any discrepancies? Haven't done that yet - in test we tend to restart the old subscriber as the new provider and rebuild the cluster. I'll check the logs from our production failure to figure out what to compare and see what I can discover. > * One way that the problem could happen would be if a race condition in > the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than > the true end-of-file (as determined by another process' write() > extending the EOF just slightly earlier --- ie, lseek fails to note the > effects of the just-completed write, and returns the prior EOF value). > PG does have internal locking that should guarantee that the lseek is > not done until after the write completes ... but could there be a bug in > the kernel allowing stale data to be returned? The SMP hardware is > relevant (maybe one processor sees different data than the other) and > frankly I don't trust NFS very far at all for questions such as this. > It'd be interesting to see if you can reproduce the problem in a > database on local storage. Unfortunately we haven't got any local storage that can stand the sort of loads we are putting through. With slower storage the CPUs mostly sit idle and we are very unlikely to trigger a timing-based bug if that's what it is. I'll get back to you with kernel build information tomorrow. We'll also try to talk to some kernel hackers about this. Many thanks for your efforts so far. __ Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > By dike out, you mean remove? Please confirm and I'll try it. Right, just remove (or comment out) the lines I quoted. > We ran this system happily for nearly a year on the > previous kernel without experiencing this problem (tcp lockups are a > different matter). I'm starting to think that a kernel bug is a realistic explanation, see other message. Whose kernel build is this exactly? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index corruption
On Thu, 2006-06-29 at 21:47 -0400, Tom Lane wrote: > One easy thing that would be worth trying is to build with > --enable-cassert and see if any Asserts get provoked during the > failure case. I don't have a lot of hope for that, but it's > something that would require only machine time not people time. I'll try this tomorrow. > A couple other things to try, given that you can provoke the failure > fairly easily: > > 1. In studying the code, it bothers me a bit that P_NEW is the same as > InvalidBlockNumber. The intended uses of P_NEW appear to be adequately > interlocked, but it's fairly easy to see how something like this could > happen if there are any places where InvalidBlockNumber is > unintentionally passed to ReadBuffer --- that would look like a P_NEW > call and it *wouldn't* be interlocked. So it would be worth changing > P_NEW to "(-2)" (this should just take a change in bufmgr.h and > recompile) and adding an "Assert(blockNum != InvalidBlockNumber)" > at the head of ReadBufferInternal(). Then rebuild with asserts enabled > and see if the failure case provokes that assert. I'll try this too. > 2. I'm also eyeing this bit of code in hio.c: > > /* > * If the FSM knows nothing of the rel, try the last page before > * we give up and extend. This avoids one-tuple-per-page syndrome > * during bootstrapping or in a recently-started system. > */ > if (targetBlock == InvalidBlockNumber) > { > BlockNumber nblocks = RelationGetNumberOfBlocks(relation); > > if (nblocks > 0) > targetBlock = nblocks - 1; > } > > If someone else has just extended the relation, it's possible that this > will allow a process to get to the page before the intended extender has > finished initializing it. AFAICT that's not harmful because the page > will look like it has no free space ... but it seems a bit fragile. > If you dike out the above-mentioned code, can you still provoke the > failure? By dike out, you mean remove? Please confirm and I'll try it. > A different line of attack is to see if you can make a self-contained > test case so other people can try to reproduce it. More eyeballs on the > problem are always better. Can't really see this being possible. This is clearly a very unusual problem and without similar hardware I doubt that anyone else will trigger it. We ran this system happily for nearly a year on the previous kernel without experiencing this problem (tcp lockups are a different matter). Also the load is provided by a bunch of servers and robots simulating rising and falling load. > Lastly, it might be interesting to look at the WAL logs for the period > leading up to a failure. This would give us an idea of what was > happening concurrently with the processes that seem directly involved. Next time we reproduce it, I'll take a copy of the WAL files too. __ Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index corruption
[ back to the start of the thread... ] Marc Munro <[EMAIL PROTECTED]> writes: > We have now experienced index corruption on two separate but identical > slony clusters. In each case the slony subscriber failed after > attempting to insert a duplicate record. In each case reindexing the > sl_log_1 table on the provider fixed the problem. > Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875 > Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux > ... > NetApp FAS270 OnTap 7.0.3 > Mounted with the NFS options > rw,nfsvers=3D3,hard,rsize=3D32768,wsize=3D32768,timeo=3D600,tcp,noac > Jumbo frames 8192 MTU. > All postgres data and logs are stored on the netapp. BTW, a couple of thoughts here: * If my theory about the low-level cause is correct, then reindexing sl_log_1 would make the "duplicate key" errors go away, but nonetheless you'd have lost data --- the overwritten rows would be gone. I suppose that this would result in the slave missing some rows that are present on the master. Have you tried comparing slave and master databases to see if you can find any discrepancies? * One way that the problem could happen would be if a race condition in the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than the true end-of-file (as determined by another process' write() extending the EOF just slightly earlier --- ie, lseek fails to note the effects of the just-completed write, and returns the prior EOF value). PG does have internal locking that should guarantee that the lseek is not done until after the write completes ... but could there be a bug in the kernel allowing stale data to be returned? The SMP hardware is relevant (maybe one processor sees different data than the other) and frankly I don't trust NFS very far at all for questions such as this. It'd be interesting to see if you can reproduce the problem in a database on local storage. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > If there's anything we can do to help debug this we will. We can apply > patches, different build options, etc. One easy thing that would be worth trying is to build with --enable-cassert and see if any Asserts get provoked during the failure case. I don't have a lot of hope for that, but it's something that would require only machine time not people time. A couple other things to try, given that you can provoke the failure fairly easily: 1. In studying the code, it bothers me a bit that P_NEW is the same as InvalidBlockNumber. The intended uses of P_NEW appear to be adequately interlocked, but it's fairly easy to see how something like this could happen if there are any places where InvalidBlockNumber is unintentionally passed to ReadBuffer --- that would look like a P_NEW call and it *wouldn't* be interlocked. So it would be worth changing P_NEW to "(-2)" (this should just take a change in bufmgr.h and recompile) and adding an "Assert(blockNum != InvalidBlockNumber)" at the head of ReadBufferInternal(). Then rebuild with asserts enabled and see if the failure case provokes that assert. 2. I'm also eyeing this bit of code in hio.c: /* * If the FSM knows nothing of the rel, try the last page before * we give up and extend. This avoids one-tuple-per-page syndrome * during bootstrapping or in a recently-started system. */ if (targetBlock == InvalidBlockNumber) { BlockNumber nblocks = RelationGetNumberOfBlocks(relation); if (nblocks > 0) targetBlock = nblocks - 1; } If someone else has just extended the relation, it's possible that this will allow a process to get to the page before the intended extender has finished initializing it. AFAICT that's not harmful because the page will look like it has no free space ... but it seems a bit fragile. If you dike out the above-mentioned code, can you still provoke the failure? A different line of attack is to see if you can make a self-contained test case so other people can try to reproduce it. More eyeballs on the problem are always better. Lastly, it might be interesting to look at the WAL logs for the period leading up to a failure. This would give us an idea of what was happening concurrently with the processes that seem directly involved. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] index corruption
On Thu, 2006-06-29 at 19:59 -0400, Tom Lane wrote: > Ummm ... you did restart the server? "select version();" would be > the definitive test. Can't say I blame you for the skepticism but I have confirmed it again. test=# select version(); version - PostgreSQL 8.0.8 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-3) (1 row) __ Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index corruption
I wrote: > What I speculate right at the moment is that we are not looking at index > corruption at all, but at heap corruption: somehow, the first insertion > into ctid (27806,2) got lost and the same ctid got re-used for the next > inserted row. We fixed one bug like this before ... Further study makes this look more probable. It seems that most of the activity around the trouble spot consists of transactions inserting exactly 13 rows into the table; for instance here are traces of two successive transactions: Block ItemXMINCMIN 27800 7 600921860 3 27800 8 600921860 7 27800 9 600921860 11 27800 10 600921860 15 27800 11 600921860 19 27800 12 600921860 23 27800 13 600921860 27 27800 14 600921860 31 27800 15 600921860 35 27800 16 600921860 39 27800 17 600921860 43 27800 18 600921860 47 27800 19 600921860 51 27800 20 600921870 3 27800 21 600921870 7 27800 22 600921870 11 27800 23 600921870 15 27800 24 600921870 19 27800 25 600921870 23 27800 26 600921870 27 27800 27 600921870 31 27800 28 600921870 35 27800 29 600921870 39 27800 30 600921870 43 27800 31 600921870 47 27800 32 600921870 51 The pattern of CMIN values is the same for all these transactions. But look at the rows inserted by 600921856 and 600921858, the two transactions that seem to be involved with the problem on page 27086: 27787 50 600921856 3 27795 41 600921858 3 27795 42 600921858 7 27795 43 600921858 11 27795 44 600921858 15 27795 45 600921858 19 27795 46 600921858 23 27795 47 600921858 27 27795 48 600921858 31 27795 49 600921858 35 27795 50 600921858 39 27795 51 600921858 43 27806 1 600921858 47 27806 2 600921856 15 27806 3 600921856 19 27806 4 600921856 23 27806 5 600921856 27 27806 6 600921856 31 27806 7 600921856 35 27806 8 600921856 39 27806 9 600921856 43 27806 10 600921856 47 27806 11 600921856 51 27806 12 600921858 51 (27787,50) and (27795,51) are both the last rows on their pages. What's evidently happened is that the two transactions filled those pages and then both seized on 27806 as the next page to insert into. I think that 600921856 tried to insert its CMIN 7 and 11 rows as items 1 and 2 on that page, and then something wiped the page, then 600921858 inserted its CMIN 47 row as item 1, and then 600921856 got control back and finished inserting its rows. Further study of the indexes shows that there are two entries in each index pointing to each of (27806,1) and (27806,2) --- but since the xxid values are different for the two (27806,1) entries, those didn't show up as duplicates in my first look. Given the apparent connection to vacuuming, this is looking a WHOLE lot like this bug fixed in 8.0.3: 2005-05-07 17:32 tgl * src/backend/: access/heap/hio.c, commands/vacuumlazy.c (REL7_3_STABLE), access/heap/hio.c, access/nbtree/nbtpage.c, access/nbtree/nbtree.c, commands/vacuumlazy.c (REL7_4_STABLE), access/heap/hio.c, commands/vacuumlazy.c (REL7_2_STABLE), access/heap/hio.c, access/nbtree/nbtpage.c, access/nbtree/nbtree.c, commands/vacuumlazy.c (REL8_0_STABLE), access/heap/hio.c, access/nbtree/nbtpage.c, access/nbtree/nbtree.c, commands/vacuumlazy.c: Repair very-low-probability race condition between relation extension and VACUUM: in the interval between adding a new page to the relation and formatting it, it was possible for VACUUM to come along and decide it should format the page too. Though not harmful in itself, this would cause data loss if a third transaction were able to insert tuples into the vacuumed page before the original extender got control back. Are you *certain* this slave isn't running 8.0.2 or older? If you can verify that, then I guess we need to look for another mechanism that could cause the same kind of thing. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Index corruption
On Fri, 2006-06-30 at 00:37 +0300, Hannu Krosing wrote: > Marc: do you have triggers on some replicated tables ? > We have a non-slony trigger on only 2 tables, neither of them involved in this transaction. We certainly have no circular trigger structures. > I remember having some corruption in a database with weird circular > trigger structures, some of them being slony log triggers. > > The thing that seemed to mess up something inside there, was when change > on parent rownt fired a trigger that changes child table rows and there > rows fired another trigger that changed the same parent row again. > __ Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index corruption
Ühel kenal päeval, N, 2006-06-29 kell 17:23, kirjutas Tom Lane: > Marc Munro <[EMAIL PROTECTED]> writes: > > Tom, > > we have a newer and much smaller (35M) file showing the same thing: > > Thanks. Looking into this, what I find is that *both* indexes have > duplicated entries for the same heap tuple: > ... > However, the two entries in idx1 contain different data!! > > What I speculate right at the moment is that we are not looking at index > corruption at all, but at heap corruption: somehow, the first insertion > into ctid (27806,2) got lost and the same ctid got re-used for the next > inserted row. We fixed one bug like this before ... Marc: do you have triggers on some replicated tables ? I remember having some corruption in a database with weird circular trigger structures, some of them being slony log triggers. The thing that seemed to mess up something inside there, was when change on parent rownt fired a trigger that changes child table rows and there rows fired another trigger that changed the same parent row again. -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Index corruption
Ühel kenal päeval, N, 2006-06-29 kell 16:42, kirjutas Chris Browne: > [EMAIL PROTECTED] (Marc Munro) writes: > > As you see, slony is attempting to enter one tuple > > ('374520943','22007','0') two times. > > > > Each previous time we have had this problem, rebuilding the indexes on > > slony log table (sl_log_1) has fixed the problem. I have not reindexed > > the table this time as I do not want to destroy any usable evidence. > > We have seen this phenomenon on 7.4.8 several times; pulled dumps of > sl_log_1 and index files that Jan Wieck looked at, which alas hasn't > led to a fix. > > He did, mind you, find some concurrency pattern that led, if memory > serves, to 7.4.12's release. We had experienced cases where there was > some worse corruption that required that we rebuild replicas from > scratch :-(. How well did you check the C-language triggers and special slony functions for possibly corrupting some backend/shared-mem structures ? -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > Tom, > we have a newer and much smaller (35M) file showing the same thing: Thanks. Looking into this, what I find is that *both* indexes have duplicated entries for the same heap tuple: idx1: Item 190 -- Length: 24 Offset: 3616 (0x0e20) Flags: USED Block Id: 27806 linp Index: 2 Size: 24 Has Nulls: 0 Has Varwidths: 0 0e20: 9e6c 02001800 0100 0057d123 ...l.W.# 0e30: 1a781200 .x.. Item 191 -- Length: 24 Offset: 3592 (0x0e08) Flags: USED Block Id: 27806 linp Index: 2 Size: 24 Has Nulls: 0 Has Varwidths: 0 0e08: 9e6c 02001800 0100 0057d123 ...l.W.# 0e18: 2e781200 .x.. idx2: Item 127 -- Length: 16 Offset: 6144 (0x1800) Flags: USED Block Id: 27806 linp Index: 2 Size: 16 Has Nulls: 0 Has Varwidths: 0 1800: 9e6c 02001000 0057d123 ...l.W.# Item 128 -- Length: 16 Offset: 6128 (0x17f0) Flags: USED Block Id: 27806 linp Index: 2 Size: 16 Has Nulls: 0 Has Varwidths: 0 17f0: 9e6c 02001000 0057d123 ...l.W.# However, the two entries in idx1 contain different data!! What I speculate right at the moment is that we are not looking at index corruption at all, but at heap corruption: somehow, the first insertion into ctid (27806,2) got lost and the same ctid got re-used for the next inserted row. We fixed one bug like this before ... regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index corruption
[EMAIL PROTECTED] (Marc Munro) writes: > As you see, slony is attempting to enter one tuple > ('374520943','22007','0') two times. > > Each previous time we have had this problem, rebuilding the indexes on > slony log table (sl_log_1) has fixed the problem. I have not reindexed > the table this time as I do not want to destroy any usable evidence. We have seen this phenomenon on 7.4.8 several times; pulled dumps of sl_log_1 and index files that Jan Wieck looked at, which alas hasn't led to a fix. He did, mind you, find some concurrency pattern that led, if memory serves, to 7.4.12's release. We had experienced cases where there was some worse corruption that required that we rebuild replicas from scratch :-(. -- output = reverse("gro.mca" "@" "enworbbc") http://cbbrowne.com/info/advocacy.html "There is no psychiatrist in the world like a puppy licking your face." -- Ben Williams ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Index corruption
We have reproduced the problem again. This time it looks like vacuum is not part of the picture. From the provider's log: 2006-06-29 14:02:41 CST DEBUG2 cleanupThread: 101.057 seconds for vacuuming And from the subscriber's: 2006-06-29 13:00:43 PDT ERROR remoteWorkerThread_1: "insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374740387','22008','4000'); If my maths is correct and the logs are honest, the vacuum would have started at 14:01:00 CST (13:01:PDT), about 20 seconds after we first encounter the problem. The clocks on the two machines, though in different timezones, are currently synced. Tom, I will create another tarball of the sl_log_1 table and indexes. Should be quite a bit smaller than the previous one. __ Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > As you see, slony is attempting to enter one tuple > ('374520943','22007','0') two times. > Each previous time we have had this problem, rebuilding the indexes on > slony log table (sl_log_1) has fixed the problem. I have not reindexed > the table this time as I do not want to destroy any usable evidence. Good. How big is the log table --- would it be possible for you to send me the physical table and index files? (Not a pg_dump, the actual disk files) regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Index corruption
On Thu, 2006-06-29 at 12:11 -0400, Tom Lane wrote: > OK, so it's not an already-known problem. > > > We were able to corrupt the index within 90 minutes of starting up our > > cluster. A slony-induced vacuum was under way on the provider at the > > time the subscriber failed. > > You still haven't given any details. What do you mean by "corrupt the > index" --- what actual symptoms are you seeing? We have a two node slony cluster with load tests running againt the provider node. After resyncing the subscriber, and running load tests for about an hour, the slony subscriber begins to fail. From the log file: 2006-06-28 17:58:43 PDT ERROR remoteWorkerThread_1: "insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22001','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22002','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22003','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22004','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22005','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22006','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22007','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22007','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520942','22009','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520942','22010','0'); " ERROR: duplicate key violates unique constraint "table_trans_attr_pk" As you see, slony is attempting to enter one tuple ('374520943','22007','0') two times. Each previous time we have had this problem, rebuilding the indexes on slony log table (sl_log_1) has fixed the problem. I have not reindexed the table this time as I do not want to destroy any usable evidence. __ Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > On Tom Lane's advice, we upgraded to Postgres 8.0.8. OK, so it's not an already-known problem. > We were able to corrupt the index within 90 minutes of starting up our > cluster. A slony-induced vacuum was under way on the provider at the > time the subscriber failed. You still haven't given any details. What do you mean by "corrupt the index" --- what actual symptoms are you seeing? regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Index corruption
On Tom Lane's advice, we upgraded to Postgres 8.0.8. We also upgraded slony to 1.1.5, due to some rpm issues. Apart from that everything is as described below. We were able to corrupt the index within 90 minutes of starting up our cluster. A slony-induced vacuum was under way on the provider at the time the subscriber failed. What can we do to help identify the cause of this? We have a test system that seems able to reproduce this fairly easily. __ Marc On Wed, 2006-06-28 at 09:28 -0700, Marc Munro wrote: > We have now experienced index corruption on two separate but identical > slony clusters. In each case the slony subscriber failed after > attempting to insert a duplicate record. In each case reindexing the > sl_log_1 table on the provider fixed the problem. > > The latest occurrence was on our production cluster yesterday. This has > only happened since we performed kernel upgrades and we are uncertain > whether this represents a kernel bug, or a postgres bug exposed by > different timings in the new kernel. > > Our systems are: > > Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875 > Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux > kernel boot option: elevator=deadline > 16 Gigs of RAM > postgresql-8.0.3-1PGDG > Bonded e1000/tg3 NICs with 8192 MTU. > Slony 1.1.0 > > NetApp FAS270 OnTap 7.0.3 > Mounted with the NFS options > rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac > Jumbo frames 8192 MTU. > > All postgres data and logs are stored on the netapp. > > In the latest episode, the index corruption was coincident with a > slony-induced vacuum. I don't know if this was the case with our test > system failures. > > __ > Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > We have now experienced index corruption on two separate but identical > slony clusters. In each case the slony subscriber failed after > attempting to insert a duplicate record. In each case reindexing the > sl_log_1 table on the provider fixed the problem. Please be more specific about what you mean by "index corruption" ... what were the exact symptoms? > postgresql-8.0.3-1PGDG The *first* thing you should do is update to 8.0.8 and see if the problem is still there. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] index corruption?
On Mon, 31 Mar 2003, Ed L. wrote: > On Feb 13, 2003, Tom Lane wrote: > > > > Laurette Cisneros <[EMAIL PROTECTED]> writes: > > > This is the error in the pgsql log: > > > 2003-02-13 16:21:42 [8843] ERROR: Index external_signstops_pkey is > > > not a btree > > > > This says that one of two fields that should never change, in fixed > > positions in the first block of a btree index, didn't have the right > > values. I am not aware of any PG bugs that could overwrite those > > fields. I think the most likely bet is that you've got hardware > > issues ... have you run memory and disk diagnostics lately? > > I am seeing this same problem on two separate machines, one brand new, one > older. Not sure yet what is causing it, but seems pretty unlikely that it > is hardware-related. Until you've tested them, the likelyhood is unimportant. If you've tested the boxes, and the memory tests good and the hard drives test good, then there is still likely to be another explanation, like a runaway kernel bug is writing somewhere it should every fifth eon or two. If you haven't tested the boxes, they're reliability is part of the NULL set. :-) ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] index corruption?
On Monday March 31 2003 3:54, Tom Lane wrote: > "Ed L." <[EMAIL PROTECTED]> writes: > >> I am seeing this same problem on two separate machines, one brand new, > >> one older. Not sure yet what is causing it, but seems pretty unlikely > >> that it is hardware-related. > > > > I am dabbling for the first time with a (crashing) C trigger, so that > > may be the culprit here. > > Could well be, although past experience has been that crashes in C code > seldom lead directly to disk corruption. (First, the bogus code has to > overwrite a shared disk buffer. If you follow what I consider the > better path of not making your shared buffers a large fraction of the > address space, the odds of a wild store happening to hit a disk buffer > aren't high. Second, once it's corrupted a shared buffer, it has to > contrive to cause that buffer to get written out before the core dump > occurs --- in most cases, the fact that the postmaster abandons the > contents of shared memory after a backend crash protects us from this > kind of failure.) > > When you find the problem, please take note of whether there's something > involved that increases the chances of corruption getting to disk. We > might want to try to do something about it ... It is definitely due to some rogue trigger code. Not sure what exactly, but if I remove a certain code segment the problem disappears. Ed ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly