Re: Online checksums verification in the backend
On Tue, Nov 03, 2020 at 06:46:12PM +0900, Michael Paquier wrote: > Yep, that's clear. I'll deal with that tomorrow. That's more than a > simple rework. This part is now done as of e152506a. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Mon, Nov 2, 2020 at 2:35 PM Andres Freund wrote: > Wouldn't this be better served by having a ReadBufferExtended() flag, > preventing erroring out and zeroing the buffer? I'm not sure that > handling both the case where the buffer contents need to be valid and > the one where it doesn't will make for a good API. I'm not sure. The goal I had in mind was giving a caller a way to get a copy of a buffer even if it's one we wouldn't normally admit into shared_buffers. I think it's probably a bad idea to allow for a back door where things that fail PageIsVerified() can nevertheless escape into the buffer, but that doesn't mean a checker or recovery tool shouldn't be allowed to see them. > > If the buffer is in shared buffers, we could take a share-lock > > on the buffer and copy the contents of the page as it exists on disk, > > and then still check it. > > Don't think we need a share lock. That still allows the buffer to be > written out (and thus a torn read). What we need is to set > BM_IO_IN_PROGRESS on the buffer in question - only one backend can set > that. And then unset that again, without unsetting > BM_DIRTY/BM_JUST_DIRTIED. OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Online checksums verification in the backend
On Mon, Nov 02, 2020 at 11:34:57AM -0800, Andres Freund wrote: > On 2020-11-02 12:35:30 -0500, Robert Haas wrote: >> On Thu, Oct 29, 2020 at 2:17 PM Andres Freund wrote: >>> I think this needs to be quickly reworked or reverted. > > I think it's fairly clear by now that revert is appropriate for now. Yep, that's clear. I'll deal with that tomorrow. That's more than a simple rework. >> I don't like this patch, either. In addition to what Andres mentioned, >> CheckBuffer() is a completely-special case mechanism which can't be >> reused by anything else. In particular, the amcheck for heap stuff >> which I recently committed (866e24d47db1743dfcff5bd595b57e3a143f2cb1) >> would really like a way to examine a buffer without risking an error >> if PageIsVerified() should happen to fail, but this patch is of >> absolutely no use in getting there, because CheckBuffer() doesn't give >> the caller any way to access the contents of the buffer. It can only >> do the checks that it knows how to do, and that's it. That doesn't >> seem like a good design. > > Wouldn't this be better served by having a ReadBufferExtended() flag, > preventing erroring out and zeroing the buffer? I'm not sure that > handling both the case where the buffer contents need to be valid and > the one where it doesn't will make for a good API. If you grep for ReadBuffer_common() is some of the emails I sent.. I was rather interested in something like that. >> I don't like the fact that CheckBuffer() silently skips dirty buffers, >> either. The comment should really say that it checks the state of a >> buffer without loading it into shared buffers, except sometimes it >> doesn't actually check it. > > Yea, I don't see a good reason for that either. There's reasons for > dirty buffers that aren't WAL logged - so if the on-disk page is broken, > a standby taken outside pg_basebackup would possibly still end up with a > corrupt on-disk page. Similar with a crash restart. Er, if you don't skip dirty buffers, wouldn't you actually report some pages as broken if attempting to run those in a standby who may have some torn pages from a previous base backup? You could still run into problems post-promotion, until the first checkpoint post-recovery happens, no? >> If the buffer is in shared buffers, we could take a share-lock >> on the buffer and copy the contents of the page as it exists on disk, >> and then still check it. > > Don't think we need a share lock. That still allows the buffer to be > written out (and thus a torn read). What we need is to set > BM_IO_IN_PROGRESS on the buffer in question - only one backend can set > that. And then unset that again, without unsetting > BM_DIRTY/BM_JUST_DIRTIED. If that can work, we could make use of some of that for base backups for a single retry of a page that initially failed. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Tue, Nov 03, 2020 at 09:31:20AM +0100, Magnus Hagander wrote: > On Mon, Nov 2, 2020 at 8:35 PM Andres Freund wrote: >> On 2020-11-02 12:35:30 -0500, Robert Haas wrote: >>> It feels really confusing to me that the user-exposed function here is >>> called pg_relation_check_pages(). How is the user supposed to >>> understand the difference between what this function does and what the >>> new verify_heapam() in amcheck does? The answer is that the latter >>> does far more extensive checks, but this isn't obvious from the SGML >>> documentation, which says only that the blocks are "verified," as if >>> an end-user can reasonably be expected to know what that means. It >>> seems likely to lead users to the belief that if this function passes, >>> they are in good shape, which is extremely far from being true. Just >>> look at what PageIsVerified() checks compared to what verify_heapam() >>> checks. The cases of verify_heapam() are much wider as they target only one AM, while this stuff should remain more general. There seems to be some overlap in terms of the basic checks done by bufmgr.c, and the fact that you may not want to be that much intrusive with the existing buffer pool as well when running the AM checks. It also seems to me that the use cases are quite different for both, the original goal of this thread is to detect physical corruptions for all AMs, while verify_heapam() looks after logical corruptions in the way heap is handled. >> Yea I had similar thoughts, it should just be called >> pg_checksum_verify_relation() or something. > > +1. I mentioned that upthread, is there really a dependency with checksums here? There are two cases where we can still apply some checks on a page, without any need of checksums: - The state of the page header. - Zeroed page if pd_upper is 0. Those pages are valid, and don't have a checksum computed. So it seems to me that when it comes to relation pages, then the check of a page should answer to the question: is this page loadable in shared buffers, or not? >>> In fact, I would argue that this functionality ought to live in >>> amcheck rather than core, though there could usefully be enabling >>> functions in core. >> >> I'm not really convinced by this though. It's not really AM >> specific - works for all types of relations with storage; don't really >> object either... > > Yeah, I'm not sure about that one either. Also what would happen > if/when we get checksums on things that aren't even relations? (though > maybe that goes for other parts of amcheck at some point as well?) I also thought about amcheck when looking at this thread, but it did not seem the right place as this applies to any AM able that could load stuff into the shared buffers. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Mon, Nov 2, 2020 at 8:35 PM Andres Freund wrote: > > Hi, > > On 2020-11-02 12:35:30 -0500, Robert Haas wrote: > > It feels really confusing to me that the user-exposed function here is > > called pg_relation_check_pages(). How is the user supposed to > > understand the difference between what this function does and what the > > new verify_heapam() in amcheck does? The answer is that the latter > > does far more extensive checks, but this isn't obvious from the SGML > > documentation, which says only that the blocks are "verified," as if > > an end-user can reasonably be expected to know what that means. It > > seems likely to lead users to the belief that if this function passes, > > they are in good shape, which is extremely far from being true. Just > > look at what PageIsVerified() checks compared to what verify_heapam() > > checks. > > Yea I had similar thoughts, it should just be called > pg_checksum_verify_relation() or something. > +1. > > In fact, I would argue that this functionality ought to live in > > amcheck rather than core, though there could usefully be enabling > > functions in core. > > I'm not really convinced by this though. It's not really AM > specific - works for all types of relations with storage; don't really > object either... Yeah, I'm not sure about that one either. Also what would happen if/when we get checksums on things that aren't even relations? (though maybe that goes for other parts of amcheck at some point as well?) -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: Online checksums verification in the backend
Hi, On 2020-11-02 12:35:30 -0500, Robert Haas wrote: > On Thu, Oct 29, 2020 at 2:17 PM Andres Freund wrote: > > I think this needs to be quickly reworked or reverted. I think it's fairly clear by now that revert is appropriate for now. > I don't like this patch, either. In addition to what Andres mentioned, > CheckBuffer() is a completely-special case mechanism which can't be > reused by anything else. In particular, the amcheck for heap stuff > which I recently committed (866e24d47db1743dfcff5bd595b57e3a143f2cb1) > would really like a way to examine a buffer without risking an error > if PageIsVerified() should happen to fail, but this patch is of > absolutely no use in getting there, because CheckBuffer() doesn't give > the caller any way to access the contents of the buffer. It can only > do the checks that it knows how to do, and that's it. That doesn't > seem like a good design. Wouldn't this be better served by having a ReadBufferExtended() flag, preventing erroring out and zeroing the buffer? I'm not sure that handling both the case where the buffer contents need to be valid and the one where it doesn't will make for a good API. > I don't like the fact that CheckBuffer() silently skips dirty buffers, > either. The comment should really say that it checks the state of a > buffer without loading it into shared buffers, except sometimes it > doesn't actually check it. Yea, I don't see a good reason for that either. There's reasons for dirty buffers that aren't WAL logged - so if the on-disk page is broken, a standby taken outside pg_basebackup would possibly still end up with a corrupt on-disk page. Similar with a crash restart. > If the buffer is in shared buffers, we could take a share-lock > on the buffer and copy the contents of the page as it exists on disk, > and then still check it. Don't think we need a share lock. That still allows the buffer to be written out (and thus a torn read). What we need is to set BM_IO_IN_PROGRESS on the buffer in question - only one backend can set that. And then unset that again, without unsetting BM_DIRTY/BM_JUST_DIRTIED. > It feels really confusing to me that the user-exposed function here is > called pg_relation_check_pages(). How is the user supposed to > understand the difference between what this function does and what the > new verify_heapam() in amcheck does? The answer is that the latter > does far more extensive checks, but this isn't obvious from the SGML > documentation, which says only that the blocks are "verified," as if > an end-user can reasonably be expected to know what that means. It > seems likely to lead users to the belief that if this function passes, > they are in good shape, which is extremely far from being true. Just > look at what PageIsVerified() checks compared to what verify_heapam() > checks. Yea I had similar thoughts, it should just be called pg_checksum_verify_relation() or something. > In fact, I would argue that this functionality ought to live in > amcheck rather than core, though there could usefully be enabling > functions in core. I'm not really convinced by this though. It's not really AM specific - works for all types of relations with storage; don't really object either... Greetings, Andres Freund
Re: Online checksums verification in the backend
On Thu, Oct 29, 2020 at 2:17 PM Andres Freund wrote: > I think this needs to be quickly reworked or reverted. I don't like this patch, either. In addition to what Andres mentioned, CheckBuffer() is a completely-special case mechanism which can't be reused by anything else. In particular, the amcheck for heap stuff which I recently committed (866e24d47db1743dfcff5bd595b57e3a143f2cb1) would really like a way to examine a buffer without risking an error if PageIsVerified() should happen to fail, but this patch is of absolutely no use in getting there, because CheckBuffer() doesn't give the caller any way to access the contents of the buffer. It can only do the checks that it knows how to do, and that's it. That doesn't seem like a good design. I don't like the fact that CheckBuffer() silently skips dirty buffers, either. The comment should really say that it checks the state of a buffer without loading it into shared buffers, except sometimes it doesn't actually check it. That doesn't seem like the behavior users really want, and it's not clear that there is any really good reason for it. If the buffer is in shared buffers, we could take a share-lock on the buffer and copy the contents of the page as it exists on disk, and then still check it. It feels really confusing to me that the user-exposed function here is called pg_relation_check_pages(). How is the user supposed to understand the difference between what this function does and what the new verify_heapam() in amcheck does? The answer is that the latter does far more extensive checks, but this isn't obvious from the SGML documentation, which says only that the blocks are "verified," as if an end-user can reasonably be expected to know what that means. It seems likely to lead users to the belief that if this function passes, they are in good shape, which is extremely far from being true. Just look at what PageIsVerified() checks compared to what verify_heapam() checks. In fact, I would argue that this functionality ought to live in amcheck rather than core, though there could usefully be enabling functions in core. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Online checksums verification in the backend
On Sun, Nov 01, 2020 at 05:50:06PM -0800, Andres Freund wrote: > On 2020-11-02 10:45:00 +0900, Michael Paquier wrote: > > On 2020-11-02 10:05:25 +0900, Michael Paquier wrote: > > > > There is no place doing that on HEAD. > > > > > > Err? > > > How is this not doing IO while holding a buffer mapping lock? > > > > Well, other than the one we are discussing of course :) > > I am not following? Were you just confirming that its not a thing we do? I meant that this is not done in any place other than the one introduced by c780a7a. So we have one place where it happens, and no places before c780a7a. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
Hi On 2020-11-02 10:45:00 +0900, Michael Paquier wrote: > On Sun, Nov 01, 2020 at 05:39:40PM -0800, Andres Freund wrote: > > I'm a bit limited writing - one handed for a while following an injury > > on Friday... > > Oops. Take care. Thanks! > > On 2020-11-02 10:05:25 +0900, Michael Paquier wrote: > > > There is no place doing that on HEAD. > > > > Err? > > How is this not doing IO while holding a buffer mapping lock? > > Well, other than the one we are discussing of course :) I am not following? Were you just confirming that its not a thing we do? Greetings, Andres Freund
Re: Online checksums verification in the backend
On Sun, Nov 01, 2020 at 05:39:40PM -0800, Andres Freund wrote: > I'm a bit limited writing - one handed for a while following an injury > on Friday... Oops. Take care. > On 2020-11-02 10:05:25 +0900, Michael Paquier wrote: > > There is no place doing that on HEAD. > > Err? > > /* see if the block is in the buffer pool or not */ > LWLockAcquire(partLock, LW_SHARED); > buf_id = BufTableLookup(_tag, buf_hash); > > [...] > > How is this not doing IO while holding a buffer mapping lock? Well, other than the one we are discussing of course :) > > > > This specific point was mentioned in the first message of this thread, > > 7th paragraph. That was a long thread, so it is easy to miss: > > https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5q0y+wulml5sr...@mail.gmail.com > > The code clearly doesnt implement it that way. > > > > I am wondering what you have in mind regarding the use of > > BM_IO_IN_PROGRESS or a similar flag. Wouldn't that imply some > > consequences for other existing buffers in the table, like a possible > > eviction? > > You'd need exactly one empty buffer for that - it can be reused for the > next to-be-checked buffer. > > > > I'd like to think that we should not do any manipulation of > > the buffer tables in this case. > > Why? Its the way we lock buffers - why is this so special that we need > to do differently? > > > > Hence, in order to prevent a > > concurrent activity to load in shared buffers the page currently > > checked on disk, I got to think that we would need something new here, > > like a filtering hash table that would be checked each time a backend > > tries to insert an entry into the buffer tables. > > Thats going to slow down everything a bit - the mapping already is a > bottleneck. > > > > 1) If the buffer is in shared buffers, we have the APIs to solve that > > by using a pin, unlock the partition, and then do the I/O. (Still > > that's unsafe with the smgrwrite() argument?) > > Thats why you need an appropriate relation lock... Something CheckBuffer > didnt bother to document. Its a restriction, but one we probably can > live with. > > > > 2) If the buffer is not in shared buffers, we don't have what it takes > > to solve the problem yet. > > We do. Set up enough state for the case to be otherwise the same as the > in s_b case. > > > But even if we solve this problem, we will > > never really be sure that this is entirely safe, as per the argument > > with concurrent smgrwrite() calls. Current in-core code assumes that > > this can happen only for init forks of unlogged relations which would > > not be visible yet in the backend doing a page check, still it can be > > really easy to break this assumption with any new code added by a new > > feature. > > It also happens in a few other cases than just init forks. But > visibility & relation locking can take care of that. But you need to > document that. If the locking allows concurent readers - and especially > concurrent writers, then you cant really use smgrwite for anything but > relation extension. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
Hi, I'm a bit limited writing - one handed for a while following an injury on Friday... On 2020-11-02 10:05:25 +0900, Michael Paquier wrote: > On Thu, Oct 29, 2020 at 10:08:52PM -0700, Andres Freund wrote: > > I think its pretty much *never* OK to do IO while holding a buffer > > mapping lock. You're locking a significant fraction of shared buffers > > over IO. That's just not OK. Don't think there's any place doing so > > currently either. > > There is no place doing that on HEAD. Err? /* see if the block is in the buffer pool or not */ LWLockAcquire(partLock, LW_SHARED); buf_id = BufTableLookup(_tag, buf_hash); if (buf_id >= 0) { ... /* Read the buffer from disk, with the I/O lock still held */ smgrread(smgr, forknum, blkno, buffer); LWLockRelease(BufferDescriptorGetIOLock(bufdesc)); } else { /* * Simply read the buffer. There's no risk of modification on it as * we are holding the buffer pool partition mapping lock. */ smgrread(smgr, forknum, blkno, buffer); } /* buffer lookup done, so now do its check */ LWLockRelease(partLock); How is this not doing IO while holding a buffer mapping lock? > This specific point was mentioned in the first message of this thread, > 7th paragraph. That was a long thread, so it is easy to miss: > https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5q0y+wulml5sr...@mail.gmail.com The code clearly doesnt implement it that way. > I am wondering what you have in mind regarding the use of > BM_IO_IN_PROGRESS or a similar flag. Wouldn't that imply some > consequences for other existing buffers in the table, like a possible > eviction? You'd need exactly one empty buffer for that - it can be reused for the next to-be-checked buffer. > I'd like to think that we should not do any manipulation of > the buffer tables in this case. Why? Its the way we lock buffers - why is this so special that we need to do differently? > Hence, in order to prevent a > concurrent activity to load in shared buffers the page currently > checked on disk, I got to think that we would need something new here, > like a filtering hash table that would be checked each time a backend > tries to insert an entry into the buffer tables. Thats going to slow down everything a bit - the mapping already is a bottleneck. > 1) If the buffer is in shared buffers, we have the APIs to solve that > by using a pin, unlock the partition, and then do the I/O. (Still > that's unsafe with the smgrwrite() argument?) Thats why you need an appropriate relation lock... Something CheckBuffer didnt bother to document. Its a restriction, but one we probably can live with. > 2) If the buffer is not in shared buffers, we don't have what it takes > to solve the problem yet. We do. Set up enough state for the case to be otherwise the same as the in s_b case. > But even if we solve this problem, we will > never really be sure that this is entirely safe, as per the argument > with concurrent smgrwrite() calls. Current in-core code assumes that > this can happen only for init forks of unlogged relations which would > not be visible yet in the backend doing a page check, still it can be > really easy to break this assumption with any new code added by a new > feature. It also happens in a few other cases than just init forks. But visibility & relation locking can take care of that. But you need to document that. If the locking allows concurent readers - and especially concurrent writers, then you cant really use smgrwite for anything but relation extension. Greetings, Andres Freund
Re: Online checksums verification in the backend
On Thu, Oct 29, 2020 at 10:08:52PM -0700, Andres Freund wrote: > I think its pretty much *never* OK to do IO while holding a buffer > mapping lock. You're locking a significant fraction of shared buffers > over IO. That's just not OK. Don't think there's any place doing so > currently either. There is no place doing that on HEAD. This specific point was mentioned in the first message of this thread, 7th paragraph. That was a long thread, so it is easy to miss: https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5q0y+wulml5sr...@mail.gmail.com I am wondering what you have in mind regarding the use of BM_IO_IN_PROGRESS or a similar flag. Wouldn't that imply some consequences for other existing buffers in the table, like a possible eviction? I'd like to think that we should not do any manipulation of the buffer tables in this case. Hence, in order to prevent a concurrent activity to load in shared buffers the page currently checked on disk, I got to think that we would need something new here, like a filtering hash table that would be checked each time a backend tries to insert an entry into the buffer tables. That's something I was wondering here: https://www.postgresql.org/message-id/20200316030638.ga2...@paquier.xyz I called that a preemptive lock, but you could also call that a discard filter or a virtual pin, just something to mean that a page locked this way cannot be loaded into the shared buffers. I'd like to think that this should not touch the existing buffer table, but it would impact the performance when attempting to insert an entry in the tables, as anything would need to be pre-checked. Assuming that we could make this thing work without holding the partition lock, and assuming that we only hold a share lock on the relation, we have two cases: 1) If the buffer is in shared buffers, we have the APIs to solve that by using a pin, unlock the partition, and then do the I/O. (Still that's unsafe with the smgrwrite() argument?) 2) If the buffer is not in shared buffers, we don't have what it takes to solve the problem yet. But even if we solve this problem, we will never really be sure that this is entirely safe, as per the argument with concurrent smgrwrite() calls. Current in-core code assumes that this can happen only for init forks of unlogged relations which would not be visible yet in the backend doing a page check, still it can be really easy to break this assumption with any new code added by a new feature. These arguments bring down to reduce the scope of CheckBuffer() as follows: - Use an AEL on the relation, pass down a Relation instead of SMgrRelation, and add on the way an assertion to make sure that the caller holds an AEL on the relation. I wanted to study the possiblity to use that stuff for base backups, but if you bring the concurrent smgrwrite() calls into the set of possibilities this shuts down the whole study from the start. - It is still useful to check that a page is in shared buffers IMO, so as if it is dirty we just discard it from the checks and rely on the next checkpoint to do a flush. It is also useful to check the state of the on-disk data is good or not if the page is not dirty, as the page could have gone rogue on-disk while a system was up for weeks. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On 2020-10-30 11:58:13 +0800, Julien Rouhaud wrote: > So I'm assuming that the previous optimization to avoid almost every > time doing an IO while holding a buffer mapping lock isn't an option? > In that case, I don't see any other option than reverting the patch > and discussing a new approach. I think its pretty much *never* OK to do IO while holding a buffer mapping lock. You're locking a significant fraction of shared buffers over IO. That's just not OK. Don't think there's any place doing so currently either.
Re: Online checksums verification in the backend
On Fri, Oct 30, 2020 at 10:58 AM Andres Freund wrote: > > Hi, > > On 2020-10-30 10:01:08 +0800, Julien Rouhaud wrote: > > On Fri, Oct 30, 2020 at 2:17 AM Andres Freund wrote: > > > The code does IO while holding the buffer mapping lock. That seems > > > *entirely* unacceptable to me. That basically locks 1/128 of shared > > > buffers against concurrent mapping changes, while reading data that is > > > likely not to be on disk? Seriously? > > > > The initial implementation had a different approach, reading the buffer once > > without holding the buffer mapping lock (which could lead to some false > > positive in some unlikely scenario), and only if a corruption is detected > > the > > read is done once again *while holding the buffer mapping lock* to ensure > > it's > > not a false positive. Some benchmarking showed that the performance was > > worse, > > so we dropped that optimisation. Should we go back to something like that > > or > > do you have a better way to ensure a consistent read of a buffer which > > isn't in > > shared buffers? > > I suspect that you're gonna need something quite different than what the > function is doing right now. Not because such a method will be faster in > isolation, but because there's a chance to have it correct and not have > a significant performance impact onto the rest of the system. > > I've not thought about it in detail yet. Is suspect you'll need to > ensure there is a valid entry in the buffer mapping table for the buffer > you're processing. By virtue of setting BM_IO_IN_PROGRESS on that entry > you're going to prevent concurrent IO from starting until your part is > done. So I'm assuming that the previous optimization to avoid almost every time doing an IO while holding a buffer mapping lock isn't an option? In that case, I don't see any other option than reverting the patch and discussing a new approach.
Re: Online checksums verification in the backend
Hi, On 2020-10-30 10:01:08 +0800, Julien Rouhaud wrote: > On Fri, Oct 30, 2020 at 2:17 AM Andres Freund wrote: > > The code does IO while holding the buffer mapping lock. That seems > > *entirely* unacceptable to me. That basically locks 1/128 of shared > > buffers against concurrent mapping changes, while reading data that is > > likely not to be on disk? Seriously? > > The initial implementation had a different approach, reading the buffer once > without holding the buffer mapping lock (which could lead to some false > positive in some unlikely scenario), and only if a corruption is detected the > read is done once again *while holding the buffer mapping lock* to ensure it's > not a false positive. Some benchmarking showed that the performance was > worse, > so we dropped that optimisation. Should we go back to something like that or > do you have a better way to ensure a consistent read of a buffer which isn't > in > shared buffers? I suspect that you're gonna need something quite different than what the function is doing right now. Not because such a method will be faster in isolation, but because there's a chance to have it correct and not have a significant performance impact onto the rest of the system. I've not thought about it in detail yet. Is suspect you'll need to ensure there is a valid entry in the buffer mapping table for the buffer you're processing. By virtue of setting BM_IO_IN_PROGRESS on that entry you're going to prevent concurrent IO from starting until your part is done. > > Also, why are pages without a valid tag ignored? I can follow the > > argument for skipping it in the DIRTY case, but that doesn't apply for > > BM_TAG_VALID? > > AFAICT pages that aren't BM_TAG_VALID are pages newly allocated. > Those shouldn't > be entirely initialized yet, and they'll be eventually written and flushed. When a page is being read there's a period when the buffer is without BM_TAG_VALID. It's quite possible that the locking prevents this case from being reachable - but in that case you shouldn't just accept it as something to be skipped... > > Also, uh, I don't think the locking of the buffer table provides you > > with the full guarantees CheckBuffer() seems to assume: > > > > * Check the state of a buffer without loading it into the shared buffers. > > To > > * avoid torn pages and possible false positives when reading data, a shared > > * LWLock is taken on the target buffer pool partition mapping, and we check > > * if the page is in shared buffers or not. An I/O lock is taken on the > > block > > * to prevent any concurrent activity from happening. > > > > this doesn't actually prevent all concurrent write IO, unless you hold > > an appropriate lock on the relation. There's a few places that use > > smgrwrite()/smgrextend() to write out data bypassing shared buffers. > > > > Maybe that isn't a problem for the uses of CheckBuffer() is envisioned > > for, but that'd need a pretty detailed explanation as to when it's safe > > to use CheckBuffer() for which blocks. > > AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged > relation, during creation. That may be the case right in core right now, but for one, there definitely are extensions going through smgrwrite() without using the buffer pool. Essentially, what you are saying is that the introduction of CheckBuffer() altered what smgrwrite() is allowed to be used for, without having discussed or documented that. Before this an AM/extension could just use smgrwrite() to write data not in shared buffers, as long as a locking scheme is used that prevents multiple backends from doing that at the same time (trivially: AccessExclusiveLock). > Those relations shouldn't be visible to the caller > snapshot, so it should be safe. I can add a comment for that if I'm not > mistaken. There's no comment warning that you shouldn't use CheckBuffer() to check every buffer in shared buffers, or every relfilenode on disk. The latter would be quite a reasonable thing, given it'd avoid needing to connect to every database etc. > For concurrent smgrextend(), we read the relation size at the beginning of the > function, so we shouldn't read newly allocated blocks. But you're right that > it's still possible to get the size that includes a newly allocated block > that can be concurrently written. We can avoid that be holding a > LOCKTAG_RELATION_EXTEND lock when reading the relation size. Would that be > ok? That could possibly work - but currently CheckBuffer() doesn't get a relation, nor are the comments explaining that it has to be a relation in the current database or anything. I hadn't yet looked at the caller - I just started looking at CheckBuffer() this because it caused compilation failures after rebasing my aio branch onto master (there's no IO locks anymore). Looking at the caller: - This is not a concurrency safe pattern: /* Check if relation exists. leaving if there is no such relation
Re: Online checksums verification in the backend
Hi, On Fri, Oct 30, 2020 at 2:17 AM Andres Freund wrote: > The code does IO while holding the buffer mapping lock. That seems > *entirely* unacceptable to me. That basically locks 1/128 of shared > buffers against concurrent mapping changes, while reading data that is > likely not to be on disk? Seriously? The initial implementation had a different approach, reading the buffer once without holding the buffer mapping lock (which could lead to some false positive in some unlikely scenario), and only if a corruption is detected the read is done once again *while holding the buffer mapping lock* to ensure it's not a false positive. Some benchmarking showed that the performance was worse, so we dropped that optimisation. Should we go back to something like that or do you have a better way to ensure a consistent read of a buffer which isn't in shared buffers? > a pin is cheap. Holding the partition lock is not. > The justification in the in-shared-buffers case seems to completely > mis-judge costs too: > * Found it. Now, retrieve its state to know what to do with > it, and > * release the pin immediately. We do so to limit overhead > as much as > * possible. We keep the shared LWLock on the target buffer > mapping > * partition for now, so this buffer cannot be evicted, and > we acquire > * an I/O Lock on the buffer as we may need to read its > contents from > * disk. > a pin is cheap. Holding the partition lock is not. I clearly did a poor job in that case. Will fix. > Also, using char[BLCKSZ] as a buffer isn't ok. This should use > PGAlignedBlock: I wasn't aware of it, I will fix. > > LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED); > > buf_state = LockBufHdr(bufdesc); > > UnlockBufHdr(bufdesc, buf_state); > > > > /* If the page is dirty or invalid, skip it */ > > if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) > > == 0) > > This is weird as well. What is this supposed to do? Just locking and > unlocking a buffer header doesn't do squat? There's no guarantee that > the flags haven't changed by this point, so you could just as well not > acquire the buffer header lock. This is using the same approach as e.g. WaitIO() to get the state. I agree that the state can change after the buffer header lock has been released, but I think that's something out of scope. The only guarantee that we can give is that the database (or subset of relations checked) was healthy at the time the check was started, provided that your cluster survive the checkpoint happening after the check ended. I don't see how we can do better than that. > Also, why are pages without a valid tag ignored? I can follow the > argument for skipping it in the DIRTY case, but that doesn't apply for > BM_TAG_VALID? AFAICT pages that aren't BM_TAG_VALID are pages newly allocated. Those shouldn't be entirely initialized yet, and they'll be eventually written and flushed. > Also, uh, I don't think the locking of the buffer table provides you > with the full guarantees CheckBuffer() seems to assume: > > * Check the state of a buffer without loading it into the shared buffers. To > * avoid torn pages and possible false positives when reading data, a shared > * LWLock is taken on the target buffer pool partition mapping, and we check > * if the page is in shared buffers or not. An I/O lock is taken on the block > * to prevent any concurrent activity from happening. > > this doesn't actually prevent all concurrent write IO, unless you hold > an appropriate lock on the relation. There's a few places that use > smgrwrite()/smgrextend() to write out data bypassing shared buffers. > > Maybe that isn't a problem for the uses of CheckBuffer() is envisioned > for, but that'd need a pretty detailed explanation as to when it's safe > to use CheckBuffer() for which blocks. AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged relation, during creation. Those relations shouldn't be visible to the caller snapshot, so it should be safe. I can add a comment for that if I'm not mistaken. For concurrent smgrextend(), we read the relation size at the beginning of the function, so we shouldn't read newly allocated blocks. But you're right that it's still possible to get the size that includes a newly allocated block that can be concurrently written. We can avoid that be holding a LOCKTAG_RELATION_EXTEND lock when reading the relation size. Would that be ok?
Re: Online checksums verification in the backend
Hi, On 2020-10-29 11:17:29 -0700, Andres Freund wrote: > The code does IO while holding the buffer mapping lock. That seems > *entirely* unacceptable to me. That basically locks 1/128 of shared > buffers against concurrent mapping changes, while reading data that is > likely not to be on disk? Seriously? Also, uh, I don't think the locking of the buffer table provides you with the full guarantees CheckBuffer() seems to assume: * Check the state of a buffer without loading it into the shared buffers. To * avoid torn pages and possible false positives when reading data, a shared * LWLock is taken on the target buffer pool partition mapping, and we check * if the page is in shared buffers or not. An I/O lock is taken on the block * to prevent any concurrent activity from happening. this doesn't actually prevent all concurrent write IO, unless you hold an appropriate lock on the relation. There's a few places that use smgrwrite()/smgrextend() to write out data bypassing shared buffers. Maybe that isn't a problem for the uses of CheckBuffer() is envisioned for, but that'd need a pretty detailed explanation as to when it's safe to use CheckBuffer() for which blocks. Greetings, Andres Freund
Re: Online checksums verification in the backend
Hi, On 2020-10-29 11:17:29 -0700, Andres Freund wrote: > LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED); > buf_state = LockBufHdr(bufdesc); > UnlockBufHdr(bufdesc, buf_state); > > /* If the page is dirty or invalid, skip it */ > if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) > == 0) This is weird as well. What is this supposed to do? Just locking and unlocking a buffer header doesn't do squat? There's no guarantee that the flags haven't changed by this point, so you could just as well not acquire the buffer header lock. Also, why are pages without a valid tag ignored? I can follow the argument for skipping it in the DIRTY case, but that doesn't apply for BM_TAG_VALID? Greetings, Andres Freund
Re: Online checksums verification in the backend
Hi, On 2020-10-28 14:08:52 +0900, Michael Paquier wrote: > Thanks for confirming. I have gone through the whole set today, > splitted the thing into two commits and applied them. We had > buildfarm member florican complain about a mistake in one of the > GetDatum() calls that I took care of already, and there is nothing > else on my radar. The code does IO while holding the buffer mapping lock. That seems *entirely* unacceptable to me. That basically locks 1/128 of shared buffers against concurrent mapping changes, while reading data that is likely not to be on disk? Seriously? /* see if the block is in the buffer pool or not */ LWLockAcquire(partLock, LW_SHARED); buf_id = BufTableLookup(_tag, buf_hash); if (buf_id >= 0) { uint32 buf_state; /* * Found it. Now, retrieve its state to know what to do with it, and * release the pin immediately. We do so to limit overhead as much as * possible. We keep the shared LWLock on the target buffer mapping * partition for now, so this buffer cannot be evicted, and we acquire * an I/O Lock on the buffer as we may need to read its contents from * disk. */ bufdesc = GetBufferDescriptor(buf_id); LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED); buf_state = LockBufHdr(bufdesc); UnlockBufHdr(bufdesc, buf_state); /* If the page is dirty or invalid, skip it */ if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) == 0) { LWLockRelease(BufferDescriptorGetIOLock(bufdesc)); LWLockRelease(partLock); return true; } /* Read the buffer from disk, with the I/O lock still held */ smgrread(smgr, forknum, blkno, buffer); LWLockRelease(BufferDescriptorGetIOLock(bufdesc)); } else { /* * Simply read the buffer. There's no risk of modification on it as * we are holding the buffer pool partition mapping lock. */ smgrread(smgr, forknum, blkno, buffer); } The justification in the in-shared-buffers case seems to completely mis-judge costs too: * Found it. Now, retrieve its state to know what to do with it, and * release the pin immediately. We do so to limit overhead as much as * possible. We keep the shared LWLock on the target buffer mapping * partition for now, so this buffer cannot be evicted, and we acquire * an I/O Lock on the buffer as we may need to read its contents from * disk. a pin is cheap. Holding the partition lock is not. Also, using char[BLCKSZ] as a buffer isn't ok. This should use PGAlignedBlock: /* * Use this, not "char buf[BLCKSZ]", to declare a field or local variable * holding a page buffer, if that page might be accessed as a page and not * just a string of bytes. Otherwise the variable might be under-aligned, * causing problems on alignment-picky hardware. (In some places, we use * this to declare buffers even though we only pass them to read() and * write(), because copying to/from aligned buffers is usually faster than * using unaligned buffers.) We include both "double" and "int64" in the * union to ensure that the compiler knows the value must be MAXALIGN'ed * (cf. configure's computation of MAXIMUM_ALIGNOF). */ typedef union PGAlignedBlock I think this needs to be quickly reworked or reverted. Greetings, Andres Freund
Re: Online checksums verification in the backend
Hello, On Thu, Oct 29, 2020 at 7:52 AM Shinoda, Noriyoshi (PN Japan A Delivery) wrote: > > Hi, > > I have tested this great feature in the latest commit environment on Red Hat > Enterprise Linux 7.8. I modified a few blocks in a relation file to raise a > checksum error. When I executed the pg_relation_check_pages function, the > backend terminated abnormally. The attached file is the operation log. Thanks for the report! As far as I can see the issue is that the pfree(path) in check_relation_fork() should be outside the for loop.
RE: Online checksums verification in the backend
Hi, I have tested this great feature in the latest commit environment on Red Hat Enterprise Linux 7.8. I modified a few blocks in a relation file to raise a checksum error. When I executed the pg_relation_check_pages function, the backend terminated abnormally. The attached file is the operation log. Regards, Noriyoshi Shinoda -Original Message- From: Michael Paquier [mailto:mich...@paquier.xyz] Sent: Wednesday, October 28, 2020 2:09 PM To: Julien Rouhaud Cc: Justin Pryzby ; Masahiko Sawada ; Robert Haas ; PostgreSQL Hackers ; Masahiko Sawada Subject: Re: Online checksums verification in the backend On Tue, Oct 27, 2020 at 07:47:19PM +0800, Julien Rouhaud wrote: > I think it's also worth noting that the IOLock is now acquired just > before getting the buffer state, and released after the read (or after > finding that the buffer is dirty). This is consistent with how it's > done elsewhere, so I'm fine. Consistency is the point. This API should be safe to use by design. I have done some extra performance tests similar to what I did upthread, and this version showed similar numbers. > Other than that I'm quite happy with the changes you made, thanks a lot! Thanks for confirming. I have gone through the whole set today, splitted the thing into two commits and applied them. We had buildfarm member florican complain about a mistake in one of the GetDatum() calls that I took care of already, and there is nothing else on my radar. -- Michael $ uname -a Linux rel78-1 3.10.0-1127.el7.x86_64 #1 SMP Tue Feb 18 16:39:12 EST 2020 x86_64 x86_64 x86_64 GNU/Linux $ $ pg_config BINDIR = /usr/local/pgsql/bin DOCDIR = /usr/local/pgsql/share/doc HTMLDIR = /usr/local/pgsql/share/doc INCLUDEDIR = /usr/local/pgsql/include PKGINCLUDEDIR = /usr/local/pgsql/include INCLUDEDIR-SERVER = /usr/local/pgsql/include/server LIBDIR = /usr/local/pgsql/lib PKGLIBDIR = /usr/local/pgsql/lib LOCALEDIR = /usr/local/pgsql/share/locale MANDIR = /usr/local/pgsql/share/man SHAREDIR = /usr/local/pgsql/share SYSCONFDIR = /usr/local/pgsql/etc PGXS = /usr/local/pgsql/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = '--enable-debug' '--enable-cassert' CC = gcc -std=gnu99 CPPFLAGS = -D_GNU_SOURCE CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 CFLAGS_SL = -fPIC LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/usr/local/pgsql/lib',--enable-new-dtags LDFLAGS_EX = LDFLAGS_SL = LIBS = -lpgcommon -lpgport -lpthread -lz -lreadline -lrt -ldl -lm VERSION = PostgreSQL 14devel $ $ initdb --no-locale --encoding=utf8 -k data.1 The files belonging to this database system will be owned by user "postgres". ... Success. You can now start the database server using: pg_ctl -D data.1 -l logfile start $ $ pg_ctl -D data.1 start waiting for server to start2020-10-29 08:14:10.772 JST [106109] LOG: redirecting log output to logging collector process 2020-10-29 08:14:10.772 JST [106109] HINT: Future log output will appear in directory "log". done server started $ $ psql psql (14devel) Type "help" for help. postgres=# SELECT pg_control_system(); pg_control_system --- (1300,202010281,612156269485571,"2020-10-29 08:09:44+09") (1 row) postgres=# postgres=# CREATE TABLE data1(c1 NUMERIC, c2 VARCHAR(10)); CREATE TABLE postgres=# postgres=# INSERT INTO data1 VALUES (generate_series(1, 1), 'data1'); INSERT 0 1 postgres=# postgres=# SELECT pg_relation_filepath('data1'); pg_relation_filepath -- base/13751/16384 (1 row) postgres=# \q $ $ pg_ctl -D data.1 stop waiting for server to shut down done $ --- Modify base/13751/16384 file for checksum error $ $ pg_ctl -D data.1 start waiting for server to start2020-10-29 08:15:31.868 JST [106144] LOG: redirecting log output to logging collector process 2020-10-29 08:15:31.868 JST [106144] HINT: Future log output will appear in directory "log". done server started $ $ psql psql (14devel) Type "help" for help. postgres=# SELECT * FROM data1; WARNING: page verification failed, calculated checksum 31945 but expected 35621 ERROR: invalid page in block 0 of relation base/13751/16384 postgres=# postgres=# SELECT pg_relation_check_pages('data1', 'main'); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !?> \q $ $ cat data.1/log/postgresql-2020-10-29_081531.log 2020-10-29 08:15:31.868 JST [106144] LOG: starting PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit 2020-10-29 08:15:31.868 JST [106144] LOG: listening on IPv6
Re: Online checksums verification in the backend
On Tue, Oct 27, 2020 at 07:47:19PM +0800, Julien Rouhaud wrote: > I think it's also worth noting that the IOLock is now acquired just > before getting the buffer state, and released after the read (or after > finding that the buffer is dirty). This is consistent with how it's > done elsewhere, so I'm fine. Consistency is the point. This API should be safe to use by design. I have done some extra performance tests similar to what I did upthread, and this version showed similar numbers. > Other than that I'm quite happy with the changes you made, thanks a lot! Thanks for confirming. I have gone through the whole set today, splitted the thing into two commits and applied them. We had buildfarm member florican complain about a mistake in one of the GetDatum() calls that I took care of already, and there is nothing else on my radar. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Tue, Oct 27, 2020 at 3:07 PM Michael Paquier wrote: > > On Fri, Oct 23, 2020 at 06:06:30PM +0900, Michael Paquier wrote: > > On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote: > >> Mmm, is it really an improvement to report warnings during this > >> function execution? Note also that PageIsVerified as-is won't report > >> a warning if a page is found as PageIsNew() but isn't actually all > >> zero, while still being reported as corrupted by the SRF. > > > > Yep, joining the point of above to just have no WARNINGs at all. > > Now that we have d401c57, I got to consider more this one, and opted > for not generating a WARNING for now. Hence, PageisVerifiedExtended() > is disabled regarding that, but we still report a checksum failure in > it. Great, that's also what I had in mind. > I have spent some time reviewing the tests, and as I felt this was > bulky. In the reworked version attached, I have reduced the number of > tests by half, without reducing the coverage, mainly: > - Removed all the stderr and the return code tests, as we always > expected the commands to succeed, and safe_psql() can do the job > already. > - Merged of the queries using pg_relation_check_pages into a single > routine, with the expected output (set of broken pages returned in the > SRF) in the arguments. > - Added some prefixes to the tests, to generate unique test names. > That makes debug easier. > - The query on pg_stat_database is run once at the beginning, once at > the end with the number of checksum failures correctly updated. > - Added comments to document all the routines, and renamed some of > them mostly for consistency. > - Skipped system relations from the scan of pg_class, making the test > more costly for nothing. > - I ran some tests on Windows, just-in-case. > > I have also added a SearchSysCacheExists1() to double-check if the > relation is missing before opening it, added a > CHECK_FOR_INTERRUPTS() within the main check loop (where the error > context is really helpful), indented the code, bumped the catalogs > (mostly a self-reminder), etc. > > So, what do you think? I think it's also worth noting that the IOLock is now acquired just before getting the buffer state, and released after the read (or after finding that the buffer is dirty). This is consistent with how it's done elsewhere, so I'm fine. Other than that I'm quite happy with the changes you made, thanks a lot!
Re: Online checksums verification in the backend
On Fri, Oct 23, 2020 at 06:06:30PM +0900, Michael Paquier wrote: > On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote: >> Mmm, is it really an improvement to report warnings during this >> function execution? Note also that PageIsVerified as-is won't report >> a warning if a page is found as PageIsNew() but isn't actually all >> zero, while still being reported as corrupted by the SRF. > > Yep, joining the point of above to just have no WARNINGs at all. Now that we have d401c57, I got to consider more this one, and opted for not generating a WARNING for now. Hence, PageisVerifiedExtended() is disabled regarding that, but we still report a checksum failure in it. I have spent some time reviewing the tests, and as I felt this was bulky. In the reworked version attached, I have reduced the number of tests by half, without reducing the coverage, mainly: - Removed all the stderr and the return code tests, as we always expected the commands to succeed, and safe_psql() can do the job already. - Merged of the queries using pg_relation_check_pages into a single routine, with the expected output (set of broken pages returned in the SRF) in the arguments. - Added some prefixes to the tests, to generate unique test names. That makes debug easier. - The query on pg_stat_database is run once at the beginning, once at the end with the number of checksum failures correctly updated. - Added comments to document all the routines, and renamed some of them mostly for consistency. - Skipped system relations from the scan of pg_class, making the test more costly for nothing. - I ran some tests on Windows, just-in-case. I have also added a SearchSysCacheExists1() to double-check if the relation is missing before opening it, added a CHECK_FOR_INTERRUPTS() within the main check loop (where the error context is really helpful), indented the code, bumped the catalogs (mostly a self-reminder), etc. So, what do you think? -- Michael diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h index f44a09b0c2..e522477780 100644 --- a/src/include/catalog/catversion.h +++ b/src/include/catalog/catversion.h @@ -53,6 +53,6 @@ */ /* mmddN */ -#define CATALOG_VERSION_NO 202010201 +#define CATALOG_VERSION_NO 202010271 #endif diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index bbcac69d48..a66870bcc0 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -10958,6 +10958,13 @@ proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}', proargnames => '{tablespace,name,size,modification}', prosrc => 'pg_ls_tmpdir_1arg' }, +{ oid => '9147', descr => 'check pages of a relation', + proname => 'pg_relation_check_pages', procost => '1', prorows => '20', + proisstrict => 'f', proretset => 't', provolatile => 'v', proparallel => 'r', + prorettype => 'record', proargtypes => 'regclass text', + proallargtypes => '{regclass,text,text,int8}', proargmodes => '{i,i,o,o}', + proargnames => '{relation,fork,path,failed_block_num}', + prosrc => 'pg_relation_check_pages' }, # hash partitioning constraint function { oid => '5028', descr => 'hash partition CHECK constraint', diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index ee91b8fa26..a21cab2eaf 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -240,6 +240,9 @@ extern void AtProcExit_LocalBuffers(void); extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation); +extern bool CheckBuffer(struct SMgrRelationData *smgr, ForkNumber forknum, + BlockNumber blkno); + /* in freelist.c */ extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype); extern void FreeAccessStrategy(BufferAccessStrategy strategy); diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 85cd147e21..c6dd084fbc 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1300,6 +1300,14 @@ LANGUAGE INTERNAL STRICT VOLATILE AS 'pg_create_logical_replication_slot'; +CREATE OR REPLACE FUNCTION pg_relation_check_pages( +IN relation regclass, IN fork text DEFAULT NULL, +OUT path text, OUT failed_block_num bigint) +RETURNS SETOF record +LANGUAGE internal +VOLATILE PARALLEL RESTRICTED +AS 'pg_relation_check_pages'; + CREATE OR REPLACE FUNCTION make_interval(years int4 DEFAULT 0, months int4 DEFAULT 0, weeks int4 DEFAULT 0, days int4 DEFAULT 0, hours int4 DEFAULT 0, mins int4 DEFAULT 0, @@ -1444,6 +1452,7 @@ AS 'unicode_is_normalized'; -- can later change who can access these functions, or leave them as only -- available to superuser / cluster owner, if they choose. -- +REVOKE EXECUTE ON FUNCTION pg_relation_check_pages(regclass, text) FROM public; REVOKE EXECUTE ON FUNCTION pg_start_backup(text, boolean, boolean) FROM public; REVOKE EXECUTE ON FUNCTION pg_stop_backup() FROM
Re: Online checksums verification in the backend
On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote: > I agree. However I'm assuming that this refactor is relying on the > not yet committed patch (in the nearby thread dealing with base backup > checksums check) to also refactor PageIsVerified? As all the logic > you removed was done to avoid spamming a lot of warnings when calling > the function. Yeah, it should use a refactored version, but I was as well in the mood of looking at version based on what we have now on HEAD. Even if I am not completely clear where the patch for page verification and base backups will go, I was thinking as well to do the refactoring introducing PageIsVerifiedExtended() first, before considering the next steps for this thread. It seems to me that the path where we generate no WARNINGs at all makes the whole experience more consistent for the user with this function. > Mmm, is it really an improvement to report warnings during this > function execution? Note also that PageIsVerified as-is won't report > a warning if a page is found as PageIsNew() but isn't actually all > zero, while still being reported as corrupted by the SRF. Yep, joining the point of above to just have no WARNINGs at all. > Have you also considered that it's possible to execute > pg_relation_check_pages with ignore_checksum_failure = on? That's > evidently a bad idea, but doing so would report some of the data > corruption as warnings while still not reporting anything in the SRF. Yeah, I thought about that as well, but I did not see a strong argument against preventing this behavior either, even if it sounds a bit strange. We could always tune that later depending on the feedback. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Fri, Oct 23, 2020 at 3:28 PM Michael Paquier wrote: > > On Mon, Oct 19, 2020 at 04:52:48PM +0900, Michael Paquier wrote: > > No issues with reporting the block number and the fork type in the SRF > > at least of course as this is helpful to detect the position of the > > broken blocks. For the checksum found in the header and the one > > calculated (named expected in the patch), I am less sure how to put a > > clear definition on top of that but we could always consider that > > later and extend the SRF as needed. Once the user knows that both do > > not match, he/she knows that there is a problem. > > So, I have reviewed your patch set, and heavily reworked the logic to > be more consistent on many thinks, resulting in a largely simplified > patch without sacrificing its usefulness: Thanks! > - Removal of the dependency with checksums for this feature. While > simplifying the code, I have noticed that this feature can also be > beneficial for clusters that do not have have data checksums, as > PageIsVerified() is perfectly able to run some page header checks and > the zero-page case. That's of course less useful than having the > checksums, but there is no need to add a dependency here. The file > for the SQL functions is renamed from checksumfuncs.c to pagefuncs.c. I agree. However I'm assuming that this refactor is relying on the not yet committed patch (in the nearby thread dealing with base backup checksums check) to also refactor PageIsVerified? As all the logic you removed was done to avoid spamming a lot of warnings when calling the function. > - The function is changed to return no tuples if the relkind is not > supported, and the same applies for temporary relations. That's more > consistent with other system functions like the ones in charge of > partition information, and this makes full scans of pg_class much > easier to work with. Temporary tables were not handled correctly > anyway as these are in local buffers, but the use case of this > function in this case is not really obvious to me. Agreed > - Having the forknum in the SRF is kind of confusing, as the user > would need to map a number with the physical on-disk name. Instead, I > have made the choice to return the *path* of the corrupted file with a > block number. This way, an operator can know immediately where a > problem comes from. The patch does not append the segment number, and > I am not sure if we should actually do that, but adding it is > straight-forward as we have the block number. There is a dependency > with table AMs here as well, as this goes down to fd.c, explaining why > I have not added it and just. That's a clear improvement, thanks! > - I really don't know what kind of default ACL should apply for such > functions, but I am sure that SCAN_TABLES is not what we are looking > for here, and there is nothing preventing us from having a safe > default from the start of times, so I moved the function to be > superuser-only by default, and GRANT can be used to allow its > execution to other roles. We could relax that in the future, of > course, this can be discussed separately. I don't have a strong opinion here, SCAN_TABLES was maybe not ideal. No objections. > - The WARNING report for each block found as corrupted gains an error > context, as a result of a switch to PageIsVerified(), giving a user > all the information needed in the logs on top of the result in the > SRF. That's useful as well if combined with CHECK_FOR_INTERRUPTS(), > and I got to wonder if we should have some progress report for this > stuff, though that's a separate discussion. Mmm, is it really an improvement to report warnings during this function execution? Note also that PageIsVerified as-is won't report a warning if a page is found as PageIsNew() but isn't actually all zero, while still being reported as corrupted by the SRF. Have you also considered that it's possible to execute pg_relation_check_pages with ignore_checksum_failure = on? That's evidently a bad idea, but doing so would report some of the data corruption as warnings while still not reporting anything in the SRF. Having some progress report would be nice to have, but +1 to have a separate discussion for that. > - The function is renamed to something less generic, > pg_relation_check_pages(), and I have reduced the number of functions > from two to one, where the user can specify the fork name with a new > option. The default of NULL means that all the forks of a relation > are checked. Ok. > - The TAP tests are rather bulky. I have moved all the non-corruption > test cases into a new SQL test file. That's useful for people willing > to do some basic sanity checks with a non-default table AM. At least > it provides a minimum coverage. Agreed
Re: Online checksums verification in the backend
On Mon, Oct 19, 2020 at 04:52:48PM +0900, Michael Paquier wrote: > No issues with reporting the block number and the fork type in the SRF > at least of course as this is helpful to detect the position of the > broken blocks. For the checksum found in the header and the one > calculated (named expected in the patch), I am less sure how to put a > clear definition on top of that but we could always consider that > later and extend the SRF as needed. Once the user knows that both do > not match, he/she knows that there is a problem. So, I have reviewed your patch set, and heavily reworked the logic to be more consistent on many thinks, resulting in a largely simplified patch without sacrificing its usefulness: - The logic of the core routine of bufmgr.c is unchanged. I have simplified it a bit though by merging the subroutines that were part of the patch. SMgrRelation is used as argument instead of a Relation. That's more consistent with the surroundings. The initial read of a page without locks is still on the table as an extra optimization, though I am not completely sure if this should be part of CheckBuffer() or not. I also thought about the previous benchmarks and I think that not using the most-optimized improved performance, because it reduced the successive runes of the SQL functions, reducing the back-pressure on the partition locks (we held on of them at the same time for a longer period, meaning that the other 127 ran free for a longer time). Please note that this part still referred to a "module", which was incorrect. - Removal of the expected and found checksums from the SRF of the function. Based on my recent business with the page checks for base backups, I have arrived at the conclusion that the SRF should return data that we can absolutely trust, and the minimum I think we have to trust here is if a given page is thought as safe or not, considering all the sanity checks done by PageIsVerified() as the main entry point for everything. This has led to a bit of confusion with the addition of NoComputedChecksum for a page that was empty as of the initial of the patch, so it happens that we don't need it anymore. - Removal of the dependency with checksums for this feature. While simplifying the code, I have noticed that this feature can also be beneficial for clusters that do not have have data checksums, as PageIsVerified() is perfectly able to run some page header checks and the zero-page case. That's of course less useful than having the checksums, but there is no need to add a dependency here. The file for the SQL functions is renamed from checksumfuncs.c to pagefuncs.c. - The function is changed to return no tuples if the relkind is not supported, and the same applies for temporary relations. That's more consistent with other system functions like the ones in charge of partition information, and this makes full scans of pg_class much easier to work with. Temporary tables were not handled correctly anyway as these are in local buffers, but the use case of this function in this case is not really obvious to me. - Having the forknum in the SRF is kind of confusing, as the user would need to map a number with the physical on-disk name. Instead, I have made the choice to return the *path* of the corrupted file with a block number. This way, an operator can know immediately where a problem comes from. The patch does not append the segment number, and I am not sure if we should actually do that, but adding it is straight-forward as we have the block number. There is a dependency with table AMs here as well, as this goes down to fd.c, explaining why I have not added it and just. - I really don't know what kind of default ACL should apply for such functions, but I am sure that SCAN_TABLES is not what we are looking for here, and there is nothing preventing us from having a safe default from the start of times, so I moved the function to be superuser-only by default, and GRANT can be used to allow its execution to other roles. We could relax that in the future, of course, this can be discussed separately. - The WARNING report for each block found as corrupted gains an error context, as a result of a switch to PageIsVerified(), giving a user all the information needed in the logs on top of the result in the SRF. That's useful as well if combined with CHECK_FOR_INTERRUPTS(), and I got to wonder if we should have some progress report for this stuff, though that's a separate discussion. - The function is renamed to something less generic, pg_relation_check_pages(), and I have reduced the number of functions from two to one, where the user can specify the fork name with a new option. The default of NULL means that all the forks of a relation are checked. - The TAP tests are rather bulky. I have moved all the non-corruption test cases into a new SQL test file. That's useful for people willing to do some basic sanity checks with a non-default table AM. At least it provides a minimum
Re: Online checksums verification in the backend
On Mon, Oct 19, 2020 at 11:16:38AM +0800, Julien Rouhaud wrote: > On Mon, Oct 19, 2020 at 10:39 AM Michael Paquier wrote: >> Somewhat related to the first point, NoComputedChecksum exists >> because, as the current patch is shaped, we need to report an existing >> checksum to the user even for the zero-only case. > > I'm not sure that I understand your point. The current patch only > returns something to users when there's a corruption. If by > "zero-only case" you mean "page corrupted in a way that PageIsNew() > returns true while not being all zero", then it's a corrupted page and > then obviously yes it needs to be returned to users. Sorry for the confusion, this previous paragraph was confusing. I meant that the reason why NoComputedChecksum exists is that we give up on attempting to calculate the checksum if we detect that the page is new, but failed the zero-only test, and that we want the users to know about this special case by setting this expected checksum to NULL for the SRF. >> So, wouldn't it be better to just rely on PageIsVerified() >> (well it's rather-soon-to-be extended flavor) for the checksum check, >> the header sanity check and the zero-only check? My point is to keep >> a single entry point for all the page sanity checks, so as base >> backups, your patch, and the buffer manager apply the same things. >> Base backups got that partially wrong because the base backup code >> wants to keep control of the number of failures and the error >> reports. > > I'm fine with it. Thanks. >> Your patch actually wishes to report a failure, but you want >> to add more context with the fork name and such. Another option we >> could use here is to add an error context so as PageIsVerified() >> reports the WARNING, but the SQL function provides more context with >> the block number and the relation involved in the check. > > Also, returning actual data rather than a bunch of warnings is way > easier to process for client code. And as mentioned previously having > an API that returns a list of corrupted blocks could be useful for a > single-page recovery feature. No issues with reporting the block number and the fork type in the SRF at least of course as this is helpful to detect the position of the broken blocks. For the checksum found in the header and the one calculated (named expected in the patch), I am less sure how to put a clear definition on top of that but we could always consider that later and extend the SRF as needed. Once the user knows that both do not match, he/she knows that there is a problem. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Mon, Oct 19, 2020 at 10:39 AM Michael Paquier wrote: > > On Fri, Oct 16, 2020 at 09:22:02AM +0800, Julien Rouhaud wrote: > > And Michael just told me that I also missed adding one of the C files > > while splitting the patch into two. > > + if (PageIsNew(page)) > + { > + /* > +* Check if the page is really new or if there's corruption that > +* affected PageIsNew detection. Note that PageIsVerified won't try > to > +* detect checksum corruption in this case, so there's no risk of > +* duplicated corruption report. > +*/ > + if (PageIsVerified(page, blkno)) > + { > + /* No corruption. */ > + return true; > + } > Please note that this part of your patch overlaps with a proposal for > a bug fix related to zero-only pages with the checksum verification of > base backups: > https://www.postgresql.org/message-id/608f3476-0598-2514-2c03-e05c7d2b0...@postgrespro.ru > > Your patch is trying to adapt itself to the existing logic we have in > PageIsVerified() so as you don't get a duplicated report, as does the > base backup part. Note that I cannot find in the wild any open code > making use of PageIsVerified(), but I'd like to believe that it is > rather handy to use for table AMs at least (?), so if we can avoid any > useless ABI breakage, it may be better to have a new > PageIsVerifiedExtended() able to take additional options, one to > report to pgstat and one to generate this elog(WARNING). And then > this patch could just make use of it? Indeed, that would be great. > + /* > +* There's corruption, but since this affects PageIsNew, we > +* can't compute a checksum, so set NoComputedChecksum for the > +* expected checksum. > +*/ > + *chk_expected = NoComputedChecksum; > + *chk_found = hdr->pd_checksum; > + return false; > [...] > + /* > +* This can happen if corruption makes the block appears as > +* PageIsNew() but isn't a new page. > +*/ > + if (chk_expected == NoComputedChecksum) > + nulls[i++] = true; > + else > + values[i++] = UInt16GetDatum(chk_expected); > Somewhat related to the first point, NoComputedChecksum exists > because, as the current patch is shaped, we need to report an existing > checksum to the user even for the zero-only case. I'm not sure that I understand your point. The current patch only returns something to users when there's a corruption. If by "zero-only case" you mean "page corrupted in a way that PageIsNew() returns true while not being all zero", then it's a corrupted page and then obviously yes it needs to be returned to users. > PageIsVerified() is > not that flexible so we could change it to report a status depending > on the error faced (checksum, header or zero-only) on top of getting a > checksum. Now, I am not completely sure either that it is worth the > complication to return in the SRF of the check function the expected > checksum. It seemed to me that it could be something useful to get with this kind of tool. You may be able to recover a corrupted page from backup/WAL if the checksum itself wasn't corrupted so that you know what to look for. There would be a lot of caveats and low level work, but if you're desperate enough that may save you a bit of time. > So, wouldn't it be better to just rely on PageIsVerified() > (well it's rather-soon-to-be extended flavor) for the checksum check, > the header sanity check and the zero-only check? My point is to keep > a single entry point for all the page sanity checks, so as base > backups, your patch, and the buffer manager apply the same things. > Base backups got that partially wrong because the base backup code > wants to keep control of the number of failures and the error > reports. I'm fine with it. > Your patch actually wishes to report a failure, but you want > to add more context with the fork name and such. Another option we > could use here is to add an error context so as PageIsVerified() > reports the WARNING, but the SQL function provides more context with > the block number and the relation involved in the check. Also, returning actual data rather than a bunch of warnings is way easier to process for client code. And as mentioned previously having an API that returns a list of corrupted blocks could be useful for a single-page recovery feature.
Re: Online checksums verification in the backend
On Fri, Oct 16, 2020 at 09:22:02AM +0800, Julien Rouhaud wrote: > And Michael just told me that I also missed adding one of the C files > while splitting the patch into two. + if (PageIsNew(page)) + { + /* +* Check if the page is really new or if there's corruption that +* affected PageIsNew detection. Note that PageIsVerified won't try to +* detect checksum corruption in this case, so there's no risk of +* duplicated corruption report. +*/ + if (PageIsVerified(page, blkno)) + { + /* No corruption. */ + return true; + } Please note that this part of your patch overlaps with a proposal for a bug fix related to zero-only pages with the checksum verification of base backups: https://www.postgresql.org/message-id/608f3476-0598-2514-2c03-e05c7d2b0...@postgrespro.ru Your patch is trying to adapt itself to the existing logic we have in PageIsVerified() so as you don't get a duplicated report, as does the base backup part. Note that I cannot find in the wild any open code making use of PageIsVerified(), but I'd like to believe that it is rather handy to use for table AMs at least (?), so if we can avoid any useless ABI breakage, it may be better to have a new PageIsVerifiedExtended() able to take additional options, one to report to pgstat and one to generate this elog(WARNING). And then this patch could just make use of it? + /* +* There's corruption, but since this affects PageIsNew, we +* can't compute a checksum, so set NoComputedChecksum for the +* expected checksum. +*/ + *chk_expected = NoComputedChecksum; + *chk_found = hdr->pd_checksum; + return false; [...] + /* +* This can happen if corruption makes the block appears as +* PageIsNew() but isn't a new page. +*/ + if (chk_expected == NoComputedChecksum) + nulls[i++] = true; + else + values[i++] = UInt16GetDatum(chk_expected); Somewhat related to the first point, NoComputedChecksum exists because, as the current patch is shaped, we need to report an existing checksum to the user even for the zero-only case. PageIsVerified() is not that flexible so we could change it to report a status depending on the error faced (checksum, header or zero-only) on top of getting a checksum. Now, I am not completely sure either that it is worth the complication to return in the SRF of the check function the expected checksum. So, wouldn't it be better to just rely on PageIsVerified() (well it's rather-soon-to-be extended flavor) for the checksum check, the header sanity check and the zero-only check? My point is to keep a single entry point for all the page sanity checks, so as base backups, your patch, and the buffer manager apply the same things. Base backups got that partially wrong because the base backup code wants to keep control of the number of failures and the error reports. Your patch actually wishes to report a failure, but you want to add more context with the fork name and such. Another option we could use here is to add an error context so as PageIsVerified() reports the WARNING, but the SQL function provides more context with the block number and the relation involved in the check. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Fri, Oct 16, 2020 at 8:59 AM Julien Rouhaud wrote: > > On Thu, Oct 15, 2020 at 3:59 PM Julien Rouhaud wrote: > > > > On Thu, Oct 15, 2020 at 1:37 PM Julien Rouhaud wrote: > > > > > > On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier > > > wrote: > > > > > > > > On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote: > > > > > Thanks a lot for the tests! I'm not surprised that forcing the lock > > > > > will slow down the pg_check_relation() execution, but I'm a bit > > > > > surprised that holding the buffer mapping lock longer in a workload > > > > > that has a lot of evictions actually makes things faster. Do you have > > > > > any idea why that's the case? > > > > > > > > That's still a bit unclear to me, but I have not spent much time > > > > thinking about this particular point either. > > > > > > > > > I'm assuming that you prefer to remove both the optimization and the > > > > > throttling part? I'll do that with the next version unless there's > > > > > objections. > > > > > > > > Yeah, any tests I have done tends to show that. It would be good to > > > > also check some perf profiles here, at least for the process running > > > > the relation check in a loop. > > > > > > > > > I agree that putting the code nearby ReadBuffer_common() would be a > > > > > good idea. However, that means that I can't move all the code to > > > > > contrib/ I'm wondering what you'd like to see going there. I can see > > > > > some values in also having the SQL functions available in core rather > > > > > than contrib, e.g. if you need to quickly check a relation on a > > > > > standby, so without requiring to create the extension on the primary > > > > > node first. > > > > > > > > Good point. This could make the user experience worse. > > > > > > > > > Then, I'm a bit worried about adding this code in ReadBuffer_common. > > > > > What this code does is quite different, and I'm afraid that it'll make > > > > > ReadBuffer_common more complex than needed, which is maybe not a good > > > > > idea for something as critical as this function. > > > > > > > > > > What do you think? > > > > > > > > Yeah, I have been looking at ReadBuffer_common() and it is true that > > > > it is complicated enough so we may not really need an extra mode or > > > > more options, for a final logic that is actually different than what a > > > > buffer read does: we just want to know if a page has a valid checksum > > > > or not. An idea that I got here would be to add a new, separate > > > > function to do the page check directly in bufmgr.c, but that's what > > > > you mean. Now only the prefetch routine and ReadBuffer_common use > > > > partition locks, but getting that done in the same file looks like a > > > > good compromise to me. It would be also possible to keep the BLCKSZ > > > > buffer used to check the page directly in this routine, so as any > > > > caller willing to do a check don't need to worry about any > > > > allocation. > > > > > > I made all the suggested modifications in attached v14: > > > > > > - moved the C code in bufmgr.c nearby ReadBuffer > > > - removed the GUC and throttling options > > > - removed the dubious optimization > > > > > > All documentation and comments are updated to reflect those changes. > > > I also split the commit in two, one for the backend infrastructure and > > > one for the SQL wrappers. > > > > And I did miss a reference in the sgml documentation, fixed in v15. > > I forgot to add the modified file in the previous attachment, sorry > for the noise. And Michael just told me that I also missed adding one of the C files while splitting the patch into two. v17-0001-Add-backend-infrastructure-to-check-the-validity.patch Description: Binary data v17-0002-Add-a-pg_check_relation-SQL-function.patch Description: Binary data
Re: Online checksums verification in the backend
On Thu, Oct 15, 2020 at 3:59 PM Julien Rouhaud wrote: > > On Thu, Oct 15, 2020 at 1:37 PM Julien Rouhaud wrote: > > > > On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier wrote: > > > > > > On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote: > > > > Thanks a lot for the tests! I'm not surprised that forcing the lock > > > > will slow down the pg_check_relation() execution, but I'm a bit > > > > surprised that holding the buffer mapping lock longer in a workload > > > > that has a lot of evictions actually makes things faster. Do you have > > > > any idea why that's the case? > > > > > > That's still a bit unclear to me, but I have not spent much time > > > thinking about this particular point either. > > > > > > > I'm assuming that you prefer to remove both the optimization and the > > > > throttling part? I'll do that with the next version unless there's > > > > objections. > > > > > > Yeah, any tests I have done tends to show that. It would be good to > > > also check some perf profiles here, at least for the process running > > > the relation check in a loop. > > > > > > > I agree that putting the code nearby ReadBuffer_common() would be a > > > > good idea. However, that means that I can't move all the code to > > > > contrib/ I'm wondering what you'd like to see going there. I can see > > > > some values in also having the SQL functions available in core rather > > > > than contrib, e.g. if you need to quickly check a relation on a > > > > standby, so without requiring to create the extension on the primary > > > > node first. > > > > > > Good point. This could make the user experience worse. > > > > > > > Then, I'm a bit worried about adding this code in ReadBuffer_common. > > > > What this code does is quite different, and I'm afraid that it'll make > > > > ReadBuffer_common more complex than needed, which is maybe not a good > > > > idea for something as critical as this function. > > > > > > > > What do you think? > > > > > > Yeah, I have been looking at ReadBuffer_common() and it is true that > > > it is complicated enough so we may not really need an extra mode or > > > more options, for a final logic that is actually different than what a > > > buffer read does: we just want to know if a page has a valid checksum > > > or not. An idea that I got here would be to add a new, separate > > > function to do the page check directly in bufmgr.c, but that's what > > > you mean. Now only the prefetch routine and ReadBuffer_common use > > > partition locks, but getting that done in the same file looks like a > > > good compromise to me. It would be also possible to keep the BLCKSZ > > > buffer used to check the page directly in this routine, so as any > > > caller willing to do a check don't need to worry about any > > > allocation. > > > > I made all the suggested modifications in attached v14: > > > > - moved the C code in bufmgr.c nearby ReadBuffer > > - removed the GUC and throttling options > > - removed the dubious optimization > > > > All documentation and comments are updated to reflect those changes. > > I also split the commit in two, one for the backend infrastructure and > > one for the SQL wrappers. > > And I did miss a reference in the sgml documentation, fixed in v15. I forgot to add the modified file in the previous attachment, sorry for the noise. v16-0001-Add-backend-infrastructure-to-check-the-validity.patch Description: Binary data v16-0002-Add-a-pg_check_relation-SQL-function.patch Description: Binary data
Re: Online checksums verification in the backend
On Thu, Oct 15, 2020 at 1:37 PM Julien Rouhaud wrote: > > On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier wrote: > > > > On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote: > > > Thanks a lot for the tests! I'm not surprised that forcing the lock > > > will slow down the pg_check_relation() execution, but I'm a bit > > > surprised that holding the buffer mapping lock longer in a workload > > > that has a lot of evictions actually makes things faster. Do you have > > > any idea why that's the case? > > > > That's still a bit unclear to me, but I have not spent much time > > thinking about this particular point either. > > > > > I'm assuming that you prefer to remove both the optimization and the > > > throttling part? I'll do that with the next version unless there's > > > objections. > > > > Yeah, any tests I have done tends to show that. It would be good to > > also check some perf profiles here, at least for the process running > > the relation check in a loop. > > > > > I agree that putting the code nearby ReadBuffer_common() would be a > > > good idea. However, that means that I can't move all the code to > > > contrib/ I'm wondering what you'd like to see going there. I can see > > > some values in also having the SQL functions available in core rather > > > than contrib, e.g. if you need to quickly check a relation on a > > > standby, so without requiring to create the extension on the primary > > > node first. > > > > Good point. This could make the user experience worse. > > > > > Then, I'm a bit worried about adding this code in ReadBuffer_common. > > > What this code does is quite different, and I'm afraid that it'll make > > > ReadBuffer_common more complex than needed, which is maybe not a good > > > idea for something as critical as this function. > > > > > > What do you think? > > > > Yeah, I have been looking at ReadBuffer_common() and it is true that > > it is complicated enough so we may not really need an extra mode or > > more options, for a final logic that is actually different than what a > > buffer read does: we just want to know if a page has a valid checksum > > or not. An idea that I got here would be to add a new, separate > > function to do the page check directly in bufmgr.c, but that's what > > you mean. Now only the prefetch routine and ReadBuffer_common use > > partition locks, but getting that done in the same file looks like a > > good compromise to me. It would be also possible to keep the BLCKSZ > > buffer used to check the page directly in this routine, so as any > > caller willing to do a check don't need to worry about any > > allocation. > > I made all the suggested modifications in attached v14: > > - moved the C code in bufmgr.c nearby ReadBuffer > - removed the GUC and throttling options > - removed the dubious optimization > > All documentation and comments are updated to reflect those changes. > I also split the commit in two, one for the backend infrastructure and > one for the SQL wrappers. And I did miss a reference in the sgml documentation, fixed in v15. v15-0002-Add-a-pg_check_relation-SQL-function.patch Description: Binary data v15-0001-Add-backend-infrastructure-to-check-the-validity.patch Description: Binary data
Re: Online checksums verification in the backend
On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier wrote: > > On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote: > > Thanks a lot for the tests! I'm not surprised that forcing the lock > > will slow down the pg_check_relation() execution, but I'm a bit > > surprised that holding the buffer mapping lock longer in a workload > > that has a lot of evictions actually makes things faster. Do you have > > any idea why that's the case? > > That's still a bit unclear to me, but I have not spent much time > thinking about this particular point either. > > > I'm assuming that you prefer to remove both the optimization and the > > throttling part? I'll do that with the next version unless there's > > objections. > > Yeah, any tests I have done tends to show that. It would be good to > also check some perf profiles here, at least for the process running > the relation check in a loop. > > > I agree that putting the code nearby ReadBuffer_common() would be a > > good idea. However, that means that I can't move all the code to > > contrib/ I'm wondering what you'd like to see going there. I can see > > some values in also having the SQL functions available in core rather > > than contrib, e.g. if you need to quickly check a relation on a > > standby, so without requiring to create the extension on the primary > > node first. > > Good point. This could make the user experience worse. > > > Then, I'm a bit worried about adding this code in ReadBuffer_common. > > What this code does is quite different, and I'm afraid that it'll make > > ReadBuffer_common more complex than needed, which is maybe not a good > > idea for something as critical as this function. > > > > What do you think? > > Yeah, I have been looking at ReadBuffer_common() and it is true that > it is complicated enough so we may not really need an extra mode or > more options, for a final logic that is actually different than what a > buffer read does: we just want to know if a page has a valid checksum > or not. An idea that I got here would be to add a new, separate > function to do the page check directly in bufmgr.c, but that's what > you mean. Now only the prefetch routine and ReadBuffer_common use > partition locks, but getting that done in the same file looks like a > good compromise to me. It would be also possible to keep the BLCKSZ > buffer used to check the page directly in this routine, so as any > caller willing to do a check don't need to worry about any > allocation. I made all the suggested modifications in attached v14: - moved the C code in bufmgr.c nearby ReadBuffer - removed the GUC and throttling options - removed the dubious optimization All documentation and comments are updated to reflect those changes. I also split the commit in two, one for the backend infrastructure and one for the SQL wrappers. v14-0001-Add-backend-infrastructure-to-check-the-validity.patch Description: Binary data v14-0002-Add-a-pg_check_relation-SQL-function.patch Description: Binary data
Re: Online checksums verification in the backend
On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote: > Thanks a lot for the tests! I'm not surprised that forcing the lock > will slow down the pg_check_relation() execution, but I'm a bit > surprised that holding the buffer mapping lock longer in a workload > that has a lot of evictions actually makes things faster. Do you have > any idea why that's the case? That's still a bit unclear to me, but I have not spent much time thinking about this particular point either. > I'm assuming that you prefer to remove both the optimization and the > throttling part? I'll do that with the next version unless there's > objections. Yeah, any tests I have done tends to show that. It would be good to also check some perf profiles here, at least for the process running the relation check in a loop. > I agree that putting the code nearby ReadBuffer_common() would be a > good idea. However, that means that I can't move all the code to > contrib/ I'm wondering what you'd like to see going there. I can see > some values in also having the SQL functions available in core rather > than contrib, e.g. if you need to quickly check a relation on a > standby, so without requiring to create the extension on the primary > node first. Good point. This could make the user experience worse. > Then, I'm a bit worried about adding this code in ReadBuffer_common. > What this code does is quite different, and I'm afraid that it'll make > ReadBuffer_common more complex than needed, which is maybe not a good > idea for something as critical as this function. > > What do you think? Yeah, I have been looking at ReadBuffer_common() and it is true that it is complicated enough so we may not really need an extra mode or more options, for a final logic that is actually different than what a buffer read does: we just want to know if a page has a valid checksum or not. An idea that I got here would be to add a new, separate function to do the page check directly in bufmgr.c, but that's what you mean. Now only the prefetch routine and ReadBuffer_common use partition locks, but getting that done in the same file looks like a good compromise to me. It would be also possible to keep the BLCKSZ buffer used to check the page directly in this routine, so as any caller willing to do a check don't need to worry about any allocation. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Wed, Sep 16, 2020 at 11:46 AM Michael Paquier wrote: > > On Fri, Sep 11, 2020 at 09:49:16AM +0200, Julien Rouhaud wrote: > > Thanks! > > I got some numbers out of my pocket, using the following base > configuration: > [...] > > Within parenthesis is the amount of time taken by pg_relation_check() > for a single check. This is not completely exact and I saw some > variations, just to give an impression: > Conns 64 128 256 > force_lock=true 60590 (7~8s) 55652 (10.2~10.5s) 46812 (9~10s) > force_lock=false 58637 (5s)54131 (6~7s) 37091 (1.1~1.2s) > > For connections higher than 128, I was kind of surprised to see > pg_relation_check being more aggressive without the optimization, with > much less contention on the buffer mapping LWLock actually, but that's > an impression from looking at pg_stat_activity. > > Looking at the wait events for each run, I saw much more hiccups with > the buffer mapping LWLock when forcing the lock rather than not, still > I was able to see some contention when also not forcing the lock. Not > surprising as this rejects a bunch of pages from shared buffers. > > > I used all default settings, but by default checksum_cost_delay is 0 > > so there shouldn't be any throttling. > > Thanks, so did I. From what I can see, there could be as well > benefits in not using the optimization by default so as the relation > check applies some natural throttling by making the checks actually > slower (there is a link between the individual runtime of > pg_relation_time and the TPS). Thanks a lot for the tests! I'm not surprised that forcing the lock will slow down the pg_check_relation() execution, but I'm a bit surprised that holding the buffer mapping lock longer in a workload that has a lot of evictions actually makes things faster. Do you have any idea why that's the case? > So it also seems to me that the > throttling parameters would be beneficial, but it looks to me that > there is as well a point to not include any throttling in a first > version if the optimization to go full speed is not around. Using > three new GUCs for those function calls is still too much IMO I'm assuming that you prefer to remove both the optimization and the throttling part? I'll do that with the next version unless there's objections. >, so > there is also the argument to move all this stuff into a new contrib/ > module, and have a bgworker implementation as part of it as it would > need shared_preload_libraries anyway. > > Also, I have been putting some thoughts into the APIs able to fetch a > buffer without going through the shared buffers. And neither > checksum.c, because it should be a place that those APIs depends on > and include only the most-internal logic for checksum algorithm and > computation, nor checksumfuncs.c, because we need to think about the > case of a background worker as well (that could spawn a set of dynamic > workers connecting to different databases able to do checksum > verifications?). It would be good to keep the handling of the buffer > mapping lock as well as the calls to smgrread() into a single place. > ReadBuffer_common() is a natural place for that, though it means for > our use case the addition of three new options: > - Being able to pass down directly a buffer pointer to save the page > contents. > - Being able to not verify directly a page, leaving the verification > to the caller upthread. > - Addition of a new mode, that I am calling here RBM_PRIVATE, where we > actually read the page from disk if not yet in shared buffers, except > that we fill in the contents of the page using the pointer given by > the caller. That's different than the use of local buffers, as we > don't need this much amount of complications like temporary tables of > course for per-page checks. > > Another idea would be to actually just let ReadBuffer_common just do > the check by itself, with a different mode like a kind of > RBM_VALIDATE, where we just return a verification state of the page > that can be consumed by callers. > > This also comes with some more advantages: > - Tracking of reads from disk with track_io_timing. > - Addition of some private stats dedicated to this private mode, with > new fields in pgBufferUsage, all in a single place I agree that putting the code nearby ReadBuffer_common() would be a good idea. However, that means that I can't move all the code to contrib/ I'm wondering what you'd like to see going there. I can see some values in also having the SQL functions available in core rather than contrib, e.g. if you need to quickly check a relation on a standby, so without requiring to create the extension on the primary node first. Then, I'm a bit worried about adding this code in ReadBuffer_common. What this code does is quite different, and I'm afraid that it'll make ReadBuffer_common more complex than needed, which is maybe not a good idea for something as critical as this
Re: Online checksums verification in the backend
On Fri, Sep 11, 2020 at 09:49:16AM +0200, Julien Rouhaud wrote: > Thanks! I got some numbers out of my pocket, using the following base configuration: wal_level = minimal fsync = off shared_buffers = '300MB' # also tested with 30MB and 60MB checksum_cost_delay = 0 # default in patch And for the test I have used pgbench initialized at a scale of 250, to have close to 3.5GB of data, so as it gives us a sort of 90% buffer eviction, with all the data cached in the OS (we may want to look as well at the case where the OS cache does not hold all the relation pages). I have also run some tests with 30MB and 60MB of shared buffers, for similar results. I also applied some prewarming on the cluster: create extension pg_prewarm select pg_prewarm(oid) from pg_class where oid > 16000; Then, I have done five runs using that: pgbench -S -M prepared -c 64/128/256 -n -T 60 In parallel of that, I got this stuff running in parallel all the time: select pg_check_relation('pgbench_accounts'); \watch 0.1 Here are some TPS numbers with the execution time of pg_check_relation. In the five runs, I removed the highest and lowest ones, then took an average of the remaining three. I have also tested two modes: with and without the optimization, that requires a one-liner in checksum.c as per your latest patch: --- a/src/backend/storage/page/checksum.c +++ b/src/backend/storage/page/checksum.c @@ -84,7 +84,7 @@ check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno, uint16 *chk_expected, uint16 *chk_found) { charbuffer[BLCKSZ]; -boolforce_lock = false; +boolforce_lock = true; boolfound_in_sb; Within parenthesis is the amount of time taken by pg_relation_check() for a single check. This is not completely exact and I saw some variations, just to give an impression: Conns 64 128 256 force_lock=true 60590 (7~8s) 55652 (10.2~10.5s) 46812 (9~10s) force_lock=false 58637 (5s)54131 (6~7s) 37091 (1.1~1.2s) For connections higher than 128, I was kind of surprised to see pg_relation_check being more aggressive without the optimization, with much less contention on the buffer mapping LWLock actually, but that's an impression from looking at pg_stat_activity. Looking at the wait events for each run, I saw much more hiccups with the buffer mapping LWLock when forcing the lock rather than not, still I was able to see some contention when also not forcing the lock. Not surprising as this rejects a bunch of pages from shared buffers. > I used all default settings, but by default checksum_cost_delay is 0 > so there shouldn't be any throttling. Thanks, so did I. From what I can see, there could be as well benefits in not using the optimization by default so as the relation check applies some natural throttling by making the checks actually slower (there is a link between the individual runtime of pg_relation_time and the TPS). So it also seems to me that the throttling parameters would be beneficial, but it looks to me that there is as well a point to not include any throttling in a first version if the optimization to go full speed is not around. Using three new GUCs for those function calls is still too much IMO, so there is also the argument to move all this stuff into a new contrib/ module, and have a bgworker implementation as part of it as it would need shared_preload_libraries anyway. Also, I have been putting some thoughts into the APIs able to fetch a buffer without going through the shared buffers. And neither checksum.c, because it should be a place that those APIs depends on and include only the most-internal logic for checksum algorithm and computation, nor checksumfuncs.c, because we need to think about the case of a background worker as well (that could spawn a set of dynamic workers connecting to different databases able to do checksum verifications?). It would be good to keep the handling of the buffer mapping lock as well as the calls to smgrread() into a single place. ReadBuffer_common() is a natural place for that, though it means for our use case the addition of three new options: - Being able to pass down directly a buffer pointer to save the page contents. - Being able to not verify directly a page, leaving the verification to the caller upthread. - Addition of a new mode, that I am calling here RBM_PRIVATE, where we actually read the page from disk if not yet in shared buffers, except that we fill in the contents of the page using the pointer given by the caller. That's different than the use of local buffers, as we don't need this much amount of complications like temporary tables of course for per-page checks. Another idea would be to actually just let ReadBuffer_common just do the check by itself, with a different mode like a kind of RBM_VALIDATE, where we just return a verification state of the page that can be consumed by callers. This also comes
Re: Online checksums verification in the backend
On Fri, Sep 11, 2020 at 9:34 AM Michael Paquier wrote: > > On Thu, Sep 10, 2020 at 08:06:10PM +0200, Julien Rouhaud wrote: > > The TPS is obviously overall extremely bad, but I can see that the submitted > > version added an overhead of ~3.9% (average of 5 runs), while the version > > without the optimisation added an overhead of ~6.57%. > > Okay, so that stands as a difference. I am planning to run some > benchmarks on my end as well, and see if I can see a clear > difference. Thanks! > > This is supposed to be a relatively fair benchmark as all the data are > > cached > > on the OS side, so IO done while holding the bufmapping lock aren't too > > long, > > but we can see that we already get a non negligible benefit from this > > optimisation. Should I do additional benchmarking, like dropping the OS > > cache > > and/or adding some write activity? This would probably only make the > > unoptimized version perform even worse. > > It would be also interesting to see the case where the pages are not > in the OS cache and see how bad it can get. For the read-write case, > I am not sure as we may have some different overhead that hide the > noise. Also, did you run your tests with the functions scanning at > full speed, with (ChecksumCostDelay < 0) so as there is no throttling? I used all default settings, but by default checksum_cost_delay is 0 so there shouldn't be any throttling.
Re: Online checksums verification in the backend
On Thu, Sep 10, 2020 at 08:06:10PM +0200, Julien Rouhaud wrote: > The TPS is obviously overall extremely bad, but I can see that the submitted > version added an overhead of ~3.9% (average of 5 runs), while the version > without the optimisation added an overhead of ~6.57%. Okay, so that stands as a difference. I am planning to run some benchmarks on my end as well, and see if I can see a clear difference. > This is supposed to be a relatively fair benchmark as all the data are cached > on the OS side, so IO done while holding the bufmapping lock aren't too long, > but we can see that we already get a non negligible benefit from this > optimisation. Should I do additional benchmarking, like dropping the OS cache > and/or adding some write activity? This would probably only make the > unoptimized version perform even worse. It would be also interesting to see the case where the pages are not in the OS cache and see how bad it can get. For the read-write case, I am not sure as we may have some different overhead that hide the noise. Also, did you run your tests with the functions scanning at full speed, with (ChecksumCostDelay < 0) so as there is no throttling? -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Thu, Sep 10, 2020 at 09:47:23AM +0200, Julien Rouhaud wrote: > On Wed, Sep 09, 2020 at 03:41:30PM +0200, Julien Rouhaud wrote: > > On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote: > > > > > > I'll do some becnhmarking and see if I can get some figures, but it'll > > > probably > > > take some time. In the meantime I'm attaching v11 of the patch that > > > should > > > address all other comments. > > > > I just realized that I forgot to update one of the Makefile when moving the > > TAP > > test folder. v12 attached should fix that. > > > And the cfbot just reported a new error for Windows build. Attached v13 > should > fix that. I did some benchmarking using the following environnment: - 16MB shared buffers - 490MB table (10M rows) - synchronized_seqscan to off - table in OS cache I don't have a big machine so I went with a very small shared_buffers and a small table, to make sure that all data is in OS cache but the table more than an order bigger than the shared_buffers, to simulate some plausible environment. I used a simple read only query that performs a sequential scan of the table (a simple SELECT * FROM table), run using 10 concurrent connections, 5 runs of 700 seconds. I did that without any other activity, with a \watch of the original pg_check_relation function using \watch .1, and a modified version of that function without the optimisation, still with a \watch .1 The TPS is obviously overall extremely bad, but I can see that the submitted version added an overhead of ~3.9% (average of 5 runs), while the version without the optimisation added an overhead of ~6.57%. This is supposed to be a relatively fair benchmark as all the data are cached on the OS side, so IO done while holding the bufmapping lock aren't too long, but we can see that we already get a non negligible benefit from this optimisation. Should I do additional benchmarking, like dropping the OS cache and/or adding some write activity? This would probably only make the unoptimized version perform even worse.
Re: Online checksums verification in the backend
On Wed, Sep 09, 2020 at 03:41:30PM +0200, Julien Rouhaud wrote: > On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote: > > > > I'll do some becnhmarking and see if I can get some figures, but it'll > > probably > > take some time. In the meantime I'm attaching v11 of the patch that should > > address all other comments. > > I just realized that I forgot to update one of the Makefile when moving the > TAP > test folder. v12 attached should fix that. And the cfbot just reported a new error for Windows build. Attached v13 should fix that. >From 2e08641e25ccd21bf6e4870085eb8b6b741027c1 Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Mon, 4 Nov 2019 08:40:23 +0100 Subject: [PATCH v13] Add a pg_check_relation() function. This functions checks the validity of the checksums for all non-dirty blocks of a given relation, and optionally a given fork, and returns the list of all blocks that don't match, along with the expected and found checksums. Author: Julien Rouhaud Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com --- doc/src/sgml/config.sgml | 85 + doc/src/sgml/func.sgml| 51 +++ src/backend/postmaster/pgstat.c | 3 + src/backend/storage/page/checksum.c | 322 +- src/backend/utils/adt/Makefile| 1 + src/backend/utils/adt/checksumfuncs.c | 218 src/backend/utils/init/globals.c | 7 + src/backend/utils/misc/guc.c | 33 ++ src/backend/utils/misc/postgresql.conf.sample | 6 + src/include/catalog/pg_proc.dat | 16 + src/include/miscadmin.h | 7 + src/include/pgstat.h | 3 +- src/include/utils/checksumfuncs.h | 31 ++ src/include/utils/guc_tables.h| 1 + src/test/modules/Makefile | 1 + src/test/modules/check_relation/.gitignore| 2 + src/test/modules/check_relation/Makefile | 14 + src/test/modules/check_relation/README| 23 ++ .../check_relation/t/001_checksums_check.pl | 276 +++ src/tools/msvc/Mkvcbuild.pm | 3 +- 20 files changed, 1098 insertions(+), 5 deletions(-) create mode 100644 src/backend/utils/adt/checksumfuncs.c create mode 100644 src/include/utils/checksumfuncs.h create mode 100644 src/test/modules/check_relation/.gitignore create mode 100644 src/test/modules/check_relation/Makefile create mode 100644 src/test/modules/check_relation/README create mode 100644 src/test/modules/check_relation/t/001_checksums_check.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index c4ba49ffaf..b7629fde60 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2137,6 +2137,91 @@ include_dir 'conf.d' + + Cost-based Checksum Verification Delay + + + During the execution of + function, the system maintains an internal counter that keeps track of + the estimated cost of the various I/O operations that are performed. + When the accumulated cost reaches a limit (specified by + checksum_cost_limit), the process performing the + operation will sleep for a short period of time, as specified by + checksum_cost_delay. Then it will reset the counter + and continue execution. + + + + This feature is disabled by default. To enable it, set the + checksum_cost_delay variable to a nonzero + value. + + + + + checksum_cost_delay (floating point) + +checksum_cost_delay configuration parameter + + + + + The amount of time that the process will sleep + when the cost limit has been exceeded. + If this value is specified without units, it is taken as milliseconds. + The default value is zero, which disables the cost-based checksum + verification delay feature. Positive values enable cost-based + checksum verification. + + + + + + checksum_cost_page (integer) + +checksum_cost_page configuration parameter + + + + + The estimated cost for verifying a buffer, whether it's found in the + shared buffer cache or not. It represents the cost to lock the buffer + pool, lookup the shared hash table, read the content of the page from + disk and compute its checksum. The default value is 10. + + + + + + checksum_cost_limit (integer) + +checksum_cost_limit configuration parameter + + + + + The accumulated cost that will cause the verification process to sleep. + The default value is 200. + + + + + +
Re: Online checksums verification in the backend
On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote: > > I'll do some becnhmarking and see if I can get some figures, but it'll > probably > take some time. In the meantime I'm attaching v11 of the patch that should > address all other comments. I just realized that I forgot to update one of the Makefile when moving the TAP test folder. v12 attached should fix that. >From 84d0a4c5ec5d6b6f832fad02d421c204f1bee98b Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Mon, 4 Nov 2019 08:40:23 +0100 Subject: [PATCH v12] Add a pg_check_relation() function. This functions checks the validity of the checksums for all non-dirty blocks of a given relation, and optionally a given fork, and returns the list of all blocks that don't match, along with the expected and found checksums. Author: Julien Rouhaud Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com --- doc/src/sgml/config.sgml | 85 + doc/src/sgml/func.sgml| 51 +++ src/backend/postmaster/pgstat.c | 3 + src/backend/storage/page/checksum.c | 322 +- src/backend/utils/adt/Makefile| 1 + src/backend/utils/adt/checksumfuncs.c | 218 src/backend/utils/init/globals.c | 7 + src/backend/utils/misc/guc.c | 33 ++ src/backend/utils/misc/postgresql.conf.sample | 6 + src/include/catalog/pg_proc.dat | 16 + src/include/miscadmin.h | 7 + src/include/pgstat.h | 3 +- src/include/utils/checksumfuncs.h | 31 ++ src/include/utils/guc_tables.h| 1 + src/test/modules/Makefile | 1 + src/test/modules/check_relation/.gitignore| 2 + src/test/modules/check_relation/Makefile | 14 + src/test/modules/check_relation/README| 23 ++ .../check_relation/t/001_checksums_check.pl | 276 +++ 19 files changed, 1096 insertions(+), 4 deletions(-) create mode 100644 src/backend/utils/adt/checksumfuncs.c create mode 100644 src/include/utils/checksumfuncs.h create mode 100644 src/test/modules/check_relation/.gitignore create mode 100644 src/test/modules/check_relation/Makefile create mode 100644 src/test/modules/check_relation/README create mode 100644 src/test/modules/check_relation/t/001_checksums_check.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index c4ba49ffaf..b7629fde60 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2137,6 +2137,91 @@ include_dir 'conf.d' + + Cost-based Checksum Verification Delay + + + During the execution of + function, the system maintains an internal counter that keeps track of + the estimated cost of the various I/O operations that are performed. + When the accumulated cost reaches a limit (specified by + checksum_cost_limit), the process performing the + operation will sleep for a short period of time, as specified by + checksum_cost_delay. Then it will reset the counter + and continue execution. + + + + This feature is disabled by default. To enable it, set the + checksum_cost_delay variable to a nonzero + value. + + + + + checksum_cost_delay (floating point) + +checksum_cost_delay configuration parameter + + + + + The amount of time that the process will sleep + when the cost limit has been exceeded. + If this value is specified without units, it is taken as milliseconds. + The default value is zero, which disables the cost-based checksum + verification delay feature. Positive values enable cost-based + checksum verification. + + + + + + checksum_cost_page (integer) + +checksum_cost_page configuration parameter + + + + + The estimated cost for verifying a buffer, whether it's found in the + shared buffer cache or not. It represents the cost to lock the buffer + pool, lookup the shared hash table, read the content of the page from + disk and compute its checksum. The default value is 10. + + + + + + checksum_cost_limit (integer) + +checksum_cost_limit configuration parameter + + + + + The accumulated cost that will cause the verification process to sleep. + The default value is 200. + + + + + + + + There are certain operations that hold critical locks and should + therefore complete as quickly as possible. Cost-based checksum + verification delays do not occur during such operations. Therefore
Re: Online checksums verification in the backend
On Wed, Sep 9, 2020 at 2:37 PM Michael Paquier wrote: > > Another thing that was itching me is the introduction of 3 GUCs with > one new category for the sake of two SQL functions. For VACUUM we > have many things relying on the GUC delays, with autovacuum and manual > vacuum. Perhaps it would make sense to have these if we have some day > a background worker doing checksum verifications, still that could > perfectly be in contrib/, and that would be kind of hard to tune as > well. The patch enabling checksums on-the-fly could also be a reason > good enough. Another thing we could consider is to pass down those > parameters as function arguments, at the cost of not being able to > reload them. I'm not terribly happy with adding that for now, but it's quite clear that there'll eventually be a lot of new stuff added that will benefit from either the category or the GUC. FTR once we reach an agreement on how to do this check (I'm wondering if it'll stay an SQL function or become a plain backend command, in which case GUCs would be mandatory), I'll also be happy to work on a background worker to help people running the check regularly. So in my opinion it's better to add them now so we won't have to change the sql function definition later when other facilities will rely on them.
Re: Online checksums verification in the backend
On Wed, Sep 09, 2020 at 11:25:24AM +0200, Julien Rouhaud wrote: > I assumed that the odds of having to check the buffer twice were so low, and > avoiding to keep a bufmapping lock while doing some IO was an uncontroversial > enough optimisation, but maybe that's only wishful thinking. Perhaps it is worth it, so it would be good to make sure of it and see if that's actually worth the extra complication. This also depends if the page is in the OS cache if the page is not in shared buffers, meaning that smgrread() is needed to fetch the page to check. I would be more curious to see if there is an actual difference if the page is the OS cache. > I'll do some becnhmarking and see if I can get some figures, but it'll > probably > take some time. In the meantime I'm attaching v11 of the patch that should > address all other comments. Thanks. Another thing that was itching me is the introduction of 3 GUCs with one new category for the sake of two SQL functions. For VACUUM we have many things relying on the GUC delays, with autovacuum and manual vacuum. Perhaps it would make sense to have these if we have some day a background worker doing checksum verifications, still that could perfectly be in contrib/, and that would be kind of hard to tune as well. The patch enabling checksums on-the-fly could also be a reason good enough. Another thing we could consider is to pass down those parameters as function arguments, at the cost of not being able to reload them. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Mon, Sep 07, 2020 at 05:50:38PM +0900, Michael Paquier wrote: > On Mon, Sep 07, 2020 at 09:38:30AM +0200, Julien Rouhaud wrote: > > Did you mean creating a new checksumfuncs.c file? I don't find any > > such file in the current tree. > > Your patch adds checksumfuncs.c, so the subroutines grabbing a given > block could just be moved there. > Sorry, I was in the middle of a rebase for another patch and missed the new files added in this one. I added a new checksumfuncs.h for the required include that should not be seen by client code. I kept checksumfuncs.c and checksums.c so that the SQL visible declaration are separated from the rest of the implementation as this is what we already do elsewhere I think. If that's a problem I'll change and put everything in checksumfuncs.[ch]. I also moved the tap tests in src/test/modules and renamed the file with a 3 digits. For the record I initially copied src/test/modules/brin, and this is apparently the only subdir that has a 2 digits pattern. I also added a new WAIT_EVENT_CHECK_DELAY wait event. > > I'm not sure I understand. Unless I missed something this approach > > *cannot* raise a false positive. What it does is force a 2nd check > > with stronger lock *to make sure it's actually a corruption*, so we > > don't raise false positive. The only report that can happen in this > > 1st loop is if smgread raises an error, which AFAICT can only happen > > (at least with mdread) if the whole block couldn't be read, which is a > > sign of a very bad problem. This should clearly be reported, as this > > cannot be caused by the locking heuristics used here. > > We don't know how much this optimization matters though? Could it be > possible to get an idea of that? For example, take the case of one > relation with a fixed size in a read-only workload and a read-write > workload (as long as autovacuum and updates make the number of > relation blocks rather constant for the read-write case), doing a > checksum verification in parallel of multiple clients working on the > relation concurrently. Assuming that the relation is fully in the OS > cache, we could get an idea of the impact with multiple > (shared_buffers / relation size) rates to make the eviction more > aggressive? The buffer partition locks, knowing that > NUM_BUFFER_PARTITIONS caps that, should be the bottleneck, still it > seems to me that it would be good to see if we have a difference. > What do you think? I assumed that the odds of having to check the buffer twice were so low, and avoiding to keep a bufmapping lock while doing some IO was an uncontroversial enough optimisation, but maybe that's only wishful thinking. I'll do some becnhmarking and see if I can get some figures, but it'll probably take some time. In the meantime I'm attaching v11 of the patch that should address all other comments. >From 08b28331e11160ab842b7b72a61c4478e7b3561a Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Mon, 4 Nov 2019 08:40:23 +0100 Subject: [PATCH v11] Add a pg_check_relation() function. This functions checks the validity of the checksums for all non-dirty blocks of a given relation, and optionally a given fork, and returns the list of all blocks that don't match, along with the expected and found checksums. Author: Julien Rouhaud Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com --- doc/src/sgml/config.sgml | 85 + doc/src/sgml/func.sgml| 51 +++ src/backend/postmaster/pgstat.c | 3 + src/backend/storage/page/checksum.c | 322 +- src/backend/utils/adt/Makefile| 1 + src/backend/utils/adt/checksumfuncs.c | 218 src/backend/utils/init/globals.c | 7 + src/backend/utils/misc/guc.c | 33 ++ src/backend/utils/misc/postgresql.conf.sample | 6 + src/include/catalog/pg_proc.dat | 16 + src/include/miscadmin.h | 7 + src/include/pgstat.h | 3 +- src/include/utils/checksumfuncs.h | 31 ++ src/include/utils/guc_tables.h| 1 + src/test/Makefile | 3 +- src/test/modules/Makefile | 1 + src/test/modules/check_relation/.gitignore| 2 + src/test/modules/check_relation/Makefile | 14 + src/test/modules/check_relation/README| 23 ++ .../check_relation/t/001_checksums_check.pl | 276 +++ 20 files changed, 1098 insertions(+), 5 deletions(-) create mode 100644 src/backend/utils/adt/checksumfuncs.c create mode 100644 src/include/utils/checksumfuncs.h create mode 100644 src/test/modules/check_relation/.gitignore create mode 100644 src/test/modules/check_relation/Makefile create mode 100644 src/test/modules/check_relation/README create mode
Re: Online checksums verification in the backend
On Tue, Sep 08, 2020 at 11:36:45AM +0900, Masahiko Sawada wrote: > On Mon, 7 Sep 2020 at 15:59, Michael Paquier wrote: >> In this case it could be better to unregister from the >> CF app for this entry. > > Well, I sent review comments on this patch and Julien fixed all > comments. So I’d marked this as Ready for Committer since I didn't > have further comments at that time, and I was waiting for the > committer review. I'll look at this patch again but should I remove my > name from the reviewer after that if no comments? Ah, sorry, I somewhat missed the previous status of the patch. Perhaps that's an overdose of CF. Keeping your name as reviewer is fine I guess. I have begun looking at the patch and spotted some issues, so let's see where we do from here. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Mon, 7 Sep 2020 at 15:59, Michael Paquier wrote: > > On Tue, Jul 14, 2020 at 11:08:08AM +0200, Julien Rouhaud wrote: > > On Sun, Jul 12, 2020 at 12:34:03PM -0500, Justin Pryzby wrote: > >> Small language fixes in comments and user-facing docs. > > > > Thanks a lot! I just fixed a small issue (see below), PFA updated v10. > > Sawada-san, you are registered as a reviewer of this patch. Are you > planning to look at it? If you are busy lately, that's fine as well > (congrats!). Thanks! > In this case it could be better to unregister from the > CF app for this entry. Well, I sent review comments on this patch and Julien fixed all comments. So I’d marked this as Ready for Committer since I didn't have further comments at that time, and I was waiting for the committer review. I'll look at this patch again but should I remove my name from the reviewer after that if no comments? Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Online checksums verification in the backend
On Mon, Sep 07, 2020 at 09:38:30AM +0200, Julien Rouhaud wrote: > Did you mean creating a new checksumfuncs.c file? I don't find any > such file in the current tree. Your patch adds checksumfuncs.c, so the subroutines grabbing a given block could just be moved there. > I'm not sure I understand. Unless I missed something this approach > *cannot* raise a false positive. What it does is force a 2nd check > with stronger lock *to make sure it's actually a corruption*, so we > don't raise false positive. The only report that can happen in this > 1st loop is if smgread raises an error, which AFAICT can only happen > (at least with mdread) if the whole block couldn't be read, which is a > sign of a very bad problem. This should clearly be reported, as this > cannot be caused by the locking heuristics used here. We don't know how much this optimization matters though? Could it be possible to get an idea of that? For example, take the case of one relation with a fixed size in a read-only workload and a read-write workload (as long as autovacuum and updates make the number of relation blocks rather constant for the read-write case), doing a checksum verification in parallel of multiple clients working on the relation concurrently. Assuming that the relation is fully in the OS cache, we could get an idea of the impact with multiple (shared_buffers / relation size) rates to make the eviction more aggressive? The buffer partition locks, knowing that NUM_BUFFER_PARTITIONS caps that, should be the bottleneck, still it seems to me that it would be good to see if we have a difference. What do you think? -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Mon, Sep 7, 2020 at 8:59 AM Michael Paquier wrote: > > +#include "postgres.h" > + > +#include "access/tupdesc.h" > +#include "common/relpath.h" > #include "storage/block.h" > +#include "utils/relcache.h" > +#include "utils/tuplestore.h" > [...] > +extern bool check_one_block(Relation relation, ForkNumber forknum, > + BlockNumber blkno, uint16 *chk_expected, > + uint16 *chk_found); > I don't think that it is a good idea to add this much to checksum.h > as these are includes coming mainly from the backend. Note that > pg_checksum_page() is a function designed to be also available for > frontend tools, with checksum.h something that can be included in > frontends. This would mean that we could move all the buffer lookup > APIs directly to checksumfuncs.c, or move that into a separate file > closer to the location. Did you mean creating a new checksumfuncs.c file? I don't find any such file in the current tree. > + * A zero checksum can never be computed, see pg_checksum_page() */ > +#define NoComputedChecksum 0 > Wouldn't it be better to rename that something like > InvalidPageChecksum, and make use of it in pg_checksum_page()? It > would be more consistent with the naming of transaction IDs, OIDs or > even XLogRecPtr. And that could be a separate patch. It seems quite ambiguous, as checksum validity usually has a different meaning. And in the code added here, the meaning isn't that the ckecksum is invalid but that there's no checsum as it cannot be computed due to PageIsNew(). > +++ b/src/test/check_relation/t/01_checksums_check.pl > @@ -0,0 +1,276 @@ > +use strict; > +use warnings; > It could be better to move that to src/test/modules/, so as it could > be picked more easily by MSVC scripts in the future. Note that if we > apply the normal naming convention here this should be named > 001_checksum_check.pl. > > +subdir = src/test/check_relation > +top_builddir = ../../.. > +include $(top_builddir)/src/Makefile.global > Let's use a Makefile shaped in a way similar to modules/test_misc that > makes use of TAP_TESTS = 1. There is the infra, let's rely on it for > the regression tests. Will fix. > + pg_usleep(msec * 1000L); > Could it be possible to add a wait event here? It would be nice to be > able to monitor that in pg_stat_activity. Sure, I missed that as this was first implemented as an extension. > +if (exists $ENV{MY_PG_REGRESS}) > +{ > + $ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS}; > +} > What is MY_PG_REGRESS for? A remnant from an external makefile > perhaps? Indeed. > + /* > +* If we get a failure and the buffer wasn't found in shared buffers, > +* reread the buffer with suitable lock to avoid false positive. See > +* check_get_buffer for more details. > +*/ > + if (!found_in_sb && !force_lock) > + { > + force_lock = true; > + goto Retry; > + } > As designed, we have a risk of false positives with a torn page in the > first loop when trying to look for a given buffer as we would try to > use smgrread() without a partition lock. This stresses me a bit, and > false positives could scare users easily. Could we consider first a > safer approach where we don't do that, and just read the page while > holding the partition lock? OK, that would be more expensive, but at > least that's safe in any case. My memory of this patch is a bit > fuzzy, but this is itching me and this is the heart of the problem > dealt with here :) I'm not sure I understand. Unless I missed something this approach *cannot* raise a false positive. What it does is force a 2nd check with stronger lock *to make sure it's actually a corruption*, so we don't raise false positive. The only report that can happen in this 1st loop is if smgread raises an error, which AFAICT can only happen (at least with mdread) if the whole block couldn't be read, which is a sign of a very bad problem. This should clearly be reported, as this cannot be caused by the locking heuristics used here.
Re: Online checksums verification in the backend
On Tue, Jul 14, 2020 at 11:08:08AM +0200, Julien Rouhaud wrote: > On Sun, Jul 12, 2020 at 12:34:03PM -0500, Justin Pryzby wrote: >> Small language fixes in comments and user-facing docs. > > Thanks a lot! I just fixed a small issue (see below), PFA updated v10. Sawada-san, you are registered as a reviewer of this patch. Are you planning to look at it? If you are busy lately, that's fine as well (congrats!). In this case it could be better to unregister from the CF app for this entry. I am refreshing my mind here, but here are some high-level comments for now... +#include "postgres.h" + +#include "access/tupdesc.h" +#include "common/relpath.h" #include "storage/block.h" +#include "utils/relcache.h" +#include "utils/tuplestore.h" [...] +extern bool check_one_block(Relation relation, ForkNumber forknum, + BlockNumber blkno, uint16 *chk_expected, + uint16 *chk_found); I don't think that it is a good idea to add this much to checksum.h as these are includes coming mainly from the backend. Note that pg_checksum_page() is a function designed to be also available for frontend tools, with checksum.h something that can be included in frontends. This would mean that we could move all the buffer lookup APIs directly to checksumfuncs.c, or move that into a separate file closer to the location. + * A zero checksum can never be computed, see pg_checksum_page() */ +#define NoComputedChecksum 0 Wouldn't it be better to rename that something like InvalidPageChecksum, and make use of it in pg_checksum_page()? It would be more consistent with the naming of transaction IDs, OIDs or even XLogRecPtr. And that could be a separate patch. +++ b/src/test/check_relation/t/01_checksums_check.pl @@ -0,0 +1,276 @@ +use strict; +use warnings; It could be better to move that to src/test/modules/, so as it could be picked more easily by MSVC scripts in the future. Note that if we apply the normal naming convention here this should be named 001_checksum_check.pl. +subdir = src/test/check_relation +top_builddir = ../../.. +include $(top_builddir)/src/Makefile.global Let's use a Makefile shaped in a way similar to modules/test_misc that makes use of TAP_TESTS = 1. There is the infra, let's rely on it for the regression tests. + pg_usleep(msec * 1000L); Could it be possible to add a wait event here? It would be nice to be able to monitor that in pg_stat_activity. +if (exists $ENV{MY_PG_REGRESS}) +{ + $ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS}; +} What is MY_PG_REGRESS for? A remnant from an external makefile perhaps? + /* +* If we get a failure and the buffer wasn't found in shared buffers, +* reread the buffer with suitable lock to avoid false positive. See +* check_get_buffer for more details. +*/ + if (!found_in_sb && !force_lock) + { + force_lock = true; + goto Retry; + } As designed, we have a risk of false positives with a torn page in the first loop when trying to look for a given buffer as we would try to use smgrread() without a partition lock. This stresses me a bit, and false positives could scare users easily. Could we consider first a safer approach where we don't do that, and just read the page while holding the partition lock? OK, that would be more expensive, but at least that's safe in any case. My memory of this patch is a bit fuzzy, but this is itching me and this is the heart of the problem dealt with here :) -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Sun, Jul 12, 2020 at 12:34:03PM -0500, Justin Pryzby wrote: > Small language fixes in comments and user-facing docs. Thanks a lot! I just fixed a small issue (see below), PFA updated v10. > > diff --git a/src/backend/storage/page/checksum.c > b/src/backend/storage/page/checksum.c > index eb2c919c34..17cd95ec95 100644 > --- a/src/backend/storage/page/checksum.c > +++ b/src/backend/storage/page/checksum.c > @@ -36,7 +36,7 @@ > * actual storage, you have to discard the operating system cache before > * running those functions. > * > - * To avoid torn page and possible false positive when reading data, and > + * To avoid torn pages and possible false positives when reading data, and to > * keeping overhead as low as possible, the following heuristics are used: > * Changed for "to keep". >From ff8b384d3acdee2986543742c22af9de0335ff4f Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Mon, 4 Nov 2019 08:40:23 +0100 Subject: [PATCH v10] Add a pg_check_relation() function. This functions checks the validity of the checksums for all non-dirty blocks of a given relation, and optionally a given fork, and returns the list of all blocks that don't match, along with the expected and found checksums. Author: Julien Rouhaud Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com --- doc/src/sgml/config.sgml | 85 + doc/src/sgml/func.sgml| 51 +++ src/backend/storage/page/checksum.c | 318 +- src/backend/utils/adt/Makefile| 1 + src/backend/utils/adt/checksumfuncs.c | 217 src/backend/utils/init/globals.c | 7 + src/backend/utils/misc/guc.c | 33 ++ src/backend/utils/misc/postgresql.conf.sample | 6 + src/include/catalog/pg_proc.dat | 16 + src/include/miscadmin.h | 7 + src/include/storage/checksum.h| 13 + src/include/utils/guc_tables.h| 1 + src/test/Makefile | 3 +- src/test/check_relation/.gitignore| 2 + src/test/check_relation/Makefile | 23 ++ src/test/check_relation/README| 23 ++ .../check_relation/t/01_checksums_check.pl| 276 +++ 17 files changed, 1078 insertions(+), 4 deletions(-) create mode 100644 src/backend/utils/adt/checksumfuncs.c create mode 100644 src/test/check_relation/.gitignore create mode 100644 src/test/check_relation/Makefile create mode 100644 src/test/check_relation/README create mode 100644 src/test/check_relation/t/01_checksums_check.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index b353c61683..f6a03ff931 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2038,6 +2038,91 @@ include_dir 'conf.d' + + Cost-based Checksum Verification Delay + + + During the execution of + function, the system maintains an internal counter that keeps track of + the estimated cost of the various I/O operations that are performed. + When the accumulated cost reaches a limit (specified by + checksum_cost_limit), the process performing the + operation will sleep for a short period of time, as specified by + checksum_cost_delay. Then it will reset the counter + and continue execution. + + + + This feature is disabled by default. To enable it, set the + checksum_cost_delay variable to a nonzero + value. + + + + + checksum_cost_delay (floating point) + +checksum_cost_delay configuration parameter + + + + + The amount of time that the process will sleep + when the cost limit has been exceeded. + If this value is specified without units, it is taken as milliseconds. + The default value is zero, which disables the cost-based checksum + verification delay feature. Positive values enable cost-based + checksum verification. + + + + + + checksum_cost_page (integer) + +checksum_cost_page configuration parameter + + + + + The estimated cost for verifying a buffer, whether it's found in the + shared buffer cache or not. It represents the cost to lock the buffer + pool, lookup the shared hash table, read the content of the page from + disk and compute its checksum. The default value is 10. + + + + + + checksum_cost_limit (integer) + +checksum_cost_limit configuration parameter + + + + + The accumulated cost that will cause the verification process to sleep. + The default value is 200. + + + + + +
Re: Online checksums verification in the backend
Small language fixes in comments and user-facing docs. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 88efb38556..39596db193 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -26162,7 +26162,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8'); The functions shown in -provide means to check for health of data file in a cluster. +provide a means to check for health of a data file in a cluster. @@ -26179,8 +26179,8 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8'); pg_check_relation(relation regclass [, fork text]) setof record - Validate the checksums for all blocks of all or the given fork of - a given relation. + Validate the checksum for all blocks of a relation. + @@ -26190,15 +26190,15 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8'); pg_check_relation -pg_check_relation iterates over all the blocks of a -given relation and verify their checksum. If provided, -fork should be 'main' for the +pg_check_relation iterates over all blocks of a +given relation and verifies their checksums. If passed, +fork specifies that only checksums of the given +fork are to be verified. Fork should be 'main' for the main data fork, 'fsm' for the free space map, 'vm' for the visibility map, or -'init' for the initialization fork, and only this -specific fork will be verifies, otherwise all forks will. The function -returns the list of blocks for which the found checksum doesn't match the -expected one. See 'init' for the initialization fork. +The function returns a list of blocks for which the computed and stored +checksums don't match. See for information on how to configure cost-based verification delay. You must be a member of the pg_read_all_stats role to use this diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c index eb2c919c34..17cd95ec95 100644 --- a/src/backend/storage/page/checksum.c +++ b/src/backend/storage/page/checksum.c @@ -36,7 +36,7 @@ * actual storage, you have to discard the operating system cache before * running those functions. * - * To avoid torn page and possible false positive when reading data, and + * To avoid torn pages and possible false positives when reading data, and to * keeping overhead as low as possible, the following heuristics are used: * * - a shared LWLock is taken on the target buffer pool partition mapping, and @@ -92,8 +92,8 @@ check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno, *chk_expected = *chk_found = NoComputedChecksum; /* -* To avoid too much overhead, the buffer will be first read without -* the locks that would guarantee the lack of false positive, as such +* To avoid excessive overhead, the buffer will be first read without +* the locks that would prevent false positives, as such * events should be quite rare. */ Retry: @@ -120,10 +120,10 @@ Retry: } /* - * Perform a checksum check on the passed page. Returns whether the page is + * Perform a checksum check on the passed page. Return True iff the page is * valid or not, and assign the expected and found checksum in chk_expected and * chk_found, respectively. Note that a page can look like new but could be - * the result of a corruption. We still check for this case, but we can't + * the result of corruption. We still check for this case, but we can't * compute its checksum as pg_checksum_page() is explicitly checking for * non-new pages, so NoComputedChecksum will be set in chk_found. */ @@ -139,7 +139,7 @@ check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected, if (PageIsNew(page)) { /* -* Check if the page is really new or if there's a corruption that +* Check if the page is really new or if there's corruption that * affected PageIsNew detection. Note that PageIsVerified won't try to * detect checksum corruption in this case, so there's no risk of * duplicated corruption report. @@ -151,7 +151,7 @@ check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected, } /* -* There's a corruption, but since this affect PageIsNew, we +* There's corruption, but since this affects PageIsNew, we * can't compute a checksum, so set NoComputedChecksum for the * expected checksum. */ @@ -218,8 +218,8 @@ check_delay_point(void) * held. Reading with this lock is to avoid the unlikely but possible case * that a buffer wasn't present in shared buffers when we checked but it then * alloc'ed in shared_buffers, modified and
Re: Online checksums verification in the backend
> On 5 Apr 2020, at 13:17, Julien Rouhaud wrote: > On Sun, Apr 05, 2020 at 08:01:36PM +0900, Masahiko Sawada wrote: >> Thank you for updating the patch! The patch looks good to me. >> >> I've marked this patch as Ready for Committer. I hope this patch will >> get committed to PG13. > Thanks a lot! This patch has been through quite thorough review, and skimming the thread all concerns raised have been addressed. It still applies and tests gree in the CF Patchtester. The feature in itself certainly gets my +1 for inclusion, it seems a good addition. Is any committer who has taken part in the thread (or anyone else for that matter) interested in seeing this to some form of closure in this CF? cheers ./daniel
Re: Online checksums verification in the backend
On Sun, Apr 05, 2020 at 08:01:36PM +0900, Masahiko Sawada wrote: > On Sun, 5 Apr 2020 at 18:45, Julien Rouhaud wrote: > > > > On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote: > > > > > > Why do we need two rows in the doc? For instance, replication slot > > > functions have some optional arguments but there is only one row in > > > the doc. So I think we don't need to change the doc from the previous > > > version patch. > > > > > > > I thought that if we document the function as pg_check_relation(regclass [, > > fork]) users could think that the 2nd argument is optional, so that > > pg_check_relation('something', NULL) could be a valid alias for the > > 1-argument > > form, which it isn't. After checking, I see that e.g. current_setting has > > the > > same semantics and is documented the way you suggest, so fixed back to > > previous > > version. > > > > > And I think these are not necessary as we already defined in > > > include/catalog/pg_proc.dat: > > > > > > +CREATE OR REPLACE FUNCTION pg_check_relation( > > > + IN relation regclass, > > > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > > > + OUT expected_checksum integer, OUT found_checksum integer) > > > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS > > > 'pg_check_relation' > > > + PARALLEL RESTRICTED; > > > + > > > +CREATE OR REPLACE FUNCTION pg_check_relation( > > > + IN relation regclass, IN fork text, > > > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > > > + OUT expected_checksum integer, OUT found_checksum integer) > > > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal > > > + AS 'pg_check_relation_fork' > > > + PARALLEL RESTRICTED; > > > > > > > Oh right this isn't required since there's no default value anymore, fixed. > > > > v9 attached. > > Thank you for updating the patch! The patch looks good to me. > > I've marked this patch as Ready for Committer. I hope this patch will > get committed to PG13. > Thanks a lot!
Re: Online checksums verification in the backend
On Sun, 5 Apr 2020 at 18:45, Julien Rouhaud wrote: > > On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote: > > > > Why do we need two rows in the doc? For instance, replication slot > > functions have some optional arguments but there is only one row in > > the doc. So I think we don't need to change the doc from the previous > > version patch. > > > > I thought that if we document the function as pg_check_relation(regclass [, > fork]) users could think that the 2nd argument is optional, so that > pg_check_relation('something', NULL) could be a valid alias for the 1-argument > form, which it isn't. After checking, I see that e.g. current_setting has the > same semantics and is documented the way you suggest, so fixed back to > previous > version. > > > And I think these are not necessary as we already defined in > > include/catalog/pg_proc.dat: > > > > +CREATE OR REPLACE FUNCTION pg_check_relation( > > + IN relation regclass, > > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > > + OUT expected_checksum integer, OUT found_checksum integer) > > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS > > 'pg_check_relation' > > + PARALLEL RESTRICTED; > > + > > +CREATE OR REPLACE FUNCTION pg_check_relation( > > + IN relation regclass, IN fork text, > > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > > + OUT expected_checksum integer, OUT found_checksum integer) > > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal > > + AS 'pg_check_relation_fork' > > + PARALLEL RESTRICTED; > > > > Oh right this isn't required since there's no default value anymore, fixed. > > v9 attached. Thank you for updating the patch! The patch looks good to me. I've marked this patch as Ready for Committer. I hope this patch will get committed to PG13. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Online checksums verification in the backend
On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote: > > Why do we need two rows in the doc? For instance, replication slot > functions have some optional arguments but there is only one row in > the doc. So I think we don't need to change the doc from the previous > version patch. > I thought that if we document the function as pg_check_relation(regclass [, fork]) users could think that the 2nd argument is optional, so that pg_check_relation('something', NULL) could be a valid alias for the 1-argument form, which it isn't. After checking, I see that e.g. current_setting has the same semantics and is documented the way you suggest, so fixed back to previous version. > And I think these are not necessary as we already defined in > include/catalog/pg_proc.dat: > > +CREATE OR REPLACE FUNCTION pg_check_relation( > + IN relation regclass, > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > + OUT expected_checksum integer, OUT found_checksum integer) > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS > 'pg_check_relation' > + PARALLEL RESTRICTED; > + > +CREATE OR REPLACE FUNCTION pg_check_relation( > + IN relation regclass, IN fork text, > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > + OUT expected_checksum integer, OUT found_checksum integer) > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal > + AS 'pg_check_relation_fork' > + PARALLEL RESTRICTED; > Oh right this isn't required since there's no default value anymore, fixed. v9 attached. >From 66e0ed1c3b12c42c4d52b2b27e79e16e82730b4b Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Mon, 4 Nov 2019 08:40:23 +0100 Subject: [PATCH v9] Add a pg_check_relation() function. This functions checks the validity of the checksums for all non-dirty blocks of a given relation, and optionally a given fork, and returns the list of all blocks that don't match, along with the expected and found checksums. Author: Julien Rouhaud Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com --- doc/src/sgml/config.sgml | 85 + doc/src/sgml/func.sgml| 51 +++ src/backend/storage/page/checksum.c | 318 +- src/backend/utils/adt/Makefile| 1 + src/backend/utils/adt/checksumfuncs.c | 217 src/backend/utils/init/globals.c | 7 + src/backend/utils/misc/guc.c | 33 ++ src/backend/utils/misc/postgresql.conf.sample | 6 + src/include/catalog/pg_proc.dat | 16 + src/include/miscadmin.h | 7 + src/include/storage/checksum.h| 13 + src/include/utils/guc_tables.h| 1 + src/test/Makefile | 3 +- src/test/check_relation/.gitignore| 2 + src/test/check_relation/Makefile | 23 ++ src/test/check_relation/README| 23 ++ .../check_relation/t/01_checksums_check.pl| 276 +++ 17 files changed, 1078 insertions(+), 4 deletions(-) create mode 100644 src/backend/utils/adt/checksumfuncs.c create mode 100644 src/test/check_relation/.gitignore create mode 100644 src/test/check_relation/Makefile create mode 100644 src/test/check_relation/README create mode 100644 src/test/check_relation/t/01_checksums_check.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 114db38116..a20501bb85 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2038,6 +2038,91 @@ include_dir 'conf.d' + + Cost-based Checksum Verification Delay + + + During the execution of + function, the system maintains an internal counter that keeps track of + the estimated cost of the various I/O operations that are performed. + When the accumulated cost reaches a limit (specified by + checksum_cost_limit), the process performing the + operation will sleep for a short period of time, as specified by + checksum_cost_delay. Then it will reset the counter + and continue execution. + + + + This feature is disabled by default. To enable it, set the + checksum_cost_delay variable to a nonzero + value. + + + + + checksum_cost_delay (floating point) + +checksum_cost_delay configuration parameter + + + + + The amount of time that the process will sleep + when the cost limit has been exceeded. + If this value is specified without units, it is taken as milliseconds. + The default value is zero, which disables the cost-based checksum + verification delay feature. Positive values enable cost-based + checksum verification. + + + + + + checksum_cost_page (integer) + +
Re: Online checksums verification in the backend
On Sun, 5 Apr 2020 at 17:44, Julien Rouhaud wrote: > > On Sun, Apr 05, 2020 at 01:13:30PM +0900, Masahiko Sawada wrote: > > > > Thank you for updating the patch! The patch looks good to me. Here are > > some random comments mostly about cosmetic changes. > > > > Thanks a lot for the review! Thank you for updating the patch. > > > > > 1. > > I think we can have two separate SQL functions: > > pg_check_relation(regclass, text) and pg_check_relation(regclass), > > instead of setting NULL by default to the second argument. > > > > I'm fine with it, so implemented this way with the required documentation > changes. Why do we need two rows in the doc? For instance, replication slot functions have some optional arguments but there is only one row in the doc. So I think we don't need to change the doc from the previous version patch. And I think these are not necessary as we already defined in include/catalog/pg_proc.dat: +CREATE OR REPLACE FUNCTION pg_check_relation( + IN relation regclass, + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, + OUT expected_checksum integer, OUT found_checksum integer) + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation' + PARALLEL RESTRICTED; + +CREATE OR REPLACE FUNCTION pg_check_relation( + IN relation regclass, IN fork text, + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, + OUT expected_checksum integer, OUT found_checksum integer) + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal + AS 'pg_check_relation_fork' + PARALLEL RESTRICTED; > > > > > 2. > > + * Check data sanity for a specific block in the given fork of the given > > + * relation, always retrieved locally with smgrred even if a version > > exists in > > > > s/smgrred/smgrread/ > > Fixed. > > > > > 3. > > + /* The buffer will have to check checked. */ > > + Assert(checkit); > > > > Should it be "The buffer will have to be checked"? > > > > Oops indeed, fixed. > > > > > 4. > > + if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES)) > > + ereport(ERROR, > > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > > +errmsg("only superuser or a member of the > > pg_read_server_files role may use this function"))); > > > > Looking at the definition of pg_stat_read_server_files role, this role > > seems to be for operations that could read non-database files such as > > csv files. Therefore, currently this role is used by file_fdw and COPY > > command. I personally think pg_stat_scan_tables would be more > > appropriate for this function but I'm not sure. > > > > That's a very good point, especially since the documentation of this default > role is quite relevant for those functions: > > "Execute monitoring functions that may take ACCESS SHARE locks on tables, > potentially for a long time." > > So changed! > > > > > 5. > > + /* Set cost-based vacuum delay */ > > + ChecksumCostActive = (ChecksumCostDelay > 0); > > + ChecksumCostBalance = 0; > > > > s/vacuum/checksum verification/ > > > > Fixed. > > > > > 6. > > + ereport(WARNING, > > + (errcode(ERRCODE_DATA_CORRUPTED), > > +errmsg("invalid page in block %u of relation %s", > > + blkno, > > + relpath(relation->rd_smgr->smgr_rnode, forknum; > > > > I think it's better to show the relation name instead of the relation path > > here. > > > > I'm here using the same pattern as what ReadBuffer_common() would display if a > corrupted block is read. I think it's better to keep the format for both, so > any existing log analyzer will keep working with those new functions. Ok, I agree with you. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Online checksums verification in the backend
On Sun, Apr 05, 2020 at 01:13:30PM +0900, Masahiko Sawada wrote: > > Thank you for updating the patch! The patch looks good to me. Here are > some random comments mostly about cosmetic changes. > Thanks a lot for the review! > > 1. > I think we can have two separate SQL functions: > pg_check_relation(regclass, text) and pg_check_relation(regclass), > instead of setting NULL by default to the second argument. > I'm fine with it, so implemented this way with the required documentation changes. > > 2. > + * Check data sanity for a specific block in the given fork of the given > + * relation, always retrieved locally with smgrred even if a version exists > in > > s/smgrred/smgrread/ Fixed. > > 3. > + /* The buffer will have to check checked. */ > + Assert(checkit); > > Should it be "The buffer will have to be checked"? > Oops indeed, fixed. > > 4. > + if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES)) > + ereport(ERROR, > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > +errmsg("only superuser or a member of the > pg_read_server_files role may use this function"))); > > Looking at the definition of pg_stat_read_server_files role, this role > seems to be for operations that could read non-database files such as > csv files. Therefore, currently this role is used by file_fdw and COPY > command. I personally think pg_stat_scan_tables would be more > appropriate for this function but I'm not sure. > That's a very good point, especially since the documentation of this default role is quite relevant for those functions: "Execute monitoring functions that may take ACCESS SHARE locks on tables, potentially for a long time." So changed! > > 5. > + /* Set cost-based vacuum delay */ > + ChecksumCostActive = (ChecksumCostDelay > 0); > + ChecksumCostBalance = 0; > > s/vacuum/checksum verification/ > Fixed. > > 6. > + ereport(WARNING, > + (errcode(ERRCODE_DATA_CORRUPTED), > +errmsg("invalid page in block %u of relation %s", > + blkno, > + relpath(relation->rd_smgr->smgr_rnode, forknum; > > I think it's better to show the relation name instead of the relation path > here. > I'm here using the same pattern as what ReadBuffer_common() would display if a corrupted block is read. I think it's better to keep the format for both, so any existing log analyzer will keep working with those new functions. > > 7. > + ereport(ERROR, > + (errcode(ERRCODE_WRONG_OBJECT_TYPE), > +errmsg("relation \"%s\" does not have storage to be checked", > +quote_qualified_identifier( > +get_namespace_name(get_rel_namespace(relid)), > +get_rel_name(relid); > > Looking at other similar error messages we don't show qualified > relation name but the relation name gotten by > RelationGetRelationName(relation). Can we do that here as well for > consistency? > Indeed, fixed. > > 8. > + if (!(rsinfo->allowedModes & SFRM_Materialize)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > +errmsg("materialize mode required, but it is not " \ > + "allowed in this context"))); > > I think it's better to have this error message in one line for easy grepping. Fixed. I also fixed missing leading tab in the perl TAP tests >From a9634bf02c0e7e1f9b0e8cf4e1ad79a72ea80ec8 Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Mon, 4 Nov 2019 08:40:23 +0100 Subject: [PATCH v8] Add a pg_check_relation() function. This functions checks the validity of the checksums for all non-dirty blocks of a given relation, and optionally a given fork, and returns the list of all blocks that don't match, along with the expected and found checksums. Author: Julien Rouhaud Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com --- doc/src/sgml/config.sgml | 85 + doc/src/sgml/func.sgml| 60 src/backend/catalog/system_views.sql | 15 + src/backend/storage/page/checksum.c | 318 +- src/backend/utils/adt/Makefile| 1 + src/backend/utils/adt/checksumfuncs.c | 217 src/backend/utils/init/globals.c | 7 + src/backend/utils/misc/guc.c | 33 ++ src/backend/utils/misc/postgresql.conf.sample | 6 + src/include/catalog/pg_proc.dat | 16 + src/include/miscadmin.h | 7 + src/include/storage/checksum.h| 13 + src/include/utils/guc_tables.h| 1 + src/test/Makefile | 3 +- src/test/check_relation/.gitignore| 2 + src/test/check_relation/Makefile | 23 ++
Re: Online checksums verification in the backend
On Sat, 4 Apr 2020 at 18:04, Julien Rouhaud wrote: > > On Fri, Apr 03, 2020 at 11:39:11AM +0200, Julien Rouhaud wrote: > > On Fri, Apr 03, 2020 at 12:24:50PM +0900, Masahiko Sawada wrote: > > > > > > check_relation_fork() seems to quite depends on pg_check_relation() > > > because the returned tuplestore is specified by pg_check_relation(). > > > It's just an idea but to improve reusability, how about moving > > > check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c > > > while iterating all blocks we call a new function in checksum.c, say > > > check_one_block() function, which has the following part and is > > > responsible for getting, checking the specified block and returning a > > > boolean indicating whether the block has corruption or not, along with > > > chk_found and chk_expected: > > > > > > /* > > > * To avoid too much overhead, the buffer will be first read > > > without > > > * the locks that would guarantee the lack of false positive, as > > > such > > > * events should be quite rare. > > > */ > > > Retry: > > > if (!check_get_buffer(relation, forknum, blkno, buffer, > > > force_lock, > > > _in_sb)) > > > continue; > > > > > > if (check_buffer(buffer, blkno, _expected, _found)) > > > continue; > > > > > > /* > > > * If we get a failure and the buffer wasn't found in shared > > > buffers, > > > * reread the buffer with suitable lock to avoid false positive. > > > See > > > * check_get_buffer for more details. > > > */ > > > if (!found_in_sb && !force_lock) > > > { > > > force_lock = true; > > > goto Retry; > > > } > > > > > > A new function in checksumfuncs.c or pg_check_relation will be > > > responsible for storing the result to the tuplestore. That way, > > > check_one_block() will be useful for other use when we want to check > > > if the particular block has corruption with low overhead. > > > > > > Yes, I agree that passing the tuplestore isn't an ideal approach and some > > refactoring should probably happen. One thing is that this wouldn't be > > "check_one_block()" but "check_one_block_on_disk()" (which could also be > > from > > the OS cache). I'm not sure how useful it's in itself. It also raises some > > concerns about the throttling. I didn't change that for now, but I hope > > there'll be some other feedback about it. > > > > > I had some time this morning, so I did the suggested refactoring as it seems > like a way cleaner interface. I also kept the suggested check_one_block(). Thank you for updating the patch! The patch looks good to me. Here are some random comments mostly about cosmetic changes. 1. I think we can have two separate SQL functions: pg_check_relation(regclass, text) and pg_check_relation(regclass), instead of setting NULL by default to the second argument. 2. + * Check data sanity for a specific block in the given fork of the given + * relation, always retrieved locally with smgrred even if a version exists in s/smgrred/smgrread/ 3. + /* The buffer will have to check checked. */ + Assert(checkit); Should it be "The buffer will have to be checked"? 4. + if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES)) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), +errmsg("only superuser or a member of the pg_read_server_files role may use this function"))); Looking at the definition of pg_stat_read_server_files role, this role seems to be for operations that could read non-database files such as csv files. Therefore, currently this role is used by file_fdw and COPY command. I personally think pg_stat_scan_tables would be more appropriate for this function but I'm not sure. 5. + /* Set cost-based vacuum delay */ + ChecksumCostActive = (ChecksumCostDelay > 0); + ChecksumCostBalance = 0; s/vacuum/checksum verification/ 6. + ereport(WARNING, + (errcode(ERRCODE_DATA_CORRUPTED), +errmsg("invalid page in block %u of relation %s", + blkno, + relpath(relation->rd_smgr->smgr_rnode, forknum; I think it's better to show the relation name instead of the relation path here. 7. + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), +errmsg("relation \"%s\" does not have storage to be checked", +quote_qualified_identifier( +get_namespace_name(get_rel_namespace(relid)), +get_rel_name(relid); Looking at other similar error messages we don't show qualified relation name but the relation name gotten by RelationGetRelationName(relation). Can we do that here as well for consistency? 8. + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, +
Re: Online checksums verification in the backend
On Fri, Apr 03, 2020 at 11:39:11AM +0200, Julien Rouhaud wrote: > On Fri, Apr 03, 2020 at 12:24:50PM +0900, Masahiko Sawada wrote: > > > > check_relation_fork() seems to quite depends on pg_check_relation() > > because the returned tuplestore is specified by pg_check_relation(). > > It's just an idea but to improve reusability, how about moving > > check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c > > while iterating all blocks we call a new function in checksum.c, say > > check_one_block() function, which has the following part and is > > responsible for getting, checking the specified block and returning a > > boolean indicating whether the block has corruption or not, along with > > chk_found and chk_expected: > > > > /* > > * To avoid too much overhead, the buffer will be first read without > > * the locks that would guarantee the lack of false positive, as > > such > > * events should be quite rare. > > */ > > Retry: > > if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock, > > _in_sb)) > > continue; > > > > if (check_buffer(buffer, blkno, _expected, _found)) > > continue; > > > > /* > > * If we get a failure and the buffer wasn't found in shared > > buffers, > > * reread the buffer with suitable lock to avoid false positive. > > See > > * check_get_buffer for more details. > > */ > > if (!found_in_sb && !force_lock) > > { > > force_lock = true; > > goto Retry; > > } > > > > A new function in checksumfuncs.c or pg_check_relation will be > > responsible for storing the result to the tuplestore. That way, > > check_one_block() will be useful for other use when we want to check > > if the particular block has corruption with low overhead. > > > Yes, I agree that passing the tuplestore isn't an ideal approach and some > refactoring should probably happen. One thing is that this wouldn't be > "check_one_block()" but "check_one_block_on_disk()" (which could also be from > the OS cache). I'm not sure how useful it's in itself. It also raises some > concerns about the throttling. I didn't change that for now, but I hope > there'll be some other feedback about it. > I had some time this morning, so I did the suggested refactoring as it seems like a way cleaner interface. I also kept the suggested check_one_block(). >From fc961d5fc8e220ddaa60306bd24300fe857a6d4b Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Mon, 4 Nov 2019 08:40:23 +0100 Subject: [PATCH v7] Add a pg_check_relation() function. This functions checks the validity of the checksums for all non-dirty blocks of a given relation, and optionally a given fork, and returns the list of all blocks that don't match, along with the expected and found checksums. Author: Julien Rouhaud Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com --- doc/src/sgml/config.sgml | 85 + doc/src/sgml/func.sgml| 51 +++ src/backend/catalog/system_views.sql | 7 + src/backend/storage/page/checksum.c | 318 +- src/backend/utils/adt/Makefile| 1 + src/backend/utils/adt/checksumfuncs.c | 218 src/backend/utils/init/globals.c | 7 + src/backend/utils/misc/guc.c | 33 ++ src/backend/utils/misc/postgresql.conf.sample | 6 + src/include/catalog/pg_proc.dat | 8 + src/include/miscadmin.h | 7 + src/include/storage/checksum.h| 13 + src/include/utils/guc_tables.h| 1 + src/test/Makefile | 3 +- src/test/check_relation/.gitignore| 2 + src/test/check_relation/Makefile | 23 ++ src/test/check_relation/README| 23 ++ .../check_relation/t/01_checksums_check.pl| 276 +++ 18 files changed, 1078 insertions(+), 4 deletions(-) create mode 100644 src/backend/utils/adt/checksumfuncs.c create mode 100644 src/test/check_relation/.gitignore create mode 100644 src/test/check_relation/Makefile create mode 100644 src/test/check_relation/README create mode 100644 src/test/check_relation/t/01_checksums_check.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index c4d6ed4bbc..259793650f 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2038,6 +2038,91 @@ include_dir 'conf.d' + + Cost-based Checksum Verification Delay + + + During the execution of + function, the system maintains an internal counter that keeps track of + the estimated cost of the various I/O operations that are performed. + When the accumulated cost reaches a
Re: Online checksums verification in the backend
On Fri, Apr 03, 2020 at 12:24:50PM +0900, Masahiko Sawada wrote: > On Sat, 28 Mar 2020 at 21:19, Julien Rouhaud wrote: > > > The current patch still checks SearchSysCacheExists1() before > relation_open. Why do we need to call SearchSysCacheExists1() here? I > think if the given relation doesn't exist, relation_open() will raise > an error "could not open relation with OID %u". > > + /* Open the relation if it exists. */ > + if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) > + { > + relation = relation_open(relid, AccessShareLock); > + } Oops yes sorry about that. Fixed. > > > 8. > > > + if (PageIsVerified(buffer, blkno)) > > > + { > > > + /* > > > +* If the page is really new, there won't by any checksum to > > > be > > > +* computed or expected. > > > +*/ > > > + *chk_expected = *chk_found = NoComputedChecksum; > > > + return true; > > > + } > > > + else > > > + { > > > + /* > > > +* There's a corruption, but since this affect PageIsNew, we > > > +* can't compute a checksum, so set NoComputedChecksum for the > > > +* expected checksum. > > > +*/ > > > + *chk_expected = NoComputedChecksum; > > > + *chk_found = hdr->pd_checksum; > > > + } > > > + return false; > > > > > > * I think the 'else' is not necessary here. > > > > AFAICT it's, see below. > > > > > * Setting *chk_expected and *chk_found seems useless when we return > > > true. The caller doesn't use them. > > > > Indeed, fixed. > > The patch still sets values to both? > > + if (PageIsVerified(buffer, blkno)) > + { > + /* No corruption. */ > + *chk_expected = *chk_found = NoComputedChecksum; > + return true; > + } Sorry again, fixed. > > > * Should we forcibly overwrite ignore_checksum_failure to off in > > > pg_check_relation()? Otherwise, this logic seems not work fine. > > > > > > * I don't understand why we cannot compute a checksum in case where a > > > page looks like a new page but is actually corrupted. Could you please > > > elaborate on that? > > > > PageIsVerified has a different behavior depending on whether the page looks > > new > > or not. If the page looks like new, it only checks that it's indeed a new > > page, and otherwise try to verify the checksum. > > > > Also, pg_check_page() has an assert to make sure that the page isn't (or > > don't > > look like) new. > > > > So it seems to me that the 'else' is required to properly detect a real or > > fake > > PageIsNew, and try to compute checksums only when required. > > Thank you for your explanation! I understand. > > I thought we can arrange the code to something like: > > if (PageIsNew(hdr)) > { > if (PageIsVerified(hdr)) > { > *chk_expected = *chk_found = NoComputedChecksum; > return true; > } > > *chk_expected = NoComputedChecksum; > *chk_found = hdr->pd_checksum; > return false; > } > > But since it's not a critical problem you can ignore it. I like it, so done! > > > 8. > > > + { > > > + {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY, > > > + gettext_noop("Checksum cost for a page found in the buffer > > > cache."), > > > + NULL > > > + }, > > > + , > > > + 1, 0, 1, > > > + NULL, NULL, NULL > > > + }, > > > > > > * There is no description about the newly added four GUC parameters in > > > the doc. > > > > > > * We need to put new GUC parameters into postgresql.conf.sample as well. > > > > Fixed both. > > > > > * The patch doesn't use checksum_cost_page_hit at all. > > > > Indeed, I also realized that while working on previous issues. I removed it > > and renamed checksum_cost_page_miss to checksum_cost_page. > > Perhaps we can use checksum_cost_page_hit when we found the page in > the shared buffer but it's marked as dirty? The thing is that when the buffer is dirty, we won't do any additional check, thus not adding any overhead. What may be needed here is to account for the locking overhead (in all cases), so that if all (or almost all) the buffers are dirty and in shared buffers the execution can be throttled. I don't know how much an issue it can be, but if that's something to be fixes then page_hit doesn't look like the right answer for that. > I've read the latest patch and here is random comments: > > 1. > + /* > +* Add a page miss cost, as we're always reading outside the shared > +* buffers. > +*/ > + /* Add a page cost. */ > + ChecksumCostBalance += ChecksumCostPage; > > There are duplicate comments. Fixed. > 2. > + /* Dirty pages are ignored as they'll be flushed soon. */ > + if (buf_state & BM_DIRTY) > + checkit = false; > > Should we check the buffer if it has BM_TAG_VALID as well here? I >
Re: Online checksums verification in the backend
On Sat, 28 Mar 2020 at 21:19, Julien Rouhaud wrote: > > On Sat, Mar 28, 2020 at 12:28:27PM +0900, Masahiko Sawada wrote: > > On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud wrote: > > > > > > v5 attached > > > > Thank you for updating the patch. I have some comments: > > Thanks a lot for the review! Thank you for updating the patch! > > 4. > > + /* Check if the relation (still) exists */ > > + if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) > > + { > > + /* > > +* We use a standard relation_open() to acquire the initial lock. > > It > > +* means that this will block until the lock is acquired, or will > > +* raise an ERROR if lock_timeout has been set. If caller wants to > > +* check multiple tables while relying on a maximum wait time, it > > +* should process tables one by one instead of relying on a global > > +* processing with the main SRF. > > +*/ > > + relation = relation_open(relid, AccessShareLock); > > + } > > > > IIUC the above was necessary because we used to have > > check_all_relations() which iterates all relations on the database to > > do checksum checks. But now that we don't have that function and > > pg_check_relation processes one relation. Can we just do > > relation_open() here? > > Ah yes I missed that comment. I think only the comment needed to be updated > to > remove any part related to NULL-relation call. I ended up removign the whole > comment since locking and lock_timeout behavior is inherent to relation_open > and there's no need to document that any further now that we always only check > one relation at a time. The current patch still checks SearchSysCacheExists1() before relation_open. Why do we need to call SearchSysCacheExists1() here? I think if the given relation doesn't exist, relation_open() will raise an error "could not open relation with OID %u". + /* Open the relation if it exists. */ + if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) + { + relation = relation_open(relid, AccessShareLock); + } > > 8. > > + if (PageIsVerified(buffer, blkno)) > > + { > > + /* > > +* If the page is really new, there won't by any checksum to be > > +* computed or expected. > > +*/ > > + *chk_expected = *chk_found = NoComputedChecksum; > > + return true; > > + } > > + else > > + { > > + /* > > +* There's a corruption, but since this affect PageIsNew, we > > +* can't compute a checksum, so set NoComputedChecksum for the > > +* expected checksum. > > +*/ > > + *chk_expected = NoComputedChecksum; > > + *chk_found = hdr->pd_checksum; > > + } > > + return false; > > > > * I think the 'else' is not necessary here. > > AFAICT it's, see below. > > > * Setting *chk_expected and *chk_found seems useless when we return > > true. The caller doesn't use them. > > Indeed, fixed. The patch still sets values to both? + if (PageIsVerified(buffer, blkno)) + { + /* No corruption. */ + *chk_expected = *chk_found = NoComputedChecksum; + return true; + } > > > * Should we forcibly overwrite ignore_checksum_failure to off in > > pg_check_relation()? Otherwise, this logic seems not work fine. > > > > * I don't understand why we cannot compute a checksum in case where a > > page looks like a new page but is actually corrupted. Could you please > > elaborate on that? > > PageIsVerified has a different behavior depending on whether the page looks > new > or not. If the page looks like new, it only checks that it's indeed a new > page, and otherwise try to verify the checksum. > > Also, pg_check_page() has an assert to make sure that the page isn't (or don't > look like) new. > > So it seems to me that the 'else' is required to properly detect a real or > fake > PageIsNew, and try to compute checksums only when required. Thank you for your explanation! I understand. I thought we can arrange the code to something like: if (PageIsNew(hdr)) { if (PageIsVerified(hdr)) { *chk_expected = *chk_found = NoComputedChecksum; return true; } *chk_expected = NoComputedChecksum; *chk_found = hdr->pd_checksum; return false; } But since it's not a critical problem you can ignore it. > > > 8. > > + { > > + {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY, > > + gettext_noop("Checksum cost for a page found in the buffer > > cache."), > > + NULL > > + }, > > + , > > + 1, 0, 1, > > + NULL, NULL, NULL > > + }, > > > > * There is no description about the newly added four GUC parameters in the > > doc. > > > > * We need to put new GUC parameters into postgresql.conf.sample as well. > > Fixed both. > > > * The patch doesn't use checksum_cost_page_hit at
Re: Online checksums verification in the backend
On Sat, Mar 28, 2020 at 12:28:27PM +0900, Masahiko Sawada wrote: > On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud wrote: > > > > v5 attached > > Thank you for updating the patch. I have some comments: Thanks a lot for the review! > 1. > + > +pg_check_relation(relation > oid, fork > text) > + > > Looking at the declaration of pg_check_relation, 'relation' and 'fork' > are optional arguments. So I think the above is not correct. But as I > commented below, 'relation' should not be optional, so maybe the above > line could be: > > +pg_check_relation(relation > oid[, fork > text]) Yes I missed that when making relation mandatory. Fixed. > 2. > + > +pg_check_relation > + > + > +pg_check_relation iterates over all the blocks of > all > +or the specified fork of a given relation and verify their checksum. It > +returns the list of blocks for which the found checksum doesn't match the > +expected one. You must be a member of the > +pg_read_all_stats role to use this function. It can > +only be used if data checksums are enabled. See +linkend="app-initdb-data-checksums"/> for more information. > + > > * I think we need a description about possible values for 'fork' > (i.g., 'main', 'vm', 'fsm' and 'init'), and the behavior when 'fork' > is omitted. Done. > * Do we need to explain about checksum cost-based delay here? It's probably better in config.sgml, nearby vacuum cost-based delay, so done this way with a link to reference that part. > 3. > +CREATE OR REPLACE FUNCTION pg_check_relation( > + IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT > NULL::text, > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > + OUT expected_checksum integer, OUT found_checksum integer) > + RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation' > + PARALLEL RESTRICTED; > > Now that pg_check_relation doesn't accept NULL as 'relation', I think > we need to make 'relation' a mandatory argument. Correct, fixed. > 4. > + /* Check if the relation (still) exists */ > + if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) > + { > + /* > +* We use a standard relation_open() to acquire the initial lock. It > +* means that this will block until the lock is acquired, or will > +* raise an ERROR if lock_timeout has been set. If caller wants to > +* check multiple tables while relying on a maximum wait time, it > +* should process tables one by one instead of relying on a global > +* processing with the main SRF. > +*/ > + relation = relation_open(relid, AccessShareLock); > + } > > IIUC the above was necessary because we used to have > check_all_relations() which iterates all relations on the database to > do checksum checks. But now that we don't have that function and > pg_check_relation processes one relation. Can we just do > relation_open() here? Ah yes I missed that comment. I think only the comment needed to be updated to remove any part related to NULL-relation call. I ended up removign the whole comment since locking and lock_timeout behavior is inherent to relation_open and there's no need to document that any further now that we always only check one relation at a time. > 5. > I think we need to check if the relation is a temp relation. I'm not > sure it's worth to check checksums of temp relations but at least we > need not to check other's temp relations. Good point. I think it's still worthwhile to check the backend's temp relation, although typical usage should be a bgworker/cron job doing that check so there shouldn't be any. > 6. > +/* > + * Safely read the wanted buffer from disk, dealing with possible concurrency > + * issue. Note that if a buffer is found dirty in shared_buffers, no read > will > + * be performed and the caller will be informed that no check should be done. > + * We can safely ignore such buffers as they'll be written before next > + * checkpoint's completion.. > [...] > + */ > > I think the above comment also needs some "/*---" at the beginning and > end. Fixed. > 7. > +static void > +check_get_buffer(Relation relation, ForkNumber forknum, > +BlockNumber blkno, char *buffer, bool needlock, bool > *checkit, > +bool *found_in_sb) > +{ > > Maybe we can make check_get_buffer() return a bool indicating we found > a buffer to check, instead of having '*checkit'. That way, we can > simplify the following code: > > + check_get_buffer(relation, forknum, blkno, buffer, force_lock, > +, _in_sb); > + > + if (!checkit) > + continue; > > to something like: > > + if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock, > + _in_sb)) > + continue; Changed. > 8. > + if (PageIsVerified(buffer, blkno)) > + { > + /* > +* If
Re: Online checksums verification in the backend
On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud wrote: > > On Wed, Mar 18, 2020 at 07:06:19AM +0100, Julien Rouhaud wrote: > > On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote: > > > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote: > > > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote: > > > >> With a large amount of > > > >> shared buffer eviction you actually increase the risk of torn page > > > >> reads. Instead of a logic relying on partition mapping locks, which > > > >> could be unwise on performance grounds, did you consider different > > > >> approaches? For example a kind of pre-emptive lock on the page in > > > >> storage to prevent any shared buffer operation to happen while the > > > >> block is read from storage, that would act like a barrier. > > > > > > > > Even with a workload having a large shared_buffers eviction pattern, I > > > > don't > > > > think that there's a high probability of hitting a torn page. Unless > > > > I'm > > > > mistaken it can only happen if all those steps happen concurrently to > > > > doing the > > > > block read just after releasing the LWLock: > > > > > > > > - postgres read the same block in shared_buffers (including all the > > > > locking) > > > > - dirties it > > > > - writes part of the page > > > > > > > > It's certainly possible, but it seems so unlikely that the optimistic > > > > lock-less > > > > approach seems like a very good tradeoff. > > > > > > Having false reports in this area could be very confusing for the > > > user. That's for example possible now with checksum verification and > > > base backups. > > > > > > I agree, however this shouldn't be the case here, as the block will be > > rechecked while holding proper lock the 2nd time in case of possible false > > positive before being reported as corrupted. So the only downside is to > > check > > twice a corrupted block that's not found in shared buffers (or concurrently > > loaded/modified/half flushed). As the number of corrupted or concurrently > > loaded/modified/half flushed blocks should usually be close to zero, it > > seems > > worthwhile to have a lockless check first for performance reason. > > > I just noticed some dumb mistakes while adding the new GUCs. v5 attached to > fix that, no other changes. Thank you for updating the patch. I have some comments: 1. + +pg_check_relation(relation oid, fork text) + Looking at the declaration of pg_check_relation, 'relation' and 'fork' are optional arguments. So I think the above is not correct. But as I commented below, 'relation' should not be optional, so maybe the above line could be: +pg_check_relation(relation oid[, fork text]) 2. + +pg_check_relation + + +pg_check_relation iterates over all the blocks of all +or the specified fork of a given relation and verify their checksum. It +returns the list of blocks for which the found checksum doesn't match the +expected one. You must be a member of the +pg_read_all_stats role to use this function. It can +only be used if data checksums are enabled. See for more information. + * I think we need a description about possible values for 'fork' (i.g., 'main', 'vm', 'fsm' and 'init'), and the behavior when 'fork' is omitted. * Do we need to explain about checksum cost-based delay here? 3. +CREATE OR REPLACE FUNCTION pg_check_relation( + IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text, + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, + OUT expected_checksum integer, OUT found_checksum integer) + RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation' + PARALLEL RESTRICTED; Now that pg_check_relation doesn't accept NULL as 'relation', I think we need to make 'relation' a mandatory argument. 4. + /* Check if the relation (still) exists */ + if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) + { + /* +* We use a standard relation_open() to acquire the initial lock. It +* means that this will block until the lock is acquired, or will +* raise an ERROR if lock_timeout has been set. If caller wants to +* check multiple tables while relying on a maximum wait time, it +* should process tables one by one instead of relying on a global +* processing with the main SRF. +*/ + relation = relation_open(relid, AccessShareLock); + } IIUC the above was necessary because we used to have check_all_relations() which iterates all relations on the database to do checksum checks. But now that we don't have that function and pg_check_relation processes one relation. Can we just do relation_open() here? 5. I think we need to check if the relation is a temp relation. I'm not sure it's worth to check checksums of temp relations but at least we need not to check other's temp relations. 6. +/* + * Safely read the wanted buffer from
Re: Online checksums verification in the backend
On Wed, Mar 18, 2020 at 07:06:19AM +0100, Julien Rouhaud wrote: > On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote: > > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote: > > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote: > > >> With a large amount of > > >> shared buffer eviction you actually increase the risk of torn page > > >> reads. Instead of a logic relying on partition mapping locks, which > > >> could be unwise on performance grounds, did you consider different > > >> approaches? For example a kind of pre-emptive lock on the page in > > >> storage to prevent any shared buffer operation to happen while the > > >> block is read from storage, that would act like a barrier. > > > > > > Even with a workload having a large shared_buffers eviction pattern, I > > > don't > > > think that there's a high probability of hitting a torn page. Unless I'm > > > mistaken it can only happen if all those steps happen concurrently to > > > doing the > > > block read just after releasing the LWLock: > > > > > > - postgres read the same block in shared_buffers (including all the > > > locking) > > > - dirties it > > > - writes part of the page > > > > > > It's certainly possible, but it seems so unlikely that the optimistic > > > lock-less > > > approach seems like a very good tradeoff. > > > > Having false reports in this area could be very confusing for the > > user. That's for example possible now with checksum verification and > > base backups. > > > I agree, however this shouldn't be the case here, as the block will be > rechecked while holding proper lock the 2nd time in case of possible false > positive before being reported as corrupted. So the only downside is to check > twice a corrupted block that's not found in shared buffers (or concurrently > loaded/modified/half flushed). As the number of corrupted or concurrently > loaded/modified/half flushed blocks should usually be close to zero, it seems > worthwhile to have a lockless check first for performance reason. I just noticed some dumb mistakes while adding the new GUCs. v5 attached to fix that, no other changes. >From 6bea507c5dbd7bff862c75b93fc023f0f33aba99 Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Mon, 4 Nov 2019 08:40:23 +0100 Subject: [PATCH v5] Add a pg_check_relation() function. This functions checks the validity of the checksums for all non-dirty blocks of a given relation, and optionally a given fork, and returns the list of all blocks that don't match, along with the expected and found checksums. Author: Julien Rouhaud Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com --- doc/src/sgml/func.sgml| 44 ++ src/backend/catalog/system_views.sql | 7 + src/backend/storage/page/checksum.c | 427 +- src/backend/utils/adt/Makefile| 1 + src/backend/utils/adt/checksumfuncs.c | 96 src/backend/utils/init/globals.c | 8 + src/backend/utils/misc/guc.c | 43 ++ src/include/catalog/pg_proc.dat | 8 + src/include/miscadmin.h | 8 + src/include/storage/checksum.h| 7 + src/include/utils/guc_tables.h| 1 + src/test/Makefile | 3 +- src/test/check_relation/.gitignore| 2 + src/test/check_relation/Makefile | 23 + src/test/check_relation/README| 23 + .../check_relation/t/01_checksums_check.pl| 276 +++ 16 files changed, 973 insertions(+), 4 deletions(-) create mode 100644 src/backend/utils/adt/checksumfuncs.c create mode 100644 src/test/check_relation/.gitignore create mode 100644 src/test/check_relation/Makefile create mode 100644 src/test/check_relation/README create mode 100644 src/test/check_relation/t/01_checksums_check.pl diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index fc4d7f0f78..4baa2bb5e9 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -21819,6 +21819,50 @@ SELECT (pg_stat_file('filename')).modification; + + Data Sanity Functions + + +The functions shown in +provide means to check for health of data file in a cluster. + + + +Data Sanity Functions + + + Name Return Type Description + + + + + + +pg_check_relation(relation oid, fork text) + + setof record + Validate the checksums for all blocks of a given relation, and + optionally the given fork. + + + + + + +pg_check_relation + + +pg_check_relation iterates over all the blocks of all +or the specified fork of a given relation and verify their checksum. It +returns the list of blocks for which the found checksum doesn't match the +expected one.
Re: Online checksums verification in the backend
On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote: > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote: > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote: > >> Based on the feedback gathered on this thread, I guess that you should > >> have a SRF returning the list of broken blocks, as well as NOTICE > >> messages. > > > > The current patch has an SRF and a WARNING message, do you want an > > additional > > NOTICE message or downgrade the existing one? > > Right, not sure which one is better, for zero_damaged_pages a WARNING > is used. Sorry forgot to answer that. IMHO a WARNING is better here, as we're talking about data corruption. Also, a WARNING will be reported to both the client and server logs, which sounds like a good thing.
Re: Online checksums verification in the backend
On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote: > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote: > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote: > >> With a large amount of > >> shared buffer eviction you actually increase the risk of torn page > >> reads. Instead of a logic relying on partition mapping locks, which > >> could be unwise on performance grounds, did you consider different > >> approaches? For example a kind of pre-emptive lock on the page in > >> storage to prevent any shared buffer operation to happen while the > >> block is read from storage, that would act like a barrier. > > > > Even with a workload having a large shared_buffers eviction pattern, I don't > > think that there's a high probability of hitting a torn page. Unless I'm > > mistaken it can only happen if all those steps happen concurrently to doing > > the > > block read just after releasing the LWLock: > > > > - postgres read the same block in shared_buffers (including all the locking) > > - dirties it > > - writes part of the page > > > > It's certainly possible, but it seems so unlikely that the optimistic > > lock-less > > approach seems like a very good tradeoff. > > Having false reports in this area could be very confusing for the > user. That's for example possible now with checksum verification and > base backups. I agree, however this shouldn't be the case here, as the block will be rechecked while holding proper lock the 2nd time in case of possible false positive before being reported as corrupted. So the only downside is to check twice a corrupted block that's not found in shared buffers (or concurrently loaded/modified/half flushed). As the number of corrupted or concurrently loaded/modified/half flushed blocks should usually be close to zero, it seems worthwhile to have a lockless check first for performance reason. > > For the record when I first tested that feature I did try to check dirty > > blocks, and it seemed that dirty blocks of shared relation were sometimes > > wrongly reported as corrupted. I didn't try to investigate more though. > > Hmm. It would be good to look at that, correct verification of shared > relations matter. I'll try to investigate, but non-dirty shared relation blocks can be checked and work as intended. > > >> + * - if a block is dirty in shared_buffers, it's ignored as it'll be > >> flushed to > >> + * disk either before the end of the next checkpoint or during recovery > >> in > >> + * case of unsafe shutdown > >> Not sure that the indentation is going to react well on that part of > >> the patch, perhaps it would be better to add some "/*---" at the > >> beginning and end of the comment block to tell pgindent to ignore this > >> part? > > > > Ok. Although I think only the beginning comment is needed? > > From src/tools/pgindent/README: > "pgindent will reflow any comment block that's not at the left margin. > If this messes up manual formatting that ought to be preserved, > protect the comment block with some dashes:" > > /*-- >* Text here will not be touched by pgindent. > *-- > */ For instance the block comment in gen_partprune_steps_internal() disagrees. Anyway I added both as all the nearby codes does that for overall function comments.
Re: Online checksums verification in the backend
On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote: > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote: >> With a large amount of >> shared buffer eviction you actually increase the risk of torn page >> reads. Instead of a logic relying on partition mapping locks, which >> could be unwise on performance grounds, did you consider different >> approaches? For example a kind of pre-emptive lock on the page in >> storage to prevent any shared buffer operation to happen while the >> block is read from storage, that would act like a barrier. > > Even with a workload having a large shared_buffers eviction pattern, I don't > think that there's a high probability of hitting a torn page. Unless I'm > mistaken it can only happen if all those steps happen concurrently to doing > the > block read just after releasing the LWLock: > > - postgres read the same block in shared_buffers (including all the locking) > - dirties it > - writes part of the page > > It's certainly possible, but it seems so unlikely that the optimistic > lock-less > approach seems like a very good tradeoff. Having false reports in this area could be very confusing for the user. That's for example possible now with checksum verification and base backups. >> I guess that this leads to the fact that this function may be better as >> a contrib module, with the addition of some better-suited APIs in core >> (see paragraph above). > > Below? Above. This thought more precisely: >> For example a kind of pre-emptive lock on the page in >> storage to prevent any shared buffer operation to happen while the >> block is read from storage, that would act like a barrier. > For the record when I first tested that feature I did try to check dirty > blocks, and it seemed that dirty blocks of shared relation were sometimes > wrongly reported as corrupted. I didn't try to investigate more though. Hmm. It would be good to look at that, correct verification of shared relations matter. >> + * - if a block is dirty in shared_buffers, it's ignored as it'll be >> flushed to >> + * disk either before the end of the next checkpoint or during recovery in >> + * case of unsafe shutdown >> Not sure that the indentation is going to react well on that part of >> the patch, perhaps it would be better to add some "/*---" at the >> beginning and end of the comment block to tell pgindent to ignore this >> part? > > Ok. Although I think only the beginning comment is needed? From src/tools/pgindent/README: "pgindent will reflow any comment block that's not at the left margin. If this messes up manual formatting that ought to be preserved, protect the comment block with some dashes:" /*-- * Text here will not be touched by pgindent. *-- */ >> Based on the feedback gathered on this thread, I guess that you should >> have a SRF returning the list of broken blocks, as well as NOTICE >> messages. > > The current patch has an SRF and a WARNING message, do you want an additional > NOTICE message or downgrade the existing one? Right, not sure which one is better, for zero_damaged_pages a WARNING is used. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Mon, Mar 16, 2020 at 02:15:27PM +0100, Julien Rouhaud wrote: > On Mon, Mar 16, 2020 at 09:42:39AM +0100, Julien Rouhaud wrote: > > On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote: > > > > > > In addition to comments from Michael-san, here are my comments: > > Thanks both for the reviews. I'm attaching a v3 with all comments addressed, > except: > > > It seems to me that this test would be a good fit for > > src/test/modules/test_misc/. > > > AFAICT this is explicitly documented as tests for various extensions, and for > now it's a core function, so I didn't move it. > > > > +Run > > +make check > > +or > > +make installcheck > > Why is installcheck mentioned here? > > > This is actually already used in multiple other test readme. Sorry I forgot to update the regression tests. v4 attached. >From e71af7998600a491efa1435d3c218de6e55bbc64 Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Mon, 4 Nov 2019 08:40:23 +0100 Subject: [PATCH v4] Add a pg_check_relation() function. This functions checks the validity of the checksums for all non-dirty blocks of a given relation, and optionally a given fork, and returns the list of all blocks that don't match, along with the expected and found checksums. Author: Julien Rouhaud Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com --- doc/src/sgml/func.sgml| 44 ++ src/backend/catalog/system_views.sql | 7 + src/backend/storage/page/checksum.c | 427 +- src/backend/utils/adt/Makefile| 1 + src/backend/utils/adt/checksumfuncs.c | 96 src/backend/utils/init/globals.c | 8 + src/backend/utils/misc/guc.c | 43 ++ src/include/catalog/pg_proc.dat | 8 + src/include/miscadmin.h | 8 + src/include/storage/checksum.h| 7 + src/include/utils/guc_tables.h| 1 + src/test/Makefile | 3 +- src/test/check_relation/.gitignore| 2 + src/test/check_relation/Makefile | 23 + src/test/check_relation/README| 23 + .../check_relation/t/01_checksums_check.pl| 276 +++ 16 files changed, 973 insertions(+), 4 deletions(-) create mode 100644 src/backend/utils/adt/checksumfuncs.c create mode 100644 src/test/check_relation/.gitignore create mode 100644 src/test/check_relation/Makefile create mode 100644 src/test/check_relation/README create mode 100644 src/test/check_relation/t/01_checksums_check.pl diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 323366feb6..5847d7f655 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -21815,6 +21815,50 @@ SELECT (pg_stat_file('filename')).modification; + + Data Sanity Functions + + +The functions shown in +provide means to check for health of data file in a cluster. + + + +Data Sanity Functions + + + Name Return Type Description + + + + + + +pg_check_relation(relation oid, fork text) + + setof record + Validate the checksums for all blocks of a given relation, and + optionally the given fork. + + + + + + +pg_check_relation + + +pg_check_relation iterates over all the blocks of all +or the specified fork of a given relation and verify their checksum. It +returns the list of blocks for which the found checksum doesn't match the +expected one. You must be a member of the +pg_read_all_stats role to use this function. It can +only be used if data checksums are enabled. See for more information. + + + + diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index b8a3f46912..e266292b03 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1386,6 +1386,13 @@ LANGUAGE INTERNAL STRICT STABLE PARALLEL SAFE AS 'jsonb_path_query_first_tz'; +CREATE OR REPLACE FUNCTION pg_check_relation( + IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text, + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, + OUT expected_checksum integer, OUT found_checksum integer) + RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation' + PARALLEL RESTRICTED; + -- -- The default permissions for functions mean that anyone can execute them. -- A number of functions shouldn't be executable by just anyone, but rather diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c index e010691c9f..aec5e2ad2d 100644 --- a/src/backend/storage/page/checksum.c +++ b/src/backend/storage/page/checksum.c @@ -15,8 +15,429 @@ #include "storage/checksum.h" /* - * The actual code is in
Re: Online checksums verification in the backend
On Mon, Mar 16, 2020 at 09:42:39AM +0100, Julien Rouhaud wrote: > On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote: > > > > In addition to comments from Michael-san, here are my comments: Thanks both for the reviews. I'm attaching a v3 with all comments addressed, except: > It seems to me that this test would be a good fit for > src/test/modules/test_misc/. AFAICT this is explicitly documented as tests for various extensions, and for now it's a core function, so I didn't move it. > +Run > +make check > +or > +make installcheck > Why is installcheck mentioned here? This is actually already used in multiple other test readme. >From d96c67e2e591cecca63fdf8c9d808bb6a1b72866 Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Mon, 4 Nov 2019 08:40:23 +0100 Subject: [PATCH v3] Add a pg_check_relation() function. This functions checks the validity of the checksums for all non-dirty blocks of a given relation, and optionally a given fork, and returns the list of all blocks that don't match, along with the expected and found checksums. Author: Julien Rouhaud Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com --- doc/src/sgml/func.sgml| 44 ++ src/backend/catalog/system_views.sql | 7 + src/backend/storage/page/checksum.c | 427 +- src/backend/utils/adt/Makefile| 1 + src/backend/utils/adt/checksumfuncs.c | 96 src/backend/utils/init/globals.c | 8 + src/backend/utils/misc/guc.c | 43 ++ src/include/catalog/pg_proc.dat | 8 + src/include/miscadmin.h | 8 + src/include/storage/checksum.h| 7 + src/include/utils/guc_tables.h| 1 + src/test/Makefile | 3 +- src/test/check_relation/.gitignore| 2 + src/test/check_relation/Makefile | 23 + src/test/check_relation/README| 23 + .../check_relation/t/01_checksums_check.pl| 274 +++ 16 files changed, 971 insertions(+), 4 deletions(-) create mode 100644 src/backend/utils/adt/checksumfuncs.c create mode 100644 src/test/check_relation/.gitignore create mode 100644 src/test/check_relation/Makefile create mode 100644 src/test/check_relation/README create mode 100644 src/test/check_relation/t/01_checksums_check.pl diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 323366feb6..5847d7f655 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -21815,6 +21815,50 @@ SELECT (pg_stat_file('filename')).modification; + + Data Sanity Functions + + +The functions shown in +provide means to check for health of data file in a cluster. + + + +Data Sanity Functions + + + Name Return Type Description + + + + + + +pg_check_relation(relation oid, fork text) + + setof record + Validate the checksums for all blocks of a given relation, and + optionally the given fork. + + + + + + +pg_check_relation + + +pg_check_relation iterates over all the blocks of all +or the specified fork of a given relation and verify their checksum. It +returns the list of blocks for which the found checksum doesn't match the +expected one. You must be a member of the +pg_read_all_stats role to use this function. It can +only be used if data checksums are enabled. See for more information. + + + + diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index b8a3f46912..e266292b03 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1386,6 +1386,13 @@ LANGUAGE INTERNAL STRICT STABLE PARALLEL SAFE AS 'jsonb_path_query_first_tz'; +CREATE OR REPLACE FUNCTION pg_check_relation( + IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text, + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, + OUT expected_checksum integer, OUT found_checksum integer) + RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation' + PARALLEL RESTRICTED; + -- -- The default permissions for functions mean that anyone can execute them. -- A number of functions shouldn't be executable by just anyone, but rather diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c index e010691c9f..aec5e2ad2d 100644 --- a/src/backend/storage/page/checksum.c +++ b/src/backend/storage/page/checksum.c @@ -15,8 +15,429 @@ #include "storage/checksum.h" /* - * The actual code is in storage/checksum_impl.h. This is done so that - * external programs can incorporate the checksum code by #include'ing - * that file from the exported Postgres headers. (Compare our CRC
Re: Online checksums verification in the backend
On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote: > > In addition to comments from Michael-san, here are my comments: > > 1. > + if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES)) > + ereport(ERROR, > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > +errmsg("only superuser or a member of the > pg_read_server_files role may use this function"))); Good point! I'll fix it. > + > + if (!DataChecksumsEnabled()) > + elog(ERROR, "Data checksums are not enabled"); > > I think it's better to reverse the order of the above checks. Indeed. > > 2. > +#define CRF_COLS 5 /* Number of output arguments in the SRF */ > > Should it be SRF_COLS? Oops, will fix. > > 3. > +static void > +check_delay_point(void) > +{ > + /* Always check for interrupts */ > + CHECK_FOR_INTERRUPTS(); > + > + /* Nap if appropriate */ > + if (!InterruptPending && VacuumCostBalance >= VacuumCostLimit) > + { > + int msec; > + > + msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit; > + if (msec > VacuumCostDelay * 4) > + msec = VacuumCostDelay * 4; > + > + pg_usleep(msec * 1000L); > + > + VacuumCostBalance = 0; > + > + /* Might have gotten an interrupt while sleeping */ > + CHECK_FOR_INTERRUPTS(); > + } > +} > > Even if we use vacuum delay for this function, I think we need to set > VacuumDelayActive and return if it's false, or it's better to just > return if VacuumCostDelay == 0. Good point, I'll fix that. > > 4. > +static void > +check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore, > + ForkNumber forknum) > > I also agree with Michael-san to remove this function. Instead we can > check all relations by: > > select pg_check_relation(oid) from pg_class; Sure, but ideally we should do that in a client program (eg. pg_checksums) that wouldn't maintain a transaction active for the whole execution. > 6. > Other typos > > s/dirted/dirtied/ > s/explictly/explicitly/ Will fix, thanks!
Re: Online checksums verification in the backend
Thanks for the review Michael! On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote: > On Wed, Mar 11, 2020 at 08:18:23AM +0100, Julien Rouhaud wrote: > > The cfbot reported a build failure, so here's a rebased v2 which also > > contains > > the pg_stat_report_failure() call and extra log info. > > + * - if a block is not found in shared_buffers, the LWLock is relased and the > + * block is read from disk without taking any lock. If an error is > detected, > + * the read block will be discarded and retrieved again while holding the > + * LWLock. This is because an error due to concurrent write is possible > but > + * very unlikely, so it's better to have an optimistic approach to limit > + * locking overhead > This can be risky with false positives, no? Do you mean high probability of false positive in the 1st iteration, so running frequently the recheck that can't have false positive, not that the 2nd check can lead to false positive? > With a large amount of > shared buffer eviction you actually increase the risk of torn page > reads. Instead of a logic relying on partition mapping locks, which > could be unwise on performance grounds, did you consider different > approaches? For example a kind of pre-emptive lock on the page in > storage to prevent any shared buffer operation to happen while the > block is read from storage, that would act like a barrier. Even with a workload having a large shared_buffers eviction pattern, I don't think that there's a high probability of hitting a torn page. Unless I'm mistaken it can only happen if all those steps happen concurrently to doing the block read just after releasing the LWLock: - postgres read the same block in shared_buffers (including all the locking) - dirties it - writes part of the page It's certainly possible, but it seems so unlikely that the optimistic lock-less approach seems like a very good tradeoff. > > + * Vacuum's GUCs are used to avoid consuming too much resources while running > + * this tool. > Shouldn't this involve separate GUCs instead of the VACUUM ones? We could but the access pattern looked so similar that it looked like a good idea to avoid adding 2 new GUC for that to keep configuration simple. Unless there are objections I'll add them in the next version. > I guess that this leads to the fact that this function may be better as > a contrib module, with the addition of some better-suited APIs in core > (see paragraph above). Below? > > +Run > +make check > +or > +make installcheck > Why is installcheck mentioned here? Oups, copy/pasto error from the original contrib module this stuff was initially implemented as, will fix. > > I don't think that it is appropriate to place the SQL-callable part in > the existing checksum.c. I would suggest instead a new file, say > checksumfuncs.c in src/backend/utils/adt/, holding any SQL functions > for checksums. Agreed. > > -SUBDIRS = perl regress isolation modules authentication recovery > subscription > +SUBDIRS = perl regress isolation modules authentication check_relation \ > + recovery subscription > It seems to me that this test would be a good fit for > src/test/modules/test_misc/. WFM. > > +static void > +check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore, > + ForkNumber forknum) > Per the argument of bloat, I think that I would remove > check_all_relation() as this function could take a very long time to > run, and just make the SQL function strict. No objection. > > + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed > to > + * disk either before the end of the next checkpoint or during recovery in > + * case of unsafe shutdown > Wouldn't it actually be a good thing to check that the page on storage > is fine in this case? This depends on the system settings and the > checkpoint frequency, but checkpoint_timeout can be extended up to 1 > day. And plenty of things could happen to the storage in one day, > including a base backup that includes a corrupted page on storage, > that this function would not be able to detect. How could that lead to data corruption? If postgres crashes before the checkpoint completion, the block will be overwritten during recovery, and if a base backup is taken the block will also be overwritten while replaying all the required WALs. Detecting a corrupted blocks in those cases would have the merit of possibly warning about possibly broken hardware sooner, but it would also make the check more expensive as the odds to prevent postgres from evicting a dirty block is way higher. Maybe an additional GUC for that? For the record when I first tested that feature I did try to check dirty blocks, and it seemed that dirty blocks of shared relation were sometimes wrongly reported as corrupted. I didn't try to investigate more though. > + * we detect if a block is in shared_buffers or not. See get_buffer() > + *
Re: Online checksums verification in the backend
On Wed, 11 Mar 2020 at 16:18, Julien Rouhaud wrote: > > On Tue, Dec 10, 2019 at 11:12:34AM +0100, Julien Rouhaud wrote: > > On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier wrote: > > > > > > On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote: > > > > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas > > > > wrote: > > > >> Some people might prefer notices, because you can get those while the > > > >> thing is still running, rather than a result set, which you will only > > > >> see when the query finishes. Other people might prefer an SRF, because > > > >> they want to have the data in structured form so that they can > > > >> postprocess it. Not sure what you mean by "more globally." > > > > > > > > I meant having the results available system-wide, not only to the > > > > caller. I think that emitting a log/notice level should always be > > > > done on top on whatever other communication facility we're using. > > > > > > The problem of notice and logs is that they tend to be ignored. Now I > > > don't see no problems either in adding something into the logs which > > > can be found later on for parsing on top of a SRF returned by the > > > caller which includes all the corruption details, say with pgbadger > > > or your friendly neighborhood grep. I think that any backend function > > > should also make sure to call pgstat_report_checksum_failure() to > > > report a report visible at database-level in the catalogs, so as it is > > > possible to use that as a cheap high-level warning. The details of > > > the failures could always be dug from the logs or the result of the > > > function itself after finding out that something is wrong in > > > pg_stat_database. > > > > I agree that adding extra information in the logs and calling > > pgstat_report_checksum_failure is a must do, and I changed that > > locally. However, I doubt that the logs is the right place to find > > the details of corrupted blocks. There's no guarantee that the file > > will be accessible to the DBA, nor that the content won't get > > truncated by the time it's needed. I really think that corruption is > > important enough to justify more specific location. > > > The cfbot reported a build failure, so here's a rebased v2 which also contains > the pg_stat_report_failure() call and extra log info. In addition to comments from Michael-san, here are my comments: 1. + if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES)) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), +errmsg("only superuser or a member of the pg_read_server_files role may use this function"))); + + if (!DataChecksumsEnabled()) + elog(ERROR, "Data checksums are not enabled"); I think it's better to reverse the order of the above checks. 2. +#define CRF_COLS 5 /* Number of output arguments in the SRF */ Should it be SRF_COLS? 3. +static void +check_delay_point(void) +{ + /* Always check for interrupts */ + CHECK_FOR_INTERRUPTS(); + + /* Nap if appropriate */ + if (!InterruptPending && VacuumCostBalance >= VacuumCostLimit) + { + int msec; + + msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit; + if (msec > VacuumCostDelay * 4) + msec = VacuumCostDelay * 4; + + pg_usleep(msec * 1000L); + + VacuumCostBalance = 0; + + /* Might have gotten an interrupt while sleeping */ + CHECK_FOR_INTERRUPTS(); + } +} Even if we use vacuum delay for this function, I think we need to set VacuumDelayActive and return if it's false, or it's better to just return if VacuumCostDelay == 0. 4. +static void +check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore, + ForkNumber forknum) I also agree with Michael-san to remove this function. Instead we can check all relations by: select pg_check_relation(oid) from pg_class; 6. Other typos s/dirted/dirtied/ s/explictly/explicitly/ Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Online checksums verification in the backend
On Wed, Mar 11, 2020 at 08:18:23AM +0100, Julien Rouhaud wrote: > The cfbot reported a build failure, so here's a rebased v2 which also contains > the pg_stat_report_failure() call and extra log info. + * - if a block is not found in shared_buffers, the LWLock is relased and the + * block is read from disk without taking any lock. If an error is detected, + * the read block will be discarded and retrieved again while holding the + * LWLock. This is because an error due to concurrent write is possible but + * very unlikely, so it's better to have an optimistic approach to limit + * locking overhead This can be risky with false positives, no? With a large amount of shared buffer eviction you actually increase the risk of torn page reads. Instead of a logic relying on partition mapping locks, which could be unwise on performance grounds, did you consider different approaches? For example a kind of pre-emptive lock on the page in storage to prevent any shared buffer operation to happen while the block is read from storage, that would act like a barrier. + * Vacuum's GUCs are used to avoid consuming too much resources while running + * this tool. Shouldn't this involve separate GUCs instead of the VACUUM ones? I guess that this leads to the fact that this function may be better as a contrib module, with the addition of some better-suited APIs in core (see paragraph above). +Run +make check +or +make installcheck Why is installcheck mentioned here? I don't think that it is appropriate to place the SQL-callable part in the existing checksum.c. I would suggest instead a new file, say checksumfuncs.c in src/backend/utils/adt/, holding any SQL functions for checksums. -SUBDIRS = perl regress isolation modules authentication recovery subscription +SUBDIRS = perl regress isolation modules authentication check_relation \ + recovery subscription It seems to me that this test would be a good fit for src/test/modules/test_misc/. +static void +check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore, + ForkNumber forknum) Per the argument of bloat, I think that I would remove check_all_relation() as this function could take a very long time to run, and just make the SQL function strict. + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to + * disk either before the end of the next checkpoint or during recovery in + * case of unsafe shutdown Wouldn't it actually be a good thing to check that the page on storage is fine in this case? This depends on the system settings and the checkpoint frequency, but checkpoint_timeout can be extended up to 1 day. And plenty of things could happen to the storage in one day, including a base backup that includes a corrupted page on storage, that this function would not be able to detect. + * - if a block is otherwise found in shared_buffers, an IO lock is taken on + * the block and the block is then read from storage, ignoring the block in + * shared_buffers Yeah, I think that you are right here to check the page on storage anyway. + * we detect if a block is in shared_buffers or not. See get_buffer() + * comments for more details about the locking strategy. get_buffer() does not exist in your patch, check_get_buffer() does. + * - if a block is not found in shared_buffers, the LWLock is relased and the [...] + * To avoid torn page and possible false postives when reading data, and Typos. + if (!DataChecksumsEnabled()) + elog(ERROR, "Data checksums are not enabled"); Note that elog() is for the class of errors which are never expected, and here a caller of pg_check_relation() with checksums disabled can trigger that. So you need to call ereport() with ERRCODE_FEATURE_NOT_SUPPORTED. + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to + * disk either before the end of the next checkpoint or during recovery in + * case of unsafe shutdown Not sure that the indentation is going to react well on that part of the patch, perhaps it would be better to add some "/*---" at the beginning and end of the comment block to tell pgindent to ignore this part? Based on the feedback gathered on this thread, I guess that you should have a SRF returning the list of broken blocks, as well as NOTICE messages. Another thing to consider is the addition of a range argument to only check a certain portion of the blocks, say one segment file at a time, etc. Fine by me to not include in the first flavor of the patch. The patch needs documentation. -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Tue, Dec 10, 2019 at 11:12:34AM +0100, Julien Rouhaud wrote: > On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier wrote: > > > > On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote: > > > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas wrote: > > >> Some people might prefer notices, because you can get those while the > > >> thing is still running, rather than a result set, which you will only > > >> see when the query finishes. Other people might prefer an SRF, because > > >> they want to have the data in structured form so that they can > > >> postprocess it. Not sure what you mean by "more globally." > > > > > > I meant having the results available system-wide, not only to the > > > caller. I think that emitting a log/notice level should always be > > > done on top on whatever other communication facility we're using. > > > > The problem of notice and logs is that they tend to be ignored. Now I > > don't see no problems either in adding something into the logs which > > can be found later on for parsing on top of a SRF returned by the > > caller which includes all the corruption details, say with pgbadger > > or your friendly neighborhood grep. I think that any backend function > > should also make sure to call pgstat_report_checksum_failure() to > > report a report visible at database-level in the catalogs, so as it is > > possible to use that as a cheap high-level warning. The details of > > the failures could always be dug from the logs or the result of the > > function itself after finding out that something is wrong in > > pg_stat_database. > > I agree that adding extra information in the logs and calling > pgstat_report_checksum_failure is a must do, and I changed that > locally. However, I doubt that the logs is the right place to find > the details of corrupted blocks. There's no guarantee that the file > will be accessible to the DBA, nor that the content won't get > truncated by the time it's needed. I really think that corruption is > important enough to justify more specific location. The cfbot reported a build failure, so here's a rebased v2 which also contains the pg_stat_report_failure() call and extra log info. diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index b8a3f46912..e266292b03 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1386,6 +1386,13 @@ LANGUAGE INTERNAL STRICT STABLE PARALLEL SAFE AS 'jsonb_path_query_first_tz'; +CREATE OR REPLACE FUNCTION pg_check_relation( + IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text, + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, + OUT expected_checksum integer, OUT found_checksum integer) + RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation' + PARALLEL RESTRICTED; + -- -- The default permissions for functions mean that anyone can execute them. -- A number of functions shouldn't be executable by just anyone, but rather diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c index e010691c9f..ed1a0c9b30 100644 --- a/src/backend/storage/page/checksum.c +++ b/src/backend/storage/page/checksum.c @@ -15,8 +15,541 @@ #include "storage/checksum.h" /* - * The actual code is in storage/checksum_impl.h. This is done so that - * external programs can incorporate the checksum code by #include'ing - * that file from the exported Postgres headers. (Compare our CRC code.) + * The actual checksum computation code is in storage/checksum_impl.h. This + * is done so that external programs can incorporate the checksum code by + * #include'ing that file from the exported Postgres headers. (Compare our + * CRC code.) */ #include "storage/checksum_impl.h" + +#include "access/heapam.h" +#include "access/htup_details.h" +#include "catalog/pg_authid_d.h" +#include "commands/dbcommands.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "storage/buf_internals.h" +#include "storage/checksum.h" +#include "storage/lockdefs.h" +#include "utils/acl.h" +#include "utils/builtins.h" +#include "utils/guc.h" +#include "utils/lsyscache.h" +#include "utils/rel.h" +#include "utils/syscache.h" + +/* + * A zero checksum can never be computed, see pg_checksum_page() */ +#define NoComputedChecksum 0 + +/* The rest of this module provides a set of functions that can be used to + * safely check all checksums on a running cluster. + * + * Please note that those only perform standard buffered reads, and don't try + * to bypass or discard the operating system cache. If you want to check the + * actual storage, you have to discard the operating system cache before + * running those functions. + * + * To avoid torn page and possible false postives when reading data, and + * keeping overhead as low as possible, the following heuristics are used: + * + * - a shared LWLock is taken on the target buffer pool partition mapping, and + * we
Re: Online checksums verification in the backend
On Tue, 24 Dec 2019 at 16:09, Julien Rouhaud wrote: > > On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada wrote: > > > > On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud wrote: > > > > > > This brings the second consideration: how to report the list corrupted > > > blocks to end users. As I said this is for now returned via the SRF, > > > but this is clearly not ideal and should rather be made available more > > > globally. One usage of this information could be block level > > > recovery. I'm Cc-ing Sawada-san, as I know he's working on this and > > > mentioned me that he had ideas on passing the list of corrupted blocks > > > using the stat collector. > > > > Yes it's necessary the list of corrupted pages for single page > > recovery. Apart from single page recovery I think it's helpful for DBA > > if they can find the corrupted blocks in the server logs and on a > > system view. > > > > I've also tried to report corrupted pages to the stats collector > > during I researching single page recovery in PostgreSQL but one > > problem is that the statistics in the stats collector is cleared when > > crash recovery. I want the information of block corruption to survive > > even when the server down. > > Yes, having the list of corrupted blocks surviving a crash-and-restart > cycle, and also available after a clean shutdown is definitely > important. > > > And we might want to add checksums to the > > permanent file having information of database corruption. The > > correctness of these information would be important because we can fix > > a database by restoring some tables from a logical backup or by doing > > reindex etc as long as we have a non-broken information of database > > corruption. > > Agreed > > > > Finally, the read and locking considerations. I tried to cover that > > > extensively in the comments, but here are some details on how I tried > > > to make the check safe while trying to keep the overhead as low as > > > possible. First thing is that this is only doing buffered reads, > > > without any attempt to discard OS cache. Therefore, any discrepancy > > > between the OS cache and the disk cannot be detected unless you do > > > other actions, such as sync / drop_caches on GNU/Linux. > > > > > > An access share lock on the currently checked relation is held, > > > meaning that it can't get deleted/truncated. The total number of > > > blocks for the given fork is retrieved first, so any new block will be > > > ignored. Such new blocks are considered out of scope as being written > > > after the start of the check. > > > > > > Each time a buffer is being checked, the target buffer mapping > > > partition lock is acquired in shared mode, to prevent concurrent > > > eviction. If the buffer is found in shared buffers, it's pinned and > > > released immediately, just to get the state. > > > > I wonder if there is possibility that blocks on disk can be corrupted > > even if these are loaded to the shared buffer. ISTM the above method > > cannot detect such corruption. Reading and checking blocks fast is > > attractive but I thought it's also important to check blocks precisely > > without overlooking. > > It can definitely happen, and it's the usual doomsday scenario: > database is working fine for months, then postgres is restarted say > for a minor version upgrade and then boom the most populars blocks > that are constantly used in read only were corrupted on disk but never > evicted from shared buffers, and you have a major outage. I have > witnessed that unfortunately too many times. This is especially bad > as in this kind of scenario, you typically discover the corruption > once all backup only contains the corrupted blocks. > > Note that in the approach I'm suggesting, I do verify blocks that are > loaded in shared buffers, I only ignore the dirty blocks, as they'll > be written by the checkpointer or recovery process in case of unclean > shutdown. A bufferpin isn't necessary to avoid torn page read, an IO > lock also guarantees that and causes less overhead. The included TAP > test should also detect the corruption of a > present-in-shared-buffers-non-dirty block. It could however be > improved eg. by calling pg_prewarm to make sure that it's indeed in > shared_buffers, and also do the same test after a clean restart to > make sure that it's hitting the not-in-shared-buffers case. It reads blocks from disk even if they are loaded in shared buffer. Now I understand. Thanks! Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Online checksums verification in the backend
On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada wrote: > > On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud wrote: > > > > This brings the second consideration: how to report the list corrupted > > blocks to end users. As I said this is for now returned via the SRF, > > but this is clearly not ideal and should rather be made available more > > globally. One usage of this information could be block level > > recovery. I'm Cc-ing Sawada-san, as I know he's working on this and > > mentioned me that he had ideas on passing the list of corrupted blocks > > using the stat collector. > > Yes it's necessary the list of corrupted pages for single page > recovery. Apart from single page recovery I think it's helpful for DBA > if they can find the corrupted blocks in the server logs and on a > system view. > > I've also tried to report corrupted pages to the stats collector > during I researching single page recovery in PostgreSQL but one > problem is that the statistics in the stats collector is cleared when > crash recovery. I want the information of block corruption to survive > even when the server down. Yes, having the list of corrupted blocks surviving a crash-and-restart cycle, and also available after a clean shutdown is definitely important. > And we might want to add checksums to the > permanent file having information of database corruption. The > correctness of these information would be important because we can fix > a database by restoring some tables from a logical backup or by doing > reindex etc as long as we have a non-broken information of database > corruption. Agreed > > Finally, the read and locking considerations. I tried to cover that > > extensively in the comments, but here are some details on how I tried > > to make the check safe while trying to keep the overhead as low as > > possible. First thing is that this is only doing buffered reads, > > without any attempt to discard OS cache. Therefore, any discrepancy > > between the OS cache and the disk cannot be detected unless you do > > other actions, such as sync / drop_caches on GNU/Linux. > > > > An access share lock on the currently checked relation is held, > > meaning that it can't get deleted/truncated. The total number of > > blocks for the given fork is retrieved first, so any new block will be > > ignored. Such new blocks are considered out of scope as being written > > after the start of the check. > > > > Each time a buffer is being checked, the target buffer mapping > > partition lock is acquired in shared mode, to prevent concurrent > > eviction. If the buffer is found in shared buffers, it's pinned and > > released immediately, just to get the state. > > I wonder if there is possibility that blocks on disk can be corrupted > even if these are loaded to the shared buffer. ISTM the above method > cannot detect such corruption. Reading and checking blocks fast is > attractive but I thought it's also important to check blocks precisely > without overlooking. It can definitely happen, and it's the usual doomsday scenario: database is working fine for months, then postgres is restarted say for a minor version upgrade and then boom the most populars blocks that are constantly used in read only were corrupted on disk but never evicted from shared buffers, and you have a major outage. I have witnessed that unfortunately too many times. This is especially bad as in this kind of scenario, you typically discover the corruption once all backup only contains the corrupted blocks. Note that in the approach I'm suggesting, I do verify blocks that are loaded in shared buffers, I only ignore the dirty blocks, as they'll be written by the checkpointer or recovery process in case of unclean shutdown. A bufferpin isn't necessary to avoid torn page read, an IO lock also guarantees that and causes less overhead. The included TAP test should also detect the corruption of a present-in-shared-buffers-non-dirty block. It could however be improved eg. by calling pg_prewarm to make sure that it's indeed in shared_buffers, and also do the same test after a clean restart to make sure that it's hitting the not-in-shared-buffers case.
Re: Online checksums verification in the backend
On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud wrote: > > Hi, > > This topic was discussed several times, with the most recent > discussions found at [1] and [2]. Based on those discussions, my > understanding is that the current approach in BASE_BACKUP has too many > drawbacks and we should instead do this check in the backend. I've > been working using such approach at VMware, and I'm submitting it here > to discuss the approach and rationales, and hopefully have such a > feature integrated. Thank you for working on this! > > First, this was originally developed as an extension. It means that > the check is performed using an SRF. That's maybe not the best > approach, as a transaction has be kept for the total processing time. > It can be leveraged by checking each relation independently, but > that's still not ideal. Maybe using some utility commands (as part of > VACUUM or a new CHECK command for instance) would be a better > approach. > > This brings the second consideration: how to report the list corrupted > blocks to end users. As I said this is for now returned via the SRF, > but this is clearly not ideal and should rather be made available more > globally. One usage of this information could be block level > recovery. I'm Cc-ing Sawada-san, as I know he's working on this and > mentioned me that he had ideas on passing the list of corrupted blocks > using the stat collector. Yes it's necessary the list of corrupted pages for single page recovery. Apart from single page recovery I think it's helpful for DBA if they can find the corrupted blocks in the server logs and on a system view. I've also tried to report corrupted pages to the stats collector during I researching single page recovery in PostgreSQL but one problem is that the statistics in the stats collector is cleared when crash recovery. I want the information of block corruption to survive even when the server down. And we might want to add checksums to the permanent file having information of database corruption. The correctness of these information would be important because we can fix a database by restoring some tables from a logical backup or by doing reindex etc as long as we have a non-broken information of database corruption. > > Finally, the read and locking considerations. I tried to cover that > extensively in the comments, but here are some details on how I tried > to make the check safe while trying to keep the overhead as low as > possible. First thing is that this is only doing buffered reads, > without any attempt to discard OS cache. Therefore, any discrepancy > between the OS cache and the disk cannot be detected unless you do > other actions, such as sync / drop_caches on GNU/Linux. > > An access share lock on the currently checked relation is held, > meaning that it can't get deleted/truncated. The total number of > blocks for the given fork is retrieved first, so any new block will be > ignored. Such new blocks are considered out of scope as being written > after the start of the check. > > Each time a buffer is being checked, the target buffer mapping > partition lock is acquired in shared mode, to prevent concurrent > eviction. If the buffer is found in shared buffers, it's pinned and > released immediately, just to get the state. I wonder if there is possibility that blocks on disk can be corrupted even if these are loaded to the shared buffer. ISTM the above method cannot detect such corruption. Reading and checking blocks fast is attractive but I thought it's also important to check blocks precisely without overlooking. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Online checksums verification in the backend
On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier wrote: > > On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote: > > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas wrote: > >> Some people might prefer notices, because you can get those while the > >> thing is still running, rather than a result set, which you will only > >> see when the query finishes. Other people might prefer an SRF, because > >> they want to have the data in structured form so that they can > >> postprocess it. Not sure what you mean by "more globally." > > > > I meant having the results available system-wide, not only to the > > caller. I think that emitting a log/notice level should always be > > done on top on whatever other communication facility we're using. > > The problem of notice and logs is that they tend to be ignored. Now I > don't see no problems either in adding something into the logs which > can be found later on for parsing on top of a SRF returned by the > caller which includes all the corruption details, say with pgbadger > or your friendly neighborhood grep. I think that any backend function > should also make sure to call pgstat_report_checksum_failure() to > report a report visible at database-level in the catalogs, so as it is > possible to use that as a cheap high-level warning. The details of > the failures could always be dug from the logs or the result of the > function itself after finding out that something is wrong in > pg_stat_database. I agree that adding extra information in the logs and calling pgstat_report_checksum_failure is a must do, and I changed that locally. However, I doubt that the logs is the right place to find the details of corrupted blocks. There's no guarantee that the file will be accessible to the DBA, nor that the content won't get truncated by the time it's needed. I really think that corruption is important enough to justify more specific location. > >> I guess one > >> idea would be to provide a way to kick this off in the background via > >> a background worker or similar and then have it put the results in a > >> table. But that might fail if there are checksum errors in the > >> catalogs themselves. > > > > Yes that's a concern. We could maintain a list in (dynamic) shared > > memory with a simple SQL wrapper to read the data, but that would be > > lost with a crash/restart. Or use > > pgstat_report_checksum_failures_in_db(), modifying it to get an > > relfilenode, bocknum and forknum and append that to some flat files, > > hoping that it won't get corrupted either. > > If a lot of blocks are corrupted, that could bloat things. Hence some > retention policies would be necessary, and that's tricky to define and > configure properly. I'd tend to be in the school of just logging the > information and be done with it, because that's simple and because you > won't need to worry about any more configuration. If the number of corrupted blocks becomes high enough to excessively bloat things, it's likely that the instance is doomed anyway, so I'm not especially concerned about it.
Re: Online checksums verification in the backend
On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote: > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas wrote: >> Some people might prefer notices, because you can get those while the >> thing is still running, rather than a result set, which you will only >> see when the query finishes. Other people might prefer an SRF, because >> they want to have the data in structured form so that they can >> postprocess it. Not sure what you mean by "more globally." > > I meant having the results available system-wide, not only to the > caller. I think that emitting a log/notice level should always be > done on top on whatever other communication facility we're using. The problem of notice and logs is that they tend to be ignored. Now I don't see no problems either in adding something into the logs which can be found later on for parsing on top of a SRF returned by the caller which includes all the corruption details, say with pgbadger or your friendly neighborhood grep. I think that any backend function should also make sure to call pgstat_report_checksum_failure() to report a report visible at database-level in the catalogs, so as it is possible to use that as a cheap high-level warning. The details of the failures could always be dug from the logs or the result of the function itself after finding out that something is wrong in pg_stat_database. >> I guess one >> idea would be to provide a way to kick this off in the background via >> a background worker or similar and then have it put the results in a >> table. But that might fail if there are checksum errors in the >> catalogs themselves. > > Yes that's a concern. We could maintain a list in (dynamic) shared > memory with a simple SQL wrapper to read the data, but that would be > lost with a crash/restart. Or use > pgstat_report_checksum_failures_in_db(), modifying it to get an > relfilenode, bocknum and forknum and append that to some flat files, > hoping that it won't get corrupted either. If a lot of blocks are corrupted, that could bloat things. Hence some retention policies would be necessary, and that's tricky to define and configure properly. I'd tend to be in the school of just logging the information and be done with it, because that's simple and because you won't need to worry about any more configuration. Doing the work in the background is still separate than a SQL-callable function though, no? In this case you need a connection to a database to allow the checksum verification to happen on a relfilenode based on the relation to check, also because you want the thing to be safe concurrently (a background work here is a combo with a bgworker triggering dynamic children working on one database, not necessarily something that needs to be in core). -- Michael signature.asc Description: PGP signature
Re: Online checksums verification in the backend
On Mon, Dec 9, 2019 at 5:21 PM Robert Haas wrote: > > On Fri, Dec 6, 2019 at 9:51 AM Julien Rouhaud wrote: > > > This brings the second consideration: how to report the list corrupted > > blocks to end users. As I said this is for now returned via the SRF, > > but this is clearly not ideal and should rather be made available more > > globally. > > Some people might prefer notices, because you can get those while the > thing is still running, rather than a result set, which you will only > see when the query finishes. Other people might prefer an SRF, because > they want to have the data in structured form so that they can > postprocess it. Not sure what you mean by "more globally." I meant having the results available system-wide, not only to the caller. I think that emitting a log/notice level should always be done on top on whatever other communication facility we're using. > I guess one > idea would be to provide a way to kick this off in the background via > a background worker or similar and then have it put the results in a > table. But that might fail if there are checksum errors in the > catalogs themselves. Yes that's a concern. We could maintain a list in (dynamic) shared memory with a simple SQL wrapper to read the data, but that would be lost with a crash/restart. Or use pgstat_report_checksum_failures_in_db(), modifying it to get an relfilenode, bocknum and forknum and append that to some flat files, hoping that it won't get corrupted either.
Re: Online checksums verification in the backend
On Fri, Dec 6, 2019 at 9:51 AM Julien Rouhaud wrote: > This topic was discussed several times, with the most recent > discussions found at [1] and [2]. Based on those discussions, my > understanding is that the current approach in BASE_BACKUP has too many > drawbacks and we should instead do this check in the backend. Good idea. > This brings the second consideration: how to report the list corrupted > blocks to end users. As I said this is for now returned via the SRF, > but this is clearly not ideal and should rather be made available more > globally. Some people might prefer notices, because you can get those while the thing is still running, rather than a result set, which you will only see when the query finishes. Other people might prefer an SRF, because they want to have the data in structured form so that they can postprocess it. Not sure what you mean by "more globally." I guess one idea would be to provide a way to kick this off in the background via a background worker or similar and then have it put the results in a table. But that might fail if there are checksum errors in the catalogs themselves. I don't really know what's best. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Online checksums verification in the backend
Hi, This topic was discussed several times, with the most recent discussions found at [1] and [2]. Based on those discussions, my understanding is that the current approach in BASE_BACKUP has too many drawbacks and we should instead do this check in the backend. I've been working using such approach at VMware, and I'm submitting it here to discuss the approach and rationales, and hopefully have such a feature integrated. First, this was originally developed as an extension. It means that the check is performed using an SRF. That's maybe not the best approach, as a transaction has be kept for the total processing time. It can be leveraged by checking each relation independently, but that's still not ideal. Maybe using some utility commands (as part of VACUUM or a new CHECK command for instance) would be a better approach. This brings the second consideration: how to report the list corrupted blocks to end users. As I said this is for now returned via the SRF, but this is clearly not ideal and should rather be made available more globally. One usage of this information could be block level recovery. I'm Cc-ing Sawada-san, as I know he's working on this and mentioned me that he had ideas on passing the list of corrupted blocks using the stat collector. Finally, the read and locking considerations. I tried to cover that extensively in the comments, but here are some details on how I tried to make the check safe while trying to keep the overhead as low as possible. First thing is that this is only doing buffered reads, without any attempt to discard OS cache. Therefore, any discrepancy between the OS cache and the disk cannot be detected unless you do other actions, such as sync / drop_caches on GNU/Linux. An access share lock on the currently checked relation is held, meaning that it can't get deleted/truncated. The total number of blocks for the given fork is retrieved first, so any new block will be ignored. Such new blocks are considered out of scope as being written after the start of the check. Each time a buffer is being checked, the target buffer mapping partition lock is acquired in shared mode, to prevent concurrent eviction. If the buffer is found in shared buffers, it's pinned and released immediately, just to get the state. If the buffer is found dirty, no check is performed as it'll be written to disk by the checkpointer, or during recovery in case of unclean shutdown. Otherwise, an IO lock is held while the the buffer is being read in a private buffer. IO Lock and buffer mapping lock are released and then the check is performed. If the buffer is not found in shared buffers, the buffer mapping partition lock is released immediately and the block is read from disk. It's therefore possible to get a false positive here, as the block could be concurrently read, modified and partially written to disk. So, if an error is detected in this case, the check is restarted from scratch and if the buffer is still not found in shared buffers, the read will be done while still holding the buffer mapping partition lock to make sure that it can't get concurrently loaded and modified. This is an optimistic approach to avoid performance overhead, assuming that there shouldn't be a lot of positive, and false positive possibility is very narrow. The check consists of simply comparing the stored and computed checksum, with an additional check that the page is really new (using PageIsVerified) if it's found as PageIsNew(). Since this is done after releasing all locks, we could definitely add more checks without causing much overhead, like pd_lower/pd_upper sanity. I prefer to keep the check simple for now and rather focus on the general approach. Finally, I also reused vacuum costing GUC (for simplicity) and approach to add some throttling. I'm attaching a patch that adds a new pg_check_relation() sql function to perform a check of one or all relations, and some simple regression tests. [1] https://www.postgresql.org/message-id/flat/1532606373.3422.5.camel%40credativ.de [2] https://www.postgresql.org/message-id/flat/20190326170820.6sylklg7eh6uhabd%40alap3.anarazel.de online_checksum_verification-v1.diff Description: Binary data