subject:"Online verification of checksums"

Re: Online verification of checksums

2021-09-13 Thread Daniel Gustafsson

> On 2 Sep 2021, at 13:18, Daniel Gustafsson  wrote:
> 
>> On 9 Jul 2021, at 22:00, Ibrar Ahmed  wrote:
> 
>> I am changing the status to "Waiting on Author" based on the latest comments 
>> of @David Steele 
>> and secondly the patch does not apply cleanly.
> 
> This patch hasn’t moved since marked as WoA in the last CF and still doesn’t
> apply, unless there is a new version brewing it seems apt to close this as RwF
> and await a new entry in a future CF.

As there has been no movement, I've marked this patch as RwF.

--
Daniel Gustafsson   https://vmware.com/

Re: Online verification of checksums

2021-09-02 Thread Daniel Gustafsson

> On 9 Jul 2021, at 22:00, Ibrar Ahmed  wrote:

> I am changing the status to "Waiting on Author" based on the latest comments 
> of @David Steele 
> and secondly the patch does not apply cleanly.

This patch hasn’t moved since marked as WoA in the last CF and still doesn’t
apply, unless there is a new version brewing it seems apt to close this as RwF
and await a new entry in a future CF.

--
Daniel Gustafsson   https://vmware.com/

Re: Online verification of checksums

2021-07-09 Thread Ibrar Ahmed

On Tue, Mar 9, 2021 at 10:43 PM David Steele  wrote:

> On 11/30/20 6:38 PM, David Steele wrote:
> > On 11/30/20 9:27 AM, Stephen Frost wrote:
> >> * Michael Paquier (mich...@paquier.xyz) wrote:
> >>> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
>  * Magnus Hagander (mag...@hagander.net) wrote:
> > On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier
> >  wrote:
> >> But here the checksum is broken, so while the offset is something we
> >> can rely on how do you make sure that the LSN is fine?  A broken
> >> checksum could perfectly mean that the LSN is actually *not* fine if
> >> the page header got corrupted.
> 
>  Of course that could be the case, but it gets to be a smaller and
>  smaller chance by checking that the LSN read falls within reasonable
>  bounds.
> >>>
> >>> FWIW, I find that scary.
> >>
> >> There's ultimately different levels of 'scary' and the risk here that
> >> something is actually wrong following these checks strikes me as being
> >> on the same order as random bits being flipped in the page and still
> >> getting a valid checksum (which is entirely possible with our current
> >> checksum...), or maybe even less.
> >
> > I would say a lot less. First you'd need to corrupt one of the eight
> > bytes that make up the LSN (pretty likely since corruption will probably
> > affect the entire block) and then it would need to be updated to a value
> > that falls within the current backup range, a 1 in 16 million chance if
> > a terabyte of WAL is generated during the backup. Plus, the corruption
> > needs to happen during the backup since we are going to check for that,
> > and the corrupted LSN needs to be ascending, and the LSN originally read
> > needs to be within the backup range (another 1 in 16 million chance)
> > since pages written before the start backup checkpoint should not be
> torn.
> >
> > So as far as I can see there are more likely to be false negatives from
> > the checksum itself.
> >
> > It would also be easy to add a few rounds of checks, i.e. test if the
> > LSN ascends but stays in the backup LSN range N times.
> >
> > Honestly, I'm much more worried about corruption zeroing the entire
> > page. I don't know how likely that is, but I know none of our proposed
> > solutions would catch it.
> >
> > Andres, since you brought this issue up originally perhaps you'd like to
> > weigh in?
>
> I had another look at this patch and though I think my suggestions above
> would improve the patch, I have no objections to going forward as is (if
> that is the consensus) since this seems an improvement over what we have
> now.
>
> It comes down to whether you prefer false negatives or false positives.
> With the LSN checking Stephen and I advocate it is theoretically
> possible to have a false negative but the chances of the LSN ascending N
> times but staying within the backup LSN range due to corruption seems to
> be approaching zero.
>
> I think Michael's method is unlikely to throw false positives, but it
> seems at least possible that a block would be hot enough to be appear
> torn N times in a row. Torn pages themselves are really easy to reproduce.
>
> If we do go forward with this method I would likely propose the
> LSN-based approach as a future improvement.
>
> Regards,
> --
> -David
> da...@pgmasters.net
>
>
>
I am changing the status to "Waiting on Author" based on the latest
comments of @David Steele 
and secondly the patch does not apply cleanly.

http://cfbot.cputube.org/patch_33_2719.log



-- 
Ibrar Ahmed

Re: Online verification of checksums

2021-03-09 Thread David Steele


On 11/30/20 6:38 PM, David Steele wrote:

On 11/30/20 9:27 AM, Stephen Frost wrote:

* Michael Paquier (mich...@paquier.xyz) wrote:

On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:

* Magnus Hagander (mag...@hagander.net) wrote:
On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier 
 wrote:

But here the checksum is broken, so while the offset is something we
can rely on how do you make sure that the LSN is fine?  A broken
checksum could perfectly mean that the LSN is actually *not* fine if
the page header got corrupted.


Of course that could be the case, but it gets to be a smaller and
smaller chance by checking that the LSN read falls within reasonable
bounds.


FWIW, I find that scary.


There's ultimately different levels of 'scary' and the risk here that
something is actually wrong following these checks strikes me as being
on the same order as random bits being flipped in the page and still
getting a valid checksum (which is entirely possible with our current
checksum...), or maybe even less. 


I would say a lot less. First you'd need to corrupt one of the eight 
bytes that make up the LSN (pretty likely since corruption will probably 
affect the entire block) and then it would need to be updated to a value 
that falls within the current backup range, a 1 in 16 million chance if 
a terabyte of WAL is generated during the backup. Plus, the corruption 
needs to happen during the backup since we are going to check for that, 
and the corrupted LSN needs to be ascending, and the LSN originally read 
needs to be within the backup range (another 1 in 16 million chance) 
since pages written before the start backup checkpoint should not be torn.


So as far as I can see there are more likely to be false negatives from 
the checksum itself.


It would also be easy to add a few rounds of checks, i.e. test if the 
LSN ascends but stays in the backup LSN range N times.


Honestly, I'm much more worried about corruption zeroing the entire 
page. I don't know how likely that is, but I know none of our proposed 
solutions would catch it.


Andres, since you brought this issue up originally perhaps you'd like to 
weigh in?


I had another look at this patch and though I think my suggestions above 
would improve the patch, I have no objections to going forward as is (if 
that is the consensus) since this seems an improvement over what we have 
now.


It comes down to whether you prefer false negatives or false positives. 
With the LSN checking Stephen and I advocate it is theoretically 
possible to have a false negative but the chances of the LSN ascending N 
times but staying within the backup LSN range due to corruption seems to 
be approaching zero.


I think Michael's method is unlikely to throw false positives, but it 
seems at least possible that a block would be hot enough to be appear 
torn N times in a row. Torn pages themselves are really easy to reproduce.


If we do go forward with this method I would likely propose the 
LSN-based approach as a future improvement.


Regards,
--
-David
da...@pgmasters.net

Re: Online verification of checksums

2020-11-30 Thread David Steele


On 11/30/20 9:27 AM, Stephen Frost wrote:

Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:

On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:

* Magnus Hagander (mag...@hagander.net) wrote:

On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier  wrote:

But here the checksum is broken, so while the offset is something we
can rely on how do you make sure that the LSN is fine?  A broken
checksum could perfectly mean that the LSN is actually *not* fine if
the page header got corrupted.


Of course that could be the case, but it gets to be a smaller and
smaller chance by checking that the LSN read falls within reasonable
bounds.


FWIW, I find that scary.


There's ultimately different levels of 'scary' and the risk here that
something is actually wrong following these checks strikes me as being
on the same order as random bits being flipped in the page and still
getting a valid checksum (which is entirely possible with our current
checksum...), or maybe even less.  


I would say a lot less. First you'd need to corrupt one of the eight 
bytes that make up the LSN (pretty likely since corruption will probably 
affect the entire block) and then it would need to be updated to a value 
that falls within the current backup range, a 1 in 16 million chance if 
a terabyte of WAL is generated during the backup. Plus, the corruption 
needs to happen during the backup since we are going to check for that, 
and the corrupted LSN needs to be ascending, and the LSN originally read 
needs to be within the backup range (another 1 in 16 million chance) 
since pages written before the start backup checkpoint should not be torn.


So as far as I can see there are more likely to be false negatives from 
the checksum itself.


It would also be easy to add a few rounds of checks, i.e. test if the 
LSN ascends but stays in the backup LSN range N times.


Honestly, I'm much more worried about corruption zeroing the entire 
page. I don't know how likely that is, but I know none of our proposed 
solutions would catch it.


Andres, since you brought this issue up originally perhaps you'd like to 
weigh in?


Regards,
--
-David
da...@pgmasters.net

Re: Online verification of checksums

2020-11-30 Thread Stephen Frost

Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
> > * Magnus Hagander (mag...@hagander.net) wrote:
> >> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier  
> >> wrote:
> >>> But here the checksum is broken, so while the offset is something we
> >>> can rely on how do you make sure that the LSN is fine?  A broken
> >>> checksum could perfectly mean that the LSN is actually *not* fine if
> >>> the page header got corrupted.
> >
> > Of course that could be the case, but it gets to be a smaller and
> > smaller chance by checking that the LSN read falls within reasonable
> > bounds.
> 
> FWIW, I find that scary.

There's ultimately different levels of 'scary' and the risk here that
something is actually wrong following these checks strikes me as being
on the same order as random bits being flipped in the page and still
getting a valid checksum (which is entirely possible with our current
checksum...), or maybe even less.  Both cases would result in a false
negative, which is surely bad, though that strikes me as better than a
false positive, where we say there's corruption when there isn't.

> And this bit is interesting, because that would give the guarantees
> you are looking for with a page retry (just grep for BM_IO_IN_PROGRESS
> on the thread):
> https://www.postgresql.org/message-id/20201102193457.fc2hoen7ahth4...@alap3.anarazel.de

There's no guarantee that the page is still in shared buffers or that we
have a buffer descriptor still for it by the time we're doing this, as I
said up-thread.  This approach requires that we reach into PG, acquire
at least a buffer descriptor and set BM_IO_IN_PROGRESS on it and then
read the page again and checksum it again before finally looking at the
(now 'trusted' LSN, even though it might have had some bits flipped in
it and we wouldn't know..) and see if it's higher than the start of the
backup, and maybe less than the current LSN.  Maybe we can avoid
actually pulling the page into shared buffers (reading it into our own
memory instead) and just have the buffer descriptor but none of this
seems like it's going to be very unobtrusive in either code or the
running system, and it isn't going to give us an actual guarantee that
there's been no corruption.  The amount that it improves on the checks
that I outline above seems to be exceedingly small and the question is
if it's worth it for, most likely, exclusively pg_basebackup (unless
we're going to figure out a way to expose this via SQL, which seems
unlikely).

> > One idea that has occured
> > to me which hasn't been discussed is checking the file's mtime to see if
> > it's changed since the backup started.  In that case, I would think it'd
> > be something like:
> > 
> > - Checksum is invalid
> > - LSN is within range
> > - Close file
> > - Stat file
> > - If mtime is from before the backup then signal possible corruption
> 
> I suspect that relying on mtime may cause problems.  One case coming
> to my mind is NFS.

I agree that it might not be perfect but it also seems like something
which could be reasonably cheaply checked and the window (between when
the backup started and the time we hit this torn page) is very likely to
be large enough that the mtime will have been updated and be different
(and forward, if it was modified) of what it was at the time the backup
started.  It's also something that incremental backups may be looking
at, so if there's serious problems with it then there's a good chance
you've got bigger issues.

> > In general, however, I don't like the idea of reaching into PG and
> > asking PG for this page.
> 
> It seems to me that if we don't ask to PG what it thinks about a page,
> we will never have a fully bullet-proof design either.

None of this is bullet-proof, it's all trade-offs.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-28 Thread Michael Paquier

On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
> * Magnus Hagander (mag...@hagander.net) wrote:
>> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier  wrote:
>>> But here the checksum is broken, so while the offset is something we
>>> can rely on how do you make sure that the LSN is fine?  A broken
>>> checksum could perfectly mean that the LSN is actually *not* fine if
>>> the page header got corrupted.
>
> Of course that could be the case, but it gets to be a smaller and
> smaller chance by checking that the LSN read falls within reasonable
> bounds.

FWIW, I find that scary.

>> We cannot rely on the LSN itself. But it's a lot more likely that we
>> can rely on the LSN changing, and on the LSN changing in a "correct
>> way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
>> *doesn't change* we know it's corrupt. But if the LSN *increases* AND
>> the new page now has a correct checksum, it's very most likely to be
>> correct. You could perhaps even put cap on it saying "if the LSN
>> increased, but less than ", where  is a sufficiently high number
>> that it's entirely unreasonable to advanced that far between the
>> reading of two blocks. But it has to have a very high margin in that
>> case.
> 
> This is, in fact, included in what was proposed- the "max increase"
> would be "the ending LSN of the backup".  I don't think we can make it
> any tighter than that though without risking false positives, which is
> surely worse than a false negative in this particular case- we already
> risk false negatives due to the fact that our checksum isn't perfect, so
> even a perfect check to make sure that the page will, in fact, be
> replayed over during crash recovery doesn't guarantee that there's no
> corruption.
> 
> If there's an easy and cheap way to see if there was concurrent i/o
> happening for the page, then let's hear it.

This has been discussed for a couple of months now.  I would recommend
to go through this thread:
https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5q0y+wulml5sr...@mail.gmail.com

And this bit is interesting, because that would give the guarantees
you are looking for with a page retry (just grep for BM_IO_IN_PROGRESS
on the thread):
https://www.postgresql.org/message-id/20201102193457.fc2hoen7ahth4...@alap3.anarazel.de

> One idea that has occured
> to me which hasn't been discussed is checking the file's mtime to see if
> it's changed since the backup started.  In that case, I would think it'd
> be something like:
> 
> - Checksum is invalid
> - LSN is within range
> - Close file
> - Stat file
> - If mtime is from before the backup then signal possible corruption

I suspect that relying on mtime may cause problems.  One case coming
to my mind is NFS.

> In general, however, I don't like the idea of reaching into PG and
> asking PG for this page.

It seems to me that if we don't ask to PG what it thinks about a page,
we will never have a fully bullet-proof design either.
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-27 Thread Stephen Frost

Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier  wrote:
> > On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> > > We are not just looking at one LSN value. Here are the steps we are
> > > proposing (I'll skip checks for zero pages here):
> > >
> > > 1) Test the page checksum. If it passes the page is OK.
> > > 2) If the checksum does not pass then record the page offset and LSN and
> > > continue.
> >
> > But here the checksum is broken, so while the offset is something we
> > can rely on how do you make sure that the LSN is fine?  A broken
> > checksum could perfectly mean that the LSN is actually *not* fine if
> > the page header got corrupted.

Of course that could be the case, but it gets to be a smaller and
smaller chance by checking that the LSN read falls within reasonable
bounds.

> > > 3) After the file is copied, reopen and reread the file, seeking to 
> > > offsets
> > > where possible invalid pages were recorded in the first pass.
> > > a) If the page is now valid then it is OK.
> > > b) If the page is not valid but the LSN has increased from the LSN
> >
> > Per se the previous point about the LSN value that we cannot rely on.
> 
> We cannot rely on the LSN itself. But it's a lot more likely that we
> can rely on the LSN changing, and on the LSN changing in a "correct
> way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
> *doesn't change* we know it's corrupt. But if the LSN *increases* AND
> the new page now has a correct checksum, it's very most likely to be
> correct. You could perhaps even put cap on it saying "if the LSN
> increased, but less than ", where  is a sufficiently high number
> that it's entirely unreasonable to advanced that far between the
> reading of two blocks. But it has to have a very high margin in that
> case.

This is, in fact, included in what was proposed- the "max increase"
would be "the ending LSN of the backup".  I don't think we can make it
any tighter than that though without risking false positives, which is
surely worse than a false negative in this particular case- we already
risk false negatives due to the fact that our checksum isn't perfect, so
even a perfect check to make sure that the page will, in fact, be
replayed over during crash recovery doesn't guarantee that there's no
corruption.

> > > A malicious attacker could easily trick these checks, but as Stephen 
> > > pointed
> > > out elsewhere they would likely make the checksums valid which would 
> > > escape
> > > detection anyway.
> > >
> > > We believe that the chances of random storage corruption passing all these
> > > checks is incredibly small, but eventually we'll also check against the 
> > > WAL
> > > to be completely sure.
> >
> > The lack of check for any concurrent I/O on the follow-up retries is
> > disturbing.  How do you guarantee that on the second retry what you
> > have is a torn page and not something corrupted?  Init forks for
> > example are made of up to 2 blocks, so the window would get short for
> > at least those.  There are many instances with tables that have few
> > pages as well.

If there's an easy and cheap way to see if there was concurrent i/o
happening for the page, then let's hear it.  One idea that has occured
to me which hasn't been discussed is checking the file's mtime to see if
it's changed since the backup started.  In that case, I would think it'd
be something like:

- Checksum is invalid
- LSN is within range
- Close file
- Stat file
- If mtime is from before the backup then signal possible corruption

If the checksum is invalid and the LSN isn't in range, then signal
corruption.

In general, however, I don't like the idea of reaching into PG and
asking PG for this page.

> Here I was more worried that the window might get *too long* if tables
> are large :)

I'm not sure that there's really a 'too long' possibility here.

> The risk is certainly that you get a torn page *again* on the second
> read. It could be the same torn page (if it hasn't changed), but you
> can detect that (by the fact that it hasn't actually changed) and
> possibly do a short delay before trying again if it gets that far.

I'm really not a fan of introducing these delays in the hopes that
they'll work..

> That could happen if the process is too quick. It could also be that
> you are unlucky and that you hit a *new* write, and you were so
> unlucky that both times it happened to hit exactly when you were
> reading the page the next time. I'm not sure the chance of that
> happening is even big enough we have to care about it, though?

If there's actually a new write, surely the LSN would be new?  At the
least, it wouldn't be the same LSN as the first read that picked up a
torn page.

In general though, I agree, we are getting to the point here where the
chances of missing something with this approach seems extremely slim.  I
do still like the idea of doing better by actually

Re: Online verification of checksums

2020-11-26 Thread Magnus Hagander

On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier  wrote:
>
> On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> > We are not just looking at one LSN value. Here are the steps we are
> > proposing (I'll skip checks for zero pages here):
> >
> > 1) Test the page checksum. If it passes the page is OK.
> > 2) If the checksum does not pass then record the page offset and LSN and
> > continue.
>
> But here the checksum is broken, so while the offset is something we
> can rely on how do you make sure that the LSN is fine?  A broken
> checksum could perfectly mean that the LSN is actually *not* fine if
> the page header got corrupted.
>
> > 3) After the file is copied, reopen and reread the file, seeking to offsets
> > where possible invalid pages were recorded in the first pass.
> > a) If the page is now valid then it is OK.
> > b) If the page is not valid but the LSN has increased from the LSN
>
> Per se the previous point about the LSN value that we cannot rely on.

We cannot rely on the LSN itself. But it's a lot more likely that we
can rely on the LSN changing, and on the LSN changing in a "correct
way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
*doesn't change* we know it's corrupt. But if the LSN *increases* AND
the new page now has a correct checksum, it's very most likely to be
correct. You could perhaps even put cap on it saying "if the LSN
increased, but less than ", where  is a sufficiently high number
that it's entirely unreasonable to advanced that far between the
reading of two blocks. But it has to have a very high margin in that
case.

> > A malicious attacker could easily trick these checks, but as Stephen pointed
> > out elsewhere they would likely make the checksums valid which would escape
> > detection anyway.
> >
> > We believe that the chances of random storage corruption passing all these
> > checks is incredibly small, but eventually we'll also check against the WAL
> > to be completely sure.
>
> The lack of check for any concurrent I/O on the follow-up retries is
> disturbing.  How do you guarantee that on the second retry what you
> have is a torn page and not something corrupted?  Init forks for
> example are made of up to 2 blocks, so the window would get short for
> at least those.  There are many instances with tables that have few
> pages as well.

Here I was more worried that the window might get *too long* if tables
are large :)

The risk is certainly that you get a torn page *again* on the second
read. It could be the same torn page (if it hasn't changed), but you
can detect that (by the fact that it hasn't actually changed) and
possibly do a short delay before trying again if it gets that far.
That could happen if the process is too quick. It could also be that
you are unlucky and that you hit a *new* write, and you were so
unlucky that both times it happened to hit exactly when you were
reading the page the next time. I'm not sure the chance of that
happening is even big enough we have to care about it, though?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Online verification of checksums

2020-11-25 Thread Michael Paquier

On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> We are not just looking at one LSN value. Here are the steps we are
> proposing (I'll skip checks for zero pages here):
> 
> 1) Test the page checksum. If it passes the page is OK.
> 2) If the checksum does not pass then record the page offset and LSN and
> continue.

But here the checksum is broken, so while the offset is something we
can rely on how do you make sure that the LSN is fine?  A broken 
checksum could perfectly mean that the LSN is actually *not* fine if
the page header got corrupted.

> 3) After the file is copied, reopen and reread the file, seeking to offsets
> where possible invalid pages were recorded in the first pass.
> a) If the page is now valid then it is OK.
> b) If the page is not valid but the LSN has increased from the LSN

Per se the previous point about the LSN value that we cannot rely on.

> A malicious attacker could easily trick these checks, but as Stephen pointed
> out elsewhere they would likely make the checksums valid which would escape
> detection anyway.
> 
> We believe that the chances of random storage corruption passing all these
> checks is incredibly small, but eventually we'll also check against the WAL
> to be completely sure.

The lack of check for any concurrent I/O on the follow-up retries is
disturbing.  How do you guarantee that on the second retry what you 
have is a torn page and not something corrupted?  Init forks for
example are made of up to 2 blocks, so the window would get short for
at least those.  There are many instances with tables that have few
pages as well.
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-24 Thread David Steele


Hi Michael,

On 11/23/20 8:10 PM, Michael Paquier wrote:

On Mon, Nov 23, 2020 at 10:35:54AM -0500, Stephen Frost wrote:


Also- what is the point of reading the page from shared buffers
anyway..?  All we need to do is prove that the page will be rewritten
during WAL replay.  If we can prove that, we don't actually care what
the contents of the page are.  We certainly can't calculate the
checksum on a page we plucked out of shared buffers since we only
calculate the checksum when we go to write the page out.


A LSN-based check makes the thing tricky.  How do you make sure that
pd_lsn is not itself broken?  It could be perfectly possible that a
random on-disk corruption makes pd_lsn seen as having a correct value,
still the rest of the page is borked.


We are not just looking at one LSN value. Here are the steps we are 
proposing (I'll skip checks for zero pages here):


1) Test the page checksum. If it passes the page is OK.
2) If the checksum does not pass then record the page offset and LSN and 
continue.
3) After the file is copied, reopen and reread the file, seeking to 
offsets where possible invalid pages were recorded in the first pass.

a) If the page is now valid then it is OK.
b) If the page is not valid but the LSN has increased from the LSN 
recorded in the previous pass then it is OK. We can infer this because 
the LSN has been updated in a way that is not consistent with storage 
corruption.


This is what we are planning for the first round of improving our page 
checksum validation. We believe that doing the retry in a second pass 
will be faster and more reliable because some time will have passed 
since the first read without having to build in a delay for each page error.


A further improvement is to check the ascending LSNs found in 3b against 
PostgreSQL to be completely sure they are valid. We are planning this 
for our second round of improvements.


Reopening the file for the second pass does require some additional logic:

1) The file may have been deleted by PG since the first pass and in that 
case we won't report any page errors.
2) The file may have been truncated by PG since the first pass so we 
won't report any errors past the point of truncation.


A malicious attacker could easily trick these checks, but as Stephen 
pointed out elsewhere they would likely make the checksums valid which 
would escape detection anyway.


We believe that the chances of random storage corruption passing all 
these checks is incredibly small, but eventually we'll also check 
against the WAL to be completely sure.


Regards,
--
-David
da...@pgmasters.net

Re: Online verification of checksums

2020-11-23 Thread Stephen Frost

Greetings,

On Mon, Nov 23, 2020 at 20:28 Michael Paquier  wrote:

> On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote:
> > * Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
> >> Yes and this is a tricky part. Until you have explained it in your
> latest
> >> message, I wasn't sure how we can distinct concurrent update from a page
> >> header corruption. Now I agree that if page LSN updated and increased
> >> between rereads, it is safe enough to conclude that we have some
> concurrent
> >> load.
> >
> > Even in this case, it's almost free to compare the LSN to the starting
> > backup LSN, and to the current LSN position, and make sure it's
> > somewhere between the two.  While that doesn't entirely eliminite the
> > possibility that the page happened to get corrupted *and* return a
> > different result on subsequent reads *and* that it was corrupted in such
> > a way that the LSN ended up falling between the starting backup LSN and
> > the current LSN, it's certainly reducing the chances of a false negative
> > a fair bit.
>
> FWIW, I am not much a fan of designs that are not bullet-proof by
> design.  This reduces the odds of problems, sure, still it does not
> discard the possibility of incorrect results, confusing users as well
> as people looking at such reports.

Let’s be clear about this- our checksums are, themselves, far from
bulletproof, regardless of all of our other efforts.  They are not
foolproof against any corruption, and certainly not even close to being
sufficient for guarantees you’d expect in, say, encryption integrity.  We
cannot say with certainty that a page which passes checksum validation
isn’t corrupted in some way.  A page which doesn’t pass checksum validation
may be corrupted or may be torn and we aren’t 100% of that either, but we
can work to try and make a sensible call about which it is.

>> To sum up, I agree with your proposal to reread the page and rely on
> >> ascending LSNs. Can you submit a patch?
> >
> > Probably would make sense to give Michael an opportunity to comment and
> > get his thoughts on this, and for him to update the patch if he agrees.
>
> I think that a LSN check would be a safe thing to do iff pd_checksum
> is already checked first to make sure that the page contents are fine
> to use.   Still, what's the point in doing a LSN check anyway if we
> know that the checksum is valid?  Then on a retry if the first attempt
> failed you also need the guarantee that there is zero concurrent I/O
> activity while a page is rechecked (no need to do that unless the
> initial page check doing a checksum match failed).  So the retry needs
> to do some s_b interactions, but then comes the much trickier point of
> concurrent smgrwrite() calls bypassing the shared buffers.

I agree that the LSN check isn’t interesting if the page passes the
checksum validation.  I do think we can look at the LSN and make reasonable
inferences based off of it even if the checksum doesn’t validate- in
particular, in my experience at least, the result of a read, without any
intervening write, is very likely to be the same if performed multiple
times quickly even if there is latent storage corruption- due to cache’ing,
if nothing else.  What’s interesting about the LSN check is that we are
specifically looking to see if it *changed* in a reasonable and predictable
manner, and that it was replaced with a new yet reasonable value. The
chances of that happening due to latent storage corruption is vanishingly
small.

> As it relates to pgbackrest, we're currently contemplating having a
> > higher level loop which, upon detecting any page with an invalid
> > checksum, continues to scan to the end of that file and perform the
> > compression, encryption, et al, but then loops back after we've
> > completed that file and skips through the file again, re-reading those
> > pages which didn't have a valid checksum the first time to see if their
> > LSN has changed and is within the range of the backup.  This will
> > certainly give more opportunity for the kernel to 'catch up', if needed,
> > and give us an updated page without a random 100ms delay, and will also
> > make it easier for us to, eventually, check and make sure the page was
> > in the WAL that was been produced as part of the backup, to give us a
> > complete guarantee that the contents of this page don't matter and that
> > the failed checksum isn't a sign of latent storage corruption.
>
> That would reduce the likelyhood of facing torn pages, still you
> cannot fully discard the problem either as a same page may get changed
> again once you loop over, no?  And what if a corruption has updated
> pd_lsn on-disk?  Unlikely so, still possible.

We surely don’t care about a page which has been changed multiple times by
PG during the backup, since all those changes will be, by definition, in
the WAL, no?  Therefore, one loop to see that the value of the LSN
*changed*, meaning something wrote something new there,

Re: Online verification of checksums

2020-11-23 Thread Michael Paquier

On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote:
> * Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
>> Yes and this is a tricky part. Until you have explained it in your latest
>> message, I wasn't sure how we can distinct concurrent update from a page
>> header corruption. Now I agree that if page LSN updated and increased
>> between rereads, it is safe enough to conclude that we have some concurrent
>> load.
> 
> Even in this case, it's almost free to compare the LSN to the starting
> backup LSN, and to the current LSN position, and make sure it's
> somewhere between the two.  While that doesn't entirely eliminite the
> possibility that the page happened to get corrupted *and* return a
> different result on subsequent reads *and* that it was corrupted in such
> a way that the LSN ended up falling between the starting backup LSN and
> the current LSN, it's certainly reducing the chances of a false negative
> a fair bit.

FWIW, I am not much a fan of designs that are not bullet-proof by
design.  This reduces the odds of problems, sure, still it does not
discard the possibility of incorrect results, confusing users as well
as people looking at such reports.

>> To sum up, I agree with your proposal to reread the page and rely on
>> ascending LSNs. Can you submit a patch?
> 
> Probably would make sense to give Michael an opportunity to comment and
> get his thoughts on this, and for him to update the patch if he agrees.

I think that a LSN check would be a safe thing to do iff pd_checksum
is already checked first to make sure that the page contents are fine
to use.   Still, what's the point in doing a LSN check anyway if we
know that the checksum is valid?  Then on a retry if the first attempt
failed you also need the guarantee that there is zero concurrent I/O
activity while a page is rechecked (no need to do that unless the
initial page check doing a checksum match failed).  So the retry needs
to do some s_b interactions, but then comes the much trickier point of
concurrent smgrwrite() calls bypassing the shared buffers.

> As it relates to pgbackrest, we're currently contemplating having a
> higher level loop which, upon detecting any page with an invalid
> checksum, continues to scan to the end of that file and perform the
> compression, encryption, et al, but then loops back after we've
> completed that file and skips through the file again, re-reading those
> pages which didn't have a valid checksum the first time to see if their
> LSN has changed and is within the range of the backup.  This will
> certainly give more opportunity for the kernel to 'catch up', if needed,
> and give us an updated page without a random 100ms delay, and will also
> make it easier for us to, eventually, check and make sure the page was
> in the WAL that was been produced as part of the backup, to give us a
> complete guarantee that the contents of this page don't matter and that
> the failed checksum isn't a sign of latent storage corruption.

That would reduce the likelyhood of facing torn pages, still you
cannot fully discard the problem either as a same page may get changed
again once you loop over, no?  And what if a corruption has updated
pd_lsn on-disk?  Unlikely so, still possible.
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-23 Thread Michael Paquier

On Mon, Nov 23, 2020 at 10:35:54AM -0500, Stephen Frost wrote:
> * Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
>> It seems reasonable to me to rely on checksums only.
>> 
>> As for retry, I think that API for concurrent I/O will be complicated.
>> Instead, we can introduce a function to read the page directly from shared
>> buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
>> solution to me. Do you see any possible problems with it?

It seems to me that you are missing the point here.  It is not
necessary to read a page from shared buffers.  What is necessary is to
make sure that there is zero concurrent I/O activity in shared buffers
while a page is getting checked on disk, giving the insurance that
there is zero risk of having a torn page for a check for anything
working with shared buffers.  You could do that only on a retry if we
found a page where there was a checksum mismatch, meaning that the
page we either torn or currupted, but need an extra verification
anyway.

> We might end up reading pages back in that have been evicted, for one
> thing, which doesn't seem great, and this also seems likely to be
> awkward for cases which aren't using the replication protocol, unless
> every process maintains a connection to PG the entire time, which also
> doesn't seem great.

I don't quite see a problem in checking pages that have been just
evicted if we are able to detect faster that a page is corrupted,
because the initial check may fail because a page was torn, meaning
that it was in the middle of an eviction, but the page could also be
corrupted, meaning also that it was *not* torn, and would fail a retry
where we should make sure that there is no s_b concurrent activity.
So in the worst case of seeing you make the detection of a corrupted
page faster.

Please note that Andres also mentioned about the potential need to
worry about table AMs that call directly smgrwrite(), bypassing shared
buffers.  The only cases in-core where it is used are related to init
forks when an unlogged relation gets created, where it would not
matter if you are doing a page check while holding a database
transaction as the newly-created relation would not be visible yet,
but it would matter in the case of base backups doing direct page
lookups.  Fun.

> Also- what is the point of reading the page from shared buffers
> anyway..?  All we need to do is prove that the page will be rewritten
> during WAL replay.  If we can prove that, we don't actually care what
> the contents of the page are.  We certainly can't calculate the
> checksum on a page we plucked out of shared buffers since we only
> calculate the checksum when we go to write the page out.

A LSN-based check makes the thing tricky.  How do you make sure that
pd_lsn is not itself broken?  It could be perfectly possible that a
random on-disk corruption makes pd_lsn seen as having a correct value,
still the rest of the page is borked.
--
Michael

signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-23 Thread Stephen Frost

Greetings,

* Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
> On 23.11.2020 18:35, Stephen Frost wrote:
> >* Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
> >>On 21.11.2020 04:30, Michael Paquier wrote:
> >>>The only method I can think as being really
> >>>reliable is based on two facts:
> >>>- Do a check only on pd_checksums, as that validates the full contents
> >>>of the page.
> >>>- When doing a retry, make sure that there is no concurrent I/O
> >>>activity in the shared buffers.  This requires an API we don't have
> >>>yet.
> >>It seems reasonable to me to rely on checksums only.
> >>
> >>As for retry, I think that API for concurrent I/O will be complicated.
> >>Instead, we can introduce a function to read the page directly from shared
> >>buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
> >>solution to me. Do you see any possible problems with it?
> >We might end up reading pages back in that have been evicted, for one
> >thing, which doesn't seem great,
> TBH, I think it is highly unlikely that the page that was just updated will
> be evicted.

Is it though..?  Consider that the page which was being written out was
being done so specifically to free a page for use by another backend-
while perhaps that doesn't happen all the time, it certainly happens
enough on very busy systems.

> >and this also seems likely to be
> >awkward for cases which aren't using the replication protocol, unless
> >every process maintains a connection to PG the entire time, which also
> >doesn't seem great.
> Have I missed something? Now pg_basebackup has only one process + one child
> process for streaming. Anyway, I totally agree with your argument. The need
> to maintain connection(s) to PG is the most unpleasant part of the proposed
> approach.

I was thinking beyond pg_basebackup, yes; apologies for that not being
clear but that's what I was meaning when I said "aren't using the
replication protocol".

> >Also- what is the point of reading the page from shared buffers
> >anyway..?
> Well... Reading a page from shared buffers is a reliable way to get a
> correct page from postgres under any concurrent load. So it just seems
> natural to me.

Yes, that's true, but if a dirty page was just written out by a backend
in order to be able to evict it, so that the backend can then pull in a
new page, then having pg_basebackup pull that page back in really isn't
great.

> >All we need to do is prove that the page will be rewritten
> >during WAL replay.
> Yes and this is a tricky part. Until you have explained it in your latest
> message, I wasn't sure how we can distinct concurrent update from a page
> header corruption. Now I agree that if page LSN updated and increased
> between rereads, it is safe enough to conclude that we have some concurrent
> load.

Even in this case, it's almost free to compare the LSN to the starting
backup LSN, and to the current LSN position, and make sure it's
somewhere between the two.  While that doesn't entirely eliminite the
possibility that the page happened to get corrupted *and* return a
different result on subsequent reads *and* that it was corrupted in such
a way that the LSN ended up falling between the starting backup LSN and
the current LSN, it's certainly reducing the chances of a false negative
a fair bit.

A concern here, however, is- can we be 100% sure that we'll get a
different result from the two subsequent reads?  For my part, at least,
I've been doubtful that it's possible but it'd be nice to hear it from
someone who has really looked at the kernel side.  To try and clairfy,
let me illustrate:

pg_basebackup (the backend that's sending data to it anyway) starts
reading an 8K page, but gets interrupted halfway through, meaning that
it's read 4K and is now paused.

PG writes that same 8K page, and is able to successfully write the
entire block.

pg_basebackup then wakes up, reads the second half, computes a checksum
and gets a checksum failure.

At this point the question is: if pg_basebackup loops, seeks and
re-reads the same 8K block again, is it possible that pg_basebackup will
get the "old" starting 4K and the "new" ending 4K again?  I'd like to
think that the answer is 'no' and that the kernel will guarantee that if
we managed to read a "new" ending 4K block then the following read of
the full 8K block would be guaranteed to give us the "new" starting 4K.
If that is truely guaranteed then we could be much more confident that
the idea here of simply checking for an ascending LSN, which falls
between the starting LSN of the backup and the current LSN (or perhaps
the final LSN for the backup) would be sufficient to detect this case.

I would also think that, if we can trust that, then there really isn't
any need for the delay in performing the re-read, which I have to admit
that I don't particularly care for.

> >  If we can prove that, we don't actually care what
> >the contents of the page are.  We certainly can't calculate

Re: Online verification of checksums

2020-11-23 Thread Anastasia Lubennikova


On 23.11.2020 18:35, Stephen Frost wrote:

Greetings,

* Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:

On 21.11.2020 04:30, Michael Paquier wrote:

The only method I can think as being really
reliable is based on two facts:
- Do a check only on pd_checksums, as that validates the full contents
of the page.
- When doing a retry, make sure that there is no concurrent I/O
activity in the shared buffers.  This requires an API we don't have
yet.

It seems reasonable to me to rely on checksums only.

As for retry, I think that API for concurrent I/O will be complicated.
Instead, we can introduce a function to read the page directly from shared
buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
solution to me. Do you see any possible problems with it?

We might end up reading pages back in that have been evicted, for one
thing, which doesn't seem great,
TBH, I think it is highly unlikely that the page that was just updated 
will be evicted.

and this also seems likely to be
awkward for cases which aren't using the replication protocol, unless
every process maintains a connection to PG the entire time, which also
doesn't seem great.
Have I missed something? Now pg_basebackup has only one process + one 
child process for streaming. Anyway, I totally agree with your argument. 
The need to maintain connection(s) to PG is the most unpleasant part of 
the proposed approach.



Also- what is the point of reading the page from shared buffers
anyway..?
Well... Reading a page from shared buffers is a reliable way to get a 
correct page from postgres under any concurrent load. So it just seems 
natural to me.

All we need to do is prove that the page will be rewritten
during WAL replay.
Yes and this is a tricky part. Until you have explained it in your 
latest message, I wasn't sure how we can distinct concurrent update from 
a page header corruption. Now I agree that if page LSN updated and 
increased between rereads, it is safe enough to conclude that we have 
some concurrent load.



  If we can prove that, we don't actually care what
the contents of the page are.  We certainly can't calculate the
checksum on a page we plucked out of shared buffers since we only
calculate the checksum when we go to write the page out.
Good point. I was thinking that we can recalculate checksum. Or even 
save a page without it, as we have checked LSN and know for sure that it 
will be rewritten by WAL replay.



To sum up, I agree with your proposal to reread the page and rely on 
ascending LSNs. Can you submit a patch?

You can write it on top of the latest attachment in this thread:
v8-master-0001-Fix-page-verifications-in-base-backups.patch 
 
from this message 
https://www.postgresql.org/message-id/20201030023028.gc1...@paquier.xyz


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Online verification of checksums

2020-11-23 Thread Stephen Frost

Greetings,

* Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
> On 21.11.2020 04:30, Michael Paquier wrote:
> >The only method I can think as being really
> >reliable is based on two facts:
> >- Do a check only on pd_checksums, as that validates the full contents
> >of the page.
> >- When doing a retry, make sure that there is no concurrent I/O
> >activity in the shared buffers.  This requires an API we don't have
> >yet.
> 
> It seems reasonable to me to rely on checksums only.
> 
> As for retry, I think that API for concurrent I/O will be complicated.
> Instead, we can introduce a function to read the page directly from shared
> buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
> solution to me. Do you see any possible problems with it?

We might end up reading pages back in that have been evicted, for one
thing, which doesn't seem great, and this also seems likely to be
awkward for cases which aren't using the replication protocol, unless
every process maintains a connection to PG the entire time, which also
doesn't seem great.

Also- what is the point of reading the page from shared buffers
anyway..?  All we need to do is prove that the page will be rewritten
during WAL replay.  If we can prove that, we don't actually care what
the contents of the page are.  We certainly can't calculate the
checksum on a page we plucked out of shared buffers since we only
calculate the checksum when we go to write the page out.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-23 Thread Stephen Frost

Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote:
> > David Steele (da...@pgmasters.net) wrote:
> >> Our current plan for pgBackRest:
> >> 
> >> 1) Remove the LSN check as you have done in your patch and when rechecking
> >> see if the page has become valid *or* the LSN is ascending.
> >> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
> >> is valid.
> > 
> > Yup, that's my recollection also as to our plans for how to improve
> > things here.
> >
> >> These do completely rule out any type of corruption, but they certainly
> >> narrows the possibility by a lot.
> > 
> > *don't :)
> 
> Have you considered the possibility of only using pd_checksums for the
> validation?  This is the only source of truth in the page header we
> can rely on to validate the full contents of the page, so if the logic
> relies on anything but the checksum then you expose the logic to risks
> of reporting pages as corrupted while they were just torn, or just
> miss corrupted pages, which is what we should avoid for such things.
> Both are bad.

There's no doubt that you'll get checksum failures from time to time,
and that it's an entirely valid case if the page is being concurrently
written, so we have to decide if we should be reporting those failures,
retrying, or what.

It's not at all clear what you're suggesting here as to how you can use
'only' the checksum.

> >> As for your patch, it mostly looks good but my objection is that a page may
> >> be reported as invalid after 5 retries when in fact it may just be very 
> >> hot.
> > 
> > Yeah.. while unlikely that it'd actually get written out that much, it
> > does seem at least possible.
> > 
> >> Maybe checking for an ascending LSN is a good idea there as well? At least
> >> in that case we could issue a different warning, instead of "checksum
> >> verification failed" perhaps "checksum verification skipped due to
> >> concurrent modifications".
> > 
> > +1.
> 
> I don't quite understand how you can make sure that the page is not
> corrupted here?  It could be possible that the last 4kB of a 8kB page
> got corrupted, where the header had valid data but failing the
> checksum verification.

Not sure that the proposed approach was really understood here.
Specifically what we're talking about is:

- read(), save the LSN seen
- calculate checksum- get a failure
- re-read(), compare LSN to prior LSN, maybe also re-check checksum

If checksum fails again AND the LSN has changed and increased (and
perhaps otherwise seems reasonable) then we have at least a bit more
confidence that the failing checksum is due to the page being rewritten
concurrently and not due to latest storage corruption, which is the
specific distinction that we're trying to discern here.

> So if you are not careful you could have at
> hand a corrupted page discarded because of it failed the retry
> multiple times in a row.

The point of checking for an ascending LSN is to see if the page is
being concurrently modified.  If it is, then we actually don't care if
the page is corrupted because it's going to be rewritten during WAL
replay as part of the restore process.

> The only method I can think as being really
> reliable is based on two facts:
> - Do a check only on pd_checksums, as that validates the full contents
> of the page.
> - When doing a retry, make sure that there is no concurrent I/O
> activity in the shared buffers.  This requires an API we don't have
> yet.

I don't think we actually want the backup process to start locking
pages, which it seems like is what you're suggesting here..?  Trying to
do a check without a lock and without having PG end up reading the page
back in if it had been evicted due to pressure seems likely to be hard
to do reliably and without race conditions complicating things.

The other 100% reliable approach, as David discussed before, is to be
scanning the WAL at the same time and to ignore any checksum failures
for pages that we know are in the WAL with FPIs.  Unfortunately, reading
WAL for all different versions of PG is a fair bit of work and we
haven't quite gotten to biting that off yet (though it's on the
roadmap), and the core code certainly doesn't help us in that regard
since any given version only supports the current major version WAL (an
issue pg_basebackup would also have to deal with it, were it to be
modified to use such an approach and to continue working with older
versions of PG..).  In a similar vein to what we do (in pgbackrest) with
pg_control, we expect to develop our own library basically vendorizing
WAL reading code from all the major versions of PG which we support in
order to track FPIs, restore points, all the kinds of potential recovery
targets, and other useful information.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-23 Thread Anastasia Lubennikova


On 21.11.2020 04:30, Michael Paquier wrote:

The only method I can think as being really
reliable is based on two facts:
- Do a check only on pd_checksums, as that validates the full contents
of the page.
- When doing a retry, make sure that there is no concurrent I/O
activity in the shared buffers.  This requires an API we don't have
yet.


It seems reasonable to me to rely on checksums only.

As for retry, I think that API for concurrent I/O will be complicated. 
Instead, we can introduce a function to read the page directly from 
shared buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a 
bullet-proof solution to me. Do you see any possible problems with it?


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Online verification of checksums

2020-11-20 Thread Michael Paquier

On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote:
> David Steele (da...@pgmasters.net) wrote:
>> Our current plan for pgBackRest:
>> 
>> 1) Remove the LSN check as you have done in your patch and when rechecking
>> see if the page has become valid *or* the LSN is ascending.
>> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
>> is valid.
> 
> Yup, that's my recollection also as to our plans for how to improve
> things here.
>
>> These do completely rule out any type of corruption, but they certainly
>> narrows the possibility by a lot.
> 
> *don't :)

Have you considered the possibility of only using pd_checksums for the
validation?  This is the only source of truth in the page header we
can rely on to validate the full contents of the page, so if the logic
relies on anything but the checksum then you expose the logic to risks
of reporting pages as corrupted while they were just torn, or just
miss corrupted pages, which is what we should avoid for such things.
Both are bad.

>> As for your patch, it mostly looks good but my objection is that a page may
>> be reported as invalid after 5 retries when in fact it may just be very hot.
> 
> Yeah.. while unlikely that it'd actually get written out that much, it
> does seem at least possible.
> 
>> Maybe checking for an ascending LSN is a good idea there as well? At least
>> in that case we could issue a different warning, instead of "checksum
>> verification failed" perhaps "checksum verification skipped due to
>> concurrent modifications".
> 
> +1.

I don't quite understand how you can make sure that the page is not
corrupted here?  It could be possible that the last 4kB of a 8kB page
got corrupted, where the header had valid data but failing the
checksum verification.  So if you are not careful you could have at
hand a corrupted page discarded because of it failed the retry
multiple times in a row.  The only method I can think as being really
reliable is based on two facts:
- Do a check only on pd_checksums, as that validates the full contents
of the page.
- When doing a retry, make sure that there is no concurrent I/O
activity in the shared buffers.  This requires an API we don't have
yet.
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-20 Thread Stephen Frost

Greetings,

* David Steele (da...@pgmasters.net) wrote:
> On 11/20/20 2:28 AM, Michael Paquier wrote:
> >On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
> >>I was referring to the latest patch on the thread. But as I said, I have
> >>not read up on all the different issues raised in the thread, so take it
> >>with a big grain os salt.
> >>
> >>And I would also echo the previous comment that this code was adapted from
> >>what the pgbackrest folks do. As such, it would be good to get a comment
> >>from for example David on that -- I don't see any of them having commented
> >>after that was mentioned?
> >
> >Agreed.  I am adding Stephen as well in CC.  From the code of
> >backrest, the same logic happens in src/command/backup/pageChecksum.c
> >(see pageChecksumProcess), where two checks on pd_upper and pd_lsn
> >happen before verifying the checksum.  So, if the page header finishes
> >with random junk because of some kind of corruption, even corrupted
> >pages would be incorrectly considered as correct if the random data
> >passes the pd_upper and pg_lsn checks :/
> 
> Indeed, this is not good, as Andres pointed out some time ago. My apologies
> for not getting to this sooner.

Yeah, it's been on our backlog to improve this.

> Our current plan for pgBackRest:
> 
> 1) Remove the LSN check as you have done in your patch and when rechecking
> see if the page has become valid *or* the LSN is ascending.
> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
> is valid.

Yup, that's my recollection also as to our plans for how to improve
things here.

> These do completely rule out any type of corruption, but they certainly
> narrows the possibility by a lot.

*don't :)

> In the future we would also like to scan the WAL to verify that the page is
> definitely being written to.

Yeah, that'd certainly be nice to do too.

> As for your patch, it mostly looks good but my objection is that a page may
> be reported as invalid after 5 retries when in fact it may just be very hot.

Yeah.. while unlikely that it'd actually get written out that much, it
does seem at least possible.

> Maybe checking for an ascending LSN is a good idea there as well? At least
> in that case we could issue a different warning, instead of "checksum
> verification failed" perhaps "checksum verification skipped due to
> concurrent modifications".

+1.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-20 Thread David Steele


Hi Michael,

On 11/20/20 2:28 AM, Michael Paquier wrote:

On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:

I was referring to the latest patch on the thread. But as I said, I have
not read up on all the different issues raised in the thread, so take it
with a big grain os salt.

And I would also echo the previous comment that this code was adapted from
what the pgbackrest folks do. As such, it would be good to get a comment
from for example David on that -- I don't see any of them having commented
after that was mentioned?


Agreed.  I am adding Stephen as well in CC.  From the code of
backrest, the same logic happens in src/command/backup/pageChecksum.c
(see pageChecksumProcess), where two checks on pd_upper and pd_lsn
happen before verifying the checksum.  So, if the page header finishes
with random junk because of some kind of corruption, even corrupted
pages would be incorrectly considered as correct if the random data
passes the pd_upper and pg_lsn checks :/


Indeed, this is not good, as Andres pointed out some time ago. My 
apologies for not getting to this sooner.


Our current plan for pgBackRest:

1) Remove the LSN check as you have done in your patch and when 
rechecking see if the page has become valid *or* the LSN is ascending.
2) Check the LSN against the max LSN reported by PostgreSQL to make sure 
it is valid.


These do completely rule out any type of corruption, but they certainly 
narrows the possibility by a lot.


In the future we would also like to scan the WAL to verify that the page 
is definitely being written to.


As for your patch, it mostly looks good but my objection is that a page 
may be reported as invalid after 5 retries when in fact it may just be 
very hot.


Maybe checking for an ascending LSN is a good idea there as well? At 
least in that case we could issue a different warning, instead of 
"checksum verification failed" perhaps "checksum verification skipped 
due to concurrent modifications".


Regards,
--
-David
da...@pgmasters.net

Re: Online verification of checksums

2020-11-19 Thread Michael Paquier

On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
> I was referring to the latest patch on the thread. But as I said, I have
> not read up on all the different issues raised in the thread, so take it
> with a big grain os salt.
> 
> And I would also echo the previous comment that this code was adapted from
> what the pgbackrest folks do. As such, it would be good to get a comment
> from for example David on that -- I don't see any of them having commented
> after that was mentioned?

Agreed.  I am adding Stephen as well in CC.  From the code of
backrest, the same logic happens in src/command/backup/pageChecksum.c
(see pageChecksumProcess), where two checks on pd_upper and pd_lsn
happen before verifying the checksum.  So, if the page header finishes
with random junk because of some kind of corruption, even corrupted
pages would be incorrectly considered as correct if the random data
passes the pd_upper and pg_lsn checks :/
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-16 Thread Magnus Hagander

On Mon, Nov 16, 2020 at 1:23 AM Michael Paquier  wrote:

> On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote:
> > On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier 
> wrote:
> >> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
> >>> I was referring to the patch I sent on this thread that fixes the
> >>> detection of a corruption for the zero-only case and where pd_lsn
> >>> and/or pg_upper are trashed by a corruption of the page header.  Both
> >>> cases allow a base backup to complete on HEAD, while sending pages
> >>> that could be corrupted, which is wrong.  Once you make the page
> >>> verification rely only on pd_checksum, as the patch does because the
> >>> checksum is the only source of truth in the page header, corrupted
> >>> pages are correctly detected, causing pg_basebackup to complain as it
> >>> should.  However, it has also the risk to cause pg_basebackup to fail
> >>> *and* to report as broken pages that are in the process of being
> >>> written, depending on how slow a disk is able to finish a 8kB write.
> >>> That's a different kind of wrongness, and users have two more reasons
> >>> to be pissed.  Note that if a page is found as torn we have a
> >>> consistent page header, meaning that on HEAD the PageIsNew() and
> >>> PageGetLSN() would pass, but the checksum verification would fail as
> >>> the contents at the end of the page does not match the checksum.
> >>
> >> Magnus, as the original committer of 4eb77d5, do you have an opinion
> >> to share?
> >>
> >
> > I admit that I at some point lost track of the overlapping threads around
> > this, and just figured there was enough different
> checksum-involved-people
> > on those threads to handle it :) Meaning the short answer is "no, I don't
> > really have one at this point".
> >
> > Slightly longer comment is that it does seem reasonable, but I have not
> > read in on all the different issues discussed over the whole thread, so
> > take that as a weak-certainty comment.
>
> Which part are you considering as reasonable?  The removal-feature
> part on a stable branch or perhaps something else?
>

I was referring to the latest patch on the thread. But as I said, I have
not read up on all the different issues raised in the thread, so take it
with a big grain os salt.

And I would also echo the previous comment that this code was adapted from
what the pgbackrest folks do. As such, it would be good to get a comment
from for example David on that -- I don't see any of them having commented
after that was mentioned?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Online verification of checksums

2020-11-15 Thread Michael Paquier

On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote:
> On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier  wrote:
>> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
>>> I was referring to the patch I sent on this thread that fixes the
>>> detection of a corruption for the zero-only case and where pd_lsn
>>> and/or pg_upper are trashed by a corruption of the page header.  Both
>>> cases allow a base backup to complete on HEAD, while sending pages
>>> that could be corrupted, which is wrong.  Once you make the page
>>> verification rely only on pd_checksum, as the patch does because the
>>> checksum is the only source of truth in the page header, corrupted
>>> pages are correctly detected, causing pg_basebackup to complain as it
>>> should.  However, it has also the risk to cause pg_basebackup to fail
>>> *and* to report as broken pages that are in the process of being
>>> written, depending on how slow a disk is able to finish a 8kB write.
>>> That's a different kind of wrongness, and users have two more reasons
>>> to be pissed.  Note that if a page is found as torn we have a
>>> consistent page header, meaning that on HEAD the PageIsNew() and
>>> PageGetLSN() would pass, but the checksum verification would fail as
>>> the contents at the end of the page does not match the checksum.
>>
>> Magnus, as the original committer of 4eb77d5, do you have an opinion
>> to share?
>>
> 
> I admit that I at some point lost track of the overlapping threads around
> this, and just figured there was enough different checksum-involved-people
> on those threads to handle it :) Meaning the short answer is "no, I don't
> really have one at this point".
> 
> Slightly longer comment is that it does seem reasonable, but I have not
> read in on all the different issues discussed over the whole thread, so
> take that as a weak-certainty comment.

Which part are you considering as reasonable?  The removal-feature
part on a stable branch or perhaps something else?
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-15 Thread Magnus Hagander

On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier  wrote:

> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
> > I was referring to the patch I sent on this thread that fixes the
> > detection of a corruption for the zero-only case and where pd_lsn
> > and/or pg_upper are trashed by a corruption of the page header.  Both
> > cases allow a base backup to complete on HEAD, while sending pages
> > that could be corrupted, which is wrong.  Once you make the page
> > verification rely only on pd_checksum, as the patch does because the
> > checksum is the only source of truth in the page header, corrupted
> > pages are correctly detected, causing pg_basebackup to complain as it
> > should.  However, it has also the risk to cause pg_basebackup to fail
> > *and* to report as broken pages that are in the process of being
> > written, depending on how slow a disk is able to finish a 8kB write.
> > That's a different kind of wrongness, and users have two more reasons
> > to be pissed.  Note that if a page is found as torn we have a
> > consistent page header, meaning that on HEAD the PageIsNew() and
> > PageGetLSN() would pass, but the checksum verification would fail as
> > the contents at the end of the page does not match the checksum.
>
> Magnus, as the original committer of 4eb77d5, do you have an opinion
> to share?
>

I admit that I at some point lost track of the overlapping threads around
this, and just figured there was enough different checksum-involved-people
on those threads to handle it :) Meaning the short answer is "no, I don't
really have one at this point".

Slightly longer comment is that it does seem reasonable, but I have not
read in on all the different issues discussed over the whole thread, so
take that as a weak-certainty comment.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Online verification of checksums

2020-11-09 Thread Michael Paquier

On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
> I was referring to the patch I sent on this thread that fixes the
> detection of a corruption for the zero-only case and where pd_lsn
> and/or pg_upper are trashed by a corruption of the page header.  Both
> cases allow a base backup to complete on HEAD, while sending pages
> that could be corrupted, which is wrong.  Once you make the page
> verification rely only on pd_checksum, as the patch does because the
> checksum is the only source of truth in the page header, corrupted
> pages are correctly detected, causing pg_basebackup to complain as it
> should.  However, it has also the risk to cause pg_basebackup to fail
> *and* to report as broken pages that are in the process of being
> written, depending on how slow a disk is able to finish a 8kB write.
> That's a different kind of wrongness, and users have two more reasons
> to be pissed.  Note that if a page is found as torn we have a
> consistent page header, meaning that on HEAD the PageIsNew() and
> PageGetLSN() would pass, but the checksum verification would fail as
> the contents at the end of the page does not match the checksum.

Magnus, as the original committer of 4eb77d5, do you have an opinion
to share?
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-04 Thread Michael Paquier

On Wed, Nov 04, 2020 at 05:41:39PM +0100, Michael Banck wrote:
> Am Mittwoch, den 04.11.2020, 17:48 +0900 schrieb Michael Paquier:
>> So, I have done much more testing of this patch using an instance with
>> a small shared buffer pool and pgbench running in parallel for having
>> a large eviction rate, and I cannot convince myself to do that.  My
>> laptop got easily constrained on I/O, and within a total of 2000 base
>> backups or so, I have seen some 5 backup failures with a correct
>> detection logic.  
> 
> I don't quite undestand what you mean here: how do the base backups
> fail,

As of basebackup.c, on HEAD:
if (total_checksum_failures)
{
if (total_checksum_failures > 1)
ereport(WARNING,
(errmsg_plural("%lld total checksum verification failure",
   "%lld total checksum verification failures",
   total_checksum_failures,
   total_checksum_failures)));

ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg("checksum verification failure during base backup")));
}

This means that when at least one page verification fails,
pg_basebackup would fail.

> and what exactly is "correct detection logic"?

I was referring to the patch I sent on this thread that fixes the
detection of a corruption for the zero-only case and where pd_lsn
and/or pg_upper are trashed by a corruption of the page header.  Both
cases allow a base backup to complete on HEAD, while sending pages
that could be corrupted, which is wrong.  Once you make the page
verification rely only on pd_checksum, as the patch does because the
checksum is the only source of truth in the page header, corrupted
pages are correctly detected, causing pg_basebackup to complain as it
should.  However, it has also the risk to cause pg_basebackup to fail
*and* to report as broken pages that are in the process of being
written, depending on how slow a disk is able to finish a 8kB write.
That's a different kind of wrongness, and users have two more reasons
to be pissed.  Note that if a page is found as torn we have a
consistent page header, meaning that on HEAD the PageIsNew() and
PageGetLSN() would pass, but the checksum verification would fail as
the contents at the end of the page does not match the checksum.
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-11-04 Thread Michael Banck

Hi,

Am Mittwoch, den 04.11.2020, 17:48 +0900 schrieb Michael Paquier:
> On Fri, Oct 30, 2020 at 11:30:28AM +0900, Michael Paquier wrote:
> > Playing with dd and generating random pages, this detects random
> > corruptions, making use of a wait/retry loop if a failure is detected.
> > As mentioned upthread, this is a double-edged sword, increasing the
> > number of retries reduces the changes of false positives, at the cost
> > of making regression tests longer.  This stuff uses up to 5 retries
> > with 100ms of sleep for each page.  (I am aware of the fact that the
> > commit message of the main patch is not written yet).
> 
> So, I have done much more testing of this patch using an instance with
> a small shared buffer pool and pgbench running in parallel for having
> a large eviction rate, and I cannot convince myself to do that.  My
> laptop got easily constrained on I/O, and within a total of 2000 base
> backups or so, I have seen some 5 backup failures with a correct
> detection logic.  

I don't quite undestand what you mean here: how do the base backups
fail, and what exactly is "correct detection logic"?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Online verification of checksums

2020-11-04 Thread Michael Paquier

On Fri, Oct 30, 2020 at 11:30:28AM +0900, Michael Paquier wrote:
> Playing with dd and generating random pages, this detects random
> corruptions, making use of a wait/retry loop if a failure is detected.
> As mentioned upthread, this is a double-edged sword, increasing the
> number of retries reduces the changes of false positives, at the cost
> of making regression tests longer.  This stuff uses up to 5 retries
> with 100ms of sleep for each page.  (I am aware of the fact that the
> commit message of the main patch is not written yet).

So, I have done much more testing of this patch using an instance with
a small shared buffer pool and pgbench running in parallel for having
a large eviction rate, and I cannot convince myself to do that.  My
laptop got easily constrained on I/O, and within a total of 2000 base
backups or so, I have seen some 5 backup failures with a correct
detection logic.  The rate is low here, but that could be annoying for
users even at 1~2%.  Couldn't we take the different approach to remove
this feature instead?  This still requires the grammar to be present
in back-branches, but as things stand, we have a feature that fails
its promise, and that also eats for nothing resources for each base
backup taken :/
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-10-29 Thread Michael Paquier

On Thu, Oct 22, 2020 at 10:41:53AM +0900, Michael Paquier wrote:
> We cannot trust the fields fields of the page header because these may
> have been messed up with some random corruption, so what really
> matters is that the checksums don't match, and that we can just rely
> on that.  The zero-only case of a page is different because these
> don't have a checksum set, so I would finish with something like the
> attached to make the detection more robust.  This does not make the
> detection perfect as there is no locking insurance (we really need to
> do that but v13 has been released already), but with a sufficient
> number of retries this can make things much more reliable than what's
> present.
> 
> Are there any comments?  Anybody?

So, hearing nothing, attached is a set of patches that I would like to
apply to 11~ to address the set of issues of this thread.  This comes
with two parts:
- Some refactoring of PageIsVerified(), similar to d401c57 on HEAD
except that this keeps ABI compatibility.
- The actual patch, with tweaks for each stable branch.

Playing with dd and generating random pages, this detects random
corruptions, making use of a wait/retry loop if a failure is detected.
As mentioned upthread, this is a double-edged sword, increasing the
number of retries reduces the changes of false positives, at the cost
of making regression tests longer.  This stuff uses up to 5 retries
with 100ms of sleep for each page.  (I am aware of the fact that the
commit message of the main patch is not written yet).
--
Michael
From 5413343de74a77f0257619d8599523dc7f0a01be Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 30 Oct 2020 10:18:34 +0900
Subject: [PATCH v8] Fix page verifications in base backups

---
 src/backend/replication/basebackup.c | 148 ++-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |  20 ++-
 2 files changed, 96 insertions(+), 72 deletions(-)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..f8d0fc041d 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1569,21 +1569,24 @@ sendFile(const char *readfilename, const char *tarfilename,
 {
 	int			fd;
 	BlockNumber blkno = 0;
-	bool		block_retry = false;
+	int			block_attempts = 0;
 	char		buf[TAR_SEND_SIZE];
-	uint16		checksum;
 	int			checksum_failures = 0;
 	off_t		cnt;
 	int			i;
 	pgoff_t		len = 0;
 	char	   *page;
 	size_t		pad;
-	PageHeader	phdr;
 	int			segmentno = 0;
 	char	   *segmentpath;
 	bool		verify_checksum = false;
 	pg_checksum_context checksum_ctx;
 
+	/* Maximum number of checksum failures reported for this file */
+#define CHECKSUM_REPORT_THRESHOLD		5
+	/* maximum number of retries done for a single page */
+#define PAGE_RETRY_THRESHOLD			5
+
 	pg_checksum_init(_ctx, manifest->checksum_type);
 
 	fd = OpenTransientFile(readfilename, O_RDONLY | PG_BINARY);
@@ -1661,9 +1664,7 @@ sendFile(const char *readfilename, const char *tarfilename,
 		if (verify_checksum && (cnt % BLCKSZ != 0))
 		{
 			ereport(WARNING,
-	(errmsg("could not verify checksum in file \"%s\", block "
-			"%d: read buffer size %d and page size %d "
-			"differ",
+	(errmsg("could not verify checksum in file \"%s\", block %u: read buffer size %d and page size %d differ",
 			readfilename, blkno, (int) cnt, BLCKSZ)));
 			verify_checksum = false;
 		}
@@ -1674,79 +1675,84 @@ sendFile(const char *readfilename, const char *tarfilename,
 			{
 page = buf + BLCKSZ * i;
 
+elog(DEBUG1, "checking block %u of file %s, attempt %d",
+	 blkno, readfilename, block_attempts);
+
 /*
- * Only check pages which have not been modified since the
- * start of the base backup. Otherwise, they might have been
- * written only halfway and the checksum would not be valid.
- * However, replaying WAL would reinstate the correct page in
- * this case. We also skip completely new pages, since they
- * don't have a checksum yet.
+ * Check on-disk pages with the same set of verification
+ * conditions used before loading pages into shared buffers.
+ * Note that PageIsVerifiedExtended() considers pages filled
+ * only with zeros as valid, since they don't have a checksum
+ * yet.  Failures are not reported to pgstat yet, as these are
+ * reported in a single batch once a file is completed.  It
+ * could be possible, that a page is written halfway while
+ * doing this check, causing a false positive.  If that
+ * happens, a page is retried multiple times, with an error
+ * reported if the second attempt also fails.
  */
-if (!PageIsNew(page) && PageGetLSN(page) < startptr)
+if (PageIsVerifiedExtended(page, blkno, 0))
 {
-	checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
-	phdr = (PageHeader) page;
-	if (phdr->pd_checksum != checksum)
+	/* success, move on to the next block */
+	blkno++;
+

Re: Online verification of checksums

2020-10-21 Thread Michael Paquier

On Wed, Oct 21, 2020 at 07:10:34PM +0900, Michael Paquier wrote:
> My guess is that we should be able to make use of that for base
> backups as well, but I also think that I'd rather let v13 go with more
> retries without depending on a new API layer, removing of the LSN
> check altogether.  Thinking of it, that's actually roughly what I
> posted here, but without the PageGetLSN() bit in the refactored code.
> So I see a pretty good argument to address the stable branches with
> that, and study for the future a better API to govern them all:
> https://www.postgresql.org/message-id/20201020062432.ga30...@paquier.xyz

So, I was sleeping on this one, and I could not find a reason why we
should not address both the zero case and the random data case at the
same time, as mentioned here:
https://www.postgresql.org/message-id/20201022012519.gf1...@paquier.xyz

We cannot trust the fields fields of the page header because these may
have been messed up with some random corruption, so what really
matters is that the checksums don't match, and that we can just rely
on that.  The zero-only case of a page is different because these
don't have a checksum set, so I would finish with something like the
attached to make the detection more robust.  This does not make the
detection perfect as there is no locking insurance (we really need to
do that but v13 has been released already), but with a sufficient
number of retries this can make things much more reliable than what's
present.

Are there any comments?  Anybody?
--
Michael
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 51b8f994ac..1a28ee1130 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -404,26 +404,33 @@ do { \
  *		extern declarations
  * 
  */
+
+/* flags for PageAddItemExtended() */
 #define PAI_OVERWRITE			(1 << 0)
 #define PAI_IS_HEAP(1 << 1)
 
+/* flags for PageIsVerifiedExtended() */
+#define PIV_LOG_WARNING			(1 << 0)
+#define PIV_REPORT_STAT			(1 << 1)
+
 #define PageAddItem(page, item, size, offsetNumber, overwrite, is_heap) \
 	PageAddItemExtended(page, item, size, offsetNumber, \
 		((overwrite) ? PAI_OVERWRITE : 0) | \
 		((is_heap) ? PAI_IS_HEAP : 0))
 
 /*
- * Check that BLCKSZ is a multiple of sizeof(size_t).  In PageIsVerified(),
- * it is much faster to check if a page is full of zeroes using the native
- * word size.  Note that this assertion is kept within a header to make
- * sure that StaticAssertDecl() works across various combinations of
- * platforms and compilers.
+ * Check that BLCKSZ is a multiple of sizeof(size_t).  In
+ * PageIsVerifiedExtended(), it is much faster to check if a page is
+ * full of zeroes using the native word size.  Note that this assertion
+ * is kept within a header to make sure that StaticAssertDecl() works
+ * across various combinations of platforms and compilers.
  */
 StaticAssertDecl(BLCKSZ == ((BLCKSZ / sizeof(size_t)) * sizeof(size_t)),
  "BLCKSZ has to be a multiple of sizeof(size_t)");
 
 extern void PageInit(Page page, Size pageSize, Size specialSize);
 extern bool PageIsVerified(Page page, BlockNumber blkno);
+extern bool PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags);
 extern OffsetNumber PageAddItemExtended(Page page, Item item, Size size,
 		OffsetNumber offsetNumber, int flags);
 extern Page PageGetTempPage(Page page);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index dbbd3aa31f..d538f25726 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -443,7 +443,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 
 		smgrread(src, forkNum, blkno, buf.data);
 
-		if (!PageIsVerified(page, blkno))
+		if (!PageIsVerifiedExtended(page, blkno,
+	PIV_LOG_WARNING | PIV_REPORT_STAT))
 			ereport(ERROR,
 	(errcode(ERRCODE_DATA_CORRUPTED),
 	 errmsg("invalid page in block %u of relation %s",
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..6f717c8956 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1569,21 +1569,24 @@ sendFile(const char *readfilename, const char *tarfilename,
 {
 	int			fd;
 	BlockNumber blkno = 0;
-	bool		block_retry = false;
+	int			block_attempts = 0;
 	char		buf[TAR_SEND_SIZE];
-	uint16		checksum;
 	int			checksum_failures = 0;
 	off_t		cnt;
 	int			i;
 	pgoff_t		len = 0;
 	char	   *page;
 	size_t		pad;
-	PageHeader	phdr;
 	int			segmentno = 0;
 	char	   *segmentpath;
 	bool		verify_checksum = false;
 	pg_checksum_context checksum_ctx;
 
+	/* Maximum number of checksum failures reported for this file */
+#define CHECKSUM_REPORT_THRESHOLD		5
+	/* maximum number of retries done for a single page */
+#define PAGE_RETRY_THRESHOLD			5
+
 	pg_checksum_init(_ctx, manifest->checksum_type);
 
 	fd = OpenTransientFile(readfilename, O_RDONLY |

Re: Online verification of checksums

2020-10-21 Thread Michael Paquier

On Wed, Oct 21, 2020 at 12:00:23PM +0200, Michael Banck wrote:
> The check was ported (or the concept of it adapted) from pgBackRest if I
> remember correctly.

Okay, I did not know that.

>> As things stand, I'd like to think that it would be much more useful
>> to remove this check and to have one or two extra retries (the current
>> code only has one).  I don't like much the possibility of false
>> positives for such critical checks, but as we need to live with what
>> has been released, that looks like a good move for stable branches.
> 
> Sounds good to me. I think some were advocating for locking the page
> before re-reading. When I looked at it, the level of abstraction that
> pg_basebackup has (just a list of files chopped up into blocks, no
> notion of relations I think) made that non-trivial, but maybe still
> possible for v14 and beyond.

That's a API layer I was looking at here:
https://www.postgresql.org/message-id/flat/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5q0y+wulml5sr...@mail.gmail.com

My guess is that we should be able to make use of that for base
backups as well, but I also think that I'd rather let v13 go with more
retries without depending on a new API layer, removing of the LSN
check altogether.  Thinking of it, that's actually roughly what I
posted here, but without the PageGetLSN() bit in the refactored code.
So I see a pretty good argument to address the stable branches with
that, and study for the future a better API to govern them all:
https://www.postgresql.org/message-id/20201020062432.ga30...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-10-21 Thread Michael Banck

Hi,

Am Dienstag, den 20.10.2020, 18:11 +0900 schrieb Michael Paquier:
> On Mon, Apr 06, 2020 at 04:45:44PM -0400, Tom Lane wrote:
> > Actually, after thinking about that a bit more: why is there an LSN-based
> > special condition at all?  It seems like it'd be far more useful to
> > checksum everything, and on failure try to re-read and re-verify the page
> > once or twice, so as to handle the corner case where we examine a page
> > that's in process of being overwritten.
> 
> I was reviewing this area today, and that actually matches my
> impression.  Why do we need a LSN-based check at all?  As said
> upthread, that's of course weak with random data as we would miss most
> of the real checksum failures, with odds getting better depending on
> the current LSN of the cluster moving on.  However, it seems to me
> that we would have an extra advantage in removing this check
> all together: it would be possible to check for pages even if these
> are more recent than the start LSN of the backup, and that could be a
> lot of pages that could be checked on a large cluster.  So by keeping
> this check we also delay the detection of real problems.

The check was ported (or the concept of it adapted) from pgBackRest if I
remember correctly.

> As things stand, I'd like to think that it would be much more useful
> to remove this check and to have one or two extra retries (the current
> code only has one).  I don't like much the possibility of false
> positives for such critical checks, but as we need to live with what
> has been released, that looks like a good move for stable branches.

Sounds good to me. I think some were advocating for locking the page
before re-reading. When I looked at it, the level of abstraction that
pg_basebackup has (just a list of files chopped up into blocks, no
notion of relations I think) made that non-trivial, but maybe still
possible for v14 and beyond.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Online verification of checksums

2020-10-20 Thread Michael Paquier

On Mon, Apr 06, 2020 at 04:45:44PM -0400, Tom Lane wrote:
> Actually, after thinking about that a bit more: why is there an LSN-based
> special condition at all?  It seems like it'd be far more useful to
> checksum everything, and on failure try to re-read and re-verify the page
> once or twice, so as to handle the corner case where we examine a page
> that's in process of being overwritten.

I was reviewing this area today, and that actually matches my
impression.  Why do we need a LSN-based check at all?  As said
upthread, that's of course weak with random data as we would miss most
of the real checksum failures, with odds getting better depending on
the current LSN of the cluster moving on.  However, it seems to me
that we would have an extra advantage in removing this check
all together: it would be possible to check for pages even if these
are more recent than the start LSN of the backup, and that could be a
lot of pages that could be checked on a large cluster.  So by keeping
this check we also delay the detection of real problems.  As things
stand, I'd like to think that it would be much more useful to remove
this check and to have one or two extra retries (the current code only
has one).  I don't like much the possibility of false positives for
such critical checks, but as we need to live with what has been
released, that looks like a good move for stable branches.
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2020-07-30 Thread Daniel Gustafsson

> On 5 Jul 2020, at 13:52, Daniel Gustafsson  wrote:
> 
>> On 6 Apr 2020, at 23:15, Michael Banck  wrote:
> 
>> Probably we need to take a step back;
> 
> This patch has been Waiting on Author since the last commitfest (and no longer
> applies as well), and by the sounds of the thread there are some open issues
> with it.  Should it be Returned with Feedback to be re-opened with a fresh 
> take
> on it?

Marked as Returned with Feedback, please open a new entry in case there is a
renewed interest with a new patch.

cheers ./daniel

Re: Online verification of checksums

2020-07-05 Thread Daniel Gustafsson

> On 6 Apr 2020, at 23:15, Michael Banck  wrote:

> Probably we need to take a step back;

This patch has been Waiting on Author since the last commitfest (and no longer
applies as well), and by the sounds of the thread there are some open issues
with it.  Should it be Returned with Feedback to be re-opened with a fresh take
on it?

cheers ./daniel

Re: Online verification of checksums

2020-04-06 Thread Michael Banck

Hi,

Am Montag, den 06.04.2020, 16:45 -0400 schrieb Tom Lane:
> I wrote:
> > Another thing that's bothering me is that the patch compares page LSN
> > against GetInsertRecPtr(); but that function says
> > ...
> > I'm not convinced that an approximation is good enough here.  It seems
> > like a page that's just now been updated could have an LSN beyond the
> > current XLOG page start, potentially leading to a false checksum
> > complaint.  Maybe we could address that by adding one xlog page to
> > the GetInsertRecPtr result?  Kind of a hack, but ...

I was about to write that it sounds like a pragmatic solution to me,
but...

> Actually, after thinking about that a bit more: why is there an LSN-based
> special condition at all?  It seems like it'd be far more useful to
> checksum everything, and on failure try to re-read and re-verify the page
> once or twice, so as to handle the corner case where we examine a page
> that's in process of being overwritten.

Andres outlined something about a year ago which on re-reading sounds
similar to what you suggest above in
20190326170820.6sylklg7eh6uh...@alap3.anarazel.de but never posted a
full patch. He seems to have had a few additional checks from PageIsVerified() 
in mind, though.

The original check against the checkpoint LSN wasn't suggested by me;
I've submitted this patch with the InsertRecPtr as an upper bound as a
*(presumably) minimal-invasive patch which could be back-patched (when
nothing came of the above thread for a while), but the issue seems to be
quite a bit nuanced.

Probably we need to take a step back; the question is whether something
like what Andres suggested should/could be coded up for v13 still
(before the feature freeze) and if so, by whom (I won't have the time),
or whether it would still qualify as a back-patchable bug-fix and/or
whether your suggestion above would.

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Online verification of checksums

2020-04-06 Thread Tom Lane

I wrote:
> Another thing that's bothering me is that the patch compares page LSN
> against GetInsertRecPtr(); but that function says
> ...
> I'm not convinced that an approximation is good enough here.  It seems
> like a page that's just now been updated could have an LSN beyond the
> current XLOG page start, potentially leading to a false checksum
> complaint.  Maybe we could address that by adding one xlog page to
> the GetInsertRecPtr result?  Kind of a hack, but ...

Actually, after thinking about that a bit more: why is there an LSN-based
special condition at all?  It seems like it'd be far more useful to
checksum everything, and on failure try to re-read and re-verify the page
once or twice, so as to handle the corner case where we examine a page
that's in process of being overwritten.

regards, tom lane

Re: Online verification of checksums

2020-04-06 Thread Tom Lane

Michael Banck  writes:
> [ 0001-Fix-checksum-verification-in-base-backups-for-random_V3.patch ]

I noticed that the cfbot wasn't testing this because of a minor merge
conflict.  I rebased it over that, and also readjusted things a little bit
to avoid unnecessarily reindenting existing code, in hopes of making the
patch easier to review.  Doing that reveals that the patch actually
removes a chunk of code, namely a special case for EOF.  Was that
intentional, or a result of a faulty merge earlier?  It certainly isn't
mentioned in your proposed commit message.

Another thing that's bothering me is that the patch compares page LSN
against GetInsertRecPtr(); but that function says

 * NOTE: The value *actually* returned is the position of the last full
 * xlog page. It lags behind the real insert position by at most 1 page.
 * For that, we don't need to scan through WAL insertion locks, and an
 * approximation is enough for the current usage of this function.

I'm not convinced that an approximation is good enough here.  It seems
like a page that's just now been updated could have an LSN beyond the
current XLOG page start, potentially leading to a false checksum
complaint.  Maybe we could address that by adding one xlog page to
the GetInsertRecPtr result?  Kind of a hack, but ...

regards, tom lane

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 5d94b9c..c7ff9a8 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -2028,15 +2028,47 @@ sendFile(const char *readfilename, const char *tarfilename,
 page = buf + BLCKSZ * i;
 
 /*
- * Only check pages which have not been modified since the
- * start of the base backup. Otherwise, they might have been
- * written only halfway and the checksum would not be valid.
- * However, replaying WAL would reinstate the correct page in
- * this case. We also skip completely new pages, since they
- * don't have a checksum yet.
+ * We skip completely new pages after checking they are
+ * all-zero, since they don't have a checksum yet.
  */
-if (!PageIsNew(page) && PageGetLSN(page) < startptr)
+if (PageIsNew(page))
 {
+	if (!PageIsZero(page))
+	{
+		/*
+		 * pd_upper is zero, but the page is not all zero.  We
+		 * cannot run pg_checksum_page() on the page as it
+		 * would throw an assertion failure.  Consider this a
+		 * checksum failure.
+		 */
+		checksum_failures++;
+
+		if (checksum_failures <= 5)
+			ereport(WARNING,
+	(errmsg("checksum verification failed in "
+			"file \"%s\", block %d: pd_upper "
+			"is zero but page is not all-zero",
+			readfilename, blkno)));
+		if (checksum_failures == 5)
+			ereport(WARNING,
+	(errmsg("further checksum verification "
+			"failures in file \"%s\" will not "
+			"be reported", readfilename)));
+	}
+}
+else if (PageGetLSN(page) < startptr ||
+		 PageGetLSN(page) > GetInsertRecPtr())
+{
+	/*
+	 * Only check pages which have not been modified since the
+	 * start of the base backup. Otherwise, they might have
+	 * been written only halfway and the checksum would not be
+	 * valid. However, replaying WAL would reinstate the
+	 * correct page in this case. If the page LSN is larger
+	 * than the current insert pointer then we assume a bogus
+	 * LSN due to random page header corruption and do verify
+	 * the checksum.
+	 */
 	checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
 	phdr = (PageHeader) page;
 	if (phdr->pd_checksum != checksum)
@@ -2064,20 +2096,6 @@ sendFile(const char *readfilename, const char *tarfilename,
 
 			if (fread(buf + BLCKSZ * i, 1, BLCKSZ, fp) != BLCKSZ)
 			{
-/*
- * If we hit end-of-file, a concurrent
- * truncation must have occurred, so break out
- * of this loop just as if the initial fread()
- * returned 0. We'll drop through to the same
- * code that handles that case. (We must fix
- * up cnt first, though.)
- */
-if (feof(fp))
-{
-	cnt = BLCKSZ * i;
-	break;
-}
-
 ereport(ERROR,
 		(errcode_for_file_access(),
 		 errmsg("could not reread block %d of file \"%s\": %m",
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index d708117..2dc8322 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -82,11 +82,8 @@ bool
 PageIsVerified(Page page, BlockNumber blkno)
 {
 	PageHeader	p = (PageHeader) page;
-	size_t	   *pagebytes;
-	int			i;
 	bool		checksum_failure = false;
 	bool		header_sane = false;
-	bool		all_zeroes = false;
 	uint16		checksum = 0;
 
 	/*
@@ -120,18 +117,7 @@ PageIsVerified(Page page, BlockNumber blkno)
 	}
 
 	/* Check

Re: Online verification of checksums

2020-03-12 Thread Michael Banck

Hi,

thanks for reviewing this patch!

Am Donnerstag, den 27.02.2020, 10:57 + schrieb Asif Rehman:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:not tested
> 
> The patch applies cleanly and works as expected. Just a few minor 
> observations:
> 
> - I would suggest refactoring PageIsZero function by getting rid of 
> all_zeroes variable
> and simply returning false when a non-zero byte is found, rather than setting 
> all_zeros
> variable to false and breaking the for loop. The function should simply 
> return true at the
> end otherwise.


Good point, I have done so.

> - Remove the empty line:
> +* would throw an assertion failure.  Consider this a
> +* checksum failure.
> +*/
> +
> +   checksum_failures++;

Done

> - Code needs to run through pgindent.

Done.

> Also, I'd suggest to make "5" a define within the current file/function, 
> perhaps 
> something like "MAX_CHECKSUM_FAILURES". You could move the second 
> warning outside the conditional statement as it appears in both "if" and 
> "else" blocks.

Well, I think you have a valid point, but that would be a different (non
bug-fix) patch as this part is not changed by this patch, but code is at
most moved around, is it?

New version attached.


Best regards,

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
From 084a2fe968a3ee9c0bb4f18fa9fbc8a582f38b3c Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Thu, 12 Mar 2020 07:31:22 +0100
Subject: [PATCH] Fix checksum verification in base backups for random or zero page headers

We currently do not checksum a page if it is considered new by PageIsNew() or
if its pd_lsn page header field is larger than the LSN at the beginning of the
base backup. However, this means that if the whole page header is zero,
PageIsNew() will consider that page new due to the pd_upper field being zero
and no subsequent checksum verification is done. Also, a random value in the
pd_lsn field will very likely be larger than the LSN at backup start, again
resulting in the page not getting checksummed.

To fix, factor out the all-zeroes check from PageIsVerified() into a new method
PageIsZero() and call that for pages considered new by PageIsNew(). If a page
is all zero, consider that a checksum failure. Also check the pd_lsn field
against both the backup start pointer and the current pointer from
GetInsertRecPtr().

Add two more tests to the pg_basebackup TAP tests to check those errors.

Reported-By: Andres Freund
Reviewed-By: Asif Rehman
Discussion: https://postgr.es/m/20190319200050.ncuxejradurja...@alap3.anarazel.de
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 806d013108..6e4a5a1365 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1526,87 +1526,27 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
 page = buf + BLCKSZ * i;
 
 /*
- * Only check pages which have not been modified since the
- * start of the base backup. Otherwise, they might have been
- * written only halfway and the checksum would not be valid.
- * However, replaying WAL would reinstate the correct page in
- * this case. We also skip completely new pages, since they
- * don't have a checksum yet.
+ * We skip completely new pages after checking they are
+ * all-zero, since they don't have a checksum yet.
  */
-if (!PageIsNew(page) && PageGetLSN(page) < startptr)
+if (PageIsNew(page))
 {
-	checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
-	phdr = (PageHeader) page;
-	if (phdr->pd_checksum != checksum)
+	if (!PageIsZero(page))
 	{
 		/*
-		 * Retry the block on the first failure.  It's
-		 * possible that we read the first 4K page of the
-		 * block just before postgres updated the entire block
-		 * so it ends up looking torn to us.  We only need to
-		 * retry once because the LSN should be updated to
-		 * something we can ignore on the next pass.  If the
-		 * error happens again then it is a true validation
-		 * failure.
+		 * pd_upper is zero, but the page is not all zero.  We
+		 * cannot run pg_checksum_page() on the page as it
+		 * would throw an assertion failure.  Consider this a
+		 * checksum failure.
 		 */
-

Re: Online verification of checksums

2020-02-27 Thread Asif Rehman

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

The patch applies cleanly and works as expected. Just a few minor observations:

- I would suggest refactoring PageIsZero function by getting rid of all_zeroes 
variable
and simply returning false when a non-zero byte is found, rather than setting 
all_zeros
variable to false and breaking the for loop. The function should simply return 
true at the
end otherwise.

- Remove the empty line:
+* would throw an assertion failure.  Consider this a
+* checksum failure.
+*/
+
+   checksum_failures++;


- Code needs to run through pgindent.

Also, I'd suggest to make "5" a define within the current file/function, 
perhaps 
something like "MAX_CHECKSUM_FAILURES". You could move the second 
warning outside the conditional statement as it appears in both "if" and "else" 
blocks.


Regards,
--Asif

The new status of this patch is: Waiting on Author

Re: Online verification of checksums

2019-03-30 Thread Andres Freund

Hi,

On 2019-03-30 12:56:21 +0100, Magnus Hagander wrote:
> > ISTM that the fact that we had to teach it about different segment files
> > for checksum verification by splitting up the filename at "." implies
> > that it is not the correct level of abstraction (but maybe it could get
> > schooled some more about Postgres internals, e.g. by passing it a
> > RefFileNode struct and not a filename).
> >
> 
> But that has to be fixed in pg_basebackup *regardless*, doesn't it? And if
> we fix it there, we only have to fix it once...

I'm not understanding the problem here. We already need to know all of
this? sendFile() determines whether the file is checksummed, and
computes the segment number:

if (is_checksummed_file(readfilename, filename))
{
verify_checksum = true;
...
checksum = pg_checksum_page((char *) 
page, blkno + segmentno * RELSEG_SIZE);
phdr = (PageHeader) page;

I agree that the way checksumming works is a bit of a layering
violation. In my opinion it belongs in the smgr level, not bufmgr.c etc,
so different storage methods can store it differently. But that seems
fairly indepedent of this problem.

Greetings,

Andres Freund

Re: Online verification of checksums

2019-03-30 Thread Magnus Hagander

On Fri, Mar 29, 2019 at 10:08 PM Michael Banck 
wrote:

> Hi,
>
> Am Freitag, den 29.03.2019, 16:52 +0100 schrieb Magnus Hagander:
> > On Fri, Mar 29, 2019 at 4:30 PM Stephen Frost 
> wrote:
> > > * Magnus Hagander (mag...@hagander.net) wrote:
> > > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <
> tomas.von...@2ndquadrant.com>
> > > > wrote:
> > > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > > > >> I agree that the current patch might have some corner-cases
> where it
> > > > > >> does not guarantee 100% accuracy in online mode, but I hope the
> current
> > > > > >> version at least has no more false negatives.
> > > > > >
> > > > > >False positives are *bad*. We shouldn't integrate code that has
> them.
> > > > >
> > > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the
> online mode
> > > > > communicate with the server, which would presumably address these
> issues.
> > > > > Can someone explain why not to do that?
> > > >
> > > > I agree that this effort seems better spent on fixing those issues
> there
> > > > (of which many are the same), and then re-use that.
> > >
> > > This really seems like it depends on which of the options we're talking
> > > about..   Connecting to the server and asking what the current insert
> > > point is, so we can check that the LSN isn't completely insane, seems
> > > reasonable, but at least one option being discussed was to have
> > > pg_basebackup actually *lock the page* (even if just for I/O..) and
> then
> > > re-read it, and having an external tool doing that instead of the
> > > backend seems like a whole different level to me.  That would involve
> > > having an SQL function for "lock this page against I/O" and then
> another
> > > for "unlock this page", wouldn't it?
> >
> > Right.
> >
> > But what if we just added a flag to the BASE_BACKUP command in the
> > replication protocol that said "meh, I really just want to verify the
> > checksums, so please send the data to devnull and only feed me regular
> > status updates on this connection"?
>
> I don't know whether BASE_BACKUP is the best interface for that (at
> least right now) - backend/replication/basebackup.c's sendFile() gets
> only an absolute filename to send, which is not adequate for more in-
> depth server-based things like locking a particular page in a particular
> relation of some particular tablespace.


> ISTM that the fact that we had to teach it about different segment files
> for checksum verification by splitting up the filename at "." implies
> that it is not the correct level of abstraction (but maybe it could get
> schooled some more about Postgres internals, e.g. by passing it a
> RefFileNode struct and not a filename).
>

But that has to be fixed in pg_basebackup *regardless*, doesn't it? And if
we fix it there, we only have to fix it once...


//Magnus

Re: Online verification of checksums

2019-03-29 Thread Michael Banck

Hi,

Am Freitag, den 29.03.2019, 16:52 +0100 schrieb Magnus Hagander:
> On Fri, Mar 29, 2019 at 4:30 PM Stephen Frost  wrote:
> > * Magnus Hagander (mag...@hagander.net) wrote:
> > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra 
> > > 
> > > wrote:
> > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > > >> I agree that the current patch might have some corner-cases where it
> > > > >> does not guarantee 100% accuracy in online mode, but I hope the 
> > > > >> current
> > > > >> version at least has no more false negatives.
> > > > >
> > > > >False positives are *bad*. We shouldn't integrate code that has them.
> > > >
> > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online 
> > > > mode
> > > > communicate with the server, which would presumably address these 
> > > > issues.
> > > > Can someone explain why not to do that?
> > > 
> > > I agree that this effort seems better spent on fixing those issues there
> > > (of which many are the same), and then re-use that.
> > 
> > This really seems like it depends on which of the options we're talking
> > about..   Connecting to the server and asking what the current insert
> > point is, so we can check that the LSN isn't completely insane, seems
> > reasonable, but at least one option being discussed was to have
> > pg_basebackup actually *lock the page* (even if just for I/O..) and then
> > re-read it, and having an external tool doing that instead of the
> > backend seems like a whole different level to me.  That would involve
> > having an SQL function for "lock this page against I/O" and then another
> > for "unlock this page", wouldn't it?
> 
> Right.
> 
> But what if we just added a flag to the BASE_BACKUP command in the
> replication protocol that said "meh, I really just want to verify the
> checksums, so please send the data to devnull and only feed me regular
> status updates on this connection"?

I don't know whether BASE_BACKUP is the best interface for that (at
least right now) - backend/replication/basebackup.c's sendFile() gets
only an absolute filename to send, which is not adequate for more in-
depth server-based things like locking a particular page in a particular
relation of some particular tablespace.

ISTM that the fact that we had to teach it about different segment files
for checksum verification by splitting up the filename at "." implies
that it is not the correct level of abstraction (but maybe it could get
schooled some more about Postgres internals, e.g. by passing it a
RefFileNode struct and not a filename).


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Online verification of checksums

2019-03-29 Thread Magnus Hagander

On Fri, Mar 29, 2019 at 4:30 PM Stephen Frost  wrote:

> Greetings,
>
> * Magnus Hagander (mag...@hagander.net) wrote:
> > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <
> tomas.von...@2ndquadrant.com>
> > wrote:
> >
> > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > >Hi,
> > > >
> > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > >> I agree that the current patch might have some corner-cases where it
> > > >> does not guarantee 100% accuracy in online mode, but I hope the
> current
> > > >> version at least has no more false negatives.
> > > >
> > > >False positives are *bad*. We shouldn't integrate code that has them.
> > > >
> > >
> > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online
> mode
> > > communicate with the server, which would presumably address these
> issues.
> > > Can someone explain why not to do that?
> >
> > I agree that this effort seems better spent on fixing those issues there
> > (of which many are the same), and then re-use that.
>
> This really seems like it depends on which of the options we're talking
> about..   Connecting to the server and asking what the current insert
> point is, so we can check that the LSN isn't completely insane, seems
> reasonable, but at least one option being discussed was to have
> pg_basebackup actually *lock the page* (even if just for I/O..) and then
> re-read it, and having an external tool doing that instead of the
> backend seems like a whole different level to me.  That would involve
> having an SQL function for "lock this page against I/O" and then another
> for "unlock this page", wouldn't it?
>

Right.

But what if we just added a flag to the BASE_BACKUP command in the
replication protocol that said "meh, I really just want to verify the
checksums, so please send the data to devnull and only feed me regular
status updates on this connection"?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Online verification of checksums

2019-03-29 Thread Andres Freund

Hi,

On 2019-03-29 11:38:02 -0400, Stephen Frost wrote:
> The server-side function would essentially lock the page against i/o,
> re-read it off disk into an independent location, unlock the page, then
> calculate the checksum and report back?

Right. I think there's a few minor variations of how this could be done,
but that'd be the basic approach.

> That seems like it would be reasonable to me.  Wouldn't it make sense to
> then have pg_basebackup use that same function..?

Yea, probably. Or at least reuse the majority of it, I can imagine the
error reporting would be a bit different (sqlstates et al are needed for
the basebackup.c case, but not the pg_checksum case).

Greetings,

Andres Freund

Re: Online verification of checksums

2019-03-29 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-03-29 11:30:15 -0400, Stephen Frost wrote:
> > * Magnus Hagander (mag...@hagander.net) wrote:
> > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra 
> > > 
> > > wrote:
> > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > > >Hi,
> > > > >
> > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > > >> I agree that the current patch might have some corner-cases where it
> > > > >> does not guarantee 100% accuracy in online mode, but I hope the 
> > > > >> current
> > > > >> version at least has no more false negatives.
> > > > >
> > > > >False positives are *bad*. We shouldn't integrate code that has them.
> > > > >
> > > >
> > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online 
> > > > mode
> > > > communicate with the server, which would presumably address these 
> > > > issues.
> > > > Can someone explain why not to do that?
> > > 
> > > I agree that this effort seems better spent on fixing those issues there
> > > (of which many are the same), and then re-use that.
> > 
> > This really seems like it depends on which of the options we're talking
> > about..   Connecting to the server and asking what the current insert
> > point is, so we can check that the LSN isn't completely insane, seems
> > reasonable, but at least one option being discussed was to have
> > pg_basebackup actually *lock the page* (even if just for I/O..) and then
> > re-read it, and having an external tool doing that instead of the
> > backend seems like a whole different level to me.  That would involve
> > having an SQL function for "lock this page against I/O" and then another
> > for "unlock this page", wouldn't it?
> 
> No, I don't think so. And we obviously couldn't have a SQL level
> function hold an LWLock after it has finished, that'd make undetected
> deadlocks triggerable by users.  The way I'd imagine that being done is
> to just perform the checksum test in the commandline tool, and whenever
> there's a checksum failure that could plausibly be a torn read, call a
> server side function that re-tests the page after locking it. Which then
> would just return the error message in a string.

The server-side function would essentially lock the page against i/o,
re-read it off disk into an independent location, unlock the page, then
calculate the checksum and report back?

That seems like it would be reasonable to me.  Wouldn't it make sense to
then have pg_basebackup use that same function..?

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Online verification of checksums

2019-03-29 Thread Andres Freund

Hi,

On 2019-03-29 11:30:15 -0400, Stephen Frost wrote:
> * Magnus Hagander (mag...@hagander.net) wrote:
> > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra 
> > wrote:
> > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > >Hi,
> > > >
> > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > >> I agree that the current patch might have some corner-cases where it
> > > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > > >> version at least has no more false negatives.
> > > >
> > > >False positives are *bad*. We shouldn't integrate code that has them.
> > > >
> > >
> > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > > communicate with the server, which would presumably address these issues.
> > > Can someone explain why not to do that?
> > 
> > I agree that this effort seems better spent on fixing those issues there
> > (of which many are the same), and then re-use that.
> 
> This really seems like it depends on which of the options we're talking
> about..   Connecting to the server and asking what the current insert
> point is, so we can check that the LSN isn't completely insane, seems
> reasonable, but at least one option being discussed was to have
> pg_basebackup actually *lock the page* (even if just for I/O..) and then
> re-read it, and having an external tool doing that instead of the
> backend seems like a whole different level to me.  That would involve
> having an SQL function for "lock this page against I/O" and then another
> for "unlock this page", wouldn't it?

No, I don't think so. And we obviously couldn't have a SQL level
function hold an LWLock after it has finished, that'd make undetected
deadlocks triggerable by users.  The way I'd imagine that being done is
to just perform the checksum test in the commandline tool, and whenever
there's a checksum failure that could plausibly be a torn read, call a
server side function that re-tests the page after locking it. Which then
would just return the error message in a string.

Greetings,

Andres Freund

Re: Online verification of checksums

2019-03-29 Thread Stephen Frost

Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra 
> wrote:
> 
> > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > >Hi,
> > >
> > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > >> I agree that the current patch might have some corner-cases where it
> > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > >> version at least has no more false negatives.
> > >
> > >False positives are *bad*. We shouldn't integrate code that has them.
> > >
> >
> > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > communicate with the server, which would presumably address these issues.
> > Can someone explain why not to do that?
> 
> I agree that this effort seems better spent on fixing those issues there
> (of which many are the same), and then re-use that.

This really seems like it depends on which of the options we're talking
about..   Connecting to the server and asking what the current insert
point is, so we can check that the LSN isn't completely insane, seems
reasonable, but at least one option being discussed was to have
pg_basebackup actually *lock the page* (even if just for I/O..) and then
re-read it, and having an external tool doing that instead of the
backend seems like a whole different level to me.  That would involve
having an SQL function for "lock this page against I/O" and then another
for "unlock this page", wouldn't it?

> > FWIW I've initially argued against that, believing that we can address
> > those issues in some other way, and I'd love if that was possible. But
> > considering we're still trying to make that work reliably I think the
> > reasonable conclusion is that Andres was right communicating with the
> > server is necessary.

As part of a backup, you could check against the pages written out into
the WAL as a cross-check and be able to be confident that at least
everything which was backed up had been checked.  That doesn't cover
things like unlogged tables though.

For my part, at least, adding additional checks around the LSN seems
like a good solution (though we can't allow those checks to turn into
false positives...) and would seriously reduce the risk that we have
false negatives (we can *not* completely eliminate false negatives
entirely..  we could possibly get to a point where at least we don't
have any more false negatives than PG itself has but it looks like an
awful lot of work and ends up adding its own risks...).

As I've said before, I'd certainly support a background worker which
performs ongoing checksum validation of pages and that would be able to
use the same approach as what we do with pg_basebackup, but having an
external tool locking pages seems really unlikely to be reasonable.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Online verification of checksums

2019-03-29 Thread Magnus Hagander

On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra 
wrote:

> On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> >Hi,
> >
> >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> >> I agree that the current patch might have some corner-cases where it
> >> does not guarantee 100% accuracy in online mode, but I hope the current
> >> version at least has no more false negatives.
> >
> >False positives are *bad*. We shouldn't integrate code that has them.
> >
>
> Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> communicate with the server, which would presumably address these issues.
> Can someone explain why not to do that?
>

I agree that this effort seems better spent on fixing those issues there
(of which many are the same), and then re-use that.


FWIW I've initially argued against that, believing that we can address
> those issues in some other way, and I'd love if that was possible. But
> considering we're still trying to make that work reliably I think the
> reasonable conclusion is that Andres was right communicating with the
> server is necessary.
>
> Of course, I definitely appreciate people are working on this, otherwise
> we wouldn't be having this discussion ...
>

+1.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Online verification of checksums

2019-03-28 Thread Tomas Vondra


On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:

Hi,

On 2019-03-28 21:09:22 +0100, Michael Banck wrote:

I agree that the current patch might have some corner-cases where it
does not guarantee 100% accuracy in online mode, but I hope the current
version at least has no more false negatives.


False positives are *bad*. We shouldn't integrate code that has them.



Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
communicate with the server, which would presumably address these issues.
Can someone explain why not to do that?

FWIW I've initially argued against that, believing that we can address
those issues in some other way, and I'd love if that was possible. But
considering we're still trying to make that work reliably I think the
reasonable conclusion is that Andres was right communicating with the
server is necessary.

Of course, I definitely appreciate people are working on this, otherwise
we wouldn't be having this discussion ...

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Online verification of checksums

2019-03-28 Thread Andres Freund

Hi,

On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> I agree that the current patch might have some corner-cases where it
> does not guarantee 100% accuracy in online mode, but I hope the current
> version at least has no more false negatives.

False positives are *bad*. We shouldn't integrate code that has them.

Greetings,

Andres Freund

Re: Online verification of checksums

2019-03-28 Thread Michael Banck

Hi,

Am Donnerstag, den 28.03.2019, 18:19 +0100 schrieb Tomas Vondra:
> On Thu, Mar 28, 2019 at 05:08:33PM +0100, Michael Banck wrote:
> > I also fixed the two issues Andres reported, namely a zeroed-out
> > pageheader and a random LSN. The first is caught be checking for an all-
> > zero-page in the way PageIsVerified() does. The second is caught by
> > comparing the upper 32 bits of the LSN as well and demanding that they
> > are equal. If the LSN is corrupted, the upper 32 bits should be wildly
> > different to the current checkpoint LSN.
> > 
> > Well, at least that is a stab at a fix; there is a window where the
> > upper 32 bits could legitimately be different. In order to make that as
> > small as possible, I update the checkpoint LSN every once in a while.

I decided it makes more sense to just re-read the checkpoint LSN from
the control file when we encounter a wrong checksum on re-read of a page
as that is when it counts, instead of doing it only every once in a
while.

> Doesn't that mean we'll report a false positive?

A false positive would be pg_checksums claiming a block has a wrong
checksum while in fact it does not (after it is correctly written out
and synced to disk), right?

If pg_checksums reads a current first part and a stale second part twice
in a row (we re-read the block), then the LSN of the first part would
presumably(?) be higher than the latest checkpoint LSN. If there was a
wraparound in the lower part of the LSN so that the upper part is now
different to the latest checkpoint LSN, then pg_checksums would report
this as a false positive I believe. 

We could add some additional heuristics like checking the upper part of
the LSN has advanced by at most one but that does not seem to make it
100% certified robust either, does it?

If pg_checksums reads a current second part and a stale first part
twice, then the pageheader LSN would presumably be lower than the
checkpoint LSN and again a false positive would be reported.

At least in my testing I haven't seen the second case and the first
(disregarding the wraparound issue for now) extremely rarely if at all
(usually the torn page is gone on re-read). The first case requiring a
wraparound since the latest checkpointLSN update also seems quite narrow
compared to the issue of random data being written due to corruption. So
I think it is more important to make sure random data won't be a false
negative than this being a false positive.

Maybe we can just issue a warning in online mode that some checksum
failures could be false positives and advise the user to recheck those
files (using the -r switch) again? I have added this in the attached new
version:

+  printf(_("%s ran against an online cluster and found some bad 
checksums.\n"), progname);
+  printf(_("It could be that those are false positives due concurrently 
updated blocks,\n"));
+  printf(_("checking the offending files again with the -r option is 
advised.\n"));

It was not mentioned on this thread, but I want to stress again that you
cannot run the current pg_checksums on a basebackup due to the control
file claiming it is still online. This makes the current program pretty
useless for production setups right now in my opinion as few people have
the luxury of regular maintenance downtimes when pg_checksums could run
and running it against base backups is quite cumbersome.

Maybe we can improve things by checking for the postmaster.pid as well
and going ahead (only for --check of course) if it is missing, but that
hasn't been implemented yet.

I agree that the current patch might have some corner-cases where it
does not guarantee 100% accuracy in online mode, but I hope the current
version at least has no more false negatives.

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 1f4d4ab8b4..5027809a59 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -37,12 +37,10 @@ PostgreSQL documentation
   Description

pg_checksums checks, enables or disables data
-   checksums in a PostgreSQL cluster.  The server
-   must be shut down cleanly before running
-   pg_checksums. The exit status is zero if there
-   are no checksum errors when checking them, and nonzero if at least one
-   checksum failure is detected. If enabling or disabling checksums, the
-   exit status is nonzero if the operation failed.
+   checksums in a PostgreSQL cluster. The exit
+   status is zero if there are no checksum errors when checking them, and
+   nonzero if at least one checksum failure is

Re: Online verification of checksums

2019-03-28 Thread Tomas Vondra


On Thu, Mar 28, 2019 at 05:08:33PM +0100, Michael Banck wrote:

Hi,

I have rebased this patch now.

I also fixed the two issues Andres reported, namely a zeroed-out
pageheader and a random LSN. The first is caught be checking for an all-
zero-page in the way PageIsVerified() does. The second is caught by
comparing the upper 32 bits of the LSN as well and demanding that they
are equal. If the LSN is corrupted, the upper 32 bits should be wildly
different to the current checkpoint LSN.

Well, at least that is a stab at a fix; there is a window where the
upper 32 bits could legitimately be different. In order to make that as
small as possible, I update the checkpoint LSN every once in a while.



Doesn't that mean we'll report a false positive?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Online verification of checksums

2019-03-28 Thread Michael Banck

Hi,

I have rebased this patch now.

I also fixed the two issues Andres reported, namely a zeroed-out
pageheader and a random LSN. The first is caught be checking for an all-
zero-page in the way PageIsVerified() does. The second is caught by
comparing the upper 32 bits of the LSN as well and demanding that they
are equal. If the LSN is corrupted, the upper 32 bits should be wildly
different to the current checkpoint LSN.

Well, at least that is a stab at a fix; there is a window where the
upper 32 bits could legitimately be different. In order to make that as
small as possible, I update the checkpoint LSN every once in a while.

Am Montag, den 18.03.2019, 21:15 +0100 schrieb Michael Banck:
> I have also added a paragraph to the documentation about possilby
> skipping new or recently updated pages:
> 
> +   If the cluster is online, pages that have been (re-)written since the last
> +   checkpoint will not count as checksum failures if they cannot be read or
> +   verified correctly.

I have removed that for now as it seems to be more confusing than
helpful.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 1f4d4ab8b4..5027809a59 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -37,12 +37,10 @@ PostgreSQL documentation
   Description
   
pg_checksums checks, enables or disables data
-   checksums in a PostgreSQL cluster.  The server
-   must be shut down cleanly before running
-   pg_checksums. The exit status is zero if there
-   are no checksum errors when checking them, and nonzero if at least one
-   checksum failure is detected. If enabling or disabling checksums, the
-   exit status is nonzero if the operation failed.
+   checksums in a PostgreSQL cluster. The exit
+   status is zero if there are no checksum errors when checking them, and
+   nonzero if at least one checksum failure is detected. If enabling or
+   disabling checksums, the exit status is nonzero if the operation failed.
   
 
   
@@ -50,6 +48,12 @@ PostgreSQL documentation
the cluster, disabling checksums will only update the file
pg_control.
   
+
+  
+   If the cluster is online, pages that have been (re-)written since the last
+   checkpoint will not count as checksum failures if they cannot be verified
+   correctly.
+  
  
 
  
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 5c185ebed8..f62fdc90a4 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -1,8 +1,7 @@
 /*-
  *
  * pg_checksums.c
- *	  Checks, enables or disables page level checksums for an offline
- *	  cluster
+ *	  Checks, enables or disables page level checksums for a cluster
  *
  * Copyright (c) 2010-2019, PostgreSQL Global Development Group
  *
@@ -30,13 +29,20 @@
 
 
 static int64 files = 0;
+static int64 skippedfiles = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
+static int64 skippedblocks = 0;
 static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
+static int64 blocks_last_lsn_update = 0;
 
 static char *only_relfilenode = NULL;
 static bool do_sync = true;
 static bool verbose = false;
+static bool online = false;
+
+static char *DataDir = NULL;
 
 typedef enum
 {
@@ -97,6 +103,23 @@ static const char *const skip[] = {
 	NULL,
 };
 
+static void
+update_checkpoint_lsn(void)
+{
+	bool		crc_ok;
+
+	ControlFile = get_controlfile(DataDir, progname, _ok);
+	if (!crc_ok)
+	{
+		fprintf(stderr, _("%s: pg_control CRC value is incorrect\n"), progname);
+		exit(1);
+	}
+
+	/* Update checkpointLSN and blocks_last_lsn_update with the current value */
+	checkpointLSN = ControlFile->checkPoint;
+	blocks_last_lsn_update = blocks;
+}
+
 static bool
 skipfile(const char *fn)
 {
@@ -116,6 +139,10 @@ scan_file(const char *fn, BlockNumber segmentno)
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
+	int			block_loc = 0;
+	bool		block_retry = false;
+	bool		all_zeroes;
+	size_t	   *pagebytes;
 	int			flags;
 
 	Assert(mode == PG_MODE_ENABLE ||
@@ -126,6 +153,12 @@ scan_file(const char *fn, BlockNumber segmentno)
 
 	if (f < 0)
 	{
+		if (online && errno == ENOENT)
+		{
+			/* File was removed in the meantime */
+			return;
+		}
+
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 progname, fn, strerror(errno));
 		exit(1);
@@ -136,27 +169,129 @@ scan_file(const char *fn, BlockNumber segmentno)
 	for (blockno = 0;; blockno++)
 	{

Re: [Patch] Base backups and random or zero pageheaders (was: Online verification of checksums)

2019-03-26 Thread Michael Banck

Hi,

Am Dienstag, den 26.03.2019, 10:30 -0700 schrieb Andres Freund:
> On 2019-03-26 18:22:55 +0100, Michael Banck wrote:
> > Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund:
> > > CREATE TABLE corruptme AS SELECT g.i::text AS data FROM 
> > > generate_series(1, 100) g(i);
> > > SELECT pg_relation_size('corruptme');
> > > postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || 
> > > pg_relation_filepath('corruptme');
> > > ┌─┐
> > > │  ?column?   │
> > > ├─┤
> > > │ /srv/dev/pgdev-dev/base/13390/16384 │
> > > └─┘
> > > (1 row)
> > > dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 
> > > conv=notrunc
> > > 
> > > Try a basebackup and see how many times it'll detect the corrupt
> > > data. In the vast majority of cases you're going to see checksum
> > > failures when reading the data for normal operation, but not when using
> > > basebackup (or this new tool).
> > > 
> > > At the very very least this would need to do
> > > 
> > > a) checks that the page is all zeroes if PageIsNew() (like
> > >PageIsVerified() does for the backend). That avoids missing cases
> > >where corruption just zeroed out the header, but not the whole page.
> > > b) Check that pd_lsn is between startlsn and the insertion pointer. That
> > >avoids accepting just about all random data.
> > > 
> > > And that'd *still* be less strenuous than what normal backends
> > > check. And that's already not great (due to not noticing zeroed out
> > > data).
> > 
> > I've done the above in the attached patch now. Well, literally like an
> > hour ago, then went jogging and came back to see you outlined about
> > fixing this differently in a separate thread. Still might be helpful for
> > the TAP test changes at least.
> 
> Sorry, I just hadn't seen much movement on this, and I'm a bit concerned
> about such a critical issue not being addressed.

Sure, I was working on this a bit on and off for a few days but I had
random corruption issues which I finally tracked down earlier to reusing
"for (i=0 [...]" via copy, d'oh.  That's why I renamed the `i'
variable to `page_in_buf' cause it's a pretty long loop so should have a
useful variable name IMO.

> > /*
> > -* Only check pages which have not been 
> > modified since the
> > -* start of the base backup. Otherwise, they 
> > might have been
> > -* written only halfway and the checksum would 
> > not be valid.
> > -* However, replaying WAL would reinstate the 
> > correct page in
> > -* this case. We also skip completely new 
> > pages, since they
> > -* don't have a checksum yet.
> > +* We skip completely new pages after checking 
> > they are
> > +* all-zero, since they don't have a checksum 
> > yet.
> >  */
> > -   if (!PageIsNew(page) && PageGetLSN(page) < 
> > startptr)
> > +   if (PageIsNew(page))
> > {
> > -   checksum = pg_checksum_page((char *) 
> > page, blkno + segmentno * RELSEG_SIZE);
> > -   phdr = (PageHeader) page;
> > -   if (phdr->pd_checksum != checksum)
> > +   all_zeroes = true;
> > +   pagebytes = (size_t *) page;
> > +   for (int i = 0; i < (BLCKSZ / 
> > sizeof(size_t)); i++)
> 
> Can we please abstract the zeroeness check into a separate function to
> be used both by PageIsVerified() and this?

Ok, done so as PageIsZero further down in bufpage.c.

> > +   if (!all_zeroes)
> > +   {
> > +   /*
> > +* pd_upper is zero, but the 
> > page is not all zero.  We
> > +* cannot run 
> > pg_checksum_page() on the page as it
> > +* would throw an assertion 
> > failure.  Consider this a
> > +* checksum failure.
> > +*/
> 
> I don't think the assertion failure is the relevant bit here, it's htat
> the page is corrupted, no?

Well, relevant in the sense that the reader might wonder why we don't
just call pg_checksum_page() and have a consistent error message with
the other codepath.

We could maybe run pg_checksum_block() on it and reverse the rest of the
permutations from pg_checksum_page() but that might be overly
complicated for

Re: [Patch] Base backups and random or zero pageheaders (was: Online verification of checksums)

2019-03-26 Thread Andres Freund

On 2019-03-26 18:22:55 +0100, Michael Banck wrote:
> Hi,
> 
> Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund:
> > CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 
> > 100) g(i);
> > SELECT pg_relation_size('corruptme');
> > postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || 
> > pg_relation_filepath('corruptme');
> > ┌─┐
> > │  ?column?   │
> > ├─┤
> > │ /srv/dev/pgdev-dev/base/13390/16384 │
> > └─┘
> > (1 row)
> > dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 
> > conv=notrunc
> > 
> > Try a basebackup and see how many times it'll detect the corrupt
> > data. In the vast majority of cases you're going to see checksum
> > failures when reading the data for normal operation, but not when using
> > basebackup (or this new tool).
> > 
> > At the very very least this would need to do
> > 
> > a) checks that the page is all zeroes if PageIsNew() (like
> >PageIsVerified() does for the backend). That avoids missing cases
> >where corruption just zeroed out the header, but not the whole page.
> > b) Check that pd_lsn is between startlsn and the insertion pointer. That
> >avoids accepting just about all random data.
> > 
> > And that'd *still* be less strenuous than what normal backends
> > check. And that's already not great (due to not noticing zeroed out
> > data).
> 
> I've done the above in the attached patch now. Well, literally like an
> hour ago, then went jogging and came back to see you outlined about
> fixing this differently in a separate thread. Still might be helpful for
> the TAP test changes at least.

Sorry, I just hadn't seen much movement on this, and I'm a bit concerned
about such a critical issue not being addressed.


>   /*
> -  * Only check pages which have not been 
> modified since the
> -  * start of the base backup. Otherwise, they 
> might have been
> -  * written only halfway and the checksum would 
> not be valid.
> -  * However, replaying WAL would reinstate the 
> correct page in
> -  * this case. We also skip completely new 
> pages, since they
> -  * don't have a checksum yet.
> +  * We skip completely new pages after checking 
> they are
> +  * all-zero, since they don't have a checksum 
> yet.
>*/
> - if (!PageIsNew(page) && PageGetLSN(page) < 
> startptr)
> + if (PageIsNew(page))
>   {
> - checksum = pg_checksum_page((char *) 
> page, blkno + segmentno * RELSEG_SIZE);
> - phdr = (PageHeader) page;
> - if (phdr->pd_checksum != checksum)
> + all_zeroes = true;
> + pagebytes = (size_t *) page;
> + for (int i = 0; i < (BLCKSZ / 
> sizeof(size_t)); i++)

Can we please abstract the zeroeness check into a separate function to
be used both by PageIsVerified() and this?


> + if (!all_zeroes)
> + {
> + /*
> +  * pd_upper is zero, but the 
> page is not all zero.  We
> +  * cannot run 
> pg_checksum_page() on the page as it
> +  * would throw an assertion 
> failure.  Consider this a
> +  * checksum failure.
> +  */

I don't think the assertion failure is the relevant bit here, it's htat
the page is corrupted, no?


> + /*
> +  * Only check pages which have not been 
> modified since the
> +  * start of the base backup. Otherwise, 
> they might have been
> +  * written only halfway and the 
> checksum would not be valid.
> +  * However, replaying WAL would 
> reinstate the correct page in
> +  * this case. If the page LSN is larger 
> than the current redo
> +  * pointer then we assume a bogus LSN 
> due to random page header
> +  * corruption and do verify the 
> checksum.
> +  */
> + if

[Patch] Base backups and random or zero pageheaders (was: Online verification of checksums)

2019-03-26 Thread Michael Banck

Hi,

Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund:
> CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 
> 100) g(i);
> SELECT pg_relation_size('corruptme');
> postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || 
> pg_relation_filepath('corruptme');
> ┌─┐
> │  ?column?   │
> ├─┤
> │ /srv/dev/pgdev-dev/base/13390/16384 │
> └─┘
> (1 row)
> dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 
> conv=notrunc
> 
> Try a basebackup and see how many times it'll detect the corrupt
> data. In the vast majority of cases you're going to see checksum
> failures when reading the data for normal operation, but not when using
> basebackup (or this new tool).
> 
> At the very very least this would need to do
> 
> a) checks that the page is all zeroes if PageIsNew() (like
>PageIsVerified() does for the backend). That avoids missing cases
>where corruption just zeroed out the header, but not the whole page.
> b) Check that pd_lsn is between startlsn and the insertion pointer. That
>avoids accepting just about all random data.
> 
> And that'd *still* be less strenuous than what normal backends
> check. And that's already not great (due to not noticing zeroed out
> data).

I've done the above in the attached patch now. Well, literally like an
hour ago, then went jogging and came back to see you outlined about
fixing this differently in a separate thread. Still might be helpful for
the TAP test changes at least.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 537f09e342..70a41b6998 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1369,16 +1369,18 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
 		 bool missing_ok, Oid dboid)
 {
 	FILE	   *fp;
+	bool		all_zeroes = false;
 	BlockNumber blkno = 0;
 	bool		block_retry = false;
 	char		buf[TAR_SEND_SIZE];
 	uint16		checksum;
 	int			checksum_failures = 0;
 	off_t		cnt;
-	int			i;
+	int			page_in_buf;
 	pgoff_t		len = 0;
 	char	   *page;
 	size_t		pad;
+	size_t *pagebytes;
 	PageHeader	phdr;
 	int			segmentno = 0;
 	char	   *segmentpath;
@@ -1449,78 +1451,42 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
 
 		if (verify_checksum)
 		{
-			for (i = 0; i < cnt / BLCKSZ; i++)
+			for (page_in_buf = 0; page_in_buf < cnt / BLCKSZ; page_in_buf++)
 			{
-page = buf + BLCKSZ * i;
+page = buf + BLCKSZ * page_in_buf;
 
 /*
- * Only check pages which have not been modified since the
- * start of the base backup. Otherwise, they might have been
- * written only halfway and the checksum would not be valid.
- * However, replaying WAL would reinstate the correct page in
- * this case. We also skip completely new pages, since they
- * don't have a checksum yet.
+ * We skip completely new pages after checking they are
+ * all-zero, since they don't have a checksum yet.
  */
-if (!PageIsNew(page) && PageGetLSN(page) < startptr)
+if (PageIsNew(page))
 {
-	checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
-	phdr = (PageHeader) page;
-	if (phdr->pd_checksum != checksum)
+	all_zeroes = true;
+	pagebytes = (size_t *) page;
+	for (int i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
 	{
-		/*
-		 * Retry the block on the first failure.  It's
-		 * possible that we read the first 4K page of the
-		 * block just before postgres updated the entire block
-		 * so it ends up looking torn to us.  We only need to
-		 * retry once because the LSN should be updated to
-		 * something we can ignore on the next pass.  If the
-		 * error happens again then it is a true validation
-		 * failure.
-		 */
-		if (block_retry == false)
+		if (pagebytes[i] != 0)
 		{
-			/* Reread the failed block */
-			if (fseek(fp, -(cnt - BLCKSZ * i), SEEK_CUR) == -1)
-			{
-ereport(ERROR,
-		(errcode_for_file_access(),
-		 errmsg("could not fseek in file \"%s\": %m",
-readfilename)));
-			}
-
-			if (fread(buf + BLCKSZ * i, 1, BLCKSZ, fp) != BLCKSZ)
-			{
-ereport(ERROR,
-		(errcode_for_file_access(),
-		 errmsg("could not reread block %d of file \"%s\": %m",
-blkno,

Re: Online verification of checksums

2019-03-19 Thread Michael Paquier

On Tue, Mar 19, 2019 at 02:44:52PM -0700, Andres Freund wrote:
> That's *PRECISELY* my point. I think it's a bad idea to do online
> checksumming from outside the backend. It needs to be inside the
> backend, and if there's any verification failures on a block, it needs
> to acquire the IO lock on the page, and reread from disk.

Yeah, FWIW, Julien Rouhaud was mentioning me that we could use
mdread() and loop over the blocks so as we don't finish loading
corrupted blocks into shared buffers, checking on the way if the block
is already in shared buffers or not.
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2019-03-19 Thread Andres Freund

Hi,

On 2019-03-19 22:39:16 +0100, Michael Banck wrote:
> Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund:
> > a) checks that the page is all zeroes if PageIsNew() (like
> >PageIsVerified() does for the backend). That avoids missing cases
> >where corruption just zeroed out the header, but not the whole page.
> 
> We can't run pg_checksum_page() on those afterwards though as it would
> fire an assertion:
> 
> |pg_checksums: [...]/../src/include/storage/checksum_impl.h:194:
> |pg_checksum_page: Assertion `!(((PageHeader) (>phdr))->pd_upper
> |== 0)' failed.
> 
> But we should count it as a checksum error and generate an appropriate
> error message in that case.

All I'm saying is that if PageIsNew() you need to run the same checks
that PageIsVerified() runs in that case. Namely verifying that the page
is all-zeroes, rather than just the pd_upper field.  That's separate
from running pg_checksum_page().

> > b) Check that pd_lsn is between startlsn and the insertion pointer. That
> >avoids accepting just about all random data.
> 
> However, for pg_checksums being a stand-alone application it can't just
> access the insertion pointer, can it? We could maybe set a threshold
> from the last checkpoint after which we consider the pd_lsn bogus. But
> what's a good threshold here?

That's *PRECISELY* my point. I think it's a bad idea to do online
checksumming from outside the backend. It needs to be inside the
backend, and if there's any verification failures on a block, it needs
to acquire the IO lock on the page, and reread from disk.

Greetings,

Andres Freund

Re: Online verification of checksums

2019-03-19 Thread Michael Banck

Hi,

Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund:
> On 2019-03-20 03:27:55 +0800, Stephen Frost wrote:
> > On Tue, Mar 19, 2019 at 23:59 Andres Freund  wrote:
> > > On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> > > > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > > > > It's torn pages that I am concerned about - the server is writing and
> > > > > we are reading, and we get a mix of old and new content.  We have been
> > > > > quite diligent about protecting ourselves from such risks elsewhere,
> > > > > and checksum verification should not be held to any lesser standard.
> > > > 
> > > > If we see a checksum failure on an otherwise correctly read block in
> > > > online mode, we retry the block on the theory that we might have read a
> > > > torn page.  If the checksum verification still fails, we compare its LSN
> > > > to the LSN of the current checkpoint and don't mind if its newer.  This
> > > > way, a torn page should not cause a false positive either way I
> > > > think?.
> > > 
> > > False positives, no. But there's plenty potential for false
> > > negatives. In plenty clusters a large fraction of the pages is going to
> > > be touched in most checkpoints.
> > 
> > 
> > How is it a false negative?  The page was in the middle of being
> > written,
> 
> You don't actually know that. It could just be random gunk in the LSN,
> and this type of logic just ignores such failures as long as the random
> gunk is above the system's LSN.

Right, I think this needs to be taken into account. For pg_basebackup,
that'd be an additional check for GetRedoRecPtr() or something 
in the below check:

[...]

> Well, I don't know what to tell you. But:
> 
>   /*
>* Only check pages which have not been 
> modified since the
>* start of the base backup. Otherwise, they 
> might have been
>* written only halfway and the checksum would 
> not be valid.
>* However, replaying WAL would reinstate the 
> correct page in
>* this case. We also skip completely new 
> pages, since they
>* don't have a checksum yet.
>*/
>   if (!PageIsNew(page) && PageGetLSN(page) < 
> startptr)
>   {
> 
> doesn't consider plenty scenarios, as pointed out above.  It'd be one
> thing if the concerns I point out above were actually commented upon and
> weighed not substantial enough (not that I know how). But...
> 

> > Do you have any example cases where the code in pg_basebackup has resulted
> > in either a false positive or a false negative?  Any case which can be
> > shown to result in either?
> 
> CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 
> 100) g(i);
> SELECT pg_relation_size('corruptme');
> postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || 
> pg_relation_filepath('corruptme');
> ┌─┐
> │  ?column?   │
> ├─┤
> │ /srv/dev/pgdev-dev/base/13390/16384 │
> └─┘
> (1 row)
> dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 
> conv=notrunc
> 
> Try a basebackup and see how many times it'll detect the corrupt
> data. In the vast majority of cases you're going to see checksum
> failures when reading the data for normal operation, but not when using
> basebackup (or this new tool).

Right, see above.

> At the very very least this would need to do
> 
> a) checks that the page is all zeroes if PageIsNew() (like
>PageIsVerified() does for the backend). That avoids missing cases
>where corruption just zeroed out the header, but not the whole page.

We can't run pg_checksum_page() on those afterwards though as it would
fire an assertion:

|pg_checksums: [...]/../src/include/storage/checksum_impl.h:194:
|pg_checksum_page: Assertion `!(((PageHeader) (>phdr))->pd_upper
|== 0)' failed.

But we should count it as a checksum error and generate an appropriate
error message in that case.

> b) Check that pd_lsn is between startlsn and the insertion pointer. That
>avoids accepting just about all random data.

However, for pg_checksums being a stand-alone application it can't just
access the insertion pointer, can it? We could maybe set a threshold
from the last checkpoint after which we consider the pd_lsn bogus. But
what's a good threshold here?
 
And/or we could port the other sanity checks from PageIsVerified:

|if ((p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
|   p->pd_lower <= p->pd_upper &&
|   p->pd_upper <= p->pd_special &&
|   p->pd_special <= BLCKSZ &&
|   p->pd_special ==

Re: Online verification of checksums

2019-03-19 Thread Robert Haas

On Tue, Mar 19, 2019 at 4:49 PM Andres Freund  wrote:
> To demonstrate that I ran a loop that verified that a) a normal backend
> query using the tale detects the corruption b) pg_basebackup doesn't.
>
> i=0;
> while true; do
> i=$(($i+1));
> echo attempt $i;
> dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 
> conv=notrunc 2>/dev/null;
> psql -X -c 'SELECT * FROM corruptme;' 2>/dev/null && break;
> ~/build/postgres/dev-assert/vpath/src/bin/pg_basebackup/pg_basebackup -X 
> fetch -F t -D - -c fast > /dev/null || break;
> done
>
> (excuse the crappy one-off sh)
>
> had, during ~12k iterations, always detected the corruption in the
> backend, and never via pg_basebackup. Given the likely LSNs in a
> cluster, that's not too surprising.

Wow.  So we shipped a checksum-verification feature (in pg_basebackup)
that reliably fails to detect blatantly corrupt pages.  That's pretty
awful.  Your chances get better the more WAL you've ever generated,
but you have to generate 163 petabytes of WAL to have a 1% chance of
detecting a page of random garbage, so realistically they never get
very good.

It's probably fair to point out that flipping a couple of random bytes
on the page is a more likely error than replacing the entire page with
garbage, and the check as designed will detect that fairly reliably --
unless those bytes are very near the beginning of the page.  Still,
that leaves a lot of kinds of corruption that this will not catch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Online verification of checksums

2019-03-19 Thread Andres Freund

On 2019-03-19 13:00:50 -0700, Andres Freund wrote:
> As it stands, the logic seems to give more false confidence than
> anything else.

To demonstrate that I ran a loop that verified that a) a normal backend
query using the tale detects the corruption b) pg_basebackup doesn't.

i=0;
while true; do
i=$(($i+1));
echo attempt $i;
dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 
conv=notrunc 2>/dev/null;
psql -X -c 'SELECT * FROM corruptme;' 2>/dev/null && break;
~/build/postgres/dev-assert/vpath/src/bin/pg_basebackup/pg_basebackup -X 
fetch -F t -D - -c fast > /dev/null || break;
done

(excuse the crappy one-off sh)

had, during ~12k iterations, always detected the corruption in the
backend, and never via pg_basebackup. Given the likely LSNs in a
cluster, that's not too surprising.

Greetings,

Andres Freund

Re: Online verification of checksums

2019-03-19 Thread Andres Freund

Hi,

On 2019-03-20 03:27:55 +0800, Stephen Frost wrote:
> On Tue, Mar 19, 2019 at 23:59 Andres Freund  wrote:
> > On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> > > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > > > It's torn pages that I am concerned about - the server is writing and
> > > > we are reading, and we get a mix of old and new content.  We have been
> > > > quite diligent about protecting ourselves from such risks elsewhere,
> > > > and checksum verification should not be held to any lesser standard.
> > >
> > > If we see a checksum failure on an otherwise correctly read block in
> > > online mode, we retry the block on the theory that we might have read a
> > > torn page.  If the checksum verification still fails, we compare its LSN
> > > to the LSN of the current checkpoint and don't mind if its newer.  This
> > > way, a torn page should not cause a false positive either way I
> > > think?.
> >
> > False positives, no. But there's plenty potential for false
> > negatives. In plenty clusters a large fraction of the pages is going to
> > be touched in most checkpoints.
> 
> 
> How is it a false negative?  The page was in the middle of being
> written,

You don't actually know that. It could just be random gunk in the LSN,
and this type of logic just ignores such failures as long as the random
gunk is above the system's LSN.

And the basebackup logic doesn't just ignore if both the checksum
failed, and the lsn is between startptr and current insertion pointer -
it just does it with *any* page that has a pd_upper != 0 and a pd_lsn >
startptr. Given typical startlsn values (skewing heavily towards lower
int64s), that means that random data is more likely than not to pass
this test.

As it stands, the logic seems to give more false confidence than
anything else.

> > The basebackup checksum verification works in the same way.
> >
> > Shouldn't have been merged that way.
> 
> 
> I have a hard time not finding this offensive.  These issues were
> considered, discussed, and well thought out, with the result being
> committed after agreement.

Well, I don't know what to tell you. But:

/*
 * Only check pages which have not been 
modified since the
 * start of the base backup. Otherwise, they 
might have been
 * written only halfway and the checksum would 
not be valid.
 * However, replaying WAL would reinstate the 
correct page in
 * this case. We also skip completely new 
pages, since they
 * don't have a checksum yet.
 */
if (!PageIsNew(page) && PageGetLSN(page) < 
startptr)
{

doesn't consider plenty scenarios, as pointed out above.  It'd be one
thing if the concerns I point out above were actually commented upon and
weighed not substantial enough (not that I know how). But...

> Do you have any example cases where the code in pg_basebackup has resulted
> in either a false positive or a false negative?  Any case which can be
> shown to result in either?

CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 
100) g(i);
SELECT pg_relation_size('corruptme');
postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || 
pg_relation_filepath('corruptme');
┌─┐
│  ?column?   │
├─┤
│ /srv/dev/pgdev-dev/base/13390/16384 │
└─┘
(1 row)
dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 
conv=notrunc

Try a basebackup and see how many times it'll detect the corrupt
data. In the vast majority of cases you're going to see checksum
failures when reading the data for normal operation, but not when using
basebackup (or this new tool).

At the very very least this would need to do

a) checks that the page is all zeroes if PageIsNew() (like
   PageIsVerified() does for the backend). That avoids missing cases
   where corruption just zeroed out the header, but not the whole page.
b) Check that pd_lsn is between startlsn and the insertion pointer. That
   avoids accepting just about all random data.

And that'd *still* be less strenuous than what normal backends
check. And that's already not great (due to not noticing zeroed out
data).

I fail to see how it's offensive to describe this as "shouldn't have
been merged that way".

Greetings,

Andres Freund

Re: Online verification of checksums

2019-03-19 Thread Stephen Frost

Greetings,

On Tue, Mar 19, 2019 at 23:59 Andres Freund  wrote:

> Hi,
>
> On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > > It's torn pages that I am concerned about - the server is writing and
> > > we are reading, and we get a mix of old and new content.  We have been
> > > quite diligent about protecting ourselves from such risks elsewhere,
> > > and checksum verification should not be held to any lesser standard.
> >
> > If we see a checksum failure on an otherwise correctly read block in
> > online mode, we retry the block on the theory that we might have read a
> > torn page.  If the checksum verification still fails, we compare its LSN
> > to the LSN of the current checkpoint and don't mind if its newer.  This
> > way, a torn page should not cause a false positive either way I
> > think?.
>
> False positives, no. But there's plenty potential for false
> negatives. In plenty clusters a large fraction of the pages is going to
> be touched in most checkpoints.

How is it a false negative?  The page was in the middle of being written,
if we crash the page won’t be used because it’ll get replayed over by the
checkpoint, if we don’t crash then it also won’t be used until it’s been
written out completely.  I don’t agree that this is in any way a false
negative- it’s simply a page that happens to be in the middle of a file
that we can skip because it isn’t going to be used. It’s not like there’s
going to be a checksum failure if the backend reads it.

Not only that, but checksums and such failures are much more likely to
happen on long dormant data, not on data that’s actively being written out
and therefore is still in the Linux FS cache and hasn’t even hit actual
storage yet anyway.

>  If it is a genuine storage failure we will see it in the next
> > pg_checksums run as its LSN will be older than the checkpoint.
>
> Well, but also, by that time it might be too late to recover things. Or
> it might be a backup that you just made, that you later want to recover
> from, ...

If it’s a backup you just made then that page is going to be in the WAL and
the torn page on disk isn’t going to be used, so how is this an issue?
This is why we have WAL- to deal with torn pages.

> The basebackup checksum verification works in the same way.
>
> Shouldn't have been merged that way.

I have a hard time not finding this offensive.  These issues were
considered, discussed, and well thought out, with the result being
committed after agreement.

Do you have any example cases where the code in pg_basebackup has resulted
in either a false positive or a false negative?  Any case which can be
shown to result in either?

If not then I think we need to stop this, because if we can’t trust that a
torn page won’t be actually used in that torn state then it seems likely
that our entire WAL system is broken and we can’t trust the way we do
backups either and have to rewrite all of that to take precautions to lock
pages while doing a backup.

Thanks!

Stephen

>

Re: Online verification of checksums

2019-03-19 Thread Andres Freund

Hi,

On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > It's torn pages that I am concerned about - the server is writing and
> > we are reading, and we get a mix of old and new content.  We have been
> > quite diligent about protecting ourselves from such risks elsewhere,
> > and checksum verification should not be held to any lesser standard.
> 
> If we see a checksum failure on an otherwise correctly read block in
> online mode, we retry the block on the theory that we might have read a
> torn page.  If the checksum verification still fails, we compare its LSN
> to the LSN of the current checkpoint and don't mind if its newer.  This
> way, a torn page should not cause a false positive either way I
> think?.

False positives, no. But there's plenty potential for false
negatives. In plenty clusters a large fraction of the pages is going to
be touched in most checkpoints.


>  If it is a genuine storage failure we will see it in the next
> pg_checksums run as its LSN will be older than the checkpoint.

Well, but also, by that time it might be too late to recover things. Or
it might be a backup that you just made, that you later want to recover
from, ...


> The basebackup checksum verification works in the same way.

Shouldn't have been merged that way.

Greetings,

Andres Freund

Re: Online verification of checksums

2019-03-19 Thread Michael Banck

Hi,

Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> It's torn pages that I am concerned about - the server is writing and
> we are reading, and we get a mix of old and new content.  We have been
> quite diligent about protecting ourselves from such risks elsewhere,
> and checksum verification should not be held to any lesser standard.

If we see a checksum failure on an otherwise correctly read block in
online mode, we retry the block on the theory that we might have read a
torn page.  If the checksum verification still fails, we compare its LSN
to the LSN of the current checkpoint and don't mind if its newer.  This
way, a torn page should not cause a false positive either way I think?. 
 If it is a genuine storage failure we will see it in the next
pg_checksums run as its LSN will be older than the checkpoint.  The
basebackup checksum verification works in the same way.

I am happy to look into further option about how to make things better,
but I am not sure what the actual problem might be that you mention
above. I will see whether I can stress-test the patch a bit more but
I've already taxed the SSD on my company notebook quite a bit during the
development of this so will see whether I can get some real server
hardware somewhere.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Online verification of checksums

2019-03-19 Thread Robert Haas

On Mon, Mar 18, 2019 at 2:38 AM Stephen Frost  wrote:
> Sure the backend has those facilities since it needs to, but these
> frontend tools *don't* need that to *never* have any false positives, so
> why are we complicating things by saying that this frontend tool and the
> backend have to coordinate?
>
> If there's an explanation of why we can't avoid having false positives
> in the frontend tool, I've yet to see it.  I definitely understand that
> we can get partial reads, but a partial read isn't a failure, and
> shouldn't be reported as such.

I think there's some confusion between 'partial read' and 'torn page',
as Michael also said.

It's torn pages that I am concerned about - the server is writing and
we are reading, and we get a mix of old and new content.  We have been
quite diligent about protecting ourselves from such risks elsewhere,
and checksum verification should not be held to any lesser standard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Online verification of checksums

2019-03-18 Thread Stephen Frost

Greetings,

On Tue, Mar 19, 2019 at 04:15 Michael Banck 
wrote:

> Am Montag, den 18.03.2019, 16:11 +0800 schrieb Stephen Frost:
> > On Mon, Mar 18, 2019 at 15:52 Michael Banck 
> wrote:
> > > Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > > > Thanks for that.  Reading through the code though, I don't entirely
> > > > understand why we're making things complicated for ourselves by
> trying
> > > > to seek and re-read the entire block, specifically this:
> > >
> > > [...]
> > >
> > > > I would think that we could just do:
> > > >
> > > >   insert_location = 0;
> > > >   r = read(BLCKSIZE - insert_location);
> > > >   if (r < 0) error();
> > > >   if (r == 0) EOF detected, move to next
> > > >   if (r < (BLCKSIZE - insert_location)) {
> > > > insert_location += r;
> > > > continue;
> > > >   }
> > > >
> > > > At this point, we should have a full block, do our checks...
> > >
> > > Well, we need to read() into some buffer which you have ommitted.
> >
> > Surely there’s a buffer the read in the existing code is passing in,
> > you just need to offset by the current pointer, sorry for not being
> > clear.
> >
> > In other words the read would look more like:
> >
> > read(fd,buf + insert_ptr, BUFSZ - insert_ptr)
> >
> > And then you have to reset insert_ptr once you have a full block.
>
> Ok, thanks for clearing that up.
>
> I've tried to do that now in the attached, does that suit you?

Yes, that’s what I was thinking.  I’m honestly not entirely convinced that
the lseek() efforts still need to be put in- I would have thought it’d be
fine to simply check the LSN on a checksum failure and mark it as skipped
if the LSN is past the current checkpoint.  That seems like it would make
things much simpler, but I’m also not against keeping that logic now that
it’s in, provided it doesn’t cause issues

> Yes, the question was more like this: have you ever seen a read return
> > a partial result when you know you’re in the middle somewhere of an
> > existing file and the length of the file hasn’t been changed by
> > something else..?
>
> I don't think I've seen that, but that wouldn't turn up in regular
> testing anyway I guess but only in pathological cases?  I guess we are
> probably dealing with this in the current version of the patch, but I
> can't say for certain as it sounds pretty difficult to test.

Yeah, a lot of things in this area are unfortunately difficult to test.
I’m glad to hear that it doesn’t sound like you’ve seen it though.

I have also added a paragraph to the documentation about possilby
> skipping new or recently updated pages:
>
> +   If the cluster is online, pages that have been (re-)written since the
> last
> +   checkpoint will not count as checksum failures if they cannot be read
> or
> +   verified correctly.

I would flip this around:

——-
In an online cluster, pages are being concurrently written to the files
while the check is being run, leading to possible torn pages or partial
reads.  When the tool detects a concurrently written page, indicated by the
page’s LSN being beyond the checkpoint the tool started at, that page will
be reported as skipped.  Note that in a crash scenario, any pages written
since the last checkpoint will be replayed from the WAL.
——-

Now here’s the $64 question- have you tested this latest version under
load..?  If not, could you?  And when you do, can you report back what the
results are?  Do you still see any actual checksum failures?  Do the number
of skipped pages seem reasonable in your tests or is there a concern there?

If you still see actual checksum failures which aren’t because the LSN is
higher than the checkpoint, or because of a short read, then we need to
investigate further but hopefully that isn’t happening now.  I think a lot
of the concerns raised on this thread about wanting to avoid false
positives are because the torn page (with higher LSN than current
checkpoint) and short read cases were previously reported as failures when
they really are expected.  Let’s test this as much as we can and make sure
we aren’t seeing false positives anymore.

Thanks!

Stephen

Re: Online verification of checksums

2019-03-18 Thread Michael Banck

Hi,

Am Montag, den 18.03.2019, 16:11 +0800 schrieb Stephen Frost:
> On Mon, Mar 18, 2019 at 15:52 Michael Banck  wrote:
> > Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > > Thanks for that.  Reading through the code though, I don't entirely
> > > understand why we're making things complicated for ourselves by trying
> > > to seek and re-read the entire block, specifically this:
> > 
> > [...]
> > 
> > > I would think that we could just do:
> > > 
> > >   insert_location = 0;
> > >   r = read(BLCKSIZE - insert_location);
> > >   if (r < 0) error();
> > >   if (r == 0) EOF detected, move to next
> > >   if (r < (BLCKSIZE - insert_location)) {
> > >     insert_location += r;
> > >     continue;
> > >   }
> > > 
> > > At this point, we should have a full block, do our checks...
> > 
> > Well, we need to read() into some buffer which you have ommitted.
> 
> Surely there’s a buffer the read in the existing code is passing in,
> you just need to offset by the current pointer, sorry for not being
> clear.
> 
> In other words the read would look more like:
> 
> read(fd,buf + insert_ptr, BUFSZ - insert_ptr)
> 
> And then you have to reset insert_ptr once you have a full block.

Ok, thanks for clearing that up.

I've tried to do that now in the attached, does that suit you?

> Yes, the question was more like this: have you ever seen a read return
> a partial result when you know you’re in the middle somewhere of an
> existing file and the length of the file hasn’t been changed by
> something else..?

I don't think I've seen that, but that wouldn't turn up in regular
testing anyway I guess but only in pathological cases?  I guess we are
probably dealing with this in the current version of the patch, but I
can't say for certain as it sounds pretty difficult to test.

I have also added a paragraph to the documentation about possilby
skipping new or recently updated pages:

+   If the cluster is online, pages that have been (re-)written since the last
+   checkpoint will not count as checksum failures if they cannot be read or
+   verified correctly.

Wording improvements welcome.


Michael


-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 6a47dda683..c1fa167508 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -37,10 +37,15 @@ PostgreSQL documentation
   Description
   
pg_checksums verifies data checksums in a
-   PostgreSQL cluster.  The server must be shut
-   down cleanly before running pg_checksums.
-   The exit status is zero if there are no checksum errors, otherwise nonzero.
+   PostgreSQL cluster.  The exit status is zero if
+   there are no checksum errors, otherwise nonzero.  
   
+
+  
+   If the cluster is online, pages that have been (re-)written since the last
+   checkpoint will not count as checksum failures if they cannot be read or
+   verified correctly.
+  
  
 
  
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index b7ebc11017..a82ce7d7b4 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -1,7 +1,7 @@
 /*-
  *
  * pg_checksums.c
- *	  Verifies page level checksums in an offline cluster.
+ *	  Verifies page level checksums in an cluster.
  *
  * Copyright (c) 2010-2019, PostgreSQL Global Development Group
  *
@@ -28,12 +28,16 @@
 
 
 static int64 files = 0;
+static int64 skippedfiles = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
+static int64 skippedblocks = 0;
 static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
+static bool online = false;
 
 static const char *progname;
 
@@ -90,10 +94,18 @@ scan_file(const char *fn, BlockNumber segmentno)
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
+	int			block_loc = 0;
+	bool		block_retry = false;
 
 	f = open(fn, O_RDONLY | PG_BINARY, 0);
 	if (f < 0)
 	{
+		if (online && errno == ENOENT)
+		{
+			/* File was removed in the meantime */
+			return;
+		}
+
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 progname, fn, strerror(errno));
 		exit(1);
@@ -104,30 +116,112 @@ scan_file(const char *fn, BlockNumber segmentno)
 	for (blockno = 0;; blockno++)
 	{
 		uint16		csum;
-		int			r = read(f, buf.data, BLCKSZ);
+		int			r = read(f, buf.data + block_loc, BLCKSZ - block_loc);
 
 		if (r == 0)
+		{
+			/*
+			 * We had a short read and got an

Re: Online verification of checksums

2019-03-18 Thread Robert Haas

On Mon, Mar 18, 2019 at 2:06 AM Michael Paquier  wrote:
> The mentions on this thread that the server has all the facility in
> place to properly lock a buffer and make sure that a partial read
> *never* happens and that we *never* have any kind of false positives,
> directly preventing the set of issues we are trying to implement
> workarounds for in a frontend tool are rather good arguments in my
> opinion (you can grep for BufferDescriptorGetIOLock() on this thread
> for example).

Yeah, exactly.  It may be that there is a good way to avoid those
issues without interacting with the server and that would be nice, but
... as far as I can see, nobody's figured out a way that's reliable
yet, and all of the solutions proposed so far basically amount to
"let's ignore things that might be serious problems because they might
be transient" and/or "let's retry and see if the problem goes away."
I'm more sanguine about a retry-based solution than an
ignore-possible-problems solution, but what's been proposed so far
seems quite prone to retrying so fast that it makes no difference, and
it's not clear how much code complexity we'd have to add to do better
or how reliable it would be even then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Online verification of checksums

2019-03-18 Thread Stephen Frost

Greetings,

On Mon, Mar 18, 2019 at 15:52 Michael Banck 
wrote:

> Hi.
>
> Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > * Michael Banck (michael.ba...@credativ.de) wrote:
> > > Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > > > * Michael Paquier (mich...@paquier.xyz) wrote:
> > > > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > > > To be clear, I agree completely that we don't want to be
> reporting false
> > > > > > positives or "this might mean corruption!" to users running the
> tool,
> > > > > > but I haven't seen a good explaination of why this needs to
> involve the
> > > > > > server to avoid that happening.  If someone would like to point
> that out
> > > > > > to me, I'd be happy to go read about it and try to understand.
> > > > >
> > > > > The mentions on this thread that the server has all the facility in
> > > > > place to properly lock a buffer and make sure that a partial read
> > > > > *never* happens and that we *never* have any kind of false
> positives,
> > > >
> > > > Uh, we are, of course, going to have partial reads- we just need to
> > > > handle them appropriately, and that's not hard to do in a way that we
> > > > never have false positives.
> > >
> > > I think the current patch (V13 from
> https://www.postgresql.org/message-i
> > > d/1552045881.4947.43.ca...@credativ.de) does that, modulo possible
> bugs.
> >
> > I think the question here is- do you ever see false positives with this
> > latest version..?  If you are, then that's an issue and we should
> > discuss and try to figure out what's happening.  If you aren't seeing
> > false positives, then it seems like we're done here, right?
>
> What do you mean with false positives here? I've never seen a bogus
> checksum failure, i.e. pg_checksums claiming some checksum is wrong
> cause it only read half of a block or a torn page.
>
> I do see sporadic partial reads and they get treated by the re-check
> logic and (if that is not enough) get tallied up as a skipped block in
> the end.  Is that a false positive in your book?


No, that’s clearer not a false positive.

[...]
>
> > > I have now rebased that patch on top of the pg_verify_checksums ->
> > > pg_checksums renaming, see attached.
> >
> > Thanks for that.  Reading through the code though, I don't entirely
> > understand why we're making things complicated for ourselves by trying
> > to seek and re-read the entire block, specifically this:
>
> [...]
>
> > I would think that we could just do:
> >
> >   insert_location = 0;
> >   r = read(BLCKSIZE - insert_location);
> >   if (r < 0) error();
> >   if (r == 0) EOF detected, move to next
> >   if (r < (BLCKSIZE - insert_location)) {
> > insert_location += r;
> > continue;
> >   }
> >
> > At this point, we should have a full block, do our checks...
>
> Well, we need to read() into some buffer which you have ommitted.


Surely there’s a buffer the read in the existing code is passing in, you
just need to offset by the current pointer, sorry for not being clear.

In other words the read would look more like:

read(fd,buf + insert_ptr, BUFSZ - insert_ptr)

And then you have to reset insert_ptr once you have a full block.

So if we had a short read, and then read the rest of the block via
> (BLCKSIZE - insert_location) wouldn't we have to read that in a second
> buffer and then join the two in order to compute the checksum?  That
> does not sounds simpler to me than just re-reading the block entirely.


No, just read into your existing buffer at the point where the prior
partial read left off...

> Have you seen cases where the kernel will actually return a partial read
> > for something that isn't at the end of the file, and where you could
> > actually lseek past that point and read the next block?  I'd be really
> > curious to see that if you can reproduce it...  I've definitely seen
> > empty pages come back with a claim that the full amount was read, but
> > that's a very different thing.
>
> Well, I've seen partial reads and I have seen very rarely that it will
> continue to read another block afterwards.  If the relation is being
> extended while we check it, it sounds plausible that another block could
> be written before we get to read EOF on the next read() after a partial
> read() so that does not sounds like a bug to me either.


Right, absolutely you can have a partial read during a relation extension
and then come back around and do another read and discover more data,
that’s entirely reasonable and I’ve seen it happen too.

I might be misunderstanding your question though?


Yes, the question was more like this: have you ever seen a read return a
partial result when you know you’re in the middle somewhere of an existing
file and the length of the file hasn’t been changed by something else..?  I
can’t say that I have, when reading from regular files, even in
kernel-error type of conditions due to hardware issues, but I’m open to
being told I’m

Re: Online verification of checksums

2019-03-18 Thread Michael Banck

Hi.

Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> * Michael Banck (michael.ba...@credativ.de) wrote:
> > Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > > * Michael Paquier (mich...@paquier.xyz) wrote:
> > > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > > To be clear, I agree completely that we don't want to be reporting 
> > > > > false
> > > > > positives or "this might mean corruption!" to users running the tool,
> > > > > but I haven't seen a good explaination of why this needs to involve 
> > > > > the
> > > > > server to avoid that happening.  If someone would like to point that 
> > > > > out
> > > > > to me, I'd be happy to go read about it and try to understand.
> > > > 
> > > > The mentions on this thread that the server has all the facility in
> > > > place to properly lock a buffer and make sure that a partial read
> > > > *never* happens and that we *never* have any kind of false positives,
> > > 
> > > Uh, we are, of course, going to have partial reads- we just need to
> > > handle them appropriately, and that's not hard to do in a way that we
> > > never have false positives.
> > 
> > I think the current patch (V13 from https://www.postgresql.org/message-i
> > d/1552045881.4947.43.ca...@credativ.de) does that, modulo possible bugs.
> 
> I think the question here is- do you ever see false positives with this
> latest version..?  If you are, then that's an issue and we should
> discuss and try to figure out what's happening.  If you aren't seeing
> false positives, then it seems like we're done here, right?

What do you mean with false positives here? I've never seen a bogus
checksum failure, i.e. pg_checksums claiming some checksum is wrong
cause it only read half of a block or a torn page.

I do see sporadic partial reads and they get treated by the re-check
logic and (if that is not enough) get tallied up as a skipped block in
the end.  Is that a false positive in your book?

[...]

> > I have now rebased that patch on top of the pg_verify_checksums ->
> > pg_checksums renaming, see attached.
> 
> Thanks for that.  Reading through the code though, I don't entirely
> understand why we're making things complicated for ourselves by trying
> to seek and re-read the entire block, specifically this:

[...]

> I would think that we could just do:
> 
>   insert_location = 0;
>   r = read(BLCKSIZE - insert_location);
>   if (r < 0) error();
>   if (r == 0) EOF detected, move to next
>   if (r < (BLCKSIZE - insert_location)) {
> insert_location += r;
> continue;
>   }
> 
> At this point, we should have a full block, do our checks...

Well, we need to read() into some buffer which you have ommitted.

So if we had a short read, and then read the rest of the block via
(BLCKSIZE - insert_location) wouldn't we have to read that in a second
buffer and then join the two in order to compute the checksum?  That
does not sounds simpler to me than just re-reading the block entirely.

> Have you seen cases where the kernel will actually return a partial read
> for something that isn't at the end of the file, and where you could
> actually lseek past that point and read the next block?  I'd be really
> curious to see that if you can reproduce it...  I've definitely seen
> empty pages come back with a claim that the full amount was read, but
> that's a very different thing.

Well, I've seen partial reads and I have seen very rarely that it will
continue to read another block afterwards.  If the relation is being
extended while we check it, it sounds plausible that another block could
be written before we get to read EOF on the next read() after a partial
read() so that does not sounds like a bug to me either.

I might be misunderstanding your question though?

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Online verification of checksums

2019-03-18 Thread Michael Banck

Hi,

Am Montag, den 18.03.2019, 08:18 +0100 schrieb Michael Banck:
> I have now rebased that patch on top of the pg_verify_checksums ->
> pg_checksums renaming, see attached.

Sorry, I had missed some hunks in the TAP tests, fixed-up patch
attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 6a47dda683..124475f057 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -37,9 +37,8 @@ PostgreSQL documentation
   Description
   
pg_checksums verifies data checksums in a
-   PostgreSQL cluster.  The server must be shut
-   down cleanly before running pg_checksums.
-   The exit status is zero if there are no checksum errors, otherwise nonzero.
+   PostgreSQL cluster.  The exit status is zero if
+   there are no checksum errors, otherwise nonzero.  
   
  
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index b7ebc11017..0ed065f7e9 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -1,7 +1,7 @@
 /*-
  *
  * pg_checksums.c
- *	  Verifies page level checksums in an offline cluster.
+ *	  Verifies page level checksums in an cluster.
  *
  * Copyright (c) 2010-2019, PostgreSQL Global Development Group
  *
@@ -28,12 +28,16 @@
 
 
 static int64 files = 0;
+static int64 skippedfiles = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
+static int64 skippedblocks = 0;
 static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
+static bool online = false;
 
 static const char *progname;
 
@@ -90,10 +94,17 @@ scan_file(const char *fn, BlockNumber segmentno)
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
+	bool		block_retry = false;
 
 	f = open(fn, O_RDONLY | PG_BINARY, 0);
 	if (f < 0)
 	{
+		if (online && errno == ENOENT)
+		{
+			/* File was removed in the meantime */
+			return;
+		}
+
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 progname, fn, strerror(errno));
 		exit(1);
@@ -108,26 +119,130 @@ scan_file(const char *fn, BlockNumber segmentno)
 
 		if (r == 0)
 			break;
+		if (r < 0)
+		{
+			skippedfiles++;
+			fprintf(stderr, _("%s: could not read block %u in file \"%s\": %s\n"),
+	progname, blockno, fn, strerror(errno));
+			return;
+		}
 		if (r != BLCKSZ)
 		{
-			fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
-	progname, blockno, fn, r, BLCKSZ);
-			exit(1);
+			if (online)
+			{
+if (block_retry)
+{
+	/* We already tried once to reread the block, skip to the next block */
+	skippedblocks++;
+	if (lseek(f, BLCKSZ-r, SEEK_CUR) == -1)
+	{
+		skippedfiles++;
+		fprintf(stderr, _("%s: could not lseek to next block in file \"%s\": %m\n"),
+progname, fn);
+		return;
+	}
+	continue;
+}
+
+/*
+ * Retry the block. It's possible that we read the block while it
+ * was extended or shrinked, so it it ends up looking torn to us.
+ */
+
+/*
+ * Seek back by the amount of bytes we read to the beginning of
+ * the failed block.
+ */
+if (lseek(f, -r, SEEK_CUR) == -1)
+{
+	skippedfiles++;
+	fprintf(stderr, _("%s: could not lseek in file \"%s\": %m\n"),
+			progname, fn);
+	return;
+}
+
+/* Set flag so we know a retry was attempted */
+block_retry = true;
+
+/* Reset loop to validate the block again */
+blockno--;
+
+continue;
+			}
+			else
+			{
+skippedfiles++;
+fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
+		progname, blockno, fn, r, BLCKSZ);
+return;
+			}
 		}
-		blocks++;
 
 		/* New pages have no checksum yet */
 		if (PageIsNew(header))
+		{
+			skippedblocks++;
 			continue;
+		}
+
+		blocks++;
 
 		csum = pg_checksum_page(buf.data, blockno + segmentno * RELSEG_SIZE);
 		if (csum != header->pd_checksum)
 		{
+			if (online)
+			{
+/*
+ * Retry the block on the first failure if online.  If the
+ * verification is done while the instance is online, it is
+ * possible that we read the first 4K page of the block just
+ * before postgres updated the entire block so it ends up
+ * looking torn to us.  We only need to retry once because the
+ * LSN should be updated to something we can ignore on the next
+ * pass.  If the error happens again then it is a true
+

Re: Online verification of checksums

2019-03-18 Thread Stephen Frost

Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Mon, Mar 18, 2019 at 02:38:10AM -0400, Stephen Frost wrote:
> > Uh, we are, of course, going to have partial reads- we just need to
> > handle them appropriately, and that's not hard to do in a way that we
> > never have false positives.
> 
> Ere, my apologies here.  I meant the read of a torn page, not a

In the case of a torn page, we should be able to check the LSN, as
discussed extensively previously, and if the LSN is from after the
checkpoint we started at then we should be fine to skip the page.

> partial read (when extending the relation file we have locks
> preventing from a partial read as well by the way).

Yes, we do, in the backend...  We don't have (nor do we need) to get
involved in those locks for these tools though..

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Online verification of checksums

2019-03-18 Thread Stephen Frost

Greetings,

* Michael Banck (michael.ba...@credativ.de) wrote:
> Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > * Michael Paquier (mich...@paquier.xyz) wrote:
> > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > To be clear, I agree completely that we don't want to be reporting false
> > > > positives or "this might mean corruption!" to users running the tool,
> > > > but I haven't seen a good explaination of why this needs to involve the
> > > > server to avoid that happening.  If someone would like to point that out
> > > > to me, I'd be happy to go read about it and try to understand.
> > > 
> > > The mentions on this thread that the server has all the facility in
> > > place to properly lock a buffer and make sure that a partial read
> > > *never* happens and that we *never* have any kind of false positives,
> > 
> > Uh, we are, of course, going to have partial reads- we just need to
> > handle them appropriately, and that's not hard to do in a way that we
> > never have false positives.
> 
> I think the current patch (V13 from https://www.postgresql.org/message-i
> d/1552045881.4947.43.ca...@credativ.de) does that, modulo possible bugs.

I think the question here is- do you ever see false positives with this
latest version..?  If you are, then that's an issue and we should
discuss and try to figure out what's happening.  If you aren't seeing
false positives, then it seems like we're done here, right?

> > I do not understand, at all, the whole sub-thread argument that we have
> > to avoid partial reads.  We certainly don't worry about that when doing
> > backups, and I don't see why we need to avoid it here.  We are going to
> > have partial reads- and that's ok, as long as it's because we're at the
> > end of the file, and that's easy enough to check by just doing another
> > read to see if we get back zero bytes, which indicates we're at the end
> > of the file, and then we move on, no need to coordinate anything with
> > the backend for this.
> 
> Well, I agree with you, but we don't seem to have consensus on that.

I feel like everyone is concerned that we'd report an acceptable partial
read as a failure, hence it would be a false positive, and I agree
entirely that we don't want false positives, but the answer to that
seems to be that we shouldn't report partial reads as failures, solving
the problem in a simple way that doesn't involve the server and doesn't
materially reduce the check that's being performed.

> > > directly preventing the set of issues we are trying to implement
> > > workarounds for in a frontend tool are rather good arguments in my
> > > opinion (you can grep for BufferDescriptorGetIOLock() on this thread 
> > > for example).
> > 
> > Sure the backend has those facilities since it needs to, but these
> > frontend tools *don't* need that to *never* have any false positives, so
> > why are we complicating things by saying that this frontend tool and the
> > backend have to coordinate?
> > 
> > If there's an explanation of why we can't avoid having false positives
> > in the frontend tool, I've yet to see it.  I definitely understand that
> > we can get partial reads, but a partial read isn't a failure, and
> > shouldn't be reported as such.
> 
> It is not in the current patch, it should just get reported as a skipped
> block in the end.  If the cluster is online that is, if it is offline,
> we do consider it a failure.

Ok, that sounds fine- and do we ever see false positives now?

> I have now rebased that patch on top of the pg_verify_checksums ->
> pg_checksums renaming, see attached.

Thanks for that.  Reading through the code though, I don't entirely
understand why we're making things complicated for ourselves by trying
to seek and re-read the entire block, specifically this:

>   if (r != BLCKSZ)
>   {
> - fprintf(stderr, _("%s: could not read block %u in file 
> \"%s\": read %d of %d\n"),
> - progname, blockno, fn, r, BLCKSZ);
> - exit(1);
> + if (online)
> + {
> + if (block_retry)
> + {
> + /* We already tried once to reread the 
> block, skip to the next block */
> + skippedblocks++;
> + if (lseek(f, BLCKSZ-r, SEEK_CUR) == -1)
> + {
> + skippedfiles++;
> + fprintf(stderr, _("%s: could 
> not lseek to next block in file \"%s\": %m\n"),
> + progname, fn);
> + return;
> + }
> + continue;
> + }
> +
> +

Re: Online verification of checksums

2019-03-18 Thread Michael Paquier

On Mon, Mar 18, 2019 at 02:38:10AM -0400, Stephen Frost wrote:
> Uh, we are, of course, going to have partial reads- we just need to
> handle them appropriately, and that's not hard to do in a way that we
> never have false positives.

Ere, my apologies here.  I meant the read of a torn page, not a
partial read (when extending the relation file we have locks
preventing from a partial read as well by the way).
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2019-03-18 Thread Michael Banck

Hi,

Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> * Michael Paquier (mich...@paquier.xyz) wrote:
> > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > To be clear, I agree completely that we don't want to be reporting false
> > > positives or "this might mean corruption!" to users running the tool,
> > > but I haven't seen a good explaination of why this needs to involve the
> > > server to avoid that happening.  If someone would like to point that out
> > > to me, I'd be happy to go read about it and try to understand.
> > 
> > The mentions on this thread that the server has all the facility in
> > place to properly lock a buffer and make sure that a partial read
> > *never* happens and that we *never* have any kind of false positives,
> 
> Uh, we are, of course, going to have partial reads- we just need to
> handle them appropriately, and that's not hard to do in a way that we
> never have false positives.

I think the current patch (V13 from https://www.postgresql.org/message-i
d/1552045881.4947.43.ca...@credativ.de) does that, modulo possible bugs.

> I do not understand, at all, the whole sub-thread argument that we have
> to avoid partial reads.  We certainly don't worry about that when doing
> backups, and I don't see why we need to avoid it here.  We are going to
> have partial reads- and that's ok, as long as it's because we're at the
> end of the file, and that's easy enough to check by just doing another
> read to see if we get back zero bytes, which indicates we're at the end
> of the file, and then we move on, no need to coordinate anything with
> the backend for this.

Well, I agree with you, but we don't seem to have consensus on that.

> > directly preventing the set of issues we are trying to implement
> > workarounds for in a frontend tool are rather good arguments in my
> > opinion (you can grep for BufferDescriptorGetIOLock() on this thread 
> > for example).
> 
> Sure the backend has those facilities since it needs to, but these
> frontend tools *don't* need that to *never* have any false positives, so
> why are we complicating things by saying that this frontend tool and the
> backend have to coordinate?
> 
> If there's an explanation of why we can't avoid having false positives
> in the frontend tool, I've yet to see it.  I definitely understand that
> we can get partial reads, but a partial read isn't a failure, and
> shouldn't be reported as such.

It is not in the current patch, it should just get reported as a skipped
block in the end.  If the cluster is online that is, if it is offline,
we do consider it a failure.

I have now rebased that patch on top of the pg_verify_checksums ->
pg_checksums renaming, see attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 6a47dda683..124475f057 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -37,9 +37,8 @@ PostgreSQL documentation
   Description
   
pg_checksums verifies data checksums in a
-   PostgreSQL cluster.  The server must be shut
-   down cleanly before running pg_checksums.
-   The exit status is zero if there are no checksum errors, otherwise nonzero.
+   PostgreSQL cluster.  The exit status is zero if
+   there are no checksum errors, otherwise nonzero.  
   
  
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index b7ebc11017..0ed065f7e9 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -1,7 +1,7 @@
 /*-
  *
  * pg_checksums.c
- *	  Verifies page level checksums in an offline cluster.
+ *	  Verifies page level checksums in an cluster.
  *
  * Copyright (c) 2010-2019, PostgreSQL Global Development Group
  *
@@ -28,12 +28,16 @@
 
 
 static int64 files = 0;
+static int64 skippedfiles = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
+static int64 skippedblocks = 0;
 static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
+static bool online = false;
 
 static const char *progname;
 
@@ -90,10 +94,17 @@ scan_file(const char *fn, BlockNumber segmentno)
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
+	bool		block_retry = false;
 
 	f = open(fn, O_RDONLY | PG_BINARY, 0);
 	if (f < 0)
 	{
+		if (online && errno == ENOENT)
+		{
+			/* File was removed in the meantime */
+			return;
+		}
+
 		fprintf(stderr,

Re: Online verification of checksums

2019-03-18 Thread Stephen Frost

Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > To be clear, I agree completely that we don't want to be reporting false
> > positives or "this might mean corruption!" to users running the tool,
> > but I haven't seen a good explaination of why this needs to involve the
> > server to avoid that happening.  If someone would like to point that out
> > to me, I'd be happy to go read about it and try to understand.
> 
> The mentions on this thread that the server has all the facility in
> place to properly lock a buffer and make sure that a partial read
> *never* happens and that we *never* have any kind of false positives,

Uh, we are, of course, going to have partial reads- we just need to
handle them appropriately, and that's not hard to do in a way that we
never have false positives.

I do not understand, at all, the whole sub-thread argument that we have
to avoid partial reads.  We certainly don't worry about that when doing
backups, and I don't see why we need to avoid it here.  We are going to
have partial reads- and that's ok, as long as it's because we're at the
end of the file, and that's easy enough to check by just doing another
read to see if we get back zero bytes, which indicates we're at the end
of the file, and then we move on, no need to coordinate anything with
the backend for this.

> directly preventing the set of issues we are trying to implement
> workarounds for in a frontend tool are rather good arguments in my
> opinion (you can grep for BufferDescriptorGetIOLock() on this thread 
> for example).

Sure the backend has those facilities since it needs to, but these
frontend tools *don't* need that to *never* have any false positives, so
why are we complicating things by saying that this frontend tool and the
backend have to coordinate?

If there's an explanation of why we can't avoid having false positives
in the frontend tool, I've yet to see it.  I definitely understand that
we can get partial reads, but a partial read isn't a failure, and
shouldn't be reported as such.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Online verification of checksums

2019-03-18 Thread Michael Paquier

On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> To be clear, I agree completely that we don't want to be reporting false
> positives or "this might mean corruption!" to users running the tool,
> but I haven't seen a good explaination of why this needs to involve the
> server to avoid that happening.  If someone would like to point that out
> to me, I'd be happy to go read about it and try to understand.

The mentions on this thread that the server has all the facility in
place to properly lock a buffer and make sure that a partial read
*never* happens and that we *never* have any kind of false positives,
directly preventing the set of issues we are trying to implement
workarounds for in a frontend tool are rather good arguments in my
opinion (you can grep for BufferDescriptorGetIOLock() on this thread 
for example).
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2019-03-17 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> If we want to run it from the server itself, then I guess a background
> worker would be a better solution. Incidentally, that's something I've
> been toying with some time ago, see [1].

So, I'm a big fan of this idea of having a background worker that's
running and (slowly, maybe configurably) scanning through the data
directory checking for corrupted pages.  I'd certainly prefer it if that
background worker didn't fault those pages into shared buffers though,
and I don't really think it should need to even check if a given page is
currently being written out or is presently in shared buffers.
Basically, I'd think it would work just fine to have it essentially do
what I am imagining pg_checksums to do, but as a background worker.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Online verification of checksums

2019-03-17 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 3/2/19 12:03 AM, Robert Haas wrote:
> > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> >  wrote:
> >> I have added a retry for this as well now, without a pg_sleep() as well.
> >> This catches around 80% of the half-reads, but a few slip through. At
> >> that point we bail out with exit(1), and the user can try again, which I
> >> think is fine?
> > 
> > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > robust at all.
> 
> FWIW I don't think this qualifies as torn page - i.e. it's not a full
> read with a mix of old and new data. This is partial write, most likely
> because we read the blocks one by one, and when we hit the last page
> while the table is being extended, we may only see the fist 4kB. And if
> we retry very fast, we may still see only the first 4kB.

I really still am not following why this is such an issue- we do a read,
get back 4KB, do another read, check if it's zero, and if so then we
should be able to conclude that we're at the end of the file, no?  If
we're at the end of the file and we don't have a final complete block to
run a checksum check on then it seems clear to me that the file was
being extended and it's ok to skip that block.  We could also stat the
file and keep track of where we are, to detect such an extension of the
file happening, if we wanted an additional cross-check, couldn't we?  If
we do a read and get 4KB back and then do another and get 4KB back, then
we just treat it like we would an 8KB block.  Really, as long as a
subsequent read is returning bytes then we keep going, and if it returns
zero then it's EOF.  I could maybe see a "one final read" option, but I
don't think it makes sense to have some kind of time-based delay around
this where we keep trying to read.

All of this about hacking up a way to connect to PG and lock pages in
shared buffers so that we can perform a checksum check seems really
rather ridiculous for either the extension case or the regular mid-file
torn-page case.

To be clear, I agree completely that we don't want to be reporting false
positives or "this might mean corruption!" to users running the tool,
but I haven't seen a good explaination of why this needs to involve the
server to avoid that happening.  If someone would like to point that out
to me, I'd be happy to go read about it and try to understand.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Online verification of checksums

2019-03-08 Thread Julien Rouhaud

On Fri, Mar 8, 2019 at 6:50 PM Tomas Vondra
 wrote:
>
> On 3/8/19 4:19 PM, Julien Rouhaud wrote:
> > On Thu, Mar 7, 2019 at 7:00 PM Andres Freund  wrote:
> >>
> >> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
> >>>
> >>> But then again, we could just
> >>> hack a special version of ReadBuffer_common() which would just
> >>
> >>> (a) check if a page is in shared buffers, and if it is then consider the
> >>> checksum correct (because in memory it may be stale, and it was read
> >>> successfully so it was OK at that moment)
> >>>
> >>> (b) if it's not in shared buffers already, try reading it and verify the
> >>> checksum, and then just evict it right away (not to spoil sb)
> >>
> >> This'd also make sense and make the whole process more efficient. OTOH,
> >> it might actually be worthwhile to check the on-disk page even if
> >> there's in-memory state. Unless IO is in progress the on-disk page
> >> always should be valid.
> >
> > Definitely.  I already saw servers with all-frozen-read-only blocks
> > popular enough to never get evicted in months, and then a minor
> > upgrade / restart having catastrophic consequences.
> >
>
> Do I understand correctly the "catastrophic consequences" here are due
> to data corruption / broken checksums on those on-disk pages?

Ah, yes sorry I should have been clearer.  Indeed, there was silent
data corruptions (no ckecksum though) that was revealed by the
restart.  So a routine minor update resulted in a massive outage.
Such a scenario can't be avoided if we always bypass checksum check
for alreay in shared_buffers pages.

Re: Online verification of checksums

2019-03-08 Thread Tomas Vondra

On 3/8/19 4:19 PM, Julien Rouhaud wrote:
> On Thu, Mar 7, 2019 at 7:00 PM Andres Freund  wrote:
>>
>> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
>>>
>>> But then again, we could just
>>> hack a special version of ReadBuffer_common() which would just
>>
>>> (a) check if a page is in shared buffers, and if it is then consider the
>>> checksum correct (because in memory it may be stale, and it was read
>>> successfully so it was OK at that moment)
>>>
>>> (b) if it's not in shared buffers already, try reading it and verify the
>>> checksum, and then just evict it right away (not to spoil sb)
>>
>> This'd also make sense and make the whole process more efficient. OTOH,
>> it might actually be worthwhile to check the on-disk page even if
>> there's in-memory state. Unless IO is in progress the on-disk page
>> always should be valid.
> 
> Definitely.  I already saw servers with all-frozen-read-only blocks
> popular enough to never get evicted in months, and then a minor
> upgrade / restart having catastrophic consequences.
> 

Do I understand correctly the "catastrophic consequences" here are due
to data corruption / broken checksums on those on-disk pages?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Online verification of checksums

2019-03-08 Thread Julien Rouhaud

On Thu, Mar 7, 2019 at 7:00 PM Andres Freund  wrote:
>
> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
> >
> > But then again, we could just
> > hack a special version of ReadBuffer_common() which would just
>
> > (a) check if a page is in shared buffers, and if it is then consider the
> > checksum correct (because in memory it may be stale, and it was read
> > successfully so it was OK at that moment)
> >
> > (b) if it's not in shared buffers already, try reading it and verify the
> > checksum, and then just evict it right away (not to spoil sb)
>
> This'd also make sense and make the whole process more efficient. OTOH,
> it might actually be worthwhile to check the on-disk page even if
> there's in-memory state. Unless IO is in progress the on-disk page
> always should be valid.

Definitely.  I already saw servers with all-frozen-read-only blocks
popular enough to never get evicted in months, and then a minor
upgrade / restart having catastrophic consequences.

Re: Online verification of checksums

2019-03-08 Thread Michael Banck

Hi,

Am Sonntag, den 03.03.2019, 11:51 +0100 schrieb Michael Banck:
> Am Samstag, den 02.03.2019, 11:08 -0500 schrieb Stephen Frost:
> > I'm not necessairly against skipping to the next file, to be clear,
> > but I think I'd be happier if we kept reading the file until we
> > actually get EOF.
> 
> So if we read half a block twice we should seek() to the next block and
> continue till EOF, ok. I think in most cases those pages will be new
> anyway and there will be no checksum check, but it sounds like a cleaner
> approach. I've seen one or two examples where we did successfully verify
> the checksum of a page after a half-read, so it might be worth it.

I've done that now, i.e. it seeks to the next block and continues to
read there (possibly getting an EOF).

I don't issue a warning for this skipped block anymore as it is somewhat
to be expected that we see some half-reads. If the seek fails for some
reason, that is still a warning.

> I still think that an external checksum verification tool has some
> merit, given that basebackup does it and the current offline requirement
> is really not useful in practise.

I've read the rest of the thread, and it seems several people prefer a
solution that interacts with the server. I won't be able to work on that
for v12 and I guess it would be too late in the cycle anyway.

I thought about I/O throttling in online mode, but it seems to be most
easily tied in with the progress reporting (that already keeps track of
everything or most of what we'd need), so I will work on it in that
context.



Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/doc/src/sgml/ref/pg_verify_checksums.sgml b/doc/src/sgml/ref/pg_verify_checksums.sgml
index 905b8f1222..4ad6edcde6 100644
--- a/doc/src/sgml/ref/pg_verify_checksums.sgml
+++ b/doc/src/sgml/ref/pg_verify_checksums.sgml
@@ -37,9 +37,8 @@ PostgreSQL documentation
   Description
   
pg_verify_checksums verifies data checksums in a
-   PostgreSQL cluster.  The server must be shut
-   down cleanly before running pg_verify_checksums.
-   The exit status is zero if there are no checksum errors, otherwise nonzero.
+   PostgreSQL cluster.  The exit status is zero if
+   there are no checksum errors, otherwise nonzero.  
   
  
 
diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index 511262ab5f..c61bd19bf9 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -1,7 +1,7 @@
 /*
  * pg_verify_checksums
  *
- * Verifies page level checksums in an offline cluster
+ * Verifies page level checksums in a cluster
  *
  *	Copyright (c) 2010-2019, PostgreSQL Global Development Group
  *
@@ -24,12 +24,16 @@
 
 
 static int64 files = 0;
+static int64 skippedfiles = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
+static int64 skippedblocks = 0;
 static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
+static bool online = false;
 
 static const char *progname;
 
@@ -86,10 +90,17 @@ scan_file(const char *fn, BlockNumber segmentno)
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
+	bool		block_retry = false;
 
 	f = open(fn, O_RDONLY | PG_BINARY, 0);
 	if (f < 0)
 	{
+		if (online && errno == ENOENT)
+		{
+			/* File was removed in the meantime */
+			return;
+		}
+
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 progname, fn, strerror(errno));
 		exit(1);
@@ -104,26 +115,130 @@ scan_file(const char *fn, BlockNumber segmentno)
 
 		if (r == 0)
 			break;
+		if (r < 0)
+		{
+			skippedfiles++;
+			fprintf(stderr, _("%s: could not read block %u in file \"%s\": %s\n"),
+	progname, blockno, fn, strerror(errno));
+			return;
+		}
 		if (r != BLCKSZ)
 		{
-			fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
-	progname, blockno, fn, r, BLCKSZ);
-			exit(1);
+			if (online)
+			{
+if (block_retry)
+{
+	/* We already tried once to reread the block, skip to the next block */
+	skippedblocks++;
+	if (lseek(f, BLCKSZ-r, SEEK_CUR) == -1)
+	{
+		skippedfiles++;
+		fprintf(stderr, _("%s: could not lseek to next block in file \"%s\": %m\n"),
+progname, fn);
+		return;
+	}
+	continue;
+}
+
+/*
+ * Retry the block. It's possible that we read the block while it
+ * was extended or shrinked, so it it ends up looking torn to us.
+ */
+
+/*
+ * Seek back by the amount of bytes we read to the

Re: Online verification of checksums

2019-03-07 Thread Andres Freund

Hi,

On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
> On 3/6/19 6:42 PM, Andres Freund wrote:
> >
> > ...
> >
> > To me the right way seems to be to IO lock the page via PG after such a
> > failure, and then retry. Which should be relatively easily doable for
> > the basebackup case, but obviously harder for the pg_verify_checksums
> > case.
> > 
> 
> Actually, what do you mean by "IO lock the page"? Just waiting for the
> current IO to complete (essentially BM_IO_IN_PROGRESS)? Or essentially
> acquiring a lock and holding it for the duration of the check?

The latter. And with IO lock I meant BufferDescriptorGetIOLock(), in
contrast to a buffer's content lock. That way we wouldn't block
modifications to the in-memory page.


> The former does not really help, because there might be another I/O request
> initiated right after, interfering with the retry.
> 
> The latter might work, assuming the check is fast (which it probably is). I
> wonder if this might cause issues due to loading possibly corrupted data
> (with invalid checksums) into shared buffers.

Oh, I was basically thinking that we'd just reread from disk outside of
postgres in that case, while preventing postgres related IO by holding
the IO lock.

But:

> But then again, we could just
> hack a special version of ReadBuffer_common() which would just

> (a) check if a page is in shared buffers, and if it is then consider the
> checksum correct (because in memory it may be stale, and it was read
> successfully so it was OK at that moment)
> 
> (b) if it's not in shared buffers already, try reading it and verify the
> checksum, and then just evict it right away (not to spoil sb)

This'd also make sense and make the whole process more efficient. OTOH,
it might actually be worthwhile to check the on-disk page even if
there's in-memory state. Unless IO is in progress the on-disk page
always should be valid.

Greetings,

Andres Freund

Re: Online verification of checksums

2019-03-07 Thread Tomas Vondra

On 3/6/19 6:42 PM, Andres Freund wrote:
>

...

>

To me the right way seems to be to IO lock the page via PG after such a
failure, and then retry. Which should be relatively easily doable for
the basebackup case, but obviously harder for the pg_verify_checksums
case.

Actually, what do you mean by "IO lock the page"? Just waiting for the 
current IO to complete (essentially BM_IO_IN_PROGRESS)? Or essentially 
acquiring a lock and holding it for the duration of the check?

The former does not really help, because there might be another I/O 
request initiated right after, interfering with the retry.

The latter might work, assuming the check is fast (which it probably 
is). I wonder if this might cause issues due to loading possibly 
corrupted data (with invalid checksums) into shared buffers. But then 
again, we could just hack a special version of ReadBuffer_common() which 
would just

(a) check if a page is in shared buffers, and if it is then consider the 
checksum correct (because in memory it may be stale, and it was read 
successfully so it was OK at that moment)

(b) if it's not in shared buffers already, try reading it and verify the 
checksum, and then just evict it right away (not to spoil sb)

Or did you have something else in mind?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Online verification of checksums

2019-03-06 Thread Michael Paquier

On Wed, Mar 06, 2019 at 08:53:57PM +0100, Tomas Vondra wrote:
> Not sure. AFAICS that would to require a single transaction, and if we
> happen to add some sort of throttling (which is a feature request I'd
> expect pretty soon to make it usable on live clusters) that might be
> quite long-running. So, not great.
> 
> If we want to run it from the server itself, then I guess a background
> worker would be a better solution. Incidentally, that's something I've
> been toying with some time ago, see [1].

It does not prevent having a SQL function which acts as a wrapper on
top of the whole routine logic, does it?  I think that it would be
nice to have the possibility to target a specific relation and a
specific page, as well as being able to check fully a relation at
once.  It gets easier to check for page ranges this way, and the
throttling can be part of the function doing a full-relation check.
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2019-03-06 Thread Tomas Vondra

On 3/6/19 8:41 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-03-06 20:37:39 +0100, Tomas Vondra wrote:
>> Not sure how to integrate it into the CLI tool, though. Perhaps we it
>> could require connection info so that it can execute a function, when
>> executed in online mode?
> 
> To me the right fix would be to simply have this run as part of the
> cluster / in a function. I don't see much point in running this outside
> of the cluster.
> 

Not sure. AFAICS that would to require a single transaction, and if we
happen to add some sort of throttling (which is a feature request I'd
expect pretty soon to make it usable on live clusters) that might be
quite long-running. So, not great.

If we want to run it from the server itself, then I guess a background
worker would be a better solution. Incidentally, that's something I've
been toying with some time ago, see [1].

[1] https://github.com/tvondra/scrub

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Online verification of checksums

2019-03-06 Thread Andres Freund

Hi,

On 2019-03-06 20:37:39 +0100, Tomas Vondra wrote:
> Not sure how to integrate it into the CLI tool, though. Perhaps we it
> could require connection info so that it can execute a function, when
> executed in online mode?

To me the right fix would be to simply have this run as part of the
cluster / in a function. I don't see much point in running this outside
of the cluster.

Greetings,

Andres Freund

Re: Online verification of checksums

2019-03-06 Thread Tomas Vondra




On 3/6/19 6:42 PM, Andres Freund wrote:
> On 2019-03-06 12:33:49 -0500, Robert Haas wrote:
>> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck  
>> wrote:
>>> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
 On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
  wrote:
> I have added a retry for this as well now, without a pg_sleep() as well.
> This catches around 80% of the half-reads, but a few slip through. At
> that point we bail out with exit(1), and the user can try again, which I
> think is fine?

 Maybe I'm confused here, but catching 80% of torn pages doesn't sound
 robust at all.
>>>
>>> The chance that pg_verify_checksums hits a torn page (at least in my
>>> tests, see below) is already pretty low, a couple of times per 1000
>>> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
>>> on. Otherwise, we now just issue a warning and skip the file (or so was
>>> the idea, see below), do you think that is not acceptable?
>>
>> Yeah.  Consider a paranoid customer with 100 clusters who runs this
>> every day on every cluster.  They're going to see failures every day
>> or three and go ballistic.
> 
> +1
> 
> 
>> I suspect that better retry logic might help here.  I mean, I would
>> guess that 10 retries at 1 second intervals or something of that sort
>> would be enough to virtually eliminate false positives while still
>> allowing us to report persistent -- and thus real -- problems.  But if
>> even that is going to produce false positives with any measurable
>> probability different from zero, then I think we have a problem,
>> because I neither like a verification tool that ignores possible signs
>> of trouble nor one that "cries wolf" when things are fine.
> 
> To me the right way seems to be to IO lock the page via PG after such a
> failure, and then retry. Which should be relatively easily doable for
> the basebackup case, but obviously harder for the pg_verify_checksums
> case.
> 

Yes, if we could ensure the retry happens after completing the current
I/O on the page (without actually initiating a read into shared buffers)
that would work I think - both for partial reads and torn pages.

Not sure how to integrate it into the CLI tool, though. Perhaps we it
could require connection info so that it can execute a function, when
executed in online mode?

cheers

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Online verification of checksums

2019-03-06 Thread Tomas Vondra

On 3/6/19 6:26 PM, Robert Haas wrote:
> On Sat, Mar 2, 2019 at 4:38 PM Tomas Vondra
>  wrote:
>> FWIW I don't think this qualifies as torn page - i.e. it's not a full
>> read with a mix of old and new data. This is partial write, most likely
>> because we read the blocks one by one, and when we hit the last page
>> while the table is being extended, we may only see the fist 4kB. And if
>> we retry very fast, we may still see only the first 4kB.
> 
> I see the distinction you're making, and you're right.  The problem
> is, whether in this case or whether for a real torn page, we don't
> seem to have a way to distinguish between a state that occurs
> transiently due to lack of synchronization and a situation that is
> permanent and means that we have corruption.  And that worries me,
> because it means we'll either report bogus complaints that will scare
> easily-panicked users (and anybody who is running this tool has a good
> chance of being in the "easily-panicked" category ...), or else we'll
> skip reporting real problems.  Neither is good.
> 

Sure, I'd also prefer having a tool that reliably detects all cases of
data corruption, and I certainly do share your concerns about false
positives and false negatives.

But maybe we shouldn't expect a tool meant to verify checksums to detect
various other issues.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Online verification of checksums

2019-03-06 Thread Andres Freund

On 2019-03-06 12:33:49 -0500, Robert Haas wrote:
> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck  
> wrote:
> > Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > >  wrote:
> > > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > > This catches around 80% of the half-reads, but a few slip through. At
> > > > that point we bail out with exit(1), and the user can try again, which I
> > > > think is fine?
> > >
> > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > > robust at all.
> >
> > The chance that pg_verify_checksums hits a torn page (at least in my
> > tests, see below) is already pretty low, a couple of times per 1000
> > runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> > on. Otherwise, we now just issue a warning and skip the file (or so was
> > the idea, see below), do you think that is not acceptable?
> 
> Yeah.  Consider a paranoid customer with 100 clusters who runs this
> every day on every cluster.  They're going to see failures every day
> or three and go ballistic.

+1


> I suspect that better retry logic might help here.  I mean, I would
> guess that 10 retries at 1 second intervals or something of that sort
> would be enough to virtually eliminate false positives while still
> allowing us to report persistent -- and thus real -- problems.  But if
> even that is going to produce false positives with any measurable
> probability different from zero, then I think we have a problem,
> because I neither like a verification tool that ignores possible signs
> of trouble nor one that "cries wolf" when things are fine.

To me the right way seems to be to IO lock the page via PG after such a
failure, and then retry. Which should be relatively easily doable for
the basebackup case, but obviously harder for the pg_verify_checksums
case.

Greetings,

Andres Freund

Re: Online verification of checksums

2019-03-06 Thread Robert Haas

On Sat, Mar 2, 2019 at 5:45 AM Michael Banck  wrote:
> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> >  wrote:
> > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > This catches around 80% of the half-reads, but a few slip through. At
> > > that point we bail out with exit(1), and the user can try again, which I
> > > think is fine?
> >
> > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > robust at all.
>
> The chance that pg_verify_checksums hits a torn page (at least in my
> tests, see below) is already pretty low, a couple of times per 1000
> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> on. Otherwise, we now just issue a warning and skip the file (or so was
> the idea, see below), do you think that is not acceptable?

Yeah.  Consider a paranoid customer with 100 clusters who runs this
every day on every cluster.  They're going to see failures every day
or three and go ballistic.

I suspect that better retry logic might help here.  I mean, I would
guess that 10 retries at 1 second intervals or something of that sort
would be enough to virtually eliminate false positives while still
allowing us to report persistent -- and thus real -- problems.  But if
even that is going to produce false positives with any measurable
probability different from zero, then I think we have a problem,
because I neither like a verification tool that ignores possible signs
of trouble nor one that "cries wolf" when things are fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Online verification of checksums

2019-03-06 Thread Robert Haas

On Sat, Mar 2, 2019 at 4:38 PM Tomas Vondra
 wrote:
> FWIW I don't think this qualifies as torn page - i.e. it's not a full
> read with a mix of old and new data. This is partial write, most likely
> because we read the blocks one by one, and when we hit the last page
> while the table is being extended, we may only see the fist 4kB. And if
> we retry very fast, we may still see only the first 4kB.

I see the distinction you're making, and you're right.  The problem
is, whether in this case or whether for a real torn page, we don't
seem to have a way to distinguish between a state that occurs
transiently due to lack of synchronization and a situation that is
permanent and means that we have corruption.  And that worries me,
because it means we'll either report bogus complaints that will scare
easily-panicked users (and anybody who is running this tool has a good
chance of being in the "easily-panicked" category ...), or else we'll
skip reporting real problems.  Neither is good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Online verification of checksums

2019-03-05 Thread Stephen Frost

Greetings,

On Tue, Mar 5, 2019 at 18:36 Michael Paquier  wrote:

> On Tue, Mar 05, 2019 at 02:08:03PM +0100, Tomas Vondra wrote:
> > Based on quickly skimming that thread the main issue seems to be
> > deciding which files in the data directory are expected to have
> > checksums. Which is a valid issue, of course, but I was expecting
> > something about partial read/writes etc.
>
> I remember complaining about partial write handling as well for the
> base backup checks...  There should be an email about it on the list,
> cannot find it now ;p
>
> > My understanding is that:
> >
> > (a) The checksum verification should not generate false positives (same
> > as for basebackup).
> >
> > (b) The partial reads do emit warnings, which might be considered false
> > positives I guess. Which is why I'm arguing for changing it to do the
> > same thing basebackup does, i.e. ignore this.
>
> Well, at least that's consistent...  Argh, I really think that we
> ought to make the failures reported harder because that's easier to
> detect within a tool and some deployments set log_min_messages >
> WARNING so checksum failures would just be lost.  For base backups we
> don't care much about that as files are just blindly copied so they
> could have torn pages, which is fine as that's fixed at replay.  Now
> we are talking about a set of tools which could have reliable
> detection mechanisms for those problems.

I’m traveling but will try to comment more in the coming days but in
general I agree with Tomas on these items. Also, pg_basebackup has to
handle torn pages when it comes to checksums just like the verify tool
does, and having them be consistent (along with external tools) would
really be for the best, imv.  I still feel like a retry of a short read
(try reading more to get the whole page..) would be alright and reading
until we hit eof and then moving on. I’m not sure it’s possible but I do
worry a bit that we might get a short read from a network file system or
something that isn’t actually at eof and then we would skip a significant
remaining portion of the file...   another thought might be to stat the
file after we have opened it to see it’s length...

Just a few thoughts since I’m on my phone.  Will try to write up something
more in a day or two.

Thanks!

Stephen

Re: Online verification of checksums

2019-03-05 Thread Michael Paquier

On Tue, Mar 05, 2019 at 02:08:03PM +0100, Tomas Vondra wrote:
> Based on quickly skimming that thread the main issue seems to be
> deciding which files in the data directory are expected to have
> checksums. Which is a valid issue, of course, but I was expecting
> something about partial read/writes etc.

I remember complaining about partial write handling as well for the
base backup checks...  There should be an email about it on the list,
cannot find it now ;p

> My understanding is that:
> 
> (a) The checksum verification should not generate false positives (same
> as for basebackup).
> 
> (b) The partial reads do emit warnings, which might be considered false
> positives I guess. Which is why I'm arguing for changing it to do the
> same thing basebackup does, i.e. ignore this.

Well, at least that's consistent...  Argh, I really think that we
ought to make the failures reported harder because that's easier to
detect within a tool and some deployments set log_min_messages >
WARNING so checksum failures would just be lost.  For base backups we
don't care much about that as files are just blindly copied so they
could have torn pages, which is fine as that's fixed at replay.  Now
we are talking about a set of tools which could have reliable
detection mechanisms for those problems.
--
Michael


signature.asc
Description: PGP signature

Re: Online verification of checksums

2019-03-05 Thread Tomas Vondra

On 3/5/19 4:12 AM, Michael Paquier wrote:
> On Mon, Mar 04, 2019 at 03:08:09PM +0100, Tomas Vondra wrote:
>> I still don't understand what issue you see in how basebackup verifies
>> checksums. Can you point me to the explanation you've sent after 11 was
>> released?
> 
> The history is mostly on this thread:
> https://www.postgresql.org/message-id/20181020044248.gd2...@paquier.xyz
> 

Thanks, will look.

Based on quickly skimming that thread the main issue seems to be
deciding which files in the data directory are expected to have
checksums. Which is a valid issue, of course, but I was expecting
something about partial read/writes etc.

>> So you have a workload/configuration that actually results in data
>> corruption yet we fail to detect that? Or we generate false positives?
>> Or what do you mean by "100% safe" here?
> 
> What's proposed on this thread could generate false positives.  Checks
> which have deterministic properties and clean failure handling are
> reliable when it comes to reports.

My understanding is that:

(a) The checksum verification should not generate false positives (same
as for basebackup).

(b) The partial reads do emit warnings, which might be considered false
positives I guess. Which is why I'm arguing for changing it to do the
same thing basebackup does, i.e. ignore this.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

1 2 >

1 - 100 of 178 matches

Mail list logo