Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-12 Thread Andrey Borodin



> 12 янв. 2021 г., в 13:49, Noah Misch  написал(а):
> 
> What do you think of abandoning slru-truncate-t-insurance entirely?  As of
> https://postgr.es/m/20200330052809.gb2324...@rfd.leadboat.com I liked the idea
> behind it, despite its complicating the system for hackers and DBAs.  The
> TruncateMultiXact() interaction rendered it less appealing.  In v14+, commit
> cd5e822 mitigates the kind of bugs that slru-truncate-t-insurance mitigates,
> further reducing the latter's value.  slru-truncate-t-insurance does mitigate
> larger trespasses into unlink-eligible space, though.
I seem to me that not committing an insurance patch is not a mistake. Let's 
abandon slru-truncate-t-insurance for now.

>> Fix  certainly worth backpatching.
> 
> I'll push it on Saturday, probably.
Thanks!

Best regards, Andrey Borodin.



Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-12 Thread Noah Misch
On Mon, Jan 11, 2021 at 11:22:05AM +0500, Andrey Borodin wrote:
> I'm marking patch as ready for committer.

Thanks.

> I can't tell should we backpatch insurance patch or not: it potentially fixes 
> unknown bugs, and potentially contains unknown bugs. I can't reason because 
> of such uncertainty. I've tried to look for any potential problem and as for 
> now I see none. Chances are  is doing 
> code less error-prone.

What do you think of abandoning slru-truncate-t-insurance entirely?  As of
https://postgr.es/m/20200330052809.gb2324...@rfd.leadboat.com I liked the idea
behind it, despite its complicating the system for hackers and DBAs.  The
TruncateMultiXact() interaction rendered it less appealing.  In v14+, commit
cd5e822 mitigates the kind of bugs that slru-truncate-t-insurance mitigates,
further reducing the latter's value.  slru-truncate-t-insurance does mitigate
larger trespasses into unlink-eligible space, though.

> Fix  certainly worth backpatching.

I'll push it on Saturday, probably.




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-10 Thread Andrey Borodin



> 10 янв. 2021 г., в 14:43, Noah Misch  написал(а):
> 
>> Can you please send revised patches with fixes?
> 
> Attached.
> 
> 

I'm marking patch as ready for committer.

I can't tell should we backpatch insurance patch or not: it potentially fixes 
unknown bugs, and potentially contains unknown bugs. I can't reason because of 
such uncertainty. I've tried to look for any potential problem and as for now I 
see none. Chances are  is doing code less 
error-prone.

Fix  certainly worth backpatching.

Thanks!

Best regards, Andrey Borodin.



Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-10 Thread Noah Misch
On Sun, Jan 10, 2021 at 11:44:14AM +0500, Andrey Borodin wrote:
> > 10 янв. 2021 г., в 03:15, Noah Misch  написал(а):
> > 
> > No; it deletes the most recent ~1B and leaves the older segments.  An
> > exception is multixact, as described in the commit message and the patch's
> > change to a comment in TruncateMultiXact().
> 
> Thanks for clarification.
> One more thing: retention point at 3/4 of overall space (half of wraparound) 
> seems more or less random to me. Why not 5/8 or 9/16?

No reason for that exact value.  The purpose of that patch is to mitigate bugs
that cause the server to write data into a region of the SLRU that we permit
truncation to unlink.  If the patch instead tested "diff > INT_MIN * .99", the
new behavior would get little testing, because xidWarnLimit would start first.
Also, the new behavior wouldn't mitigate bugs that trespass >~20M XIDs into
unlink-eligible space.  If the patch tested "diff > INT_MIN * .01", more sites
would see disk consumption grow.  I think reasonable multipliers range from
0.5 (in the patch today) to 0.9, but it's a judgment call.

> Can you please send revised patches with fixes?

Attached.
Author: Noah Misch 
Commit: Noah Misch 

Prevent excess SimpleLruTruncate() deletion.

Every core SLRU wraps around.  With the exception of pg_notify, the wrap
point can fall in the middle of a page.  Account for this in the
PagePrecedes callback specification and in SimpleLruTruncate()'s use of
said callback.  Update each callback implementation to fit the new
specification.  This changes SerialPagePrecedesLogically() from the
style of asyncQueuePagePrecedes() to the style of CLOGPagePrecedes().
(Whereas pg_clog and pg_serial share a key space, pg_serial is nothing
like pg_notify.)  The bug fixed here has the same symptoms and user
followup steps as 592a589a04bd456410b853d86bd05faa9432cbbb.  Back-patch
to 9.5 (all supported versions).

Reviewed by Andrey Borodin and Tom Lane.

Discussion: https://postgr.es/m/20190202083822.gc32...@gust.leadboat.com

diff --git a/src/backend/access/transam/clog.c 
b/src/backend/access/transam/clog.c
index 69a81f3..410d02a 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -694,6 +694,7 @@ CLOGShmemInit(void)
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
  XactSLRULock, "pg_xact", 
LWTRANCHE_XACT_BUFFER,
  SYNC_HANDLER_CLOG);
+   SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE);
 }
 
 /*
@@ -912,13 +913,22 @@ TruncateCLOG(TransactionId oldestXact, Oid 
oldestxid_datoid)
 
 
 /*
- * Decide which of two CLOG page numbers is "older" for truncation purposes.
+ * Decide whether a CLOG page number is "older" for truncation purposes.
  *
  * We need to use comparison of TransactionIds here in order to do the right
- * thing with wraparound XID arithmetic.  However, if we are asked about
- * page number zero, we don't want to hand InvalidTransactionId to
- * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
- * offset both xids by FirstNormalTransactionId to avoid that.
+ * thing with wraparound XID arithmetic.  However, TransactionIdPrecedes()
+ * would get weird about permanent xact IDs.  So, offset both such that xid1,
+ * xid2, and xid2 + CLOG_XACTS_PER_PAGE - 1 are all normal XIDs; this offset
+ * is relevant to page 0 and to the page preceding page 0.
+ *
+ * The page containing oldestXact-2^31 is the important edge case.  The
+ * portion of that page equaling or following oldestXact-2^31 is expendable,
+ * but the portion preceding oldestXact-2^31 is not.  When oldestXact-2^31 is
+ * the first XID of a page and segment, the entire page and segment is
+ * expendable, and we could truncate the segment.  Recognizing that case would
+ * require making oldestXact, not just the page containing oldestXact,
+ * available to this callback.  The benefit would be rare and small, so we
+ * don't optimize that edge case.
  */
 static bool
 CLOGPagePrecedes(int page1, int page2)
@@ -927,11 +937,12 @@ CLOGPagePrecedes(int page1, int page2)
TransactionId xid2;
 
xid1 = ((TransactionId) page1) * CLOG_XACTS_PER_PAGE;
-   xid1 += FirstNormalTransactionId;
+   xid1 += FirstNormalTransactionId + 1;
xid2 = ((TransactionId) page2) * CLOG_XACTS_PER_PAGE;
-   xid2 += FirstNormalTransactionId;
+   xid2 += FirstNormalTransactionId + 1;
 
-   return TransactionIdPrecedes(xid1, xid2);
+   return (TransactionIdPrecedes(xid1, xid2) &&
+   TransactionIdPrecedes(xid1, xid2 + CLOG_XACTS_PER_PAGE 
- 1));
 }
 
 
diff --git a/src/backend/access/transam/commit_ts.c 
b/src/backend/access/transam/commit_ts.c
index b786eef..9f42461 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -557,6 +557,7 @@ CommitTsShmemInit(void)

Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-09 Thread Andrey Borodin



> 10 янв. 2021 г., в 03:15, Noah Misch  написал(а):
> 
> No; it deletes the most recent ~1B and leaves the older segments.  An
> exception is multixact, as described in the commit message and the patch's
> change to a comment in TruncateMultiXact().

Thanks for clarification.
One more thing: retention point at 3/4 of overall space (half of wraparound) 
seems more or less random to me. Why not 5/8 or 9/16?

Can you please send revised patches with fixes?

Thanks!

Best regards, Andrey Borodin.



Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-09 Thread Noah Misch
On Sat, Jan 09, 2021 at 08:25:39PM +0500, Andrey Borodin wrote:
> > 9 янв. 2021 г., в 15:17, Noah Misch  написал(а):

> >> I'm a little bit afraid that this kind of patch can hide bugs (while 
> >> potentially saving some users data). Besides this patch seems like a 
> >> useful precaution. Maybe we could emit scary warnings if SLRU segments do 
> >> not stack into continuous range?
> > 
> > Scary warnings are good for an observation that implies a bug, but the
> > slru-truncate-t-insurance patch causes such an outcome in non-bug cases 
> > where
> > it doesn't happen today.  In other words, discontinuous ranges of SLRU
> > segments would be even more common after that patch.  For example, it would
> > happen anytime oldestXID advances by more than ~1B at a time.
> 
> Uhm, I thought that if there is going to be more than ~1B xids - we are going 
> to keep all segements forever and range still will be continuous. Or am I 
> missing something?

No; it deletes the most recent ~1B and leaves the older segments.  An
exception is multixact, as described in the commit message and the patch's
change to a comment in TruncateMultiXact().




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-09 Thread Andrey Borodin



> 9 янв. 2021 г., в 15:17, Noah Misch  написал(а):
> 
>> This
>> int diff_max = ((QUEUE_MAX_PAGE + 1) / 2) - 1,
>> seems to be functional equivalent of
>> int diff_max = ((QUEUE_MAX_PAGE - 1) / 2),
> 
> Do you think one conveys the concept better than the other?
I see now that next comments mention "(QUEUE_MAX_PAGE+1)/2", so I think there 
is no need to change something in a patch here.

>> I'm a little bit afraid that this kind of patch can hide bugs (while 
>> potentially saving some users data). Besides this patch seems like a useful 
>> precaution. Maybe we could emit scary warnings if SLRU segments do not stack 
>> into continuous range?
> 
> Scary warnings are good for an observation that implies a bug, but the
> slru-truncate-t-insurance patch causes such an outcome in non-bug cases where
> it doesn't happen today.  In other words, discontinuous ranges of SLRU
> segments would be even more common after that patch.  For example, it would
> happen anytime oldestXID advances by more than ~1B at a time.

Uhm, I thought that if there is going to be more than ~1B xids - we are going 
to keep all segements forever and range still will be continuous. Or am I 
missing something?

Thanks!

Best regards, Andrey Borodin.





Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-09 Thread Noah Misch
On Wed, Jan 06, 2021 at 11:28:36AM +0500, Andrey Borodin wrote:
> First patch fixes a bug with possible SLRU truncation around wrapping point 
> too early.
> Basic idea of the patch is "If we want to delete a range we must be eligible 
> to delete it's beginning and ending".
> So to test if page is deletable it tests that first and last xids(or other 
> SLRU's unit) are of no interest. To test if a segment is deletable it tests 
> if first and last pages can be deleted.

Yes.

> I'm a little suspicious of implementation of *PagePrecedes(int page1, int 
> page2) functions. Consider following example from the patch:
> 
> static bool
> CommitTsPagePrecedes(int page1, int page2)
> {
> TransactionId xid1;
> TransactionId xid2;
> 
> xid1 = ((TransactionId) page1) * COMMIT_TS_XACTS_PER_PAGE;
> xid1 += FirstNormalTransactionId + 1;
> xid2 = ((TransactionId) page2) * COMMIT_TS_XACTS_PER_PAGE;
> xid2 += FirstNormalTransactionId + 1;
> 
> return (TransactionIdPrecedes(xid1, xid2) &&
> TransactionIdPrecedes(xid1, xid2 + COMMIT_TS_XACTS_PER_PAGE - 1));
> }
> 
> We are adding FirstNormalTransactionId to xids to avoid them being special 
> xids.
> We add COMMIT_TS_XACTS_PER_PAGE - 1 to xid2 to shift it to the end of the 
> page. But due to += FirstNormalTransactionId we shift slightly behind page 
> boundaries and risk that xid2 + COMMIT_TS_XACTS_PER_PAGE - 1 can become 
> FrozenTransactionId (FirstNormalTransactionId - 1). Thus we add +1 to all 
> values in scope. While the logic is correct, coding is difficult. Maybe we 
> could just use

Right.  The overall objective is to compare the first XID of page1 to the
first and last XIDs of page2.  The FirstNormalTransactionId+1 addend operates
at a lower level.  It just makes TransactionIdPrecedes() behave like
NormalTransactionIdPrecedes() without the latter's assertion.

>   page1_first_normal_xid = ((TransactionId) page1) * CLOG_XACTS_PER_PAGE 
> + FirstNormalTransactionId;
>   page2_first_normal_xid = ((TransactionId) page2) * CLOG_XACTS_PER_PAGE 
> + FirstNormalTransactionId;
>   page2_last_xid = ((TransactionId) page2) * CLOG_XACTS_PER_PAGE + 
> CLOG_XACTS_PER_PAGE - 1;
> But I'm not insisting on this.

I see your point, but I doubt using different addends on different operands
makes this easier to understand.  If anything, I'd lean toward adding more
explicit abstraction between "the XID we intend to test" and "the XID we're
using to fool some general-purpose API".

> Following comment is not correct for 1Kb and 4Kb pages
> + * At every supported BLCKSZ, (1 << 31) % COMMIT_TS_XACTS_PER_PAGE == 128.

Fixed, thanks.

> All above notes are not blocking the fix, I just wanted to let committer know 
> about this. I think that it's very important to have this bug fixed.
> 
> Second patch converts binary *PagePreceeds() functions to *PageDiff()s and 
> adds logic to avoid deleting pages in suspicious cases. This logic depends on 
> the scale of returned by diff value: it expects that overflow happens between 
> INT_MIN\INT_MAX. This it prevents page deletion if page_diff <= INT_MIN / 2 
> (too far from current cleaning point; and in normal cases, of cause). 
> 
> It must be comparison here, not equality test.
> -   ctl->PagePrecedes(segpage, trunc->earliestExistingPage))
> +   ctl->PageDiff(segpage, trunc->earliestExistingPage))

That's bad.  Fixed, thanks.

> This
> int diff_max = ((QUEUE_MAX_PAGE + 1) / 2) - 1,
> seems to be functional equivalent of
> int diff_max = ((QUEUE_MAX_PAGE - 1) / 2),

Do you think one conveys the concept better than the other?

> AFAICS the overall purpose of the 2nd patch is to help corrupted by other 
> bugs clusters avoid deleting SLRU segments.

Yes.

> I'm a little bit afraid that this kind of patch can hide bugs (while 
> potentially saving some users data). Besides this patch seems like a useful 
> precaution. Maybe we could emit scary warnings if SLRU segments do not stack 
> into continuous range?

Scary warnings are good for an observation that implies a bug, but the
slru-truncate-t-insurance patch causes such an outcome in non-bug cases where
it doesn't happen today.  In other words, discontinuous ranges of SLRU
segments would be even more common after that patch.  For example, it would
happen anytime oldestXID advances by more than ~1B at a time.

Thanks,
nm




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-05 Thread Andrey Borodin



> 1 янв. 2021 г., в 23:05, Andrey Borodin  написал(а):
> 
> I've found this thread in CF looking for something to review.

We discussed patches with Noah offlist. I'm resending summary to list.

There are two patches:
1. slru-truncate-modulo-v5.patch
2. slru-truncate-t-insurance-v4.patch

It would be a bit easier to review if patches were versioned together (v5 
both), because 2nd patch applies on top of 1st. Also 2nd patch have a problem 
if applying with git (in async.c).

First patch fixes a bug with possible SLRU truncation around wrapping point too 
early.
Basic idea of the patch is "If we want to delete a range we must be eligible to 
delete it's beginning and ending".
So to test if page is deletable it tests that first and last xids(or other 
SLRU's unit) are of no interest. To test if a segment is deletable it tests if 
first and last pages can be deleted.
Patch adds test in unusual manner: they are implemented as assert functions. 
Tests are fast, they are only checking basic and edge cases. But tests will not 
be run if Postgres is build without asserts.
I'm a little suspicious of implementation of *PagePrecedes(int page1, int 
page2) functions. Consider following example from the patch:

static bool
CommitTsPagePrecedes(int page1, int page2)
{
TransactionId xid1;
TransactionId xid2;

xid1 = ((TransactionId) page1) * COMMIT_TS_XACTS_PER_PAGE;
xid1 += FirstNormalTransactionId + 1;
xid2 = ((TransactionId) page2) * COMMIT_TS_XACTS_PER_PAGE;
xid2 += FirstNormalTransactionId + 1;

return (TransactionIdPrecedes(xid1, xid2) &&
TransactionIdPrecedes(xid1, xid2 + COMMIT_TS_XACTS_PER_PAGE - 1));
}

We are adding FirstNormalTransactionId to xids to avoid them being special xids.
We add COMMIT_TS_XACTS_PER_PAGE - 1 to xid2 to shift it to the end of the page. 
But due to += FirstNormalTransactionId we shift slightly behind page boundaries 
and risk that xid2 + COMMIT_TS_XACTS_PER_PAGE - 1 can become 
FrozenTransactionId (FirstNormalTransactionId - 1). Thus we add +1 to all 
values in scope. While the logic is correct, coding is difficult. Maybe we 
could just use
page1_first_normal_xid = ((TransactionId) page1) * CLOG_XACTS_PER_PAGE 
+ FirstNormalTransactionId;
page2_first_normal_xid = ((TransactionId) page2) * CLOG_XACTS_PER_PAGE 
+ FirstNormalTransactionId;
page2_last_xid = ((TransactionId) page2) * CLOG_XACTS_PER_PAGE + 
CLOG_XACTS_PER_PAGE - 1;
But I'm not insisting on this.


Following comment is not correct for 1Kb and 4Kb pages
+ * At every supported BLCKSZ, (1 << 31) % COMMIT_TS_XACTS_PER_PAGE == 128.

All above notes are not blocking the fix, I just wanted to let committer know 
about this. I think that it's very important to have this bug fixed.

Second patch converts binary *PagePreceeds() functions to *PageDiff()s and adds 
logic to avoid deleting pages in suspicious cases. This logic depends on the 
scale of returned by diff value: it expects that overflow happens between 
INT_MIN\INT_MAX. This it prevents page deletion if page_diff <= INT_MIN / 2 
(too far from current cleaning point; and in normal cases, of cause). 

It must be comparison here, not equality test.
-   ctl->PagePrecedes(segpage, trunc->earliestExistingPage))
+   ctl->PageDiff(segpage, trunc->earliestExistingPage))

This
int diff_max = ((QUEUE_MAX_PAGE + 1) / 2) - 1,
seems to be functional equivalent of
int diff_max = ((QUEUE_MAX_PAGE - 1) / 2),

What I like about the patch is that it removes all described above trickery 
around + FirstNormalTransactionId + 1.

AFAICS the overall purpose of the 2nd patch is to help corrupted by other bugs 
clusters avoid deleting SLRU segments.
I'm a little bit afraid that this kind of patch can hide bugs (while 
potentially saving some users data). Besides this patch seems like a useful 
precaution. Maybe we could emit scary warnings if SLRU segments do not stack 
into continuous range?


Thanks!

Best regards, Andrey Borodin.



Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-02 Thread Noah Misch
On Sat, Jan 02, 2021 at 12:31:45PM +0500, Andrey Borodin wrote:
> Do I understand correctly that this is bugfix that needs to be back-patched?

The slru-truncate-modulo patch fixes a bug.  The slru-truncate-t-insurance
patch does not.  Neither _needs_ to be back-patched, though I'm proposing to
back-patch both.  I welcome opinions about that.

> Thus we should not refactor 4 identical *PagePrecedes(int page1, int page2) 
> into 1 generic function?

I agree with not refactoring that way, in this case.

> Since functions are not symmetric anymore, maybe we should have better names 
> for arguments than "page1" and "page2"? At least in dev branch.

That works for me.  What names would you suggest?

> Is it common practice to embed tests into assert checking like in 
> SlruPagePrecedesUnitTests()?

No; it's neither common practice nor a policy breach.




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-01 Thread Andrey Borodin



> 2 янв. 2021 г., в 01:35, Noah Misch  написал(а):
> There's no
> other connection to this thread, and one can review patches on this thread
> without studying commit c732c3f. 

OK, thanks!

Do I understand correctly that this is bugfix that needs to be back-patched?
Thus we should not refactor 4 identical *PagePrecedes(int page1, int page2) 
into 1 generic function?
Since functions are not symmetric anymore, maybe we should have better names 
for arguments than "page1" and "page2"? At least in dev branch.

Is it common practice to embed tests into assert checking like in 
SlruPagePrecedesUnitTests()?

SLRU seems no near simple, BTW. The only simple place is naive caching 
algorithm. I remember there was a thread to do relations from SLRUs.

Best regards, Andrey Borodin.



Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-01 Thread Noah Misch
On Fri, Jan 01, 2021 at 11:05:29PM +0500, Andrey Borodin wrote:
> I've found this thread in CF looking for something to review.

Thanks for taking a look.

> > 9 нояб. 2020 г., в 09:53, Noah Misch  написал(а):
> > 
> > Rebased both patches, necessitated by commit c732c3f (a repair of commit
> > dee663f).  As I mentioned on another branch of the thread, I'd be content if
> > someone reviews the slru-truncate-modulo patch and disclaims knowledge of 
> > the
> > slru-truncate-insurance patch; I would then abandon the latter patch.
> > 
> 
> Commit c732c3f adds some SYNC_FORGET_REQUESTs.
> slru-truncate-modulo-v5.patch fixes off-by-one error in functions like 
> *PagePrecedes(int page1, int page2).
> slru-truncate-t-insurance-v4.patch ensures that off-by-one errors do not 
> inflict data loss.
> 
> While I agree that fixing error is better than hiding it, I could not figure 
> out how c732c3f is connected to these patches.
> Can you please give me few pointers how to understand this connection?

Commit c732c3f is the last commit that caused a merge conflict.  There's no
other connection to this thread, and one can review patches on this thread
without studying commit c732c3f.  Specifically, this thread's
slru-truncate-modulo patch and commit c732c3f modify adjacent lines in
SlruScanDirCbDeleteCutoff(); here's the diff after merge conflict resolution:

--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@@ -1525,4 -1406,4 +1517,4 @@@ SlruScanDirCbDeleteCutoff(SlruCtl ctl, 
  
 -  if (ctl->PagePrecedes(segpage, cutoffPage))
 +  if (SlruMayDeleteSegment(ctl, segpage, cutoffPage))
-   SlruInternalDeleteSegment(ctl, filename);
+   SlruInternalDeleteSegment(ctl, segpage / 
SLRU_PAGES_PER_SEGMENT);
  




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-01 Thread Andrey Borodin
Hi Noah!

I've found this thread in CF looking for something to review.

> 9 нояб. 2020 г., в 09:53, Noah Misch  написал(а):
> 
> Rebased both patches, necessitated by commit c732c3f (a repair of commit
> dee663f).  As I mentioned on another branch of the thread, I'd be content if
> someone reviews the slru-truncate-modulo patch and disclaims knowledge of the
> slru-truncate-insurance patch; I would then abandon the latter patch.
> 

Commit c732c3f adds some SYNC_FORGET_REQUESTs.
slru-truncate-modulo-v5.patch fixes off-by-one error in functions like 
*PagePrecedes(int page1, int page2).
slru-truncate-t-insurance-v4.patch ensures that off-by-one errors do not 
inflict data loss.

While I agree that fixing error is better than hiding it, I could not figure 
out how c732c3f is connected to these patches.
Can you please give me few pointers how to understand this connection?

Best regards, Andrey Borodin.



Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-11-08 Thread Noah Misch
On Wed, Oct 28, 2020 at 09:01:59PM -0700, Noah Misch wrote:
> On Sat, Aug 29, 2020 at 10:34:33PM -0700, Noah Misch wrote:
> > On Mon, May 25, 2020 at 12:00:33AM -0700, Noah Misch wrote:
> > > [last non-rebase change]
> > 
> > Rebased the second patch.  The first patch did not need a rebase.
> 
> Rebased both patches, necessitated by commit dee663f changing many of the same
> spots.

Rebased both patches, necessitated by commit c732c3f (a repair of commit
dee663f).  As I mentioned on another branch of the thread, I'd be content if
someone reviews the slru-truncate-modulo patch and disclaims knowledge of the
slru-truncate-insurance patch; I would then abandon the latter patch.
Author: Noah Misch 
Commit: Noah Misch 

Prevent excess SimpleLruTruncate() deletion.

Every core SLRU wraps around.  With the exception of pg_notify, the wrap
point can fall in the middle of a page.  Account for this in the
PagePrecedes callback specification and in SimpleLruTruncate()'s use of
said callback.  Update each callback implementation to fit the new
specification.  This changes SerialPagePrecedesLogically() from the
style of asyncQueuePagePrecedes() to the style of CLOGPagePrecedes().
(Whereas pg_clog and pg_serial share a key space, pg_serial is nothing
like pg_notify.)  The bug fixed here has the same symptoms and user
followup steps as 592a589a04bd456410b853d86bd05faa9432cbbb.  Back-patch
to 9.5 (all supported versions).

Reviewed by Tom Lane.

Discussion: https://postgr.es/m/20190202083822.gc32...@gust.leadboat.com

diff --git a/src/backend/access/transam/clog.c 
b/src/backend/access/transam/clog.c
index 034349a..55bdac4 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -694,6 +694,7 @@ CLOGShmemInit(void)
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
  XactSLRULock, "pg_xact", 
LWTRANCHE_XACT_BUFFER,
  SYNC_HANDLER_CLOG);
+   SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE);
 }
 
 /*
@@ -912,13 +913,22 @@ TruncateCLOG(TransactionId oldestXact, Oid 
oldestxid_datoid)
 
 
 /*
- * Decide which of two CLOG page numbers is "older" for truncation purposes.
+ * Decide whether a CLOG page number is "older" for truncation purposes.
  *
  * We need to use comparison of TransactionIds here in order to do the right
- * thing with wraparound XID arithmetic.  However, if we are asked about
- * page number zero, we don't want to hand InvalidTransactionId to
- * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
- * offset both xids by FirstNormalTransactionId to avoid that.
+ * thing with wraparound XID arithmetic.  However, TransactionIdPrecedes()
+ * would get weird about permanent xact IDs.  So, offset both such that xid1,
+ * xid2, and xid + CLOG_XACTS_PER_PAGE - 1 are all normal XIDs; this offset is
+ * relevant to page 0 and to the page preceding page 0.
+ *
+ * The page containing oldestXact-2^31 is the important edge case.  The
+ * portion of that page equaling or following oldestXact-2^31 is expendable,
+ * but the portion preceding oldestXact-2^31 is not.  When oldestXact-2^31 is
+ * the first XID of a page and segment, the entire page and segment is
+ * expendable, and we could truncate the segment.  Recognizing that case would
+ * require making oldestXact, not just the page containing oldestXact,
+ * available to this callback.  The benefit would be rare and small, so we
+ * don't optimize that edge case.
  */
 static bool
 CLOGPagePrecedes(int page1, int page2)
@@ -927,11 +937,12 @@ CLOGPagePrecedes(int page1, int page2)
TransactionId xid2;
 
xid1 = ((TransactionId) page1) * CLOG_XACTS_PER_PAGE;
-   xid1 += FirstNormalTransactionId;
+   xid1 += FirstNormalTransactionId + 1;
xid2 = ((TransactionId) page2) * CLOG_XACTS_PER_PAGE;
-   xid2 += FirstNormalTransactionId;
+   xid2 += FirstNormalTransactionId + 1;
 
-   return TransactionIdPrecedes(xid1, xid2);
+   return (TransactionIdPrecedes(xid1, xid2) &&
+   TransactionIdPrecedes(xid1, xid2 + CLOG_XACTS_PER_PAGE 
- 1));
 }
 
 
diff --git a/src/backend/access/transam/commit_ts.c 
b/src/backend/access/transam/commit_ts.c
index 2fe551f..ae45777 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -557,6 +557,7 @@ CommitTsShmemInit(void)
  CommitTsSLRULock, "pg_commit_ts",
  LWTRANCHE_COMMITTS_BUFFER,
  SYNC_HANDLER_COMMIT_TS);
+   SlruPagePrecedesUnitTests(CommitTsCtl, COMMIT_TS_XACTS_PER_PAGE);
 
commitTsShared = ShmemInitStruct("CommitTs shared",
 
sizeof(CommitTimestampShared),
@@ -927,14 +928,27 @@ AdvanceOldestCommitTsXid(TransactionId oldestXact)

Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-10-28 Thread Noah Misch
On Sat, Aug 29, 2020 at 10:34:33PM -0700, Noah Misch wrote:
> On Mon, May 25, 2020 at 12:00:33AM -0700, Noah Misch wrote:
> > On Mon, Apr 06, 2020 at 09:18:47PM -0700, Noah Misch wrote:
> > > On Mon, Apr 06, 2020 at 02:46:09PM -0400, Tom Lane wrote:
> > > > Noah Misch  writes:
> > > > > On Wed, Mar 25, 2020 at 04:42:31PM -0400, Tom Lane wrote:
> > > > >> So I think what we're actually trying to accomplish here is to
> > > > >> ensure that instead of deleting up to half of the SLRU space
> > > > >> before the cutoff, we delete up to half-less-one-segment.
> > > > >> Maybe it should be half-less-two-segments, just to provide some
> > > > >> cushion against edge cases.  Reading the first comment in
> > > > >> SetTransactionIdLimit makes one not want to trust too much in
> > > > >> arguments based on the exact value of xidWrapLimit, while for
> > > > >> the other SLRUs it was already unclear whether the edge cases
> > > > >> were exactly right.
> > > > 
> > > > > That could be interesting insurance.  While it would be sad for us to 
> > > > > miss an
> > > > > edge case and print "must be vacuumed within 2 transactions" when 
> > > > > wrap has
> > > > > already happened, reaching that message implies the DBA burned ~1M 
> > > > > XIDs, all
> > > > > in single-user mode.  More plausible is FreezeMultiXactId() 
> > > > > overrunning the
> > > > > limit by tens of segments.  Hence, if we do buy this insurance, let's 
> > > > > skip far
> > > > > more segments.  For example, instead of unlinking segments 
> > > > > representing up to
> > > > > 2^31 past XIDs, we could divide that into an upper half that we 
> > > > > unlink and a
> > > > > lower half.  The lower half will stay in place; eventually, XID 
> > > > > consumption
> > > > > will overwrite it.  Truncation behavior won't change until the region 
> > > > > of CLOG
> > > > > for pre-oldestXact XIDs exceeds 256 MiB.  Beyond that threshold,
> > > > > vac_truncate_clog() will unlink the upper 256 MiB and leave the rest. 
> > > > >  CLOG
> > > > > maximum would rise from 512 MiB to 768 MiB.  Would that be worthwhile?
> > 
> > > > Temporarily wasting some disk
> > > > space is a lot more palatable than corrupting data, and these code
> > > > paths are necessarily not terribly well tested.  So +1 for more
> > > > insurance.
> > > 
> > > Okay, I'll give that a try.  I expect this will replace the PagePrecedes
> > > callback with a PageDiff callback such that PageDiff(a, b) < 0 iff
> > > PagePrecedes(a, b).  PageDiff callbacks shall distribute return values
> > > uniformly in [INT_MIN,INT_MAX].  SimpleLruTruncate() will unlink segments
> > > where INT_MIN/2 < PageDiff(candidate, cutoff) < 0.
> > 
> > While doing so, I found that slru-truncate-modulo-v2.patch did get edge 
> > cases
> > wrong, as you feared.  In particular, if the newest XID reached xidStopLimit
> > and was in the first page of a segment, TruncateCLOG() would delete its
> > segment.  Attached slru-truncate-modulo-v3.patch fixes that; as 
> > restitution, I
> > added unit tests covering that and other scenarios.  Reaching the bug via 
> > XIDs
> > was hard, requiring one to burn 1000k-CLOG_XACTS_PER_PAGE=967k XIDs in
> > single-user mode.  I expect the bug was easier to reach via pg_multixact.
> > 
> > The insurance patch stacks on top of the bug fix patch.  It does have a
> > negative effect on TruncateMultiXact(), which uses SlruScanDirCbFindEarliest
> > to skip truncation in corrupted clusters.  SlruScanDirCbFindEarliest() gives
> > nonsense answers if "future" segments exist.  That can happen today, but the
> > patch creates new ways to make it happen.  The symptom is wasting yet more
> > space in pg_multixact.  I am okay with this, since it arises only after one
> > fills pg_multixact 50% full.  There are alternatives.  We could weaken the
> > corruption defense in TruncateMultiXact() or look for another implementation
> > of equivalent defense.  We could unlink, say, 75% or 95% of the "past" 
> > instead
> > of 50% (this patch) or >99.99% (today's behavior).
> 
> Rebased the second patch.  The first patch did not need a rebase.

Rebased both patches, necessitated by commit dee663f changing many of the same
spots.  I've updated one of the log messages for 592a589 having landed.

I've also changed a patch name stem from slru-truncate-insurance to
slru-truncate-t-insurance, so it sorts after the other patch.  Perhaps that
will trick http://cfbot.cputube.org/noah-misch.html into applying the patches
in the right order.  If not, continue to ignore cfbot.
Author: Noah Misch 
Commit: Noah Misch 

Prevent excess SimpleLruTruncate() deletion.

Every core SLRU wraps around.  With the exception of pg_notify, the wrap
point can fall in the middle of a page.  Account for this in the
PagePrecedes callback specification and in SimpleLruTruncate()'s use of
said callback.  Update each callback implementation to fit the new
specification.  This changes 

Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-09-30 Thread Noah Misch
On Wed, Sep 30, 2020 at 03:06:40PM +0900, Michael Paquier wrote:
> On Sun, Sep 06, 2020 at 08:14:21PM -0700, Noah Misch wrote:
> > http://cfbot.cputube.org/patch_29_2026.log applies the two patches in the
> > wrong order.  For this CF entry, ignore it.
> 
> OK, thanks.  This is a bug fix, so I have moved that to the next CF
> for now.  Noah, would you prefer more reviews or are you confident
> enough to move on with this issue?

The former.  I plan to wait until a review puts this in Ready for Committer.

I'd be content if someone reviews the slru-truncate-modulo patch and disclaims
knowledge of the slru-truncate-insurance patch; I would then abandon the
latter patch.  I'm not fond of how the latter turned out, particularly the
unintended consequence in TruncateMultiXact().  (See the commit message and/or
the edit to the comment in TruncateMultiXact().)  The subtle interaction with
SerialAdd() is not great, either.




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-09-30 Thread Michael Paquier
On Sun, Sep 06, 2020 at 08:14:21PM -0700, Noah Misch wrote:
> http://cfbot.cputube.org/patch_29_2026.log applies the two patches in the
> wrong order.  For this CF entry, ignore it.

OK, thanks.  This is a bug fix, so I have moved that to the next CF
for now.  Noah, would you prefer more reviews or are you confident
enough to move on with this issue?
--
Michael


signature.asc
Description: PGP signature


Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-09-06 Thread Noah Misch
On Mon, Sep 07, 2020 at 11:06:12AM +0900, Michael Paquier wrote:
> On Sat, Aug 29, 2020 at 10:34:33PM -0700, Noah Misch wrote:
> > Rebased the second patch.  The first patch did not need a rebase.
> 
> It looks like a new rebase is needed, the CF bot is complaining here.

http://cfbot.cputube.org/patch_29_2026.log applies the two patches in the
wrong order.  For this CF entry, ignore it.




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-09-06 Thread Michael Paquier
On Sat, Aug 29, 2020 at 10:34:33PM -0700, Noah Misch wrote:
> Rebased the second patch.  The first patch did not need a rebase.

It looks like a new rebase is needed, the CF bot is complaining here.
--
Michael


signature.asc
Description: PGP signature


Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-08-29 Thread Noah Misch
On Mon, May 25, 2020 at 12:00:33AM -0700, Noah Misch wrote:
> On Mon, Apr 06, 2020 at 09:18:47PM -0700, Noah Misch wrote:
> > On Mon, Apr 06, 2020 at 02:46:09PM -0400, Tom Lane wrote:
> > > Noah Misch  writes:
> > > > On Wed, Mar 25, 2020 at 04:42:31PM -0400, Tom Lane wrote:
> > > >> So I think what we're actually trying to accomplish here is to
> > > >> ensure that instead of deleting up to half of the SLRU space
> > > >> before the cutoff, we delete up to half-less-one-segment.
> > > >> Maybe it should be half-less-two-segments, just to provide some
> > > >> cushion against edge cases.  Reading the first comment in
> > > >> SetTransactionIdLimit makes one not want to trust too much in
> > > >> arguments based on the exact value of xidWrapLimit, while for
> > > >> the other SLRUs it was already unclear whether the edge cases
> > > >> were exactly right.
> > > 
> > > > That could be interesting insurance.  While it would be sad for us to 
> > > > miss an
> > > > edge case and print "must be vacuumed within 2 transactions" when wrap 
> > > > has
> > > > already happened, reaching that message implies the DBA burned ~1M 
> > > > XIDs, all
> > > > in single-user mode.  More plausible is FreezeMultiXactId() overrunning 
> > > > the
> > > > limit by tens of segments.  Hence, if we do buy this insurance, let's 
> > > > skip far
> > > > more segments.  For example, instead of unlinking segments representing 
> > > > up to
> > > > 2^31 past XIDs, we could divide that into an upper half that we unlink 
> > > > and a
> > > > lower half.  The lower half will stay in place; eventually, XID 
> > > > consumption
> > > > will overwrite it.  Truncation behavior won't change until the region 
> > > > of CLOG
> > > > for pre-oldestXact XIDs exceeds 256 MiB.  Beyond that threshold,
> > > > vac_truncate_clog() will unlink the upper 256 MiB and leave the rest.  
> > > > CLOG
> > > > maximum would rise from 512 MiB to 768 MiB.  Would that be worthwhile?
> 
> > > Temporarily wasting some disk
> > > space is a lot more palatable than corrupting data, and these code
> > > paths are necessarily not terribly well tested.  So +1 for more
> > > insurance.
> > 
> > Okay, I'll give that a try.  I expect this will replace the PagePrecedes
> > callback with a PageDiff callback such that PageDiff(a, b) < 0 iff
> > PagePrecedes(a, b).  PageDiff callbacks shall distribute return values
> > uniformly in [INT_MIN,INT_MAX].  SimpleLruTruncate() will unlink segments
> > where INT_MIN/2 < PageDiff(candidate, cutoff) < 0.
> 
> While doing so, I found that slru-truncate-modulo-v2.patch did get edge cases
> wrong, as you feared.  In particular, if the newest XID reached xidStopLimit
> and was in the first page of a segment, TruncateCLOG() would delete its
> segment.  Attached slru-truncate-modulo-v3.patch fixes that; as restitution, I
> added unit tests covering that and other scenarios.  Reaching the bug via XIDs
> was hard, requiring one to burn 1000k-CLOG_XACTS_PER_PAGE=967k XIDs in
> single-user mode.  I expect the bug was easier to reach via pg_multixact.
> 
> The insurance patch stacks on top of the bug fix patch.  It does have a
> negative effect on TruncateMultiXact(), which uses SlruScanDirCbFindEarliest
> to skip truncation in corrupted clusters.  SlruScanDirCbFindEarliest() gives
> nonsense answers if "future" segments exist.  That can happen today, but the
> patch creates new ways to make it happen.  The symptom is wasting yet more
> space in pg_multixact.  I am okay with this, since it arises only after one
> fills pg_multixact 50% full.  There are alternatives.  We could weaken the
> corruption defense in TruncateMultiXact() or look for another implementation
> of equivalent defense.  We could unlink, say, 75% or 95% of the "past" instead
> of 50% (this patch) or >99.99% (today's behavior).

Rebased the second patch.  The first patch did not need a rebase.
Author: Noah Misch 
Commit: Noah Misch 

Prevent excess SimpleLruTruncate() deletion.

Every core SLRU wraps around.  With the exception of pg_notify, the wrap
point can fall in the middle of a page.  Account for this in the
PagePrecedes callback specification and in SimpleLruTruncate()'s use of
said callback.  Update each callback implementation to fit the new
specification.  This changes SerialPagePrecedesLogically() from the
style of asyncQueuePagePrecedes() to the style of CLOGPagePrecedes().
(Whereas pg_clog and pg_serial share a key space, pg_serial is nothing
like pg_notify.)

This closes a rare opportunity for data loss, which manifested as
"apparent wraparound" or "could not access status of transaction"
errors.  This is more likely to affect pg_multixact, due to the thin
space between multiStopLimit and multiWrapLimit.  If a user's physical
replication primary logged ":  apparent wraparound" messages, the user
should rebuild that standbys of that primary regardless of symptoms.  At
less risk is 

Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-05-25 Thread Noah Misch
On Mon, Apr 06, 2020 at 09:18:47PM -0700, Noah Misch wrote:
> On Mon, Apr 06, 2020 at 02:46:09PM -0400, Tom Lane wrote:
> > Noah Misch  writes:
> > > On Wed, Mar 25, 2020 at 04:42:31PM -0400, Tom Lane wrote:
> > >> So I think what we're actually trying to accomplish here is to
> > >> ensure that instead of deleting up to half of the SLRU space
> > >> before the cutoff, we delete up to half-less-one-segment.
> > >> Maybe it should be half-less-two-segments, just to provide some
> > >> cushion against edge cases.  Reading the first comment in
> > >> SetTransactionIdLimit makes one not want to trust too much in
> > >> arguments based on the exact value of xidWrapLimit, while for
> > >> the other SLRUs it was already unclear whether the edge cases
> > >> were exactly right.
> > 
> > > That could be interesting insurance.  While it would be sad for us to 
> > > miss an
> > > edge case and print "must be vacuumed within 2 transactions" when wrap has
> > > already happened, reaching that message implies the DBA burned ~1M XIDs, 
> > > all
> > > in single-user mode.  More plausible is FreezeMultiXactId() overrunning 
> > > the
> > > limit by tens of segments.  Hence, if we do buy this insurance, let's 
> > > skip far
> > > more segments.  For example, instead of unlinking segments representing 
> > > up to
> > > 2^31 past XIDs, we could divide that into an upper half that we unlink 
> > > and a
> > > lower half.  The lower half will stay in place; eventually, XID 
> > > consumption
> > > will overwrite it.  Truncation behavior won't change until the region of 
> > > CLOG
> > > for pre-oldestXact XIDs exceeds 256 MiB.  Beyond that threshold,
> > > vac_truncate_clog() will unlink the upper 256 MiB and leave the rest.  
> > > CLOG
> > > maximum would rise from 512 MiB to 768 MiB.  Would that be worthwhile?

> > Temporarily wasting some disk
> > space is a lot more palatable than corrupting data, and these code
> > paths are necessarily not terribly well tested.  So +1 for more
> > insurance.
> 
> Okay, I'll give that a try.  I expect this will replace the PagePrecedes
> callback with a PageDiff callback such that PageDiff(a, b) < 0 iff
> PagePrecedes(a, b).  PageDiff callbacks shall distribute return values
> uniformly in [INT_MIN,INT_MAX].  SimpleLruTruncate() will unlink segments
> where INT_MIN/2 < PageDiff(candidate, cutoff) < 0.

While doing so, I found that slru-truncate-modulo-v2.patch did get edge cases
wrong, as you feared.  In particular, if the newest XID reached xidStopLimit
and was in the first page of a segment, TruncateCLOG() would delete its
segment.  Attached slru-truncate-modulo-v3.patch fixes that; as restitution, I
added unit tests covering that and other scenarios.  Reaching the bug via XIDs
was hard, requiring one to burn 1000k-CLOG_XACTS_PER_PAGE=967k XIDs in
single-user mode.  I expect the bug was easier to reach via pg_multixact.

The insurance patch stacks on top of the bug fix patch.  It does have a
negative effect on TruncateMultiXact(), which uses SlruScanDirCbFindEarliest
to skip truncation in corrupted clusters.  SlruScanDirCbFindEarliest() gives
nonsense answers if "future" segments exist.  That can happen today, but the
patch creates new ways to make it happen.  The symptom is wasting yet more
space in pg_multixact.  I am okay with this, since it arises only after one
fills pg_multixact 50% full.  There are alternatives.  We could weaken the
corruption defense in TruncateMultiXact() or look for another implementation
of equivalent defense.  We could unlink, say, 75% or 95% of the "past" instead
of 50% (this patch) or >99.99% (today's behavior).

Thanks,
nm
Author: Noah Misch 
Commit: Noah Misch 

Prevent excess SimpleLruTruncate() deletion.

Every core SLRU wraps around.  With the exception of pg_notify, the wrap
point can fall in the middle of a page.  Account for this in the
PagePrecedes callback specification and in SimpleLruTruncate()'s use of
said callback.  Update each callback implementation to fit the new
specification.  This changes SerialPagePrecedesLogically() from the
style of asyncQueuePagePrecedes() to the style of CLOGPagePrecedes().
(Whereas pg_clog and pg_serial share a key space, pg_serial is nothing
like pg_notify.)

This closes a rare opportunity for data loss, which manifested as
"apparent wraparound" or "could not access status of transaction"
errors.  This is more likely to affect pg_multixact, due to the thin
space between multiStopLimit and multiWrapLimit.  If a user's physical
replication primary logged ":  apparent wraparound" messages, the user
should rebuild that standbys of that primary regardless of symptoms.  At
less risk is a cluster having emitted "not accepting commands" errors or
"must be vacuumed" warnings at some point.  One can test a cluster for
this data loss by running VACUUM FREEZE in every database.  Back-patch
to 9.5 (all supported versions).

  

Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-04-06 Thread Noah Misch
On Mon, Apr 06, 2020 at 02:46:09PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Wed, Mar 25, 2020 at 04:42:31PM -0400, Tom Lane wrote:
> >> So I think what we're actually trying to accomplish here is to
> >> ensure that instead of deleting up to half of the SLRU space
> >> before the cutoff, we delete up to half-less-one-segment.
> >> Maybe it should be half-less-two-segments, just to provide some
> >> cushion against edge cases.  Reading the first comment in
> >> SetTransactionIdLimit makes one not want to trust too much in
> >> arguments based on the exact value of xidWrapLimit, while for
> >> the other SLRUs it was already unclear whether the edge cases
> >> were exactly right.
> 
> > That could be interesting insurance.  While it would be sad for us to miss 
> > an
> > edge case and print "must be vacuumed within 2 transactions" when wrap has
> > already happened, reaching that message implies the DBA burned ~1M XIDs, all
> > in single-user mode.  More plausible is FreezeMultiXactId() overrunning the
> > limit by tens of segments.  Hence, if we do buy this insurance, let's skip 
> > far
> > more segments.  For example, instead of unlinking segments representing up 
> > to
> > 2^31 past XIDs, we could divide that into an upper half that we unlink and a
> > lower half.  The lower half will stay in place; eventually, XID consumption
> > will overwrite it.  Truncation behavior won't change until the region of 
> > CLOG
> > for pre-oldestXact XIDs exceeds 256 MiB.  Beyond that threshold,
> > vac_truncate_clog() will unlink the upper 256 MiB and leave the rest.  CLOG
> > maximum would rise from 512 MiB to 768 MiB.  Would that be worthwhile?
> 
> Hmm.  I'm not particularly concerned about the disk-space-consumption
> angle, but I do wonder about whether we'd be sacrificing the ability to
> recover cleanly from a situation that the code does let you get into.
> However, as long as we're sure that the system will ultimately reuse/
> recycle a not-deleted old segment file without complaint, it's hard to
> find much fault with your proposal.

That is the trade-off.  By distancing ourselves from the wraparound edge
cases, we'll get more segment recycling (heretofore an edge case).
Fortunately, recycling doesn't change behavior as you approach some limit; it
works or it doesn't.

> Temporarily wasting some disk
> space is a lot more palatable than corrupting data, and these code
> paths are necessarily not terribly well tested.  So +1 for more
> insurance.

Okay, I'll give that a try.  I expect this will replace the PagePrecedes
callback with a PageDiff callback such that PageDiff(a, b) < 0 iff
PagePrecedes(a, b).  PageDiff callbacks shall distribute return values
uniformly in [INT_MIN,INT_MAX].  SimpleLruTruncate() will unlink segments
where INT_MIN/2 < PageDiff(candidate, cutoff) < 0.




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-04-06 Thread Tom Lane
Noah Misch  writes:
> On Wed, Mar 25, 2020 at 04:42:31PM -0400, Tom Lane wrote:
>> So I think what we're actually trying to accomplish here is to
>> ensure that instead of deleting up to half of the SLRU space
>> before the cutoff, we delete up to half-less-one-segment.
>> Maybe it should be half-less-two-segments, just to provide some
>> cushion against edge cases.  Reading the first comment in
>> SetTransactionIdLimit makes one not want to trust too much in
>> arguments based on the exact value of xidWrapLimit, while for
>> the other SLRUs it was already unclear whether the edge cases
>> were exactly right.

> That could be interesting insurance.  While it would be sad for us to miss an
> edge case and print "must be vacuumed within 2 transactions" when wrap has
> already happened, reaching that message implies the DBA burned ~1M XIDs, all
> in single-user mode.  More plausible is FreezeMultiXactId() overrunning the
> limit by tens of segments.  Hence, if we do buy this insurance, let's skip far
> more segments.  For example, instead of unlinking segments representing up to
> 2^31 past XIDs, we could divide that into an upper half that we unlink and a
> lower half.  The lower half will stay in place; eventually, XID consumption
> will overwrite it.  Truncation behavior won't change until the region of CLOG
> for pre-oldestXact XIDs exceeds 256 MiB.  Beyond that threshold,
> vac_truncate_clog() will unlink the upper 256 MiB and leave the rest.  CLOG
> maximum would rise from 512 MiB to 768 MiB.  Would that be worthwhile?

Hmm.  I'm not particularly concerned about the disk-space-consumption
angle, but I do wonder about whether we'd be sacrificing the ability to
recover cleanly from a situation that the code does let you get into.
However, as long as we're sure that the system will ultimately reuse/
recycle a not-deleted old segment file without complaint, it's hard to
find much fault with your proposal.  Temporarily wasting some disk
space is a lot more palatable than corrupting data, and these code
paths are necessarily not terribly well tested.  So +1 for more
insurance.

regards, tom lane




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-03-29 Thread Noah Misch
On Wed, Mar 25, 2020 at 04:42:31PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Thu, Mar 19, 2020 at 06:04:52PM -0400, Tom Lane wrote:
> >> 3. It feels like the proposed test of cutoff position against both
> >> ends of a segment is a roundabout way of fixing the problem.  I
> >> wonder whether we ought not pass *both* the cutoff and the current
> >> endpoint (latest_page_number) down to the truncation logic, and
> >> have it compare against both of those values.
> 
> > Since latest_page_number can keep changing throughout SlruScanDirectory()
> > execution, that would give a false impression of control.  Better to
> > demonstrate that the xidWrapLimit machinery keeps latest_page_number within
> > acceptable constraints than to ascribe significance to a comparison with a
> > stale latest_page_number.
> 
> Perhaps.  I'm prepared to accept that line of argument so far as the clog
> SLRU goes, but I'm not convinced that the other SLRUs have equally robust
> defenses against advancing too far.  So on the whole I'd rather that the
> SLRU logic handled this issue strictly on the basis of what it knows,
> without assumptions about what calling code may be doing.  Still, maybe
> we only really care about the risk for the clog SLRU?

PerformOffsetsTruncation() is the most at-risk, since a single VACUUM could
burn millions of multixacts via FreezeMultiXactId() calls.  (To make that
happen in single-user mode, I suspect one could use prepared transactions as
active lockers and/or in-progress updaters.)  I'm not concerned about other
SLRUs.  TruncateCommitTs() moves in lockstep with TruncateCLOG().  The other
SimpleLruTruncate() callers handle data that becomes obsolete at every
postmaster restart.

> > Exactly.
> > https://docs.google.com/drawings/d/1xRTbQ4DVyP5wI1Ujm_gmmY-cC8KKCjahEtsU_o0fC7I
> > uses your octagon to show the behaviors before and after this patch.
> 
> Cool, thanks for drafting that up.  (My original sketch was not of
> publishable quality ;-).)  To clarify, the upper annotations probably
> ought to read "nextXid <= xidWrapLimit"?

It diagrams the scenario of nextXid reaching xidWrapLimit, so the green dot
represents both values.

> And "cutoffPage" ought
> to be affixed to the orange dot at lower right of the center image?

No; oldestXact and cutoffPage have the same position in that diagram, because
the patch causes the cutoffPage variable to denote the page that contains
oldestXact.  I've now added an orange dot to show that.

> I agree that this diagram depicts why we have a problem right now,
> and the right-hand image shows what we want to have happen.
> What's a little less clear is whether the proposed patch achieves
> that effect.
> 
> In particular, after studying this awhile, it seems like removal
> of the initial "cutoffPage -= cutoffPage % SLRU_PAGES_PER_SEGMENT"
> adjustment isn't really affecting anything.

True.  The set of unlink() calls needs to be the same for oldestXact in the
first page of a segment, in the last page, or in some interior page.  Removing
the rounding neither helps nor hurts correctness.

> So I think what we're actually trying to accomplish here is to
> ensure that instead of deleting up to half of the SLRU space
> before the cutoff, we delete up to half-less-one-segment.
> Maybe it should be half-less-two-segments, just to provide some
> cushion against edge cases.  Reading the first comment in
> SetTransactionIdLimit makes one not want to trust too much in
> arguments based on the exact value of xidWrapLimit, while for
> the other SLRUs it was already unclear whether the edge cases
> were exactly right.

That could be interesting insurance.  While it would be sad for us to miss an
edge case and print "must be vacuumed within 2 transactions" when wrap has
already happened, reaching that message implies the DBA burned ~1M XIDs, all
in single-user mode.  More plausible is FreezeMultiXactId() overrunning the
limit by tens of segments.  Hence, if we do buy this insurance, let's skip far
more segments.  For example, instead of unlinking segments representing up to
2^31 past XIDs, we could divide that into an upper half that we unlink and a
lower half.  The lower half will stay in place; eventually, XID consumption
will overwrite it.  Truncation behavior won't change until the region of CLOG
for pre-oldestXact XIDs exceeds 256 MiB.  Beyond that threshold,
vac_truncate_clog() will unlink the upper 256 MiB and leave the rest.  CLOG
maximum would rise from 512 MiB to 768 MiB.  Would that be worthwhile?




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-03-25 Thread Tom Lane
Noah Misch  writes:
> On Thu, Mar 19, 2020 at 06:04:52PM -0400, Tom Lane wrote:
>> 1. It seems like this discussion is conflating two different issues.

> When the newest XID and the oldest XID fall in "opposite" segments in the XID
> space, we must not unlink the segment containing the newest XID.  That is the
> chief goal at present.

Got it.  Thanks for clarifying the scope of the patch.

>> 3. It feels like the proposed test of cutoff position against both
>> ends of a segment is a roundabout way of fixing the problem.  I
>> wonder whether we ought not pass *both* the cutoff and the current
>> endpoint (latest_page_number) down to the truncation logic, and
>> have it compare against both of those values.

> Since latest_page_number can keep changing throughout SlruScanDirectory()
> execution, that would give a false impression of control.  Better to
> demonstrate that the xidWrapLimit machinery keeps latest_page_number within
> acceptable constraints than to ascribe significance to a comparison with a
> stale latest_page_number.

Perhaps.  I'm prepared to accept that line of argument so far as the clog
SLRU goes, but I'm not convinced that the other SLRUs have equally robust
defenses against advancing too far.  So on the whole I'd rather that the
SLRU logic handled this issue strictly on the basis of what it knows,
without assumptions about what calling code may be doing.  Still, maybe
we only really care about the risk for the clog SLRU?

>> To try to clarify this in my head, I thought about an image of
>> the modular XID space as an octagon, where each side would correspond to
>> a segment file if we chose numbers such that there were only 8 possible
>> segment files.

> Exactly.
> https://docs.google.com/drawings/d/1xRTbQ4DVyP5wI1Ujm_gmmY-cC8KKCjahEtsU_o0fC7I
> uses your octagon to show the behaviors before and after this patch.

Cool, thanks for drafting that up.  (My original sketch was not of
publishable quality ;-).)  To clarify, the upper annotations probably
ought to read "nextXid <= xidWrapLimit"?  And "cutoffPage" ought
to be affixed to the orange dot at lower right of the center image?

I agree that this diagram depicts why we have a problem right now,
and the right-hand image shows what we want to have happen.
What's a little less clear is whether the proposed patch achieves
that effect.

In particular, after studying this awhile, it seems like removal
of the initial "cutoffPage -= cutoffPage % SLRU_PAGES_PER_SEGMENT"
adjustment isn't really affecting anything.  It's already the case
that just by allowing oldestXact to get rounded back to an SLRU page
boundary, we've created some daylight between oldestXact and the
cutoff point.  Rounding back further within the same SLRU segment
changes nothing.  (For example, suppose that oldestXact is already
within the oldest page of its SLRU segment.  Then either rounding
rule has the same effect.  But there's still a little bit of room for
xidWrapLimit to be in the opposite SLRU segment, allowing trouble.)

So I think what we're actually trying to accomplish here is to
ensure that instead of deleting up to half of the SLRU space
before the cutoff, we delete up to half-less-one-segment.
Maybe it should be half-less-two-segments, just to provide some
cushion against edge cases.  Reading the first comment in
SetTransactionIdLimit makes one not want to trust too much in
arguments based on the exact value of xidWrapLimit, while for
the other SLRUs it was already unclear whether the edge cases
were exactly right.

In any case, it feels like the specific solution you have here,
of testing both ends of the segment, is a roundabout way of
providing that one-segment slop; and it doesn't help if we decide
we need two-segment slop.  Can we write the test in a way that
explicitly provides N segments of slop?

regards, tom lane




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-03-22 Thread Noah Misch
On Thu, Mar 19, 2020 at 06:04:52PM -0400, Tom Lane wrote:
> Yeah, this patch has been waiting in the queue for way too long :-(.

Thanks for reviewing.

> I spent some time studying this, and I have a few comments/questions:
> 
> 1. It seems like this discussion is conflating two different issues.
> The original complaint was about a seemingly-bogus log message "could
> not truncate directory "pg_xact": apparent wraparound".  Your theory
> about that, IIUC, is that SimpleLruTruncate's initial round-back of
> cutoffPage to a segment boundary moved us from a state where
> shared->latest_page_number doesn't logically precede the cutoffPage to
> one where it does, triggering the message.  So removing the initial
> round-back, and then doing whatever's needful to compensate for that in
> the later processing, is a reasonable fix to prevent the bogus warning.
> However, you're also discussing whether or not an SLRU segment file that
> is close to the wraparound boundary should get removed or not.  As far

When the newest XID and the oldest XID fall in "opposite" segments in the XID
space, we must not unlink the segment containing the newest XID.  That is the
chief goal at present.

> as I can see that's 100% independent of issuance of the log message, no?

Perhaps confusing is that the first message of the thread and the subject line
contain wrong claims, which I corrected in the 2019-02-13 message[1].  Due to
point (2) in [1], it's essential to make the "apparent wraparound" message a
can't-happen event.  Hence, I'm not looking to improve the message or its
firing conditions.  I want to fix the excess segment deletion, at which point
the message will become unreachable except under single-user mode.

> 2. I wonder whether we have an issue even with rounding back to the
> SLRU page boundary, as is done by each caller before we ever get to
> SimpleLruTruncate.  I'm pretty sure that the actual anti-wraparound
> protections are all precise to the XID level, so that there's some
> daylight between what SimpleLruTruncate is told and the range of
> data that the higher-level code thinks is of interest.  Maybe we
> should restructure things so that we keep the original cutoff XID
> (or equivalent) all the way through, and compare the start/end
> positions of a segment file directly to that.

Currently, slru.c knows nothing about the division of pages into records.
Hmm.  To keep oldestXact all the way through, I suppose the PagePrecedes
callback could become one or more record-oriented (e.g. XID-oriented)
callbacks.  The current scheme just relies on TransactionIdToPage() in
TruncateCLOG().  If TransactionIdToPage() had a bug, all sorts of CLOG lookups
would do wrong things.  Hence, I think today's scheme is tougher to get wrong.
Do you see it differently?

> 3. It feels like the proposed test of cutoff position against both
> ends of a segment is a roundabout way of fixing the problem.  I
> wonder whether we ought not pass *both* the cutoff and the current
> endpoint (latest_page_number) down to the truncation logic, and
> have it compare against both of those values.

Since latest_page_number can keep changing throughout SlruScanDirectory()
execution, that would give a false impression of control.  Better to
demonstrate that the xidWrapLimit machinery keeps latest_page_number within
acceptable constraints than to ascribe significance to a comparison with a
stale latest_page_number.

> To try to clarify this in my head, I thought about an image of
> the modular XID space as an octagon, where each side would correspond to
> a segment file if we chose numbers such that there were only 8 possible
> segment files.  Let's say that nextXID is somewhere in the bottommost
> octagon segment.  The oldest possible value for the truncation cutoff
> XID is a bit less than "halfway around" from nextXID; so it could be
> in the topmost octagon segment, if the minimum permitted daylight-
> till-wraparound is less than the SLRU segment size (which it is).
> Then, if we round the cutoff XID "back" to a segment boundary, most
> of the current (bottommost) segment is now less than halfway around
> from that cutoff point, and in particular the current segment's
> starting page is exactly halfway around.  Because of the way that
> TransactionIdPrecedes works, the current segment will be considered to
> precede that cutoff point (the int32 difference comes out as exactly
> 2^31 which is negative).  Boom, data loss, because we'll decide the
> current segment is removable.

Exactly.
https://docs.google.com/drawings/d/1xRTbQ4DVyP5wI1Ujm_gmmY-cC8KKCjahEtsU_o0fC7I
uses your octagon to show the behaviors before and after this patch.

> I think that your proposed patch does fix this, but I'm not quite
> sold that the corner cases (where the cutoff XID is itself exactly
> at a page boundary) are right.

That's a good thing to worry about.  More specifically, I think the edge case
to check is when oldestXact is the last XID of a _segment_.  That 

Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-03-19 Thread Tom Lane
Noah Misch  writes:
> On Sun, Jan 05, 2020 at 01:33:55AM +0100, Tomas Vondra wrote:
>> It's a bit unfortunate that a patch for a data corruption / loss issue
>> (even if a low-probability one) fell through multiple commitfests.

> Thanks for investing in steps to fix that.

Yeah, this patch has been waiting in the queue for way too long :-(.

I spent some time studying this, and I have a few comments/questions:

1. It seems like this discussion is conflating two different issues.
The original complaint was about a seemingly-bogus log message "could
not truncate directory "pg_xact": apparent wraparound".  Your theory
about that, IIUC, is that SimpleLruTruncate's initial round-back of
cutoffPage to a segment boundary moved us from a state where
shared->latest_page_number doesn't logically precede the cutoffPage to
one where it does, triggering the message.  So removing the initial
round-back, and then doing whatever's needful to compensate for that in
the later processing, is a reasonable fix to prevent the bogus warning.
However, you're also discussing whether or not an SLRU segment file that
is close to the wraparound boundary should get removed or not.  As far
as I can see that's 100% independent of issuance of the log message, no?
This might not affect the code substance of the patch at all; but it
seems like we need to be clear about it in our discussion, and maybe the
comments need to change too.

2. I wonder whether we have an issue even with rounding back to the
SLRU page boundary, as is done by each caller before we ever get to
SimpleLruTruncate.  I'm pretty sure that the actual anti-wraparound
protections are all precise to the XID level, so that there's some
daylight between what SimpleLruTruncate is told and the range of
data that the higher-level code thinks is of interest.  Maybe we
should restructure things so that we keep the original cutoff XID
(or equivalent) all the way through, and compare the start/end
positions of a segment file directly to that.

3. It feels like the proposed test of cutoff position against both
ends of a segment is a roundabout way of fixing the problem.  I
wonder whether we ought not pass *both* the cutoff and the current
endpoint (latest_page_number) down to the truncation logic, and
have it compare against both of those values.

To try to clarify this in my head, I thought about an image of
the modular XID space as an octagon, where each side would correspond to
a segment file if we chose numbers such that there were only 8 possible
segment files.  Let's say that nextXID is somewhere in the bottommost
octagon segment.  The oldest possible value for the truncation cutoff
XID is a bit less than "halfway around" from nextXID; so it could be
in the topmost octagon segment, if the minimum permitted daylight-
till-wraparound is less than the SLRU segment size (which it is).
Then, if we round the cutoff XID "back" to a segment boundary, most
of the current (bottommost) segment is now less than halfway around
from that cutoff point, and in particular the current segment's
starting page is exactly halfway around.  Because of the way that
TransactionIdPrecedes works, the current segment will be considered to
precede that cutoff point (the int32 difference comes out as exactly
2^31 which is negative).  Boom, data loss, because we'll decide the
current segment is removable.

I think that your proposed patch does fix this, but I'm not quite
sold that the corner cases (where the cutoff XID is itself exactly
at a page boundary) are right.  In any case, I think it'd be more
robust to be comparing explicitly against a notion of the latest
in-use page number, instead of backing into it from an assumption
that the cutoff XID itself is less than halfway around.

I wonder if we ought to dodge the problem by having a higher minimum
value of the required daylight-before-wraparound, so that the cutoff
point couldn't be in the diametrically-opposite-current segment but
would have to be at least one segment before that.  In the end,
I believe that all of this logic was written under an assumption
that we should never get into a situation where we are so close
to the wraparound threshold that considerations like these would
manifest.  Maybe we can get it right, but I don't have huge
faith in it.

It also bothers me that some of the callers of SimpleLruTruncate
have explicit one-count backoffs of the cutoff point and others
do not.  There's no obvious reason for the difference, so I wonder
if that isn't something we should have across-the-board, or else
adjust SimpleLruTruncate to do the equivalent thing internally.

I haven't thought much yet about your second point about race
conditions arising from nextXID possibly moving before we
finish the deletion scan.  Maybe we could integrate a fix for
that issue, along the lines of (1) see an SLRU segment file,
(2) determine that it appears to precede the cutoff XID, if so
(3) acquire the control lock and fetch latest_page_number,
compare against 

Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-01-04 Thread Noah Misch
On Sun, Jan 05, 2020 at 01:33:55AM +0100, Tomas Vondra wrote:
> On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote:
> >a. Fix the rounding in SimpleLruTruncate().  (The patch I posted upthread is
> >  wrong; I will correct it in a separate message.)
> >
> >b. Arrange so only one backend runs vac_truncate_clog() at a time.  Other 
> >than
> >  AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
> >  checkpoint, or in the startup process.  Hence, also arrange for only one
> >  backend to call SimpleLruTruncate(AsyncCtl) at a time.
> >
> >c. Test "apparent wraparound" before writing WAL, and don't WAL-log
> >  truncations that "apparent wraparound" forces us to skip.
> >
> >d. Hold the ControlLock for the entirety of SimpleLruTruncate().  This 
> >removes
> >  the TOCTOU race condition, but TransactionIdDidCommit() and other key
> >  operations would be waiting behind filesystem I/O.

> >With both (a) and (b), the only way I'd know to reach the "apparent
> >wraparound" message is to restart in single-user mode and burn XIDs to the
> >point of bona fide wraparound.  Hence, I propose to back-patch (a) and (b),
> >and I propose (c) for HEAD only.  I don't want (d), which threatens
> >performance too much.  I would rather not have (e), because I expect it's 
> >more
> >complex than (b) and fixes strictly less than (b) fixes.

> Seems reasonable, although I wonder how much more expensive would just
> doing (d) be. It seems by far the least complex solution, and it moves
> "just" the SlruScanDirectory() call before the lock. It's true it adds
> I/O requests, OTOH it's just unlink() without fsync() and I'd expect the
> number of files to be relatively low. Plus we already do SimpleLruWaitIO
> and SlruInternalWritePage in the loop.

Trivial read-only transactions often need CLogControlLock to check tuple
visibility.  If an unlink() takes 1s, stalling read-only transactions for that
1s is a big problem.  SimpleLruWaitIO() and SlruInternalWritePage() release
the control lock during I/O, then re-acquire it.  (Moreover,
SimpleLruTruncate() rarely calls them.  Calling them implies a page old enough
to truncate was modified after the most recent checkpoint.)

> BTW isn't that an issue that SlruInternalDeleteSegment does not do any
> fsync calls after unlinking the segments? If the system crashes/reboots
> before this becomes persistent (i.e. some of the segments reappear,
> won't that cause a problem)?

I think not; we could turn SlruInternalDeleteSegment() into a no-op, and the
only SQL-visible consequence would be extra disk usage.  CheckPoint fields
tell the server what region of slru data is meaningful, and segments outside
that range merely waste disk space.  (If that's not true and can't be made
true, we'd also need to stop ignoring the unlink() return value.)

> It's a bit unfortunate that a patch for a data corruption / loss issue
> (even if a low-probability one) fell through multiple commitfests.

Thanks for investing in steps to fix that.




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-01-04 Thread Tomas Vondra

On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote:

On Sat, Feb 02, 2019 at 03:38:22AM -0500, Noah Misch wrote:

The main consequence is the false alarm.


That conclusion was incorrect.  On further study, I was able to reproduce data
loss via either of two weaknesses in the "apparent wraparound" test:

1. The result of the test is valid only until we release the SLRU ControlLock,
  which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate
  segments for deletion.  Once we release that lock, latest_page_number can
  advance.  This creates a TOCTOU race condition, allowing excess deletion:

  [local] test=# table trunc_clog_concurrency ;
  ERROR:  could not access status of transaction 2149484247
  DETAIL:  Could not open file "pg_xact/0801": No such file or directory.

2. By the time the "apparent wraparound" test fires, we've already WAL-logged
  the truncation.  clog_redo() suppresses the "apparent wraparound" test,
  then deletes too much.  Startup then fails:

  881997 2019-02-10 02:53:32.105 GMT FATAL:  could not access status of 
transaction 708112327
  881997 2019-02-10 02:53:32.105 GMT DETAIL:  Could not open file 
"pg_xact/02A3": No such file or directory.
  881855 2019-02-10 02:53:32.107 GMT LOG:  startup process (PID 881997) exited 
with exit code 1


Fixes are available:

a. Fix the rounding in SimpleLruTruncate().  (The patch I posted upthread is
  wrong; I will correct it in a separate message.)

b. Arrange so only one backend runs vac_truncate_clog() at a time.  Other than
  AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
  checkpoint, or in the startup process.  Hence, also arrange for only one
  backend to call SimpleLruTruncate(AsyncCtl) at a time.

c. Test "apparent wraparound" before writing WAL, and don't WAL-log
  truncations that "apparent wraparound" forces us to skip.

d. Hold the ControlLock for the entirety of SimpleLruTruncate().  This removes
  the TOCTOU race condition, but TransactionIdDidCommit() and other key
  operations would be waiting behind filesystem I/O.

e. Have the SLRU track a "low cutoff" for an ongoing truncation.  Initially,
  the low cutoff is the page furthest in the past relative to cutoffPage (the
  "high cutoff").  If SimpleLruZeroPage() wishes to use a page in the
  truncation range, it would acquire an LWLock and increment the low cutoff.
  Before unlinking any segment, SlruScanDirCbDeleteCutoff() would take the
  same LWLock and recheck the segment against the latest low cutoff.

With both (a) and (b), the only way I'd know to reach the "apparent
wraparound" message is to restart in single-user mode and burn XIDs to the
point of bona fide wraparound.  Hence, I propose to back-patch (a) and (b),
and I propose (c) for HEAD only.  I don't want (d), which threatens
performance too much.  I would rather not have (e), because I expect it's more
complex than (b) and fixes strictly less than (b) fixes.

Can you see a way to improve on that plan?  Can you see other bugs of this
nature that this plan does not fix?



Seems reasonable, although I wonder how much more expensive would just
doing (d) be. It seems by far the least complex solution, and it moves
"just" the SlruScanDirectory() call before the lock. It's true it adds
I/O requests, OTOH it's just unlink() without fsync() and I'd expect the
number of files to be relatively low. Plus we already do SimpleLruWaitIO
and SlruInternalWritePage in the loop.

BTW isn't that an issue that SlruInternalDeleteSegment does not do any
fsync calls after unlinking the segments? If the system crashes/reboots
before this becomes persistent (i.e. some of the segments reappear,
won't that cause a problem)?


It's a bit unfortunate that a patch for a data corruption / loss issue
(even if a low-probability one) fell through multiple commitfests.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2019-07-24 Thread Noah Misch
On Wed, Jul 24, 2019 at 05:27:18PM +0900, Kyotaro Horiguchi wrote:
> Sorry in advance for link-breaking message forced by gmail..

Using the archives page "Resend email" link avoids that.

> https://www.postgresql.org/message-id/flat/20190202083822.gc32...@gust.leadboat.com
> 
> > 1. The result of the test is valid only until we release the SLRU 
> > ControlLock,
> >which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to 
> > evaluate
> >segments for deletion.  Once we release that lock, latest_page_number can
> >advance.  This creates a TOCTOU race condition, allowing excess deletion:
> >
> >
> >[local] test=# table trunc_clog_concurrency ;
> >ERROR:  could not access status of transaction 2149484247
> >DETAIL:  Could not open file "pg_xact/0801": No such file or directory.
> 
> It seems like some other vacuum process saw larger cutoff page?

No, just one VACUUM suffices.

> If I'm
> not missing something, the missing page is no longer the
> "recently-populated" page at the time (As I understand it as the last
> page that holds valid data). Couldn't we just ignore ENOENT there?

The server reported this error while attempting to read CLOG to determine
whether a tuple's xmin committed or aborted.  That ENOENT means the relevant
CLOG page is not available.  To ignore that ENOENT, the server would need to
guess whether to consider the xmin committed or consider it aborted.  So, no,
we can't just ignore the ENOENT.




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2019-07-24 Thread Kyotaro Horiguchi
Sorry in advance for link-breaking message forced by gmail..

https://www.postgresql.org/message-id/flat/20190202083822.gc32...@gust.leadboat.com

> 1. The result of the test is valid only until we release the SLRU ControlLock,
>which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate
>segments for deletion.  Once we release that lock, latest_page_number can
>advance.  This creates a TOCTOU race condition, allowing excess deletion:
>
>
>[local] test=# table trunc_clog_concurrency ;
>ERROR:  could not access status of transaction 2149484247
>DETAIL:  Could not open file "pg_xact/0801": No such file or directory.

It seems like some other vacuum process saw larger cutoff page? If I'm
not missing something, the missing page is no longer the
"recently-populated" page at the time (As I understand it as the last
page that holds valid data). Couldn't we just ignore ENOENT there?

> 2. By the time the "apparent wraparound" test fires, we've already WAL-logged
>the truncation.  clog_redo() suppresses the "apparent wraparound" test,
>then deletes too much.  Startup then fails:

I agree that if truncation is skipped after issuing log, it will
lead to data-loss at the next recovery. But the follwoing log..:

>881997 2019-02-10 02:53:32.105 GMT FATAL:  could not access status of 
> transaction 708112327
>881997 2019-02-10 02:53:32.105 GMT DETAIL:  Could not open file 
> "pg_xact/02A3": No such file or directory.
>881855 2019-02-10 02:53:32.107 GMT LOG:  startup process (PID 881997) 
> exited with exit code 1

If it came from the same reason as 1, the log is simply ignorable, so
recovery stopping by the error is unacceptable, but the ENOENT is just
ignorable for the same reason.

As the result, I agree to (a) (fix rounding), and (c) (test
wrap-around before writing WAL) but I'm not sure for others. And
additional fix for ignorable ENOENT is needed.

What do you think about this?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2019-02-16 Thread Noah Misch
On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote:
> On further study, I was able to reproduce data loss

> Fixes are available:
> 
> a. Fix the rounding in SimpleLruTruncate().  (The patch I posted upthread is
>wrong; I will correct it in a separate message.)

Here's a corrected version.  I now delete a segment only if both its first
page and its last page are considered to precede the cutoff; see the new
comment at SlruMayDeleteSegment().
commit 09393a1 (HEAD)
Author: Noah Misch 
AuthorDate: Sat Feb 16 20:02:51 2019 -0800
Commit: Noah Misch 
CommitDate: Sat Feb 16 20:02:51 2019 -0800

Don't round SimpleLruTruncate() cutoffPage values.

Every core SLRU wraps around.  The rounding did not account for that; in
rare cases, it permitted deletion of the most recently-populated page of
SLRU data.  This closes a rare opportunity for data loss, which
manifested as "could not access status of transaction" errors.  If a
user's physical replication primary logged ": apparent wraparound"
messages, the user should rebuild that primary's standbys regardless of
symptoms.  At less risk is a cluster having emitted "not accepting
commands" errors or "must be vacuumed" warnings at some point.  One can
test a cluster for this data loss by running VACUUM FREEZE in every
database.  Back-patch to 9.4 (all supported versions).

Discussion: https://postgr.es/m/20190202083822.gc32...@gust.leadboat.com

diff --git a/src/backend/access/transam/slru.c 
b/src/backend/access/transam/slru.c
index 3623352..71e29b9 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -1171,11 +1171,6 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
int slotno;
 
/*
-* The cutoff point is the start of the segment containing cutoffPage.
-*/
-   cutoffPage -= cutoffPage % SLRU_PAGES_PER_SEGMENT;
-
-   /*
 * Scan shared memory and remove any pages preceding the cutoff page, to
 * ensure we won't rewrite them later.  (Since this is normally called 
in
 * or just after a checkpoint, any dirty pages should have been flushed
@@ -1221,8 +1216,11 @@ restart:;
 * Hmm, we have (or may have) I/O operations acting on the 
page, so
 * we've got to wait for them to finish and then start again. 
This is
 * the same logic as in SlruSelectLRUPage.  (XXX if page is 
dirty,
-* wouldn't it be OK to just discard it without writing it?  
For now,
-* keep the logic the same as it was.)
+* wouldn't it be OK to just discard it without writing it?
+* SlruMayDeleteSegment() uses a stricter qualification, so we 
might
+* not delete this page in the end; even if we don't delete it, 
we
+* won't have cause to read its data again.  For now, keep the 
logic
+* the same as it was.)
 */
if (shared->page_status[slotno] == SLRU_PAGE_VALID)
SlruInternalWritePage(ctl, slotno, NULL);
@@ -1313,18 +1311,40 @@ restart:
 }
 
 /*
+ * Determine whether a segment is okay to delete.
+ *
+ * segpage is the first page of the segment, and cutoffPage is the oldest (in
+ * PagePrecedes order) page in the SLRU containing still-useful data.  Since
+ * every core PagePrecedes callback implements "wrap around", check the
+ * segment's first and last pages:
+ *
+ * first=cutoff: no; cutoff falls inside this segment
+ * first>=cutoff && last=cutoff && last>=cutoff: no; every page of this segment is too young
+ */
+static bool
+SlruMayDeleteSegment(SlruCtl ctl, int segpage, int cutoffPage)
+{
+   int seg_last_page = segpage + 
SLRU_PAGES_PER_SEGMENT - 1;
+
+   Assert(segpage % SLRU_PAGES_PER_SEGMENT == 0);
+
+   return (ctl->PagePrecedes(segpage, cutoffPage) &&
+   ctl->PagePrecedes(seg_last_page, cutoffPage));
+}
+
+/*
  * SlruScanDirectory callback
- * This callback reports true if there's any segment prior to the 
one
- * containing the page passed as "data".
+ * This callback reports true if there's any segment wholly prior 
to the
+ * one containing the page passed as "data".
  */
 bool
 SlruScanDirCbReportPresence(SlruCtl ctl, char *filename, int segpage, void 
*data)
 {
int cutoffPage = *(int *) data;
 
-   cutoffPage -= cutoffPage % SLRU_PAGES_PER_SEGMENT;
-
-   if (ctl->PagePrecedes(segpage, cutoffPage))
+   if (SlruMayDeleteSegment(ctl, segpage, cutoffPage))
return true;/* found one; don't iterate any 
more */
 
return false;   /* keep going */
@@ -1339,7 +1359,7 @@ SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename, 
int segpage, void *data)
 {
int cutoffPage = 

Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2019-02-13 Thread Noah Misch
On Sat, Feb 02, 2019 at 03:38:22AM -0500, Noah Misch wrote:
> The main consequence is the false alarm.

That conclusion was incorrect.  On further study, I was able to reproduce data
loss via either of two weaknesses in the "apparent wraparound" test:

1. The result of the test is valid only until we release the SLRU ControlLock,
   which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate
   segments for deletion.  Once we release that lock, latest_page_number can
   advance.  This creates a TOCTOU race condition, allowing excess deletion:

   [local] test=# table trunc_clog_concurrency ;
   ERROR:  could not access status of transaction 2149484247
   DETAIL:  Could not open file "pg_xact/0801": No such file or directory.

2. By the time the "apparent wraparound" test fires, we've already WAL-logged
   the truncation.  clog_redo() suppresses the "apparent wraparound" test,
   then deletes too much.  Startup then fails:

   881997 2019-02-10 02:53:32.105 GMT FATAL:  could not access status of 
transaction 708112327
   881997 2019-02-10 02:53:32.105 GMT DETAIL:  Could not open file 
"pg_xact/02A3": No such file or directory.
   881855 2019-02-10 02:53:32.107 GMT LOG:  startup process (PID 881997) exited 
with exit code 1


Fixes are available:

a. Fix the rounding in SimpleLruTruncate().  (The patch I posted upthread is
   wrong; I will correct it in a separate message.)

b. Arrange so only one backend runs vac_truncate_clog() at a time.  Other than
   AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
   checkpoint, or in the startup process.  Hence, also arrange for only one
   backend to call SimpleLruTruncate(AsyncCtl) at a time.

c. Test "apparent wraparound" before writing WAL, and don't WAL-log
   truncations that "apparent wraparound" forces us to skip.

d. Hold the ControlLock for the entirety of SimpleLruTruncate().  This removes
   the TOCTOU race condition, but TransactionIdDidCommit() and other key
   operations would be waiting behind filesystem I/O.

e. Have the SLRU track a "low cutoff" for an ongoing truncation.  Initially,
   the low cutoff is the page furthest in the past relative to cutoffPage (the
   "high cutoff").  If SimpleLruZeroPage() wishes to use a page in the
   truncation range, it would acquire an LWLock and increment the low cutoff.
   Before unlinking any segment, SlruScanDirCbDeleteCutoff() would take the
   same LWLock and recheck the segment against the latest low cutoff.

With both (a) and (b), the only way I'd know to reach the "apparent
wraparound" message is to restart in single-user mode and burn XIDs to the
point of bona fide wraparound.  Hence, I propose to back-patch (a) and (b),
and I propose (c) for HEAD only.  I don't want (d), which threatens
performance too much.  I would rather not have (e), because I expect it's more
complex than (b) and fixes strictly less than (b) fixes.

Can you see a way to improve on that plan?  Can you see other bugs of this
nature that this plan does not fix?

Thanks,
nm



Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2019-02-10 Thread Noah Misch
On Sat, Feb 02, 2019 at 03:38:22AM -0500, Noah Misch wrote:
> The main consequence is the false alarm.  A prudent DBA will want to react to
> true wraparound, but no such wraparound has occurred.  Also, we temporarily
> waste disk space in pg_xact.  This feels like a recipe for future bugs.  The
> fix I have in mind, attached, is to change instances of
> ctl->PagePrecedes(FIRST_PAGE_OF_SEGMENT, ROUNDED_cutoffPage) to
> ctl->PagePrecedes(LAST_PAGE_OF_SEGMENT, cutoffPage).  I'm inclined not to
> back-patch this; does anyone favor back-patching?

To avoid wasting more of anyone's time: that patch is bad; I'll update this
thread when I have something better.



Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2019-02-02 Thread Noah Misch
While testing an xidStopLimit corner case, I got this:

3656710 2019-01-05 00:05:13.910 GMT LOG:  automatic aggressive vacuum to 
prevent wraparound of table "test.pg_toast.pg_toast_826": index scans: 0
3656710 2019-01-05 00:05:16.912 GMT LOG:  could not truncate directory 
"pg_xact": apparent wraparound
3656710 2019-01-05 00:05:16.912 GMT DEBUG:  transaction ID wrap limit is 
4294486400, limited by database with OID 1
3656710 2019-01-05 00:05:16.912 GMT WARNING:  database "template1" must be 
vacuumed within 481499 transactions
3656710 2019-01-05 00:05:16.912 GMT HINT:  To avoid a database shutdown, 
execute a database-wide VACUUM in that database.

I think the WARNING was correct about having 481499 XIDs left before
xidWrapLimit, and the spurious "apparent wraparound" arose from this
rounding-down in SimpleLruTruncate():

cutoffPage -= cutoffPage % SLRU_PAGES_PER_SEGMENT;
...
/*
 * While we are holding the lock, make an important safety check: the
 * planned cutoff point must be <= the current endpoint page. Otherwise 
we
 * have already wrapped around, and proceeding with the truncation would
 * risk removing the current segment.
 */
if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
{
LWLockRelease(shared->ControlLock);
ereport(LOG,
(errmsg("could not truncate directory \"%s\": 
apparent wraparound",
ctl->Dir)));

We round "cutoffPage" to make ctl->PagePrecedes(segpage, cutoffPage) return
false for the segment containing the cutoff page.  CLOGPagePrecedes() (and
most SLRU PagePrecedes methods) implements a circular address space.  Hence,
the rounding also causes ctl->PagePrecedes(segpage, cutoffPage) to return true
for the segment furthest in the future relative to the unrounded cutoffPage
(if it exists).  That's bad.  Such a segment rarely exists, because
xidStopLimit protects 100 XIDs, and the rounding moves truncation by no
more than (BLCKSZ * CLOG_XACTS_PER_BYTE * SLRU_PAGES_PER_SEGMENT - 1) =
1048575 XIDs.  Thus, I expect to see this problem at 4.9% of xidStopLimit
values.  I expect this is easier to see with multiStopLimit, which protects
only 100 mxid.

The main consequence is the false alarm.  A prudent DBA will want to react to
true wraparound, but no such wraparound has occurred.  Also, we temporarily
waste disk space in pg_xact.  This feels like a recipe for future bugs.  The
fix I have in mind, attached, is to change instances of
ctl->PagePrecedes(FIRST_PAGE_OF_SEGMENT, ROUNDED_cutoffPage) to
ctl->PagePrecedes(LAST_PAGE_OF_SEGMENT, cutoffPage).  I'm inclined not to
back-patch this; does anyone favor back-patching?

Thanks,
nm
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 3623352..843486a 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -1171,11 +1171,6 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
 	int			slotno;
 
 	/*
-	 * The cutoff point is the start of the segment containing cutoffPage.
-	 */
-	cutoffPage -= cutoffPage % SLRU_PAGES_PER_SEGMENT;
-
-	/*
 	 * Scan shared memory and remove any pages preceding the cutoff page, to
 	 * ensure we won't rewrite them later.  (Since this is normally called in
 	 * or just after a checkpoint, any dirty pages should have been flushed
@@ -1320,11 +1315,10 @@ restart:
 bool
 SlruScanDirCbReportPresence(SlruCtl ctl, char *filename, int segpage, void *data)
 {
+	int			seg_last_page = segpage + SLRU_PAGES_PER_SEGMENT - 1;
 	int			cutoffPage = *(int *) data;
 
-	cutoffPage -= cutoffPage % SLRU_PAGES_PER_SEGMENT;
-
-	if (ctl->PagePrecedes(segpage, cutoffPage))
+	if (ctl->PagePrecedes(seg_last_page, cutoffPage))
 		return true;			/* found one; don't iterate any more */
 
 	return false;/* keep going */
@@ -1337,9 +1331,10 @@ SlruScanDirCbReportPresence(SlruCtl ctl, char *filename, int segpage, void *data
 static bool
 SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename, int segpage, void *data)
 {
+	int			seg_last_page = segpage + SLRU_PAGES_PER_SEGMENT - 1;
 	int			cutoffPage = *(int *) data;
 
-	if (ctl->PagePrecedes(segpage, cutoffPage))
+	if (ctl->PagePrecedes(seg_last_page, cutoffPage))
 		SlruInternalDeleteSegment(ctl, filename);
 
 	return false;/* keep going */