Re: [HACKERS] WAL logging problem in 9.4.3?

2020-04-06 Thread Noah Misch
On Mon, Apr 06, 2020 at 09:46:31AM +0900, Kyotaro Horiguchi wrote:
> At Sat, 4 Apr 2020 15:32:12 -0700, Noah Misch  wrote in 
> > On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote:
> > > Shouldn't the CF entry get closed?
> > 
> > Once the buildfarm is clean for a day, sure.  The buildfarm has already
> > revealed a missing perl2host call.
> 
> Thank you for (re-) committing this and the following fix. I hope this
> doesn't bring in another failure.

I have closed the CF entry.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2020-04-05%2000%3A00%3A27
happened, but I doubt it is unrelated.  A wait_for_catchup that usually takes
<1s instead timed out after 397s.  I can't reproduce it.  In the past, another
animal on the same machine had the same failure:

  sysname  │  snapshot   │ branch │ 
bfurl
───┼─┼┼───
 bowerbird │ 2019-11-17 15:22:42 │ HEAD   │ 
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird=2019-11-17%2015%3A22%3A42
 bowerbird │ 2020-01-10 17:30:49 │ HEAD   │ 
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird=2020-01-10%2017%3A30%3A49




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-04-05 Thread Kyotaro Horiguchi
At Sat, 4 Apr 2020 15:32:12 -0700, Noah Misch  wrote in 
> On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote:
> > Shouldn't the CF entry get closed?
> 
> Once the buildfarm is clean for a day, sure.  The buildfarm has already
> revealed a missing perl2host call.

Thank you for (re-) committing this and the following fix. I hope this
doesn't bring in another failure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-04-04 Thread Noah Misch
On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote:
> Shouldn't the CF entry get closed?

Once the buildfarm is clean for a day, sure.  The buildfarm has already
revealed a missing perl2host call.




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-04-04 Thread Tom Lane
Robert Haas  writes:
> On Wed, Apr 1, 2020 at 11:51 PM Noah Misch  wrote:
>> I've translated the non-vote comments into estimated votes of -0.3, -0.6,
>> -0.4, +0.5, and -0.3.  Hence, I revoke the plan to back-patch.

> FWIW, I also think that it would be better not to back-patch.

FWIW, I also concur with not back-patching; the risk/reward ratio
does not look favorable.  Maybe later.

> Last but not least, I would like to join with others in expressing my
> thanks to you for your hard work on this problem.

+1 on that, too.

Shouldn't the CF entry get closed?

regards, tom lane




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-04-02 Thread Robert Haas
On Wed, Apr 1, 2020 at 11:51 PM Noah Misch  wrote:
> I've translated the non-vote comments into estimated votes of -0.3, -0.6,
> -0.4, +0.5, and -0.3.  Hence, I revoke the plan to back-patch.

FWIW, I also think that it would be better not to back-patch. The risk
of back-patching is that this will break things, whereas the risk of
not back-patching is that we will harm people who are affected by this
bug for a longer period of time than would otherwise be the case.
Because this patch is complex, the risk of breaking things seems
higher than normal. On the other hand, the number of users adversely
affected by the bug appears to be relatively low. Taken together,
these factors persuade me that we should not back-patch at this time.

It is possible that in the future things may look different. In the
happy event that this patch causes no more problems following commit,
while at the same time we have more complaints about the underlying
problem, we can make a decision to back-patch at a later time. This
brings me to another point: because this patch changes the WAL format,
a straight revert will be impossible once a release has occurred.
Therefore, if we hold off on back-patching for now and later decide
that we erred, we can proceed at that time and it will probably not be
much harder than it would be to do it now. On the other hand, if we
decide to back-patch now and later decide that we have erred, we will
have additional engineering work to do to cater to people who have
already installed the version containing the back-patched fix and now
need to upgrade again. Perhaps the WAL format changes are simple
enough that this isn't likely to be a huge issue even if it happens,
but it does seem better to avoid the chance that it might. A further
factor is that releases which break WAL compatibility are undesirable,
and should only be undertaken when necessary.

Last but not least, I would like to join with others in expressing my
thanks to you for your hard work on this problem. While the process of
developing a fix has not been without bumps, few people would have had
the time, patience, diligence, and skill to take this effort as far as
you have. Kyotaro Horiguchi and others likewise deserve credit for all
of the many hours that they have put into this work. The entire
PostgreSQL community owes all of you a debt of gratitude, and you have
my thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-04-01 Thread Noah Misch
On Mon, Mar 30, 2020 at 11:37:57PM -0700, Andres Freund wrote:
> On 2020-03-30 23:28:54 -0700, Noah Misch wrote:
> > On Mon, Mar 30, 2020 at 04:43:00PM +0900, Michael Paquier wrote:
> > > On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote:
> > > > I think attached v41nm is ready for commit.  Would anyone like to vote 
> > > > against
> > > > back-patching this?  It's hard to justify lack of back-patch for a 
> > > > data-loss
> > > > bug, but this is atypically invasive.  (I'm repeating the question, 
> > > > since some
> > > > folks missed my 2020-02-18 question.)  Otherwise, I'll push this on 
> > > > Saturday.
> > > 
> > > The invasiveness of the patch is a concern.  Have you considered a
> > > different strategy?  For example, we are soon going to be in beta for
> > > 13, so you could consider committing the patch only on HEAD first.
> > > If there are issues to take care of, you can then leverage the beta
> > > testing to address any issues found.  Finally, once some dust has
> > > settled on the concept and we have gained enough confidence, we could
> > > consider a back-patch.
> > 
> > No.  Does anyone favor this proposal more than back-patching normally?
> 
> I have not reviewed the patch, so I don't have a good feeling for its
> riskiness. But it does sound fairly invasive. Given that we've lived
> with this issue for many years by now, and that the rate of incidents
> seems to have been fairly low, I think living with the issue for a bit
> longer to gain confidence might be a good choice.  But I'd not push back
> if you, being much more informed, think the risk/reward balance favors
> immediate backpatching.

I've translated the non-vote comments into estimated votes of -0.3, -0.6,
-0.4, +0.5, and -0.3.  Hence, I revoke the plan to back-patch.




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-31 Thread Andres Freund
Hi,

On 2020-03-30 23:28:54 -0700, Noah Misch wrote:
> On Mon, Mar 30, 2020 at 04:43:00PM +0900, Michael Paquier wrote:
> > On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote:
> > > I think attached v41nm is ready for commit.  Would anyone like to vote 
> > > against
> > > back-patching this?  It's hard to justify lack of back-patch for a 
> > > data-loss
> > > bug, but this is atypically invasive.  (I'm repeating the question, since 
> > > some
> > > folks missed my 2020-02-18 question.)  Otherwise, I'll push this on 
> > > Saturday.
> > 
> > The invasiveness of the patch is a concern.  Have you considered a
> > different strategy?  For example, we are soon going to be in beta for
> > 13, so you could consider committing the patch only on HEAD first.
> > If there are issues to take care of, you can then leverage the beta
> > testing to address any issues found.  Finally, once some dust has
> > settled on the concept and we have gained enough confidence, we could
> > consider a back-patch.
> 
> No.  Does anyone favor this proposal more than back-patching normally?

I have not reviewed the patch, so I don't have a good feeling for its
riskiness. But it does sound fairly invasive. Given that we've lived
with this issue for many years by now, and that the rate of incidents
seems to have been fairly low, I think living with the issue for a bit
longer to gain confidence might be a good choice.  But I'd not push back
if you, being much more informed, think the risk/reward balance favors
immediate backpatching.

Greetings,

Andres Freund




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-31 Thread Noah Misch
On Mon, Mar 30, 2020 at 04:43:00PM +0900, Michael Paquier wrote:
> On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote:
> > I think attached v41nm is ready for commit.  Would anyone like to vote 
> > against
> > back-patching this?  It's hard to justify lack of back-patch for a data-loss
> > bug, but this is atypically invasive.  (I'm repeating the question, since 
> > some
> > folks missed my 2020-02-18 question.)  Otherwise, I'll push this on 
> > Saturday.
> 
> The invasiveness of the patch is a concern.  Have you considered a
> different strategy?  For example, we are soon going to be in beta for
> 13, so you could consider committing the patch only on HEAD first.
> If there are issues to take care of, you can then leverage the beta
> testing to address any issues found.  Finally, once some dust has
> settled on the concept and we have gained enough confidence, we could
> consider a back-patch.

No.  Does anyone favor this proposal more than back-patching normally?




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-30 Thread Michael Paquier
On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote:
> I think attached v41nm is ready for commit.  Would anyone like to vote against
> back-patching this?  It's hard to justify lack of back-patch for a data-loss
> bug, but this is atypically invasive.  (I'm repeating the question, since some
> folks missed my 2020-02-18 question.)  Otherwise, I'll push this on Saturday.

The invasiveness of the patch is a concern.  Have you considered a
different strategy?  For example, we are soon going to be in beta for
13, so you could consider committing the patch only on HEAD first.
If there are issues to take care of, you can then leverage the beta
testing to address any issues found.  Finally, once some dust has
settled on the concept and we have gained enough confidence, we could
consider a back-patch.  In short, my point is just that even if this
stuff is discussed for years, I see no urgency in back-patching per
the lack of complains we have in -bugs or such.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-30 Thread Kyotaro Horiguchi
At Sun, 29 Mar 2020 23:08:27 -0700, Noah Misch  wrote in 
> On Mon, Mar 30, 2020 at 02:56:11PM +0900, Kyotaro Horiguchi wrote:
> > At Sun, 29 Mar 2020 21:41:01 -0700, Noah Misch  wrote in 
> > > Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed 
> > > some
> > > XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests.
> > 
> > Sounds reasonable. In AddPendingSync, don't we put
> > Assert(!XLogIsNeeded()) instead of "Assert(pendingSyncHash == NULL)"?
> > The former guarantees the relationship between XLogIsNeeded() and
> > pendingSyncHash, and the existing latter assertion looks redundant as
> > it is placed just after "if (pendingSyncHash)".
> 
> The "Assert(pendingSyncHash == NULL)" is indeed useless; I will remove it.  I
> am not inclined to replace it with Assert(!XLogIsNeeded()).  This static
> function is not likely to get more callers, so the chance of accidentally
> calling it under XLogIsNeeded() is too low.

Agreed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-30 Thread Noah Misch
On Mon, Mar 30, 2020 at 02:56:11PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 29 Mar 2020 21:41:01 -0700, Noah Misch  wrote in 
> > Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed 
> > some
> > XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests.
> 
> Sounds reasonable. In AddPendingSync, don't we put
> Assert(!XLogIsNeeded()) instead of "Assert(pendingSyncHash == NULL)"?
> The former guarantees the relationship between XLogIsNeeded() and
> pendingSyncHash, and the existing latter assertion looks redundant as
> it is placed just after "if (pendingSyncHash)".

The "Assert(pendingSyncHash == NULL)" is indeed useless; I will remove it.  I
am not inclined to replace it with Assert(!XLogIsNeeded()).  This static
function is not likely to get more callers, so the chance of accidentally
calling it under XLogIsNeeded() is too low.




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-29 Thread Kyotaro Horiguchi
At Sun, 29 Mar 2020 21:41:01 -0700, Noah Misch  wrote in 
> I think attached v41nm is ready for commit.  Would anyone like to vote against
> back-patching this?  It's hard to justify lack of back-patch for a data-loss
> bug, but this is atypically invasive.  (I'm repeating the question, since some
> folks missed my 2020-02-18 question.)  Otherwise, I'll push this on Saturday.
> 
> On Mon, Mar 23, 2020 at 05:20:27PM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 21 Mar 2020 15:49:20 -0700, Noah Misch  wrote in 
> > > The proximate cause is the RelFileNodeSkippingWAL() call that we added to
> > > MarkBufferDirtyHint().  MarkBufferDirtyHint() runs in parallel workers, 
> > > but
> > > parallel workers have zeroes for pendingSyncHash and rd_*Subid.
> 
> > > Kyotaro, can you look through the affected code and propose a strategy for
> > > good coexistence of parallel query with the WAL skipping mechanism?
> > 
> > Bi-directional communication between leader and workers is too-much.
> > It wouldn't be acceptable to inhibit the problematic operations on
> > workers such like heap-prune or btree pin removal.  If we do pending
> > syncs just before worker start, it won't fix the issue.
> > 
> > The attached patch passes a list of pending-sync relfilenodes at
> > worker start.
> 
> If you were to issue pending syncs and also cease skipping WAL for affected
> relations, that would fix the issue.  Your design is better, though.  I made
> two notable changes:
>
> - The patch was issuing syncs or FPIs every time a parallel worker exited.  I
>   changed it to skip most of smgrDoPendingSyncs() in parallel workers, like
>   AtEOXact_RelationMap() does.

Exactly. Thank you for fixing it.

> - PARALLEL_KEY_PENDING_SYNCS is most similar to PARALLEL_KEY_REINDEX_STATE and
>   PARALLEL_KEY_COMBO_CID.  parallel.c, not execParallel.c, owns those.  I
>   moved PARALLEL_KEY_PENDING_SYNCS to parallel.c, which also called for style
>   changes in the associated storage.c functions.

That sounds better.

Moving the responsibility of creating pending syncs array reduces
copy. RestorePendingSyncs (And AddPendingSync()) looks better.

> Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed some
> XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests.

Sounds reasonable. In AddPendingSync, don't we put
Assert(!XLogIsNeeded()) instead of "Assert(pendingSyncHash == NULL)"?
The former guarantees the relationship between XLogIsNeeded() and
pendingSyncHash, and the existing latter assertion looks redundant as
it is placed just after "if (pendingSyncHash)".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-23 Thread Kyotaro Horiguchi
Thanks for the labour on this.

At Sat, 21 Mar 2020 15:49:20 -0700, Noah Misch  wrote in 
> On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> > Pushed, after adding a missing "break" to gist_identify() and tweaking two
..
> The proximate cause is the RelFileNodeSkippingWAL() call that we added to
> MarkBufferDirtyHint().  MarkBufferDirtyHint() runs in parallel workers, but
> parallel workers have zeroes for pendingSyncHash and rd_*Subid.  I hacked up
> the attached patch to understand the scope of the problem (not to commit).  It
> logs a message whenever a parallel worker uses pendingSyncHash or
> RelationNeedsWAL().  Some of the cases happen often enough to make logs huge,
> so the patch suppresses logging for them.  You can see the lower-volume calls
> like this:
> 
>   printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 
> 'max_wal_senders = 0' 'force_parallel_mode = regress' 
> >/tmp/minimal_parallel.conf
>   make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf
>   find . -name log | xargs grep -rl 'nm0 invalid'
> 
> Not all are actual bugs.  For example, get_relation_info() behaves fine:
> 
>   /* Temporary and unlogged relations are inaccessible during recovery. */
>   if (!RelationNeedsWAL(relation) && RecoveryInProgress())

But the relcache entry shows wrong information about new-ness of its
storage and it is the root cause of the all other problems.

> Kyotaro, can you look through the affected code and propose a strategy for
> good coexistence of parallel query with the WAL skipping mechanism?

Bi-directional communication between leader and workers is too-much.
It wouldn't be acceptable to inhibit the problematic operations on
workers such like heap-prune or btree pin removal.  If we do pending
syncs just before worker start, it won't fix the issue.

The attached patch passes a list of pending-sync relfilenodes at
worker start.  Workers create (immature) pending sync hash from the
list and create relcache entries using the hash. Given that parallel
workers don't perform transactional operations and DDL operations,
workers needs only the list of relfilenodes. The list might be long,
but I don't think it realistic that such many tables are truncated or
created then scanned in parallel within a transaction while wal_level
= minimal.

> Since I don't expect one strategy to win clearly and quickly, I plan to revert
> the main patch around 2020-03-22 17:30 UTC.  That will give the patch about
> twenty-four hours in the buildfarm, so more animals can report in.  I will
> leave the three smaller patches in place.

Thank you for your trouble and the check code. Sorry for not
responding in time.

> > fairywren failed differently on 9.5; I have not yet studied it:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-03-21%2018%3A01%3A10
> 
> This did not remain specific to 9.5.  On platforms where SIZEOF_SIZE_T==4 or
> SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB.  A simple s/1TB/1GB/ in
> the test should fix this.

Oops. I felt that the 2TB looks too large but didn't get it
seriously. 1GB is 1048576 is less than the said limit 2097151 so the
attached second patch does that.

The attached is a proposal of fix of the issue on top of the reverted
commit.

- v36-0001-Skip-WAL-for-new-relfilenodes-under-wal_level-mi.patch
 The reverted patch.

- v36-0002-Fix-GUC-value-in-TAP-test.patch
 Change wal_skip_threashold to 2TB to 2GB in TAP test.

v36-0003-Fix-the-name-of-struct-pendingSyncs.patch
 I found that the struct of pending sync hash entry is named
 differently way from pending delete hash entry. Change it so that the
 two are in similarly naming convention.

v36-0004-Propagage-pending-sync-information-to-parallel-w.patch

 The proposed fix for the parallel-worker problem.

The make check-world above didn't fail with this patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 7aa0ba8fe7bccda5e24a8d25af925372aae8402f Mon Sep 17 00:00:00 2001
From: Noah Misch 
Date: Sat, 21 Mar 2020 09:38:26 -0700
Subject: [PATCH v36 1/4] Skip WAL for new relfilenodes, under
 wal_level=minimal.

Until now, only selected bulk operations (e.g. COPY) did this.  If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules.  Maintainers of table access
methods should examine that section.

To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL.  A new GUC,
wal_skip_threshold, guides that choice.  If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT 

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-21 Thread Noah Misch
On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> Pushed, after adding a missing "break" to gist_identify() and tweaking two
> more comments.  However, a diverse minority of buildfarm members are failing
> like this, in most branches:
> 
> Mar 21 13:16:37 #   Failed test 'wal_level = minimal, SET TABLESPACE, hint 
> bit'
> Mar 21 13:16:37 #   at t/018_wal_optimize.pl line 231.
> Mar 21 13:16:37 #  got: '1'
> Mar 21 13:16:37 # expected: '2'
> Mar 21 13:16:46 # Looks like you failed 1 test of 34.
> Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl  
>   -- 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake=2020-03-21%2016%3A52%3A05
> 
> Since I run two of the failing animals, I expect to reproduce this soon.

force_parallel_regress was the setting needed to reproduce this:

  printf '%s\n%s\n%s\n' 'log_statement = all' 'force_parallel_mode = regress' 
>/tmp/force_parallel.conf
  make -C src/test/recovery check PROVE_TESTS=t/018_wal_optimize.pl 
TEMP_CONFIG=/tmp/force_parallel.conf

The proximate cause is the RelFileNodeSkippingWAL() call that we added to
MarkBufferDirtyHint().  MarkBufferDirtyHint() runs in parallel workers, but
parallel workers have zeroes for pendingSyncHash and rd_*Subid.  I hacked up
the attached patch to understand the scope of the problem (not to commit).  It
logs a message whenever a parallel worker uses pendingSyncHash or
RelationNeedsWAL().  Some of the cases happen often enough to make logs huge,
so the patch suppresses logging for them.  You can see the lower-volume calls
like this:

  printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 
'max_wal_senders = 0' 'force_parallel_mode = regress' 
>/tmp/minimal_parallel.conf
  make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf
  find . -name log | xargs grep -rl 'nm0 invalid'

Not all are actual bugs.  For example, get_relation_info() behaves fine:

/* Temporary and unlogged relations are inaccessible during recovery. */
if (!RelationNeedsWAL(relation) && RecoveryInProgress())

Kyotaro, can you look through the affected code and propose a strategy for
good coexistence of parallel query with the WAL skipping mechanism?

Since I don't expect one strategy to win clearly and quickly, I plan to revert
the main patch around 2020-03-22 17:30 UTC.  That will give the patch about
twenty-four hours in the buildfarm, so more animals can report in.  I will
leave the three smaller patches in place.

> fairywren failed differently on 9.5; I have not yet studied it:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-03-21%2018%3A01%3A10

This did not remain specific to 9.5.  On platforms where SIZEOF_SIZE_T==4 or
SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB.  A simple s/1TB/1GB/ in
the test should fix this.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a25d539..885049d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7183,7 +7183,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr  recptr;
 
/* Caller should not call me on a non-WAL-logged relation */
-   Assert(RelationNeedsWAL(reln));
+   Assert(RelationNeedsWALKnownProblem(reln));
 
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
diff --git a/src/backend/access/heap/pruneheap.c 
b/src/backend/access/heap/pruneheap.c
index 1794cfd..5be469d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -258,7 +258,7 @@ heap_page_prune(Relation relation, Buffer buffer, 
TransactionId OldestXmin,
/*
 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
 */
-   if (RelationNeedsWAL(relation))
+   if (RelationNeedsWALKnownProblem(relation))
{
XLogRecPtr  recptr;
 
diff --git a/src/backend/access/nbtree/nbtsearch.c 
b/src/backend/access/nbtree/nbtsearch.c
index 8ff49ce..cfa72ce 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -67,7 +67,7 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
LockBuffer(sp->buf, BUFFER_LOCK_UNLOCK);
 
if (IsMVCCSnapshot(scan->xs_snapshot) &&
-   RelationNeedsWAL(scan->indexRelation) &&
+   RelationNeedsWALKnownProblem(scan->indexRelation) &&
!scan->xs_want_itup)
{
ReleaseBuffer(sp->buf);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0ed7c64..14fb36f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -93,6 +93,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
BackendId   backend;
boolneeds_wal;
 
+   PreventParallelWorker(); /* can't update pendingSyncHash */
+
switch (relpersistence)
 

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-21 Thread Bruce Momjian


Wow, this thread started in 2015.  :-O

Date: Fri, 3 Jul 2015 00:05:24 +0200

---

On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> On Sun, Mar 15, 2020 at 08:46:47PM -0700, Noah Misch wrote:
> > On Wed, Mar 04, 2020 at 04:29:19PM +0900, Kyotaro Horiguchi wrote:
> > > The attached is back-patches from 9.5 through master.
> > 
> > Thanks.  I've made some edits.  I'll plan to push the attached patches on
> > Friday or Saturday.
> 
> Pushed, after adding a missing "break" to gist_identify() and tweaking two
> more comments.  However, a diverse minority of buildfarm members are failing
> like this, in most branches:
> 
> Mar 21 13:16:37 #   Failed test 'wal_level = minimal, SET TABLESPACE, hint 
> bit'
> Mar 21 13:16:37 #   at t/018_wal_optimize.pl line 231.
> Mar 21 13:16:37 #  got: '1'
> Mar 21 13:16:37 # expected: '2'
> Mar 21 13:16:46 # Looks like you failed 1 test of 34.
> Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl  
>   -- 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake=2020-03-21%2016%3A52%3A05
> 
> Since I run two of the failing animals, I expect to reproduce this soon.
> 
> fairywren failed differently on 9.5; I have not yet studied it:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-03-21%2018%3A01%3A10
> 
> > > It lacks a part of TAP infrastructure nowadays we have, but I
> > > want to have the test (and it actually found a bug I made during this
> > > work). So I added a patch to back-patch TestLib.pm, PostgresNode.pm
> > > and RecursiveCopy.pm along with 018_wal_optimize.pl.
> > > (0004-Add-TAP-test-for-WAL-skipping-feature.patch)
> > 
> > That is a good idea.  Rather than make it specific to this test, I would 
> > like
> > to back-patch all applicable test files from 9.6 src/test/recovery.  I'll 
> > plan
> > to push that one part on Thursday.
> 
> That push did not cause failures.
> 
> 

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-21 Thread Noah Misch
On Sun, Mar 15, 2020 at 08:46:47PM -0700, Noah Misch wrote:
> On Wed, Mar 04, 2020 at 04:29:19PM +0900, Kyotaro Horiguchi wrote:
> > The attached is back-patches from 9.5 through master.
> 
> Thanks.  I've made some edits.  I'll plan to push the attached patches on
> Friday or Saturday.

Pushed, after adding a missing "break" to gist_identify() and tweaking two
more comments.  However, a diverse minority of buildfarm members are failing
like this, in most branches:

Mar 21 13:16:37 #   Failed test 'wal_level = minimal, SET TABLESPACE, hint bit'
Mar 21 13:16:37 #   at t/018_wal_optimize.pl line 231.
Mar 21 13:16:37 #  got: '1'
Mar 21 13:16:37 # expected: '2'
Mar 21 13:16:46 # Looks like you failed 1 test of 34.
Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl  
  -- 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake=2020-03-21%2016%3A52%3A05

Since I run two of the failing animals, I expect to reproduce this soon.

fairywren failed differently on 9.5; I have not yet studied it:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-03-21%2018%3A01%3A10

> > It lacks a part of TAP infrastructure nowadays we have, but I
> > want to have the test (and it actually found a bug I made during this
> > work). So I added a patch to back-patch TestLib.pm, PostgresNode.pm
> > and RecursiveCopy.pm along with 018_wal_optimize.pl.
> > (0004-Add-TAP-test-for-WAL-skipping-feature.patch)
> 
> That is a good idea.  Rather than make it specific to this test, I would like
> to back-patch all applicable test files from 9.6 src/test/recovery.  I'll plan
> to push that one part on Thursday.

That push did not cause failures.




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-03 Thread Kyotaro Horiguchi
Some fixes..

At Wed, 04 Mar 2020 16:29:19 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At first I fixed several ssues in 018_wal_optimize.pl:
> 
> - TRUNCATE INSERT, TRUNCATE INSERT PREPARE
> 
>  It wrongly passes if finally we see the value only from the first
>  INSERT. I changed it so that it checks the value, not the number of
>  values.

Finally it checks both number of values and the largest value.

...
> log_newpage_range has been introduced at PG12.  Fortunately the
> required infrastructure is introduced at PG9.5 so what I need to do
> for PG95-PG11 is back-patching the function and its counter part in
- xlog_redo. It doen't WAL format itself but XLOG_FPI gets to have 2 or
+ xlog_redo. It doen't change WAL format itself but XLOG_FPI gets to have 2 or
> more backup pages so the compatibility is forward only. That is, newer
> minor versions read WAL from older minor versions, but not vise
> versea.  I'm not sure it is back-patchable so in the attached the
> end-of-xact WAL feature is separated for PG9.5-PG11.
> (000x-Add-end-of-xact-WAL-feature-of-WAL-skipping.patch)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-01 Thread Kyotaro Horiguchi
At Sun, 1 Mar 2020 11:56:32 -0800, Noah Misch  wrote in 
> On Thu, Feb 27, 2020 at 04:00:24PM +0900, Kyotaro Horiguchi wrote:
> > It sounds somewhat obscure.
> 
> I see.  I won't use that.

Thanks.

> > Coulnd't we enumetate examples? And if we
> > could use pg_relation_filenode, I think we can use just
> > "filenode". (Thuogh the word is used in the documentation, it is not
> > defined anywhere..)
> 
> func.sgml does define the term.  Nonetheless, I'm not using it.

Ah, "The filenode is the base component oif the file name(s) used for
the relation".. So it's very similar to "on-disk file" in a sense.

> > 
> > In minimal level, no information is logged for
> > permanent relations for the remainder of a transaction that creates
> > them or changes their filenode. For example, CREATE
> > TABLE, CLUSTER or REFRESH MATERIALIZED VIEW are the command of that
> > category.
> > 
> > 
> > # sorry for bothering you..
> 
> Including examples is fine.  Attached v36nm has just comment and doc changes.
> Would you translate this into back-patch versions for v9.5 through v12?

The explicit list of commands that initiate the WAL-skipping mode
works for me. I'm going to work on the tranlation right now.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-01 Thread Noah Misch
On Thu, Feb 27, 2020 at 04:00:24PM +0900, Kyotaro Horiguchi wrote:
> At Tue, 25 Feb 2020 21:36:12 -0800, Noah Misch  wrote in 
> > On Tue, Feb 25, 2020 at 10:01:51AM +0900, Kyotaro Horiguchi wrote:
> > > At Sat, 22 Feb 2020 21:12:20 -0800, Noah Misch  wrote 
> > > in 
> > > > On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote:
> > > If we decide to keep the consistency there, I would like to describe
> > > the code is there for consistency, not for the benefit of a specific
> > > assertion.
> > > 
> > > (cluster.c:1116)
> > > -* new. The next step for rel2 is deletion, but copy rd_*Subid for the
> > > -* benefit of AssertPendingSyncs_RelationCache().
> > > +* new. The next step for rel2 is deletion, but copy rd_*Subid for the
> > > +* consistency of the fieles. It is checked later by
> > > +* AssertPendingSyncs_RelationCache().
> > 
> > I think the word "consistency" is too vague for "consistency of the fields" 
> > to
> > convey information.  May I just remove the last sentence of the comment
> > (everything after "* new.")?
> 
> I'm fine with that:)
> 
> > > I agree that relation works as the generic name of table-like
> > > objects. Addition to that, doesn't using the word "storage file" make
> > > it more clearly?  I'm not confident on the wording itself, but it will
> > > look like the following.
> > 
> > The docs rarely use "storage file" or "on-disk file" as terms.  I hesitate 
> > to
> > put more emphasis on files, because they are part of the implementation, not
> > part of the user interface.  The term "rewrites"/"rewriting" has the same
> > problem, though.  Yet another alternative would be to talk about operations
> > that change the pg_relation_filenode() return value:
> > 
> >   In minimal level, no information is logged for 
> > permanent
> >   relations for the remainder of a transaction that creates them or changes
> >   what pg_relation_filenode returns for them.
> > 
> > What do you think?
> 
> It sounds somewhat obscure.

I see.  I won't use that.

> Coulnd't we enumetate examples? And if we
> could use pg_relation_filenode, I think we can use just
> "filenode". (Thuogh the word is used in the documentation, it is not
> defined anywhere..)

func.sgml does define the term.  Nonetheless, I'm not using it.

> 
> In minimal level, no information is logged for
> permanent relations for the remainder of a transaction that creates
> them or changes their filenode. For example, CREATE
> TABLE, CLUSTER or REFRESH MATERIALIZED VIEW are the command of that
> category.
> 
> 
> # sorry for bothering you..

Including examples is fine.  Attached v36nm has just comment and doc changes.
Would you translate this into back-patch versions for v9.5 through v12?
Author: Noah Misch 
Commit: Noah Misch 

Fix cosmetic blemishes involving rd_createSubid.

Remove an obsolete comment from AtEOXact_cleanup().  Restore formatting
of a comment in struct RelationData, mangled by the pgindent run in
commit 9af4159fce6654aa0e081b00d02bca40b978745c.  Back-patch to 9.5 (all
supported versions), because another fix stacks on this.

diff --git a/src/backend/utils/cache/relcache.c 
b/src/backend/utils/cache/relcache.c
index 50f8912..44fa843 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3029,10 +3029,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
 *
 * During commit, reset the flag to zero, since we are now out of the
 * creating transaction.  During abort, simply delete the relcache entry
-* --- it isn't interesting any longer.  (NOTE: if we have forgotten the
-* new-ness of a new relation due to a forced cache flush, the entry 
will
-* get deleted anyway by shared-cache-inval processing of the aborted
-* pg_class insertion.)
+* --- it isn't interesting any longer.
 */
if (relation->rd_createSubid != InvalidSubTransactionId)
{
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 31d8a1a..bf7a7df 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -63,7 +63,7 @@ typedef struct RelationData
 * 
rd_replidindex) */
boolrd_statvalid;   /* is rd_statlist valid? */
 
-   /*
+   /*--
 * rd_createSubid is the ID of the highest subtransaction the rel has
 * survived into; or zero if the rel was not created in the current top
 * transaction.  This can be now be relied on, whereas previously it 
could
@@ -73,8 +73,13 @@ typedef struct RelationData
 * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
 * when a relation has multiple new relfilenodes within a single
 * transaction, with one of them occurring in a subsequently aborted
-* subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
-* 

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-26 Thread Kyotaro Horiguchi
At Tue, 25 Feb 2020 21:36:12 -0800, Noah Misch  wrote in 
> On Tue, Feb 25, 2020 at 10:01:51AM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 22 Feb 2020 21:12:20 -0800, Noah Misch  wrote in 
> > > On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote:
> > I aggree that the new #ifdef can invite a Heisenbug. I thought that
> > you didn't want that because it doesn't make substantial difference.
> 
> v35nm added swap_relation_files() code so AssertPendingSyncs_RelationCache()
> could check rd_droppedSubid relations.  v30nm, which did not have
> rd_droppedSubid, removed swap_relation_files() code that wasn't making a
> difference.

Ok, I understand that it meant that the additional code still makes
difference in --enable-cassert build.

> > If we decide to keep the consistency there, I would like to describe
> > the code is there for consistency, not for the benefit of a specific
> > assertion.
> > 
> > (cluster.c:1116)
> > -* new. The next step for rel2 is deletion, but copy rd_*Subid for the
> > -* benefit of AssertPendingSyncs_RelationCache().
> > +* new. The next step for rel2 is deletion, but copy rd_*Subid for the
> > +* consistency of the fieles. It is checked later by
> > +* AssertPendingSyncs_RelationCache().
> 
> I think the word "consistency" is too vague for "consistency of the fields" to
> convey information.  May I just remove the last sentence of the comment
> (everything after "* new.")?

I'm fine with that:)

> > I agree that relation works as the generic name of table-like
> > objects. Addition to that, doesn't using the word "storage file" make
> > it more clearly?  I'm not confident on the wording itself, but it will
> > look like the following.
> 
> The docs rarely use "storage file" or "on-disk file" as terms.  I hesitate to
> put more emphasis on files, because they are part of the implementation, not
> part of the user interface.  The term "rewrites"/"rewriting" has the same
> problem, though.  Yet another alternative would be to talk about operations
> that change the pg_relation_filenode() return value:
> 
>   In minimal level, no information is logged for permanent
>   relations for the remainder of a transaction that creates them or changes
>   what pg_relation_filenode returns for them.
> 
> What do you think?

It sounds somewhat obscure. Coulnd't we enumetate examples? And if we
could use pg_relation_filenode, I think we can use just
"filenode". (Thuogh the word is used in the documentation, it is not
defined anywhere..)


In minimal level, no information is logged for
permanent relations for the remainder of a transaction that creates
them or changes their filenode. For example, CREATE
TABLE, CLUSTER or REFRESH MATERIALIZED VIEW are the command of that
category.


# sorry for bothering you..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-25 Thread Noah Misch
On Tue, Feb 25, 2020 at 10:01:51AM +0900, Kyotaro Horiguchi wrote:
> At Sat, 22 Feb 2020 21:12:20 -0800, Noah Misch  wrote in 
> > On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote:
> > > At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi 
> > >  wrote in 
> > > > At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch  
> > > > wrote in 

> > > In swap_relation_files, we can remove rel2-related code when #ifndef
> > > USE_ASSERT_CHECKING.
> > 
> > When state is visible to many compilation units, we should avoid making that
> > state depend on --enable-cassert.  That would be a recipe for a Heisenbug.  
> > In
> > a hot code path, it might be worth the risk.
> 
> I aggree that the new #ifdef can invite a Heisenbug. I thought that
> you didn't want that because it doesn't make substantial difference.

v35nm added swap_relation_files() code so AssertPendingSyncs_RelationCache()
could check rd_droppedSubid relations.  v30nm, which did not have
rd_droppedSubid, removed swap_relation_files() code that wasn't making a
difference.

> If we decide to keep the consistency there, I would like to describe
> the code is there for consistency, not for the benefit of a specific
> assertion.
> 
> (cluster.c:1116)
> -* new. The next step for rel2 is deletion, but copy rd_*Subid for the
> -* benefit of AssertPendingSyncs_RelationCache().
> +* new. The next step for rel2 is deletion, but copy rd_*Subid for the
> +* consistency of the fieles. It is checked later by
> +* AssertPendingSyncs_RelationCache().

I think the word "consistency" is too vague for "consistency of the fields" to
convey information.  May I just remove the last sentence of the comment
(everything after "* new.")?

> > > config.sgml:
> > > +When wal_level is minimal 
> > > and a
> > > +transaction commits after creating or rewriting a permanent 
> > > table,
> > > +materialized view, or index, this setting determines how to 
> > > persist
> > > 
> > > "creating or truncation" a permanent table?  and maybe "refreshing
> > > matview and reindex". I'm not sure that they can be merged that way.
> ...
> > I like mentioning truncation, but I dislike how this implies that CREATE
> > INDEX, CREATE MATERIALIZED VIEW, and ALTER INDEX SET TABLESPACE aren't in
> > scope.  While I usually avoid the word "relation" in documentation, I can
> > justify it here to make the sentence less complex.  How about the following?
> > 
> > --- a/doc/src/sgml/config.sgml
> > +++ b/doc/src/sgml/config.sgml
> > @@ -2484,9 +2484,9 @@ include_dir 'conf.d'
> >  In minimal level, no information is logged for
> > -tables or indexes for the remainder of a transaction that creates 
> > or
> > -truncates them.  This can make bulk operations much faster (see
> > -).  But minimal WAL does not contain
> > -enough information to reconstruct the data from a base backup and 
> > the
> > -WAL logs, so replica or higher must be used to
> > -enable WAL archiving () and
> > -streaming replication.
> > +permanent relations for the remainder of a transaction that 
> > creates,
> > +rewrites, or truncates them.  This can make bulk operations much
> > +faster (see ).  But minimal WAL does
> > +not contain enough information to reconstruct the data from a base
> > +backup and the WAL logs, so replica or higher 
> > must
> > +be used to enable WAL archiving ( > linkend="guc-archive-mode"/>)
> > +and streaming replication.
> > 
> > @@ -2891,9 +2891,9 @@ include_dir 'conf.d'
> >  When wal_level is minimal 
> > and a
> > -transaction commits after creating or rewriting a permanent table,
> > -materialized view, or index, this setting determines how to persist
> > -the new data.  If the data is smaller than this setting, write it 
> > to
> > -the WAL log; otherwise, use an fsync of the data file.  Depending 
> > on
> > -the properties of your storage, raising or lowering this value 
> > might
> > -help if such commits are slowing concurrent transactions.  The 
> > default
> > -is two megabytes (2MB).
> > +transaction commits after creating, rewriting, or truncating a
> > +permanent relation, this setting determines how to persist the new
> > +data.  If the data is smaller than this setting, write it to the 
> > WAL
> > +log; otherwise, use an fsync of the data file.  Depending on the
> > +properties of your storage, raising or lowering this value might 
> > help
> > +if such commits are slowing concurrent transactions.  The default 
> > is
> > +two megabytes (2MB).
> > 
> 
> I agree that relation works as the generic name of table-like
> objects. Addition to that, doesn't using the word "storage file" make
> it more clearly?  I'm not confident on the wording itself, but it will
> look like the 

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-24 Thread Kyotaro Horiguchi
At Sat, 22 Feb 2020 21:12:20 -0800, Noah Misch  wrote in 
> On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote:
> > At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in 
> > > At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch  wrote 
> > > in 
> > > > - When reusing an index build, instead of storing the dropped relid in 
> > > > the
> > > >   IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), 
> > > > store
> > > >   the subid fields in the IndexStmt.  This is less code, and I felt
> > > >   RelationIdGetRelationCache() invited misuse.
> > > 
> > > Hmm. I'm not sure that index_create having the new subid parameters is
> > > good. And the last if(OidIsValid) clause handles storage persistence
> > > so I did that there. But I don't strongly against it.
> 
> Agreed.  My choice there was not a clear improvement.
> 
> > The change on alter_table.sql and create_table.sql is expecting to
> > cause assertion failure.  Don't we need that kind of explanation in
> > the comment? 
> 
> Test comments generally describe the feature unique to that test, not how the
> test might break.  Some tests list bug numbers, but that doesn't apply here.

Agreed. 

> > In swap_relation_files, we can remove rel2-related code when #ifndef
> > USE_ASSERT_CHECKING.
> 
> When state is visible to many compilation units, we should avoid making that
> state depend on --enable-cassert.  That would be a recipe for a Heisenbug.  In
> a hot code path, it might be worth the risk.

I aggree that the new #ifdef can invite a Heisenbug. I thought that
you didn't want that because it doesn't make substantial difference.
If we decide to keep the consistency there, I would like to describe
the code is there for consistency, not for the benefit of a specific
assertion.

(cluster.c:1116)
-* new. The next step for rel2 is deletion, but copy rd_*Subid for the
-* benefit of AssertPendingSyncs_RelationCache().
+* new. The next step for rel2 is deletion, but copy rd_*Subid for the
+* consistency of the fieles. It is checked later by
+* AssertPendingSyncs_RelationCache().

> > The patch adds the test for createSubid to pg_visibility.out. It
> > doesn't fail without CLOBBER_CACHE_ALWAYS while regression test but
> > CLOBBER_CACHE_ALWAYS causes initdb fail and the added check won't be
> > reached. I'm not sure it is useful.
> 
> I agree it's not clearly useful, but tests don't need to meet a "clearly
> useful" standard.  When a fast test is not clearly redundant with another
> test, we generally accept it.  In the earlier patch version that inspired this
> test, RELCACHE_FORCE_RELEASE sufficed to make it fail.
> 
> > config.sgml:
> > +When wal_level is minimal 
> > and a
> > +transaction commits after creating or rewriting a permanent table,
> > +materialized view, or index, this setting determines how to persist
> > 
> > "creating or truncation" a permanent table?  and maybe "refreshing
> > matview and reindex". I'm not sure that they can be merged that way.
...
> I like mentioning truncation, but I dislike how this implies that CREATE
> INDEX, CREATE MATERIALIZED VIEW, and ALTER INDEX SET TABLESPACE aren't in
> scope.  While I usually avoid the word "relation" in documentation, I can
> justify it here to make the sentence less complex.  How about the following?
> 
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -2484,9 +2484,9 @@ include_dir 'conf.d'
>  In minimal level, no information is logged for
> -tables or indexes for the remainder of a transaction that creates or
> -truncates them.  This can make bulk operations much faster (see
> -).  But minimal WAL does not contain
> -enough information to reconstruct the data from a base backup and the
> -WAL logs, so replica or higher must be used to
> -enable WAL archiving () and
> -streaming replication.
> +permanent relations for the remainder of a transaction that creates,
> +rewrites, or truncates them.  This can make bulk operations much
> +faster (see ).  But minimal WAL does
> +not contain enough information to reconstruct the data from a base
> +backup and the WAL logs, so replica or higher must
> +be used to enable WAL archiving ()
> +and streaming replication.
> 
> @@ -2891,9 +2891,9 @@ include_dir 'conf.d'
>  When wal_level is minimal and a
> -transaction commits after creating or rewriting a permanent table,
> -materialized view, or index, this setting determines how to persist
> -the new data.  If the data is smaller than this setting, write it to
> -the WAL log; otherwise, use an fsync of the data file.  Depending on
> -the properties of your storage, raising or lowering this value might
> -help if such commits are slowing concurrent transactions.  The 
> default
> -is two megabytes (2MB).
> 

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-22 Thread Noah Misch
On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote:
> At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch  wrote in 
> > > - When reusing an index build, instead of storing the dropped relid in the
> > >   IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), 
> > > store
> > >   the subid fields in the IndexStmt.  This is less code, and I felt
> > >   RelationIdGetRelationCache() invited misuse.
> > 
> > Hmm. I'm not sure that index_create having the new subid parameters is
> > good. And the last if(OidIsValid) clause handles storage persistence
> > so I did that there. But I don't strongly against it.

Agreed.  My choice there was not a clear improvement.

> The change on alter_table.sql and create_table.sql is expecting to
> cause assertion failure.  Don't we need that kind of explanation in
> the comment? 

Test comments generally describe the feature unique to that test, not how the
test might break.  Some tests list bug numbers, but that doesn't apply here.

> In swap_relation_files, we can remove rel2-related code when #ifndef
> USE_ASSERT_CHECKING.

When state is visible to many compilation units, we should avoid making that
state depend on --enable-cassert.  That would be a recipe for a Heisenbug.  In
a hot code path, it might be worth the risk.

> The patch adds the test for createSubid to pg_visibility.out. It
> doesn't fail without CLOBBER_CACHE_ALWAYS while regression test but
> CLOBBER_CACHE_ALWAYS causes initdb fail and the added check won't be
> reached. I'm not sure it is useful.

I agree it's not clearly useful, but tests don't need to meet a "clearly
useful" standard.  When a fast test is not clearly redundant with another
test, we generally accept it.  In the earlier patch version that inspired this
test, RELCACHE_FORCE_RELEASE sufficed to make it fail.

> config.sgml:
> +When wal_level is minimal and a
> +transaction commits after creating or rewriting a permanent table,
> +materialized view, or index, this setting determines how to persist
> 
> "creating or truncation" a permanent table?  and maybe "refreshing
> matview and reindex". I'm not sure that they can be merged that way.

> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -2889,13 +2889,13 @@ include_dir 'conf.d'
>
> 
>  When wal_level is minimal and a
> -transaction commits after creating or rewriting a permanent table,
> -materialized view, or index, this setting determines how to persist
> -the new data.  If the data is smaller than this setting, write it to
> -the WAL log; otherwise, use an fsync of the data file.  Depending on
> -the properties of your storage, raising or lowering this value might
> -help if such commits are slowing concurrent transactions.  The 
> default
> -is two megabytes (2MB).
> +transaction commits after creating or truncating a permanent table,
> +refreshing a materialized view, or reindexing, this setting 
> determines
> +how to persist the new data.  If the data is smaller than this
> +setting, write it to the WAL log; otherwise, use an fsync of the data
> +file.  Depending on the properties of your storage, raising or
> +lowering this value might help if such commits are slowing concurrent
> +transactions.  The default is two megabytes (2MB).
> 

I like mentioning truncation, but I dislike how this implies that CREATE
INDEX, CREATE MATERIALIZED VIEW, and ALTER INDEX SET TABLESPACE aren't in
scope.  While I usually avoid the word "relation" in documentation, I can
justify it here to make the sentence less complex.  How about the following?

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2484,9 +2484,9 @@ include_dir 'conf.d'
 In minimal level, no information is logged for
-tables or indexes for the remainder of a transaction that creates or
-truncates them.  This can make bulk operations much faster (see
-).  But minimal WAL does not contain
-enough information to reconstruct the data from a base backup and the
-WAL logs, so replica or higher must be used to
-enable WAL archiving () and
-streaming replication.
+permanent relations for the remainder of a transaction that creates,
+rewrites, or truncates them.  This can make bulk operations much
+faster (see ).  But minimal WAL does
+not contain enough information to reconstruct the data from a base
+backup and the WAL logs, so replica or higher must
+be used to enable WAL archiving ()
+and streaming replication.

@@ -2891,9 +2891,9 @@ include_dir 'conf.d'
 When wal_level is minimal and a
-transaction commits after creating or rewriting a permanent table,
-materialized view, or index, this 

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-20 Thread Kyotaro Horiguchi
Hello. I looked through the latest patch.

At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch  wrote in 
> > - When reusing an index build, instead of storing the dropped relid in the
> >   IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), 
> > store
> >   the subid fields in the IndexStmt.  This is less code, and I felt
> >   RelationIdGetRelationCache() invited misuse.
> 
> Hmm. I'm not sure that index_create having the new subid parameters is
> good. And the last if(OidIsValid) clause handles storage persistence
> so I did that there. But I don't strongly against it.
> 
> Please give a bit more time to look it.


The change on alter_table.sql and create_table.sql is expecting to
cause assertion failure.  Don't we need that kind of explanation in
the comment? 

In swap_relation_files, we can remove rel2-related code when #ifndef
USE_ASSERT_CHECKING.

The patch adds the test for createSubid to pg_visibility.out. It
doesn't fail without CLOBBER_CACHE_ALWAYS while regression test but
CLOBBER_CACHE_ALWAYS causes initdb fail and the added check won't be
reached. I'm not sure it is useful.

config.sgml:
+When wal_level is minimal and a
+transaction commits after creating or rewriting a permanent table,
+materialized view, or index, this setting determines how to persist

"creating or truncation" a permanent table?  and maybe "refreshing
matview and reindex". I'm not sure that they can be merged that way.

Other than the item related to pg_visibility.sql are in the attached.

The others look good to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3068e1e94a..38a2edf860 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2889,13 +2889,13 @@ include_dir 'conf.d'
   

 When wal_level is minimal and a
-transaction commits after creating or rewriting a permanent table,
-materialized view, or index, this setting determines how to persist
-the new data.  If the data is smaller than this setting, write it to
-the WAL log; otherwise, use an fsync of the data file.  Depending on
-the properties of your storage, raising or lowering this value might
-help if such commits are slowing concurrent transactions.  The default
-is two megabytes (2MB).
+transaction commits after creating or truncating a permanent table,
+refreshing a materialized view, or reindexing, this setting determines
+how to persist the new data.  If the data is smaller than this
+setting, write it to the WAL log; otherwise, use an fsync of the data
+file.  Depending on the properties of your storage, raising or
+lowering this value might help if such commits are slowing concurrent
+transactions.  The default is two megabytes (2MB).

   
  
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 391a8a9ea3..682619c9db 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1118,16 +1118,18 @@ swap_relation_files(Oid r1, Oid r2, bool 
target_is_pg_class,
 */
{
Relationrel1,
-   rel2;
+   rel2 PG_USED_FOR_ASSERTS_ONLY;
 
rel1 = relation_open(r1, NoLock);
+#ifdef USE_ASSERT_CHECKING
rel2 = relation_open(r2, NoLock);
rel2->rd_createSubid = rel1->rd_createSubid;
rel2->rd_newRelfilenodeSubid = rel1->rd_newRelfilenodeSubid;
rel2->rd_firstRelfilenodeSubid = rel1->rd_firstRelfilenodeSubid;
+   relation_close(rel2, NoLock);
+#endif
RelationAssumeNewRelfilenode(rel1);
relation_close(rel1, NoLock);
-   relation_close(rel2, NoLock);
}
 
/*
diff --git a/src/test/regress/expected/alter_table.out 
b/src/test/regress/expected/alter_table.out
index 7c2181ac2f..3c500944cd 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -1985,7 +1985,8 @@ select * from another;
 
 drop table another;
 -- Create an index that skips WAL, then perform a SET DATA TYPE that skips
--- rewriting the index.
+-- rewriting the index. Inadvertent changes on rd_createSubid causes
+-- asseertion failure.
 begin;
 create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
 alter table skip_wal_skip_rewrite_index alter c type varchar(20);
diff --git a/src/test/regress/expected/create_table.out 
b/src/test/regress/expected/create_table.out
index 6acf31725f..dae7595957 100644
--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -332,6 +332,7 @@ ERROR:  set-returning functions are not allowed in DEFAULT 

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-19 Thread Kyotaro Horiguchi
Sorry, just one fix. (omitting some typos, though..)

At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch  wrote in 
> > I think attached v35nm is ready for commit to master.  Would anyone like to
> > talk me out of back-patching this?  I would not enjoy back-patching it, but
> > it's hard to justify lack of back-patch for a data-loss bug.
> > 
> > Notable changes since v34:
> > 
> > - Separate a few freestanding fixes into their own patches.
> 
> All of the three patches look fine.
> 
> > On Mon, Jan 27, 2020 at 07:28:31PM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/backend/catalog/storage.c
> > > +++ b/src/backend/catalog/storage.c
> > > @@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel)
> > >   /* Record largest maybe-unsynced block of files under tracking  */
> > >   pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
> > > HASH_FIND, NULL);
> > > - if (pending)
> > > - {
> > > - BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
> > > -
> > > - if (pending->max_truncated < nblocks)
> > > - pending->max_truncated = nblocks;
> > > - }
> > > + pending->is_truncated = true;
> > 
> > - Fix this crashing when "pending" is NULL, as it is in this test case:
> > 
> >   begin;
> >   create temp table t ();
> >   create table t2 ();  -- cause pendingSyncHash to exist
> >   truncate t;
> >   rollback;
> 
> That's terrible... Thanks for fixint it.
> 
> > - Fix the "deleted while still in use" problem that Thomas Munro reported, 
> > by
> >   removing the heap_create() change.  Restoring the saved rd_createSubid had
> >   made obsolete the heap_create() change.  check-world now passes with
> >   wal_level=minimal and CLOBBER_CACHE_ALWAYS.
> 
> Ok, as in the previous mail.
> 
> > - Set rd_droppedSubid in RelationForgetRelation(), not
> >   RelationClearRelation().  RelationForgetRelation() knows it is processing 
> > a
> >   drop, but RelationClearRelation() could only infer that from 
> > circumstantial
> >   evidence.  This seems more future-proof to me.
> 
> Agreed. Different from RelationClearRelatoin, RelationForgetRelation
> is called only for "drop"ing the relation.
> 
> > - When reusing an index build, instead of storing the dropped relid in the
> >   IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), 
> > store
> >   the subid fields in the IndexStmt.  This is less code, and I felt
> >   RelationIdGetRelationCache() invited misuse.
> 
> Hmm. I'm not sure that index_create having the new subid parameters is
> good. And the last if(OidIsValid) clause handles storage persistence
> so I did that there. But I don't strongly against it.

Hmm. I'm not sure that index_create having the new subid parameters is
good. And the last if(OidIsValid) clause in AtExecAddIndex handles
storage persistence so I did that there. But I don't strongly against
it.

> Please give a bit more time to look it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-19 Thread Kyotaro Horiguchi
At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch  wrote in 
> I think attached v35nm is ready for commit to master.  Would anyone like to
> talk me out of back-patching this?  I would not enjoy back-patching it, but
> it's hard to justify lack of back-patch for a data-loss bug.
> 
> Notable changes since v34:
> 
> - Separate a few freestanding fixes into their own patches.

All of the three patches look fine.

> On Mon, Jan 27, 2020 at 07:28:31PM +0900, Kyotaro Horiguchi wrote:
> > --- a/src/backend/catalog/storage.c
> > +++ b/src/backend/catalog/storage.c
> > @@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel)
> > /* Record largest maybe-unsynced block of files under tracking  */
> > pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
> >   HASH_FIND, NULL);
> > -   if (pending)
> > -   {
> > -   BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
> > -
> > -   if (pending->max_truncated < nblocks)
> > -   pending->max_truncated = nblocks;
> > -   }
> > +   pending->is_truncated = true;
> 
> - Fix this crashing when "pending" is NULL, as it is in this test case:
> 
>   begin;
>   create temp table t ();
>   create table t2 ();  -- cause pendingSyncHash to exist
>   truncate t;
>   rollback;

That's terrible... Thanks for fixint it.

> - Fix the "deleted while still in use" problem that Thomas Munro reported, by
>   removing the heap_create() change.  Restoring the saved rd_createSubid had
>   made obsolete the heap_create() change.  check-world now passes with
>   wal_level=minimal and CLOBBER_CACHE_ALWAYS.

Ok, as in the previous mail.

> - Set rd_droppedSubid in RelationForgetRelation(), not
>   RelationClearRelation().  RelationForgetRelation() knows it is processing a
>   drop, but RelationClearRelation() could only infer that from circumstantial
>   evidence.  This seems more future-proof to me.

Agreed. Different from RelationClearRelatoin, RelationForgetRelation
is called only for "drop"ing the relation.

> - When reusing an index build, instead of storing the dropped relid in the
>   IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), store
>   the subid fields in the IndexStmt.  This is less code, and I felt
>   RelationIdGetRelationCache() invited misuse.

Hmm. I'm not sure that index_create having the new subid parameters is
good. And the last if(OidIsValid) clause handles storage persistence
so I did that there. But I don't strongly against it.

Please give a bit more time to look it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-18 Thread Noah Misch
I think attached v35nm is ready for commit to master.  Would anyone like to
talk me out of back-patching this?  I would not enjoy back-patching it, but
it's hard to justify lack of back-patch for a data-loss bug.

Notable changes since v34:

- Separate a few freestanding fixes into their own patches.

On Mon, Jan 27, 2020 at 07:28:31PM +0900, Kyotaro Horiguchi wrote:
> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel)
>   /* Record largest maybe-unsynced block of files under tracking  */
>   pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
> HASH_FIND, NULL);
> - if (pending)
> - {
> - BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
> -
> - if (pending->max_truncated < nblocks)
> - pending->max_truncated = nblocks;
> - }
> + pending->is_truncated = true;

- Fix this crashing when "pending" is NULL, as it is in this test case:

  begin;
  create temp table t ();
  create table t2 ();  -- cause pendingSyncHash to exist
  truncate t;
  rollback;

- Fix the "deleted while still in use" problem that Thomas Munro reported, by
  removing the heap_create() change.  Restoring the saved rd_createSubid had
  made obsolete the heap_create() change.  check-world now passes with
  wal_level=minimal and CLOBBER_CACHE_ALWAYS.

- Set rd_droppedSubid in RelationForgetRelation(), not
  RelationClearRelation().  RelationForgetRelation() knows it is processing a
  drop, but RelationClearRelation() could only infer that from circumstantial
  evidence.  This seems more future-proof to me.

- When reusing an index build, instead of storing the dropped relid in the
  IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), store
  the subid fields in the IndexStmt.  This is less code, and I felt
  RelationIdGetRelationCache() invited misuse.
Author: Noah Misch 
Commit: Noah Misch 

Fix cosmetic blemishes involving rd_createSubid.

Remove an obsolete comment from AtEOXact_cleanup().  Restore formatting
of a comment in struct RelationData, mangled by the pgindent run in
commit 9af4159fce6654aa0e081b00d02bca40b978745c.  Back-patch to 9.5 (all
supported versions), because another fix stacks on this.

diff --git a/src/backend/utils/cache/relcache.c 
b/src/backend/utils/cache/relcache.c
index 50f8912..44fa843 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3029,10 +3029,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
 *
 * During commit, reset the flag to zero, since we are now out of the
 * creating transaction.  During abort, simply delete the relcache entry
-* --- it isn't interesting any longer.  (NOTE: if we have forgotten the
-* new-ness of a new relation due to a forced cache flush, the entry 
will
-* get deleted anyway by shared-cache-inval processing of the aborted
-* pg_class insertion.)
+* --- it isn't interesting any longer.
 */
if (relation->rd_createSubid != InvalidSubTransactionId)
{
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 31d8a1a..bf7a7df 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -63,7 +63,7 @@ typedef struct RelationData
 * 
rd_replidindex) */
boolrd_statvalid;   /* is rd_statlist valid? */
 
-   /*
+   /*--
 * rd_createSubid is the ID of the highest subtransaction the rel has
 * survived into; or zero if the rel was not created in the current top
 * transaction.  This can be now be relied on, whereas previously it 
could
@@ -73,8 +73,13 @@ typedef struct RelationData
 * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
 * when a relation has multiple new relfilenodes within a single
 * transaction, with one of them occurring in a subsequently aborted
-* subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
-* ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+* subtransaction, e.g.
+*  BEGIN;
+*  TRUNCATE t;
+*  SAVEPOINT save;
+*  TRUNCATE t;
+*  ROLLBACK TO save;
+*  -- rd_newRelfilenodeSubid is now forgotten
 */
SubTransactionId rd_createSubid;/* rel was created in current 
xact */
SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode 
assigned in
Author: Noah Misch 
Commit: Noah Misch 

During heap rebuild, lock any TOAST index until end of transaction.

swap_relation_files() calls toast_get_valid_index() to find and lock
this index, just before 

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-18 Thread Kyotaro Horiguchi
Oops. I played on a wrong branch and got stuck in slow build on
Windows...

At Tue, 18 Feb 2020 00:53:37 -0800, Noah Misch  wrote in 
> On Tue, Feb 18, 2020 at 03:56:15PM +1300, Thomas Munro wrote:
> > CREATE TYPE priv_testtype1 AS (a int, b text);
> > +ERROR: relation 24844 deleted while still in use
> > REVOKE USAGE ON TYPE priv_testtype1 FROM PUBLIC;
> > 
> > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.79923
> > 
> > It didn't fail on the same OS a couple of days earlier:
> > 
> > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/30829686
> 
> Thanks for the report.  This reproduces consistently under
> CLOBBER_CACHE_ALWAYS (which, coincidentally, I started today).  Removing the
> heap_create() change fixes it.  Since we now restore a saved rd_createSubid,
> the heap_create() change is obsolete.  My next version will include that fix.

Yes, ATExecAddIndex correctly set createSubid without that.

> The system uses rd_createSubid to mean two things.  First, rd_node is new.
> Second, the rel might not yet be in catalogs, so we can't rebuild its relcache
> entry.  The first can be false while the second is true, hence this failure.
> However, the second is true in a relatively-narrow period in which we don't
> run arbitrary user code.  Hence, that simple fix suffices.

I didn't care the second meaning. I thought it is caused by
invalidation but I couldn't get a core dump on Windows 10.. The
comment for RelationCacheInvalidate seems faintly explains about the
second meaning.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-18 Thread Noah Misch
On Tue, Feb 18, 2020 at 03:56:15PM +1300, Thomas Munro wrote:
> CREATE TYPE priv_testtype1 AS (a int, b text);
> +ERROR: relation 24844 deleted while still in use
> REVOKE USAGE ON TYPE priv_testtype1 FROM PUBLIC;
> 
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.79923
> 
> It didn't fail on the same OS a couple of days earlier:
> 
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/30829686

Thanks for the report.  This reproduces consistently under
CLOBBER_CACHE_ALWAYS (which, coincidentally, I started today).  Removing the
heap_create() change fixes it.  Since we now restore a saved rd_createSubid,
the heap_create() change is obsolete.  My next version will include that fix.

The system uses rd_createSubid to mean two things.  First, rd_node is new.
Second, the rel might not yet be in catalogs, so we can't rebuild its relcache
entry.  The first can be false while the second is true, hence this failure.
However, the second is true in a relatively-narrow period in which we don't
run arbitrary user code.  Hence, that simple fix suffices.




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-17 Thread Thomas Munro
On Mon, Jan 27, 2020 at 11:30 PM Kyotaro Horiguchi
 wrote:
> Hello, this is rebased then addressed version.

Hi, I haven't followed this thread but I just noticed this strange
looking failure:

CREATE TYPE priv_testtype1 AS (a int, b text);
+ERROR: relation 24844 deleted while still in use
REVOKE USAGE ON TYPE priv_testtype1 FROM PUBLIC;

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.79923

It didn't fail on the same OS a couple of days earlier:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/30829686




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-01-27 Thread Kyotaro Horiguchi
Hello, this is rebased then addressed version.

- Now valid rd_firstRelfilenodeSubid causes drop-pending of relcache
  as well as rd_createSubid.  The oblivion in the last example no
  longer happens.

- Revert the (really) useless change of AssertPendingSyncs_RelationCache.

- Fix several comments. Some of the fixes are just rewording and some
  are related to the first change above.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 91ae812abd0fcfe0172b7bb0ad563d3a7e5dd009 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v34 1/4] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml|  43 ++-
 doc/src/sgml/perform.sgml   |  47 +--
 src/backend/access/common/toast_internals.c |   4 +-
 src/backend/access/gist/gistutil.c  |  31 +-
 src/backend/access/gist/gistxlog.c  |  21 ++
 src/backend/access/heap/heapam.c|  45 +--
 src/backend/access/heap/heapam_handler.c|  22 +-
 src/backend/access/heap/rewriteheap.c   |  21 +-
 src/backend/access/nbtree/nbtsort.c |  41 +--
 src/backend/access/rmgrdesc/gistdesc.c  |   5 +
 src/backend/access/transam/README   |  45 ++-
 src/backend/access/transam/xact.c   |  15 +
 src/backend/access/transam/xloginsert.c |  10 +-
 src/backend/access/transam/xlogutils.c  |  18 +-
 src/backend/catalog/heap.c  |   4 +
 src/backend/catalog/storage.c   | 248 -
 src/backend/commands/cluster.c  |  12 +-
 src/backend/commands/copy.c |  58 +--
 src/backend/commands/createas.c |  13 +-
 src/backend/commands/matview.c  |  12 +-
 src/backend/commands/tablecmds.c|  11 +-
 src/backend/storage/buffer/bufmgr.c | 125 ++-
 src/backend/storage/lmgr/lock.c |  12 +
 src/backend/storage/smgr/md.c   |  36 +-
 src/backend/storage/smgr/smgr.c |  35 ++
 src/backend/utils/cache/relcache.c  | 161 +++--
 src/backend/utils/misc/guc.c|  13 +
 src/include/access/gist_private.h   |   2 +
 src/include/access/gistxlog.h   |   1 +
 src/include/access/heapam.h |   3 -
 src/include/access/rewriteheap.h|   2 +-
 src/include/access/tableam.h|  15 +-
 src/include/catalog/storage.h   |   6 +
 src/include/storage/bufmgr.h|   4 +
 src/include/storage/lock.h  |   3 +
 src/include/storage/smgr.h  |   1 +
 src/include/utils/rel.h |  55 ++-
 src/include/utils/relcache.h|   8 +-
 src/test/recovery/t/018_wal_optimize.pl | 374 
 src/test/regress/expected/alter_table.out   |   6 +
 src/test/regress/sql/alter_table.sql|   7 +
 41 files changed, 1243 insertions(+), 352 deletions(-)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e07dc01e80..ea1c866a15 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2481,21 +2481,14 @@ include_dir 'conf.d'
 levels.  This parameter can only be set at server start.


-In minimal level, WAL-logging of some bulk
-operations can be safely skipped, which can make those
-operations much faster (see ).
-Operations in which this optimization can be applied include:
-
- CREATE TABLE AS
- CREATE INDEX
- CLUSTER
- COPY into tables that were created or truncated in the same
- transaction
-
-But minimal WAL does not contain enough information to reconstruct the
-data from a base backup and the WAL logs, so replica or
-higher must be used to enable WAL archiving
-() and streaming replication.
+In minimal level, no information is logged for
+tables or indexes for the remainder of a transaction that creates or
+truncates them.  This can make bulk operations much faster (see
+).  But minimal WAL does not contain
+enough information to reconstruct the data from a base backup and the
+WAL logs, so replica or higher must be used to
+enable WAL archiving () and
+

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-01-26 Thread Kyotaro Horiguchi
By the way, the previous version looks somewhat different from what I
thought I posted..

At Sun, 26 Jan 2020 20:57:00 -0800, Noah Misch  wrote in 
> On Mon, Jan 27, 2020 at 01:44:13PM +0900, Kyotaro Horiguchi wrote:
> > > The purpose of this loop is to create relcache entries for rels locked in 
> > > the
> > > current transaction.  (The "r == NULL" case happens for rels no longer 
> > > visible
> > > in catalogs.  It is harmless.)  Since RelationIdGetRelationCache() never
> > > creates a relcache entry, calling it defeats that purpose.
> > > RelationIdGetRelation() is the right function to call.
> > 
> > I thought that the all required entry exist in the cache but actually
> > it's safer that recreate dropped caches. Does the following works?
> > 
> > r = RelationIdGetRelation(relid);
> > +   /* if not found, fetch a "dropped" entry if any  */
> > +   if (r == NULL)
> > +   r = RelationIdGetRelationCache(relid);
> > if (r == NULL)
> > continue;
> 
> That does not materially change the function's behavior.  Notice that the
> function does one thing with "r", which is to call RelationClose(r).  The
> function calls RelationIdGetRelation() for its side effects, not for its
> return value.

..Right.  The following loop accesses relcache hash directly and no
need for storing returned r to the array rels..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-01-26 Thread Noah Misch
On Mon, Jan 27, 2020 at 01:44:13PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 26 Jan 2020 20:22:01 -0800, Noah Misch  wrote in 
> > Diffing the two latest versions of one patch:
> > > --- v32-0002-Fix-the-defect-1.patch   2020-01-18 14:32:47.499129940 
> > > -0800
> > > +++ v33-0002-Fix-the-defect-1.patch   2020-01-26 16:23:52.846391035 
> > > -0800
> > > +@@ -2978,8 +3054,8 @@ AssertPendingSyncs_RelationCache(void)
> > > + LOCKTAG_RELATION)
> > > + continue;
> > > + relid = 
> > > ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
> > > +-r = RelationIdGetRelation(relid);
> > > +-if (r == NULL)
> > > ++r = RelationIdGetRelationCache(relid);
> > 
> > The purpose of this loop is to create relcache entries for rels locked in 
> > the
> > current transaction.  (The "r == NULL" case happens for rels no longer 
> > visible
> > in catalogs.  It is harmless.)  Since RelationIdGetRelationCache() never
> > creates a relcache entry, calling it defeats that purpose.
> > RelationIdGetRelation() is the right function to call.
> 
> I thought that the all required entry exist in the cache but actually
> it's safer that recreate dropped caches. Does the following works?
> 
>   r = RelationIdGetRelation(relid);
> +   /* if not found, fetch a "dropped" entry if any  */
> + if (r == NULL)
> + r = RelationIdGetRelationCache(relid);
>   if (r == NULL)
>   continue;

That does not materially change the function's behavior.  Notice that the
function does one thing with "r", which is to call RelationClose(r).  The
function calls RelationIdGetRelation() for its side effects, not for its
return value.




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-01-26 Thread Kyotaro Horiguchi
Thanks!

At Sun, 26 Jan 2020 20:22:01 -0800, Noah Misch  wrote in 
> Diffing the two latest versions of one patch:
> > --- v32-0002-Fix-the-defect-1.patch 2020-01-18 14:32:47.499129940 -0800
> > +++ v33-0002-Fix-the-defect-1.patch 2020-01-26 16:23:52.846391035 -0800
> > +@@ -2978,8 +3054,8 @@ AssertPendingSyncs_RelationCache(void)
> > +   LOCKTAG_RELATION)
> > +   continue;
> > +   relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
> > +-  r = RelationIdGetRelation(relid);
> > +-  if (r == NULL)
> > ++  r = RelationIdGetRelationCache(relid);
> 
> The purpose of this loop is to create relcache entries for rels locked in the
> current transaction.  (The "r == NULL" case happens for rels no longer visible
> in catalogs.  It is harmless.)  Since RelationIdGetRelationCache() never
> creates a relcache entry, calling it defeats that purpose.
> RelationIdGetRelation() is the right function to call.

I thought that the all required entry exist in the cache but actually
it's safer that recreate dropped caches. Does the following works?

r = RelationIdGetRelation(relid);
+   /* if not found, fetch a "dropped" entry if any  */
+   if (r == NULL)
+   r = RelationIdGetRelationCache(relid);
if (r == NULL)
continue;

> On Tue, Jan 21, 2020 at 06:45:57PM +0900, Kyotaro Horiguchi wrote:
> > Three other fixes not mentined above are made. One is the useless
> > rd_firstRelfilenodeSubid in the condition to dicide whether to
> > preserve or not a relcache entry
> 
> It was not useless.  Test case:
> 
>   create table t (c int);
>   begin;
>   alter table t alter c type bigint;  -- sets rd_firstRelfilenodeSubid
>   savepoint q; drop table t; rollback to q;  -- forgets 
> rd_firstRelfilenodeSubid
>   commit;  -- assertion failure, after 
> s/RelationIdGetRelationCache/RelationIdGetRelation/ discussed above

Mmm? I thought somehow that that relcache entry never be dropped and I
believe I considered that case, of course.  But yes, you're right.

I'll post upated version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-01-26 Thread Noah Misch
Diffing the two latest versions of one patch:
> --- v32-0002-Fix-the-defect-1.patch   2020-01-18 14:32:47.499129940 -0800
> +++ v33-0002-Fix-the-defect-1.patch   2020-01-26 16:23:52.846391035 -0800
> +@@ -2978,8 +3054,8 @@ AssertPendingSyncs_RelationCache(void)
> + LOCKTAG_RELATION)
> + continue;
> + relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
> +-r = RelationIdGetRelation(relid);
> +-if (r == NULL)
> ++r = RelationIdGetRelationCache(relid);

The purpose of this loop is to create relcache entries for rels locked in the
current transaction.  (The "r == NULL" case happens for rels no longer visible
in catalogs.  It is harmless.)  Since RelationIdGetRelationCache() never
creates a relcache entry, calling it defeats that purpose.
RelationIdGetRelation() is the right function to call.

On Tue, Jan 21, 2020 at 06:45:57PM +0900, Kyotaro Horiguchi wrote:
> Three other fixes not mentined above are made. One is the useless
> rd_firstRelfilenodeSubid in the condition to dicide whether to
> preserve or not a relcache entry

It was not useless.  Test case:

  create table t (c int);
  begin;
  alter table t alter c type bigint;  -- sets rd_firstRelfilenodeSubid
  savepoint q; drop table t; rollback to q;  -- forgets rd_firstRelfilenodeSubid
  commit;  -- assertion failure, after 
s/RelationIdGetRelationCache/RelationIdGetRelation/ discussed above




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-01-21 Thread Kyotaro Horiguchi
Thank you for the comment.

At Sat, 18 Jan 2020 19:51:39 -0800, Noah Misch  wrote in 
> On Tue, Jan 14, 2020 at 07:35:22PM +0900, Kyotaro Horiguchi wrote:
> > At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in 
> > > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch  wrote 
> > > in 
> > > > === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK 
> > > > TO
...
> > This is complex than expected. The DROP TABLE unconditionally removed
> > relcache entry. To fix that, I tried to use rd_isinvalid but it failed
> > because there's a state that a relcache invalid but the corresponding
> > catalog entry is alive.
> > 
> > In the attached patch 0002, I added a boolean in relcache that
> > indicates that the relation is already removed in catalog but not
> > committed.
> 
> This design could work, but some if its properties aren't ideal.  For example,
> RelationIdGetRelation() can return a !rd_isvalid relation when the relation
> has been dropped.  What others designs did you consider, if any?

I thought that the entries with rd_isdropped is true cannot be fetched
by other transactions because the relid is not seen there. Still the
same session could do that by repeatedly reindexing or invalidation on
the same relation and I think it is safe because the content of the
entry coudln't be changed and the cache content is reusable. That
being said, it is actually makes things unclear.

I came up with two alternatives. One is a variant of
RelationIdGetRelation for the purpose. The new function
RelationIdGetRelationCache is currently used (only) in ATExecAddIndex
so we could restrict it to return only dropped relations.

Another is another "stashed" relcache, but it seems to make things too
complex.

> On Thu, Jan 16, 2020 at 02:20:57PM +0900, Kyotaro Horiguchi wrote:
> > --- a/src/backend/utils/cache/relcache.c
> > +++ b/src/backend/utils/cache/relcache.c
> > @@ -3114,8 +3153,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
> >  */
> > if (relation->rd_createSubid != InvalidSubTransactionId)
> > {
> > -   if (isCommit)
> > -   relation->rd_createSubid = InvalidSubTransactionId;
> > +   relation->rd_createSubid = InvalidSubTransactionId;
> > +
> > +   if (isCommit && !relation->rd_isdropped)
> > +   {} /* Nothing to do */
> 
> What is the purpose of this particular change?  This executes at the end of a
> top-level transaction.  We've already done any necessary syncing, and we're
> clearing any flags that caused WAL skipping.  I think it's no longer
> productive to treat dropped relations differently.

It executes the pending *relcache* drop we should have done in
ATPostAlterTypeCleanup (or in RelationClearRelation) if the newness
flags were false. The entry misses the chance of being removed (then
bloats the relcache) if we don't do that there.  I added a comment
there to exlaining that.

> > @@ -3232,6 +3272,19 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
> > }
> > }
> >  
> > +   /*
> > +* If this relation registered pending sync then dropped, subxact 
> > rollback
> > +* cancels the uncommitted drop, and commit propagates it to the parent.
> > +*/
> > +   if (relation->rd_isdropped)
> > +   {
> > +   Assert (!relation->rd_isvalid &&
> > +   (relation->rd_createSubid != 
> > InvalidSubTransactionId ||
> > +relation->rd_firstRelfilenodeSubid != 
> > InvalidSubTransactionId));
> > +   if (!isCommit)
> > +   relation->rd_isdropped = false;
> 
> This does the wrong thing when there exists some subtransaction rollback that
> does not rollback the DROP:

Sorry for my stupid. I actually thought something like that on the
way. After all, I concluded that the dropped flag ought to behave same
way with rd_createSubid.

> \pset null 'NULL'
> begin;
> create extension pg_visibility;
> create table droppedtest (c int);
> select 'droppedtest'::regclass::oid as oid \gset
> savepoint q; drop table droppedtest; release q; -- rd_dropped==true
> select * from pg_visibility_map(:oid); -- processes !rd_isvalid rel (not 
> ideal)
> savepoint q; select 1; rollback to q; -- rd_dropped==false (wrong)
> savepoint q; select 1; rollback to q;
> select pg_relation_size(:oid), pg_relation_filepath(:oid),
>   has_table_privilege(:oid, 'SELECT'); -- all nulls, okay
> select * from pg_visibility_map(:oid); -- assertion failure
> rollback;

And I teached RelationIdGetRelation not to return dropped
relations. So the (not ideal) cases just fail as before.

Three other fixes not mentined above are made. One is the useless
rd_firstRelfilenodeSubid in the condition to dicide whether to
preserve or not a relcache entry, and the forgotten copying of other
newness flags. Another is the forgotten SWAPFIELD on
rd_dropedSubid. The last is the forgotten change in
out/equal/copyfuncs.

Please find the attached.

regards.

-- 
Kyotaro 

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-01-18 Thread Noah Misch
On Tue, Jan 14, 2020 at 07:35:22PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch  wrote in 
> > > === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO
> > > 
> > > A test in transactions.sql now fails in 
> > > AssertPendingSyncs_RelationCache(),
> > > when running "make check" under wal_level=minimal.  I test this way:
> > > 
> > > printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' 
> > > >$PWD/minimal.conf
> > > make check TEMP_CONFIG=$PWD/minimal.conf
> > > 
> > > Self-contained demonstration:
> > >   begin;
> > >   create table t (c int);
> > >   savepoint q; drop table t; rollback to q;  -- forgets table is skipping 
> > > wal
> > >   commit;  -- assertion failure
> 
> This is complex than expected. The DROP TABLE unconditionally removed
> relcache entry. To fix that, I tried to use rd_isinvalid but it failed
> because there's a state that a relcache invalid but the corresponding
> catalog entry is alive.
> 
> In the attached patch 0002, I added a boolean in relcache that
> indicates that the relation is already removed in catalog but not
> committed.

This design could work, but some if its properties aren't ideal.  For example,
RelationIdGetRelation() can return a !rd_isvalid relation when the relation
has been dropped.  What others designs did you consider, if any?

On Thu, Jan 16, 2020 at 02:20:57PM +0900, Kyotaro Horiguchi wrote:
> --- a/src/backend/utils/cache/relcache.c
> +++ b/src/backend/utils/cache/relcache.c
> @@ -3114,8 +3153,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
>*/
>   if (relation->rd_createSubid != InvalidSubTransactionId)
>   {
> - if (isCommit)
> - relation->rd_createSubid = InvalidSubTransactionId;
> + relation->rd_createSubid = InvalidSubTransactionId;
> +
> + if (isCommit && !relation->rd_isdropped)
> + {} /* Nothing to do */

What is the purpose of this particular change?  This executes at the end of a
top-level transaction.  We've already done any necessary syncing, and we're
clearing any flags that caused WAL skipping.  I think it's no longer
productive to treat dropped relations differently.

> @@ -3232,6 +3272,19 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
>   }
>   }
>  
> + /*
> +  * If this relation registered pending sync then dropped, subxact 
> rollback
> +  * cancels the uncommitted drop, and commit propagates it to the parent.
> +  */
> + if (relation->rd_isdropped)
> + {
> + Assert (!relation->rd_isvalid &&
> + (relation->rd_createSubid != 
> InvalidSubTransactionId ||
> +  relation->rd_firstRelfilenodeSubid != 
> InvalidSubTransactionId));
> + if (!isCommit)
> + relation->rd_isdropped = false;

This does the wrong thing when there exists some subtransaction rollback that
does not rollback the DROP:

\pset null 'NULL'
begin;
create extension pg_visibility;
create table droppedtest (c int);
select 'droppedtest'::regclass::oid as oid \gset
savepoint q; drop table droppedtest; release q; -- rd_dropped==true
select * from pg_visibility_map(:oid); -- processes !rd_isvalid rel (not ideal)
savepoint q; select 1; rollback to q; -- rd_dropped==false (wrong)
savepoint q; select 1; rollback to q;
select pg_relation_size(:oid), pg_relation_filepath(:oid),
  has_table_privilege(:oid, 'SELECT'); -- all nulls, okay
select * from pg_visibility_map(:oid); -- assertion failure
rollback;




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-01-15 Thread Kyotaro Horiguchi
All the known defects are fixed.

At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch  wrote in 
> === Defect 3: storage.c checks size decrease of MAIN_FORKNUM only
> 
> storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated.  Is it
> possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
> net size decrease?  I haven't tested, but this sequence seems possible:
> 
>   TRUNCATE
> reduces MAIN_FORKNUM from 100 blocks to 0 blocks
> reduces FSM_FORKNUM from 3 blocks to 0 blocks
>   COPY
> raises MAIN_FORKNUM from 0 blocks to 110 blocks
> does not change FSM_FORKNUM
>   COMMIT
> should fsync, but wrongly chooses log_newpage_range() approach
> 
> If that's indeed a problem, beside the obvious option of tracking every fork's
> max_truncated, we could convert max_truncated to a bool and use fsync anytime
> the relation experienced an mdtruncate().  (While FSM_FORKNUM is not critical
> for database operations, the choice to subject it to checksums entails
> protecting it here.)  If that's not a problem, would you explain?

That causes page-load failure since FSM can offer a nonexistent heap
block, which failure leads to ERROR of an SQL statement. It's not
critical but surely a problem. I'd like to take the bool option
because insert-truncate sequence is rarely happen. That case is not
our main target of the optimization so it is enough for us to make
sure that the operation doesn't lead to such errors.

The attached is nm30 patch followed by the three fix patches for the
three defects. The new member "RelationData.isremoved" is renamed to
"isdropped" in this version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 3a811c76874ae3c596e138369766ad00888c572c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v32 1/4] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml|  43 ++-
 doc/src/sgml/perform.sgml   |  47 +--
 src/backend/access/common/toast_internals.c |   4 +-
 src/backend/access/gist/gistutil.c  |  31 +-
 src/backend/access/gist/gistxlog.c  |  21 ++
 src/backend/access/heap/heapam.c|  45 +--
 src/backend/access/heap/heapam_handler.c|  22 +-
 src/backend/access/heap/rewriteheap.c   |  21 +-
 src/backend/access/nbtree/nbtsort.c |  41 +--
 src/backend/access/rmgrdesc/gistdesc.c  |   5 +
 src/backend/access/transam/README   |  45 ++-
 src/backend/access/transam/xact.c   |  15 +
 src/backend/access/transam/xloginsert.c |  10 +-
 src/backend/access/transam/xlogutils.c  |  18 +-
 src/backend/catalog/heap.c  |   4 +
 src/backend/catalog/storage.c   | 248 -
 src/backend/commands/cluster.c  |  12 +-
 src/backend/commands/copy.c |  58 +--
 src/backend/commands/createas.c |  11 +-
 src/backend/commands/matview.c  |  12 +-
 src/backend/commands/tablecmds.c|  11 +-
 src/backend/storage/buffer/bufmgr.c | 125 ++-
 src/backend/storage/lmgr/lock.c |  12 +
 src/backend/storage/smgr/md.c   |  36 +-
 src/backend/storage/smgr/smgr.c |  35 ++
 src/backend/utils/cache/relcache.c  | 159 +++--
 src/backend/utils/misc/guc.c|  13 +
 src/include/access/gist_private.h   |   2 +
 src/include/access/gistxlog.h   |   1 +
 src/include/access/heapam.h |   3 -
 src/include/access/rewriteheap.h|   2 +-
 src/include/access/tableam.h|  15 +-
 src/include/catalog/storage.h   |   6 +
 src/include/storage/bufmgr.h|   4 +
 src/include/storage/lock.h  |   3 +
 src/include/storage/smgr.h  |   1 +
 src/include/utils/rel.h |  57 ++-
 src/include/utils/relcache.h|   8 +-
 src/test/recovery/t/018_wal_optimize.pl | 374 
 src/test/regress/expected/alter_table.out   |   6 +
 src/test/regress/sql/alter_table.sql|   7 +
 41 files changed, 1242 insertions(+), 351 deletions(-)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d45b6f7cb..0e7a0bc0ee 100644
--- a/doc/src/sgml/config.sgml

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-01-15 Thread Kyotaro Horiguchi
Hello. I added a fix for the defect 2.

At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch  wrote in 
> === Defect 2: Forgets to skip WAL due to oversimplification in heap_create()
> 
> In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
> to transfer WAL-skipped state to the new index relation.  Before v24nm, the
> new index relation skipped WAL unconditionally.  Since v24nm, the new index
> relation never skips WAL.  I've added a test to alter_table.sql that reveals
> this problem under wal_level=minimal.

The fix for this defect utilizes the mechanism that preserves relcache
entry for dropped relation.  If ATExecAddIndex can obtain such a
relcache entry for the old relation, it should hold the newness flags
and we can copy them to the new relcache entry.  I added one member
named oldRelId to the struct IndexStmt to let the function access the
relcache entry for the old index relation.

I forgot to assing version 31 to the last patch, I reused the number
for this version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 591872bdd7b18566fe2529d20e4073900dec32fd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v31 1/3] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml|  43 ++--
 doc/src/sgml/perform.sgml   |  47 +---
 src/backend/access/common/toast_internals.c |   4 +-
 src/backend/access/gist/gistutil.c  |  31 ++-
 src/backend/access/gist/gistxlog.c  |  21 ++
 src/backend/access/heap/heapam.c|  45 +---
 src/backend/access/heap/heapam_handler.c|  22 +-
 src/backend/access/heap/rewriteheap.c   |  21 +-
 src/backend/access/nbtree/nbtsort.c |  41 +---
 src/backend/access/rmgrdesc/gistdesc.c  |   5 +
 src/backend/access/transam/README   |  45 +++-
 src/backend/access/transam/xact.c   |  15 ++
 src/backend/access/transam/xloginsert.c |  10 +-
 src/backend/access/transam/xlogutils.c  |  18 +-
 src/backend/catalog/heap.c  |   4 +
 src/backend/catalog/storage.c   | 248 ++--
 src/backend/commands/cluster.c  |  12 +-
 src/backend/commands/copy.c |  58 +
 src/backend/commands/createas.c |  11 +-
 src/backend/commands/matview.c  |  12 +-
 src/backend/commands/tablecmds.c|  11 +-
 src/backend/storage/buffer/bufmgr.c | 125 +-
 src/backend/storage/lmgr/lock.c |  12 +
 src/backend/storage/smgr/md.c   |  36 ++-
 src/backend/storage/smgr/smgr.c |  35 +++
 src/backend/utils/cache/relcache.c  | 159 ++---
 src/backend/utils/misc/guc.c|  13 +
 src/include/access/gist_private.h   |   2 +
 src/include/access/gistxlog.h   |   1 +
 src/include/access/heapam.h |   3 -
 src/include/access/rewriteheap.h|   2 +-
 src/include/access/tableam.h|  15 +-
 src/include/catalog/storage.h   |   6 +
 src/include/storage/bufmgr.h|   4 +
 src/include/storage/lock.h  |   3 +
 src/include/storage/smgr.h  |   1 +
 src/include/utils/rel.h |  57 +++--
 src/include/utils/relcache.h|   8 +-
 src/test/regress/expected/alter_table.out   |   6 +
 src/test/regress/sql/alter_table.sql|   7 +
 40 files changed, 868 insertions(+), 351 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d45b6f7cb..0e7a0bc0ee 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2481,21 +2481,14 @@ include_dir 'conf.d'
 levels.  This parameter can only be set at server start.


-In minimal level, WAL-logging of some bulk
-operations can be safely skipped, which can make those
-operations much faster (see ).
-Operations in which this optimization can be applied include:
-
- CREATE TABLE AS
- CREATE INDEX
- CLUSTER
- COPY into tables that were created or truncated in the same
- transaction
-
-But minimal WAL does not contain enough information to reconstruct the
-data from a base backup and the WAL logs, so replica or
-higher 

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-01-14 Thread Kyotaro Horiguchi
Hello, this is a fix for the defect 1 of 3.

At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Thank you for the findings.
> 
> At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch  wrote in 
> > By improving AssertPendingSyncs_RelationCache() and by testing with
> > -DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
> > Would you fix these?
> 
> I'd like to do that, please give me som time.
> 
> > === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO
> > 
> > A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
> > when running "make check" under wal_level=minimal.  I test this way:
> > 
> > printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' 
> > >$PWD/minimal.conf
> > make check TEMP_CONFIG=$PWD/minimal.conf
> > 
> > Self-contained demonstration:
> >   begin;
> >   create table t (c int);
> >   savepoint q; drop table t; rollback to q;  -- forgets table is skipping 
> > wal
> >   commit;  -- assertion failure

This is complex than expected. The DROP TABLE unconditionally removed
relcache entry. To fix that, I tried to use rd_isinvalid but it failed
because there's a state that a relcache invalid but the corresponding
catalog entry is alive.

In the attached patch 0002, I added a boolean in relcache that
indicates that the relation is already removed in catalog but not
committed. I needed to ignore invalid relcache entries in
AssertPendingSyncs_RelationCache but I think it is the right thing to
do.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 75ce09ae56227f9b87b1e9fcae1cad016857344c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH 1/2] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml|  43 ++--
 doc/src/sgml/perform.sgml   |  47 +---
 src/backend/access/common/toast_internals.c |   4 +-
 src/backend/access/gist/gistutil.c  |  31 ++-
 src/backend/access/gist/gistxlog.c  |  21 ++
 src/backend/access/heap/heapam.c|  45 +---
 src/backend/access/heap/heapam_handler.c|  22 +-
 src/backend/access/heap/rewriteheap.c   |  21 +-
 src/backend/access/nbtree/nbtsort.c |  41 +---
 src/backend/access/rmgrdesc/gistdesc.c  |   5 +
 src/backend/access/transam/README   |  45 +++-
 src/backend/access/transam/xact.c   |  15 ++
 src/backend/access/transam/xloginsert.c |  10 +-
 src/backend/access/transam/xlogutils.c  |  18 +-
 src/backend/catalog/heap.c  |   4 +
 src/backend/catalog/storage.c   | 248 ++--
 src/backend/commands/cluster.c  |  12 +-
 src/backend/commands/copy.c |  58 +
 src/backend/commands/createas.c |  11 +-
 src/backend/commands/matview.c  |  12 +-
 src/backend/commands/tablecmds.c|  11 +-
 src/backend/storage/buffer/bufmgr.c | 125 +-
 src/backend/storage/lmgr/lock.c |  12 +
 src/backend/storage/smgr/md.c   |  36 ++-
 src/backend/storage/smgr/smgr.c |  35 +++
 src/backend/utils/cache/relcache.c  | 159 ++---
 src/backend/utils/misc/guc.c|  13 +
 src/include/access/gist_private.h   |   2 +
 src/include/access/gistxlog.h   |   1 +
 src/include/access/heapam.h |   3 -
 src/include/access/rewriteheap.h|   2 +-
 src/include/access/tableam.h|  15 +-
 src/include/catalog/storage.h   |   6 +
 src/include/storage/bufmgr.h|   4 +
 src/include/storage/lock.h  |   3 +
 src/include/storage/smgr.h  |   1 +
 src/include/utils/rel.h |  57 +++--
 src/include/utils/relcache.h|   8 +-
 src/test/regress/expected/alter_table.out   |   6 +
 src/test/regress/sql/alter_table.sql|   7 +
 40 files changed, 868 insertions(+), 351 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..12d07b11e4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2481,21 +2481,14 @@ include_dir 'conf.d'
 levels.  This parameter can only be set at server start.


-In minimal level, WAL-logging of some bulk
-

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-26 Thread Kyotaro Horiguchi
Hello, Noah.

At Wed, 25 Dec 2019 20:22:04 -0800, Noah Misch  wrote in 
> On Thu, Dec 26, 2019 at 12:46:39PM +0900, Kyotaro Horiguchi wrote:
> > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch  wrote in 
> > >   Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did.  
> > > Making
> > >   that work no matter what does ereport(ERROR) would be tricky and 
> > > low-value.
> > 
> > Right about ereport, but I'm not sure remove the whole assertion from abort.
> 
> You may think of a useful assert location that lacks the problems of asserting
> at abort.  For example, I considered asserting in PortalRunMulti() and
> PortalRun(), just after each command, if still in a transaction.

Thanks for the suggestion. I'll consider that

> > > - Reverted most post-v24nm changes to swap_relation_files().  Under
> > >   "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
> > >   rel1->rd_node.relNode update.  Clearing rel2->rd_createSubid is not 
> > > right if
> > >   we're running CLUSTER for the second time in one transaction.  I used
> > 
> > I don't agree to that. As I think I have mentioned upthread, rel2 is
> > wrongly marked as "new in this tranction" at that time, which hinders
> > the opportunity of removal and such entries wrongly persist for the
> > backend life and causes problems. (That was found by abort-time
> > AssertPendingSyncs_RelationCache()..)
> 
> I can't reproduce rel2's relcache entry wrongly persisting for the life of a
> backend.  If that were happening, I would expect repeating a CLUSTER command N
> times to increase hash_get_num_entries(RelationIdCache) by at least N.  I
> tried that, but hash_get_num_entries(RelationIdCache) did not increase.  In a
> non-assert build, how can I reproduce problems caused by incorrect
> rd_createSubid on rel2?

As wrote in the another mail. I don't see such a problem and agree to
the removal.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-25 Thread Kyotaro Horiguchi
At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> > - Reverted most post-v24nm changes to swap_relation_files().  Under
> >   "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
> >   rel1->rd_node.relNode update.  Clearing rel2->rd_createSubid is not right 
> > if
> >   we're running CLUSTER for the second time in one transaction.  I used
> 
> I don't agree to that. As I think I have mentioned upthread, rel2 is
> wrongly marked as "new in this tranction" at that time, which hinders
> the opportunity of removal and such entries wrongly persist for the
> backend life and causes problems. (That was found by abort-time
> AssertPendingSyncs_RelationCache()..)

I played with the new version for a while and I don't see such a
problem. I don't recall cleary what I saw the time I thought I saw a
problem but I changed my mind to agree to that. It's far reasonable
and clearer as long as it works correctly.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-25 Thread Noah Misch
On Thu, Dec 26, 2019 at 12:46:39PM +0900, Kyotaro Horiguchi wrote:
> At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch  wrote in 
> >   Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did.  Making
> >   that work no matter what does ereport(ERROR) would be tricky and 
> > low-value.
> 
> Right about ereport, but I'm not sure remove the whole assertion from abort.

You may think of a useful assert location that lacks the problems of asserting
at abort.  For example, I considered asserting in PortalRunMulti() and
PortalRun(), just after each command, if still in a transaction.

> > - Reverted most post-v24nm changes to swap_relation_files().  Under
> >   "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
> >   rel1->rd_node.relNode update.  Clearing rel2->rd_createSubid is not right 
> > if
> >   we're running CLUSTER for the second time in one transaction.  I used
> 
> I don't agree to that. As I think I have mentioned upthread, rel2 is
> wrongly marked as "new in this tranction" at that time, which hinders
> the opportunity of removal and such entries wrongly persist for the
> backend life and causes problems. (That was found by abort-time
> AssertPendingSyncs_RelationCache()..)

I can't reproduce rel2's relcache entry wrongly persisting for the life of a
backend.  If that were happening, I would expect repeating a CLUSTER command N
times to increase hash_get_num_entries(RelationIdCache) by at least N.  I
tried that, but hash_get_num_entries(RelationIdCache) did not increase.  In a
non-assert build, how can I reproduce problems caused by incorrect
rd_createSubid on rel2?




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-25 Thread Kyotaro Horiguchi
Thank you for the findings.

At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch  wrote in 
> By improving AssertPendingSyncs_RelationCache() and by testing with
> -DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
> Would you fix these?

I'd like to do that, please give me som time.

> === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO
> 
> A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
> when running "make check" under wal_level=minimal.  I test this way:
> 
> printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' 
> >$PWD/minimal.conf
> make check TEMP_CONFIG=$PWD/minimal.conf
> 
> Self-contained demonstration:
>   begin;
>   create table t (c int);
>   savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
>   commit;  -- assertion failure
> 
> 
> === Defect 2: Forgets to skip WAL due to oversimplification in heap_create()
> 
> In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
> to transfer WAL-skipped state to the new index relation.  Before v24nm, the
> new index relation skipped WAL unconditionally.  Since v24nm, the new index
> relation never skips WAL.  I've added a test to alter_table.sql that reveals
> this problem under wal_level=minimal.
> 
> 
> === Defect 3: storage.c checks size decrease of MAIN_FORKNUM only
> 
> storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated.  Is it
> possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
> net size decrease?  I haven't tested, but this sequence seems possible:
> 
>   TRUNCATE
> reduces MAIN_FORKNUM from 100 blocks to 0 blocks
> reduces FSM_FORKNUM from 3 blocks to 0 blocks
>   COPY
> raises MAIN_FORKNUM from 0 blocks to 110 blocks
> does not change FSM_FORKNUM
>   COMMIT
> should fsync, but wrongly chooses log_newpage_range() approach
> 
> If that's indeed a problem, beside the obvious option of tracking every fork's
> max_truncated, we could convert max_truncated to a bool and use fsync anytime
> the relation experienced an mdtruncate().  (While FSM_FORKNUM is not critical
> for database operations, the choice to subject it to checksums entails
> protecting it here.)  If that's not a problem, would you explain?
> 


> === Non-defect notes
> 
> Once you have a correct patch, would you run check-world with
> -DCLOBBER_CACHE_ALWAYS?  That may reveal additional defects.  It may take a
> day or more, but that's fine.

Sure.

> The new smgrimmedsync() calls are potentially fragile, because they sometimes
> target a file of a dropped relation.  However, the mdexists() test prevents
> anything bad from happening.  No change is necessary.  Example:
> 
>   SET wal_skip_threshold = 0;
>   BEGIN;
>   SAVEPOINT q;
>   CREATE TABLE t (c) AS SELECT 1;
>   ROLLBACK TO q;  -- truncates the relfilenode
>   CHECKPOINT;  -- unlinks the relfilenode
>   COMMIT;  -- calls mdexists() on the relfilenode
> 
> 
> === Notable changes in v30nm
> 
> - Changed "wal_skip_threshold * 1024" to an expression that can't overflow.
>   Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0.

Ahh.., I wrongly understood that MAX_KILOBYTES inhibits that
setting. work_mem and maintenance_work_mem are casted to double or
long before calculation. In this case it's enough that calculation
unit becomes kilobytes instad of bytes.

> - Changed AssertPendingSyncs_RelationCache() to open all relations on which
>   the transaction holds locks.  This permits detection of cases where
>   RelationNeedsWAL() returns true but storage.c will sync the relation.
>
>   Removed the assertions from RelationIdGetRelation().  Using
>   "-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't
>   actually problematic, since invalidation updates rd_node while other code
>   updates rd_firstRelfilenodeSubid.  This is not a significant loss, now that
>   AssertPendingSyncs_RelationCache() opens relations.  (I considered making
>   the update of rd_firstRelfilenodeSubid more like rd_node, where we store it
>   somewhere until the next CommandCounterIncrement(), which would make it
>   actually affect RelationNeedsWAL().  That might have been better in general,
>   but it felt complex without clear benefits.)
> 
>   Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did.  Making
>   that work no matter what does ereport(ERROR) would be tricky and low-value.

Right about ereport, but I'm not sure remove the whole assertion from abort.

> - Extracted the RelationTruncate() changes into new function
>   RelationPreTruncate(), so table access methods that can't use
>   RelationTruncate() have another way to request that work.

Sounds reasonable. Also the new behavior of max_truncated looks fine.

> - Changed wal_skip_threshold default to 2MB.  My second preference was for
>   4MB.  In your data, 2MB and 4MB had similar performance at optimal
>   wal_buffers, but 4MB performed worse at low 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-25 Thread Noah Misch
By improving AssertPendingSyncs_RelationCache() and by testing with
-DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
Would you fix these?


=== Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO

A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
when running "make check" under wal_level=minimal.  I test this way:

printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf
make check TEMP_CONFIG=$PWD/minimal.conf

Self-contained demonstration:
  begin;
  create table t (c int);
  savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
  commit;  -- assertion failure


=== Defect 2: Forgets to skip WAL due to oversimplification in heap_create()

In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
to transfer WAL-skipped state to the new index relation.  Before v24nm, the
new index relation skipped WAL unconditionally.  Since v24nm, the new index
relation never skips WAL.  I've added a test to alter_table.sql that reveals
this problem under wal_level=minimal.


=== Defect 3: storage.c checks size decrease of MAIN_FORKNUM only

storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated.  Is it
possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
net size decrease?  I haven't tested, but this sequence seems possible:

  TRUNCATE
reduces MAIN_FORKNUM from 100 blocks to 0 blocks
reduces FSM_FORKNUM from 3 blocks to 0 blocks
  COPY
raises MAIN_FORKNUM from 0 blocks to 110 blocks
does not change FSM_FORKNUM
  COMMIT
should fsync, but wrongly chooses log_newpage_range() approach

If that's indeed a problem, beside the obvious option of tracking every fork's
max_truncated, we could convert max_truncated to a bool and use fsync anytime
the relation experienced an mdtruncate().  (While FSM_FORKNUM is not critical
for database operations, the choice to subject it to checksums entails
protecting it here.)  If that's not a problem, would you explain?


=== Non-defect notes

Once you have a correct patch, would you run check-world with
-DCLOBBER_CACHE_ALWAYS?  That may reveal additional defects.  It may take a
day or more, but that's fine.

The new smgrimmedsync() calls are potentially fragile, because they sometimes
target a file of a dropped relation.  However, the mdexists() test prevents
anything bad from happening.  No change is necessary.  Example:

  SET wal_skip_threshold = 0;
  BEGIN;
  SAVEPOINT q;
  CREATE TABLE t (c) AS SELECT 1;
  ROLLBACK TO q;  -- truncates the relfilenode
  CHECKPOINT;  -- unlinks the relfilenode
  COMMIT;  -- calls mdexists() on the relfilenode


=== Notable changes in v30nm

- Changed "wal_skip_threshold * 1024" to an expression that can't overflow.
  Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0.

- Changed AssertPendingSyncs_RelationCache() to open all relations on which
  the transaction holds locks.  This permits detection of cases where
  RelationNeedsWAL() returns true but storage.c will sync the relation.

  Removed the assertions from RelationIdGetRelation().  Using
  "-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't
  actually problematic, since invalidation updates rd_node while other code
  updates rd_firstRelfilenodeSubid.  This is not a significant loss, now that
  AssertPendingSyncs_RelationCache() opens relations.  (I considered making
  the update of rd_firstRelfilenodeSubid more like rd_node, where we store it
  somewhere until the next CommandCounterIncrement(), which would make it
  actually affect RelationNeedsWAL().  That might have been better in general,
  but it felt complex without clear benefits.)

  Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did.  Making
  that work no matter what does ereport(ERROR) would be tricky and low-value.

- Extracted the RelationTruncate() changes into new function
  RelationPreTruncate(), so table access methods that can't use
  RelationTruncate() have another way to request that work.

- Changed wal_skip_threshold default to 2MB.  My second preference was for
  4MB.  In your data, 2MB and 4MB had similar performance at optimal
  wal_buffers, but 4MB performed worse at low wal_buffers.

- Reverted most post-v24nm changes to swap_relation_files().  Under
  "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
  rel1->rd_node.relNode update.  Clearing rel2->rd_createSubid is not right if
  we're running CLUSTER for the second time in one transaction.  I used
  relation_open(r1, NoLock) instead of AccessShareLock, because we deserve an
  assertion failure if we hold no lock at that point.

- Change toast_get_valid_index() to retain locks until end of transaction.
  When I adopted relation_open(r1, NoLock) in swap_relation_files(), that
  revealed that we retain no lock on the TOAST index.

- Ran pgindent and perltidy.  Updated some comments and names.

On Mon, Dec 09, 2019 at 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-25 Thread Kyotaro Horiguchi
At Tue, 24 Dec 2019 16:35:35 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I rebased the patch and changed the default value for the GUC variable
> wal_skip_threshold to 4096 kilobytes in config.sgml, storage.c and
> guc.c. 4096kB is choosed as it is the nice round number of 500 pages *
> 8kB = 4000kB.

The value in the doc was not correct. Fixed only the value from 3192
to 4096kB.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 2f184c140ab442ee29103be830b3389b71e8e609 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v29] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml |  43 ++--
 doc/src/sgml/perform.sgml|  47 +
 src/backend/access/gist/gistutil.c   |  31 ++-
 src/backend/access/gist/gistxlog.c   |  21 ++
 src/backend/access/heap/heapam.c |  45 +---
 src/backend/access/heap/heapam_handler.c |  22 +-
 src/backend/access/heap/rewriteheap.c|  21 +-
 src/backend/access/nbtree/nbtsort.c  |  41 +---
 src/backend/access/rmgrdesc/gistdesc.c   |   5 +
 src/backend/access/transam/README|  47 -
 src/backend/access/transam/xact.c|  15 ++
 src/backend/access/transam/xloginsert.c  |  10 +-
 src/backend/access/transam/xlogutils.c   |  17 +-
 src/backend/catalog/heap.c   |   4 +
 src/backend/catalog/storage.c| 257 +--
 src/backend/commands/cluster.c   |  31 +++
 src/backend/commands/copy.c  |  58 +
 src/backend/commands/createas.c  |  11 +-
 src/backend/commands/matview.c   |  12 +-
 src/backend/commands/tablecmds.c |  11 +-
 src/backend/storage/buffer/bufmgr.c  | 123 ++-
 src/backend/storage/smgr/md.c|  35 ++-
 src/backend/storage/smgr/smgr.c  |  37 
 src/backend/utils/cache/relcache.c   | 122 ---
 src/backend/utils/misc/guc.c |  13 ++
 src/bin/psql/input.c |   1 +
 src/include/access/gist_private.h|   2 +
 src/include/access/gistxlog.h|   1 +
 src/include/access/heapam.h  |   3 -
 src/include/access/rewriteheap.h |   2 +-
 src/include/access/tableam.h |  18 +-
 src/include/catalog/storage.h|   5 +
 src/include/storage/bufmgr.h |   4 +
 src/include/storage/smgr.h   |   1 +
 src/include/utils/rel.h  |  57 +++--
 src/include/utils/relcache.h |   8 +-
 src/test/regress/pg_regress.c|   2 +
 37 files changed, 839 insertions(+), 344 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..d893864c40 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2481,21 +2481,14 @@ include_dir 'conf.d'
 levels.  This parameter can only be set at server start.


-In minimal level, WAL-logging of some bulk
-operations can be safely skipped, which can make those
-operations much faster (see ).
-Operations in which this optimization can be applied include:
-
- CREATE TABLE AS
- CREATE INDEX
- CLUSTER
- COPY into tables that were created or truncated in the same
- transaction
-
-But minimal WAL does not contain enough information to reconstruct the
-data from a base backup and the WAL logs, so replica or
-higher must be used to enable WAL archiving
-() and streaming replication.
+In minimal level, no information is logged for
+tables or indexes for the remainder of a transaction that creates or
+truncates them.  This can make bulk operations much faster (see
+).  But minimal WAL does not contain
+enough information to reconstruct the data from a base backup and the
+WAL logs, so replica or higher must be used to
+enable WAL archiving () and
+streaming replication.


 In logical level, the same information is logged as
@@ -2887,6 +2880,26 @@ include_dir 'conf.d'
   
  
 
+ 
+  wal_skip_threshold (integer)
+  
+   wal_skip_threshold configuration parameter
+  
+  
+  
+   
+When wal_level is minimal and a
+transaction commits after creating or 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-23 Thread Kyotaro Horiguchi
At Tue, 10 Dec 2019 16:59:25 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> shared_buffers=1GB/wal_buffers=16MB(defalut). pgbench -s 20 uses 256MB
> of storage so all of them can be loaded on shared memory.
> 
> The attached graph shows larger benefit in TPS drop and latency
> increase for HDD. The DDL pages at the corsspoint between commit-FPW
> and commit-sync moves from roughly 300 to 200 in TPS and latency, and
> 1000 to 600 in DDL runtime. If we can rely on the two graphs, 500 (or
> 512) pages seems to be the most promising candidate for the default
> value of wal_skip_threshold.
> regards.

I rebased the patch and changed the default value for the GUC variable
wal_skip_threshold to 4096 kilobytes in config.sgml, storage.c and
guc.c. 4096kB is choosed as it is the nice round number of 500 pages *
8kB = 4000kB.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 64b4f63016c776cd5102b70f5562328dc5d371fa Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml |  43 ++--
 doc/src/sgml/perform.sgml|  47 +
 src/backend/access/gist/gistutil.c   |  31 ++-
 src/backend/access/gist/gistxlog.c   |  21 ++
 src/backend/access/heap/heapam.c |  45 +---
 src/backend/access/heap/heapam_handler.c |  22 +-
 src/backend/access/heap/rewriteheap.c|  21 +-
 src/backend/access/nbtree/nbtsort.c  |  41 +---
 src/backend/access/rmgrdesc/gistdesc.c   |   5 +
 src/backend/access/transam/README|  47 -
 src/backend/access/transam/xact.c|  15 ++
 src/backend/access/transam/xloginsert.c  |  10 +-
 src/backend/access/transam/xlogutils.c   |  17 +-
 src/backend/catalog/heap.c   |   4 +
 src/backend/catalog/storage.c| 257 +--
 src/backend/commands/cluster.c   |  31 +++
 src/backend/commands/copy.c  |  58 +
 src/backend/commands/createas.c  |  11 +-
 src/backend/commands/matview.c   |  12 +-
 src/backend/commands/tablecmds.c |  11 +-
 src/backend/storage/buffer/bufmgr.c  | 123 ++-
 src/backend/storage/smgr/md.c|  35 ++-
 src/backend/storage/smgr/smgr.c  |  37 
 src/backend/utils/cache/relcache.c   | 122 ---
 src/backend/utils/misc/guc.c |  13 ++
 src/bin/psql/input.c |   1 +
 src/include/access/gist_private.h|   2 +
 src/include/access/gistxlog.h|   1 +
 src/include/access/heapam.h  |   3 -
 src/include/access/rewriteheap.h |   2 +-
 src/include/access/tableam.h |  18 +-
 src/include/catalog/storage.h|   5 +
 src/include/storage/bufmgr.h |   4 +
 src/include/storage/smgr.h   |   1 +
 src/include/utils/rel.h  |  57 +++--
 src/include/utils/relcache.h |   8 +-
 src/test/regress/pg_regress.c|   2 +
 37 files changed, 839 insertions(+), 344 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..d455a42c9d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2481,21 +2481,14 @@ include_dir 'conf.d'
 levels.  This parameter can only be set at server start.


-In minimal level, WAL-logging of some bulk
-operations can be safely skipped, which can make those
-operations much faster (see ).
-Operations in which this optimization can be applied include:
-
- CREATE TABLE AS
- CREATE INDEX
- CLUSTER
- COPY into tables that were created or truncated in the same
- transaction
-
-But minimal WAL does not contain enough information to reconstruct the
-data from a base backup and the WAL logs, so replica or
-higher must be used to enable WAL archiving
-() and streaming replication.
+In minimal level, no information is logged for
+tables or indexes for the remainder of a transaction that creates or
+truncates them.  This can make bulk operations much faster (see
+).  But minimal WAL does not contain
+enough information to reconstruct the data from a base backup and the
+WAL logs, so replica or higher must be 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-09 Thread Robert Haas
On Mon, Dec 9, 2019 at 4:04 AM Kyotaro Horiguchi
 wrote:
> Yeah, only 0.5GB of shared_buffers makes the default value of
> wal_buffers reach to the heaven. I think I can take numbers on that
> condition. (I doubt that it's meaningful if I increase only
> wal_buffers manually.)

Heaven seems a bit exalted, but I think we really only have a formula
because somebody might have really small shared_buffers for some
reason and be unhappy about us gobbling up a comparatively large
amount of memory for WAL buffers. The current limit means that normal
installations get what they need without manual tuning, and small
installations - where performance presumably sucks anyway for other
reasons - keep a small memory footprint.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-09 Thread Kyotaro Horiguchi
Hello.

At Sun, 8 Dec 2019 10:09:51 -0800, Noah Misch  wrote in 
> I reviewed your latest code, and it's nearly complete.  mdimmedsync() syncs
> only "active segments" (as defined in md.c), but smgrDoPendingSyncs() must
> sync active and inactive segments.  This matters when mdtruncate() truncated
> the relation after the last checkpoint, causing active segments to become
> inactive.  In such cases, syncs of the inactive segments will have been queued
> for execution during the next checkpoint.  Since we skipped the
> XLOG_SMGR_TRUNCATE record, we must complete those syncs before commit.  Let's

Got it! You're so great. Thanks.

> just modify smgrimmedsync() to always sync active and inactive segments;
> that's fine to do in other smgrimmedsync() callers, even though they operate
> on relations that can't have inactive segments.

Agreed and done that way. Even it's not harmful to leave inactive
segments open but I choosed to close them after sync. As being
mentioned in the added comment in the function, inactive segments may
not be closed if error happened on file sync. md works properly even
in the case (as the file comment says) and, anyway, mdnblocks leaves
the first inactive segment open if there's no partial segment.

I don't understand why mdclose checks for (v->mdfd_vfd >= 0) of open
segment but anyway mdimmedsync is believing that that won't happen and
I follow the assumption.  (I suspect that the if condition in mdclose
should be an assertion..)

> On Tue, Dec 03, 2019 at 08:51:46PM +0900, Kyotaro Horiguchi wrote:
> > At Thu, 28 Nov 2019 17:23:19 -0500, Noah Misch  wrote in 
> > > On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
> > > > I measured the performance with the latest patch set.
> > > > 
> > > > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about 
> > > > > one
> > > > >minute when done via syncs.
> > > > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > > > > 3. Wait 10s.
> > > > > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > > > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
> 
> >  wal_buffers| 512
> 
> This value (4 MiB) is lower than a tuned production system would have.  In
> future benchmarks (if any) use wal_buffers=2048 (16 MiB).

Yeah, only 0.5GB of shared_buffers makes the default value of
wal_buffers reach to the heaven. I think I can take numbers on that
condition. (I doubt that it's meaningful if I increase only
wal_buffers manually.)

Anyway the default value ought to be defined based on the default
configuration.

In the attached patch, I merged all pieces in the previous version and
the change made this time (only md.c is changed this time).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From a523eeddbd8840a02c81267a217c8f4aecc1fe5e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v27] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml |  43 ++--
 doc/src/sgml/perform.sgml|  47 +
 src/backend/access/gist/gistutil.c   |  31 ++-
 src/backend/access/gist/gistxlog.c   |  21 ++
 src/backend/access/heap/heapam.c |  45 +---
 src/backend/access/heap/heapam_handler.c |  22 +-
 src/backend/access/heap/rewriteheap.c|  21 +-
 src/backend/access/nbtree/nbtsort.c  |  41 +---
 src/backend/access/rmgrdesc/gistdesc.c   |   5 +
 src/backend/access/transam/README|  47 -
 src/backend/access/transam/xact.c|  15 ++
 src/backend/access/transam/xloginsert.c  |  10 +-
 src/backend/access/transam/xlogutils.c   |  17 +-
 src/backend/catalog/heap.c   |   4 +
 src/backend/catalog/storage.c| 257 +--
 src/backend/commands/cluster.c   |  31 +++
 src/backend/commands/copy.c  |  58 +
 src/backend/commands/createas.c  |  11 +-
 src/backend/commands/matview.c   |  12 +-
 src/backend/commands/tablecmds.c |  11 +-
 src/backend/storage/buffer/bufmgr.c  | 123 ++-
 src/backend/storage/smgr/md.c|  35 ++-
 src/backend/storage/smgr/smgr.c  |  37 
 src/backend/utils/cache/relcache.c   | 122 ---
 src/backend/utils/misc/guc.c |  13 ++
 src/bin/psql/input.c |   1 +
 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-08 Thread Noah Misch
I reviewed your latest code, and it's nearly complete.  mdimmedsync() syncs
only "active segments" (as defined in md.c), but smgrDoPendingSyncs() must
sync active and inactive segments.  This matters when mdtruncate() truncated
the relation after the last checkpoint, causing active segments to become
inactive.  In such cases, syncs of the inactive segments will have been queued
for execution during the next checkpoint.  Since we skipped the
XLOG_SMGR_TRUNCATE record, we must complete those syncs before commit.  Let's
just modify smgrimmedsync() to always sync active and inactive segments;
that's fine to do in other smgrimmedsync() callers, even though they operate
on relations that can't have inactive segments.

On Tue, Dec 03, 2019 at 08:51:46PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 28 Nov 2019 17:23:19 -0500, Noah Misch  wrote in 
> > On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
> > > I measured the performance with the latest patch set.
> > > 
> > > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about 
> > > > one
> > > >minute when done via syncs.
> > > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > > > 3. Wait 10s.
> > > > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

>  wal_buffers| 512

This value (4 MiB) is lower than a tuned production system would have.  In
future benchmarks (if any) use wal_buffers=2048 (16 MiB).




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-12-03 Thread Kyotaro Horiguchi
Hello.

At Thu, 28 Nov 2019 17:23:19 -0500, Noah Misch  wrote in 
> On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
> > I measured the performance with the latest patch set.
> > 
> > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> > >minute when done via syncs.
> > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > > 3. Wait 10s.
> > > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
> 
> If you have the raw data requested in (5), please share them here so folks
> have the option to reproduce your graphs and calculations.

Sorry, I forgot to attach the scripts. The raw data was vanished into
unstable connection and the steps was quite crude. I prioritized on
showing some numbers at the time. I revised the scripts into more
automated way and will take numbers again.

> > > 2. Start server with wal_level = replica (all other variables are not
> > > changed) then run the attached ./bench.sh
> > 
> > The bench.sh attachment was missing; please attach it.  Please give the 
> > output
> > of this command:
> > 
> >   select name, setting from pg_settings where setting <> boot_val;

(I intentionally show all the results..)
=# select name, setting from pg_settings where setting<> boot_val;
name|  setting   
+
 application_name   | psql
 archive_command| (disabled)
 client_encoding| UTF8
 data_directory_mode| 0700
 default_text_search_config | pg_catalog.english
 lc_collate | en_US.UTF-8
 lc_ctype   | en_US.UTF-8
 lc_messages| en_US.UTF-8
 lc_monetary| en_US.UTF-8
 lc_numeric | en_US.UTF-8
 lc_time| en_US.UTF-8
 log_checkpoints| on
 log_file_mode  | 0600
 log_timezone   | Asia/Tokyo
 max_stack_depth| 2048
 max_wal_senders| 0
 max_wal_size   | 10240
 server_encoding| UTF8
 shared_buffers | 16384
 TimeZone   | Asia/Tokyo
 unix_socket_permissions| 0777
 wal_buffers| 512
 wal_level  | minimal
(23 rows)

The result for "replica" setting in the benchmark script are used as
base numbers (or the denominator of the percentages).

> > 3. Restart server with wal_level = replica then run the bench.sh
> > twice.
> 
> I assume this is wal_level=minimal, not wal_level=replica.

Oops! It's wrong I ran once with replica, then twice with minimal.


Anyway, I revised the benchmarking scripts and attached them.  The
parameters written in benchmain.sh were decided as ./bench2.pl 5
  s with wal_level=minimal server takes around 60
seconds.

I'll send the complete data tomorrow (in JST). The attached f.txt is
the result of preliminary test only with pages=100 and 250 (with HDD).

The attached files are:
  benchmain.sh- main script
  bench2.sh   - run a benchmark with a single set of parameters
  bench1.pl   - behchmark client program
  summarize.pl- script to summarize benchmain.sh's output
  f.txt.gz- result only for pages=100, DDL count = 2200 (not 2250)

How to run:

$ /..unpatched_path../initdb -D 
 (wal_level=replica, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB)
$ /..patched_path../initdb -D 
 (wal_level=minimal, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB)
$./benchmain.sh ># output raw data
$./summarize.pl [-v] <# show summary


With the attached f.txt, summarize.pl gives the following output.
WAL wins with the that pages.

$ cat f.txt | ./summarize.pl
## params: wal_level=replica mode=none pages=100 count=353 scale=20
(% are relative to "before")
before: tps  262.3 (100.0%), lat39.840 ms (100.0%) (29 samples)
during: tps  120.7 ( 46.0%), lat   112.508 ms (282.4%) (35 samples)
 after: tps  106.3 ( 40.5%), lat   163.492 ms (410.4%) (86 samples)
DDL time:  34883 ms ( 100.0% relative to mode=none)
## params: wal_level=minimal mode=sync pages=100 count=353 scale=20
(% are relative to "before")
before: tps  226.3 (100.0%), lat48.091 ms (100.0%) (29 samples)
during: tps   83.0 ( 36.7%), lat   184.942 ms (384.6%) (100 samples)
 after: tps   82.6 ( 36.5%), lat   196.863 ms (409.4%) (21 samples)
DDL time:  99239 ms ( 284.5% relative to mode=none)
## params: wal_level=minimal mode=WAL pages=100 count=353 scale=20
(% are relative to "before")
before: tps  240.3 (100.0%), lat44.686 ms (100.0%) (29 samples)
during: tps  129.6 ( 53.9%), lat   113.585 ms (254.2%) (31 samples)
 after: tps  124.5 ( 51.8%), lat   141.992 ms (317.8%) (90 samples)
DDL time:  30392 ms (  87.1% relative to mode=none)
## params: wal_level=replica mode=none pages=250 count=258 scale=20
(% are relative to "before")
before: tps  266.3 (100.0%), lat45.884 ms (100.0%) (29 samples)
during: 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-28 Thread Noah Misch
On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
> I measured the performance with the latest patch set.
> 
> > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> >minute when done via syncs.
> > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > 3. Wait 10s.
> > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

If you have the raw data requested in (5), please share them here so folks
have the option to reproduce your graphs and calculations.

> I did the following benchmarking.
> 
> 1. Initialize bench database
> 
>   $ pgbench -i -s 20
> 
> 2. Start server with wal_level = replica (all other variables are not
> changed) then run the attached ./bench.sh

The bench.sh attachment was missing; please attach it.  Please give the output
of this command:

  select name, setting from pg_settings where setting <> boot_val;

> 3. Restart server with wal_level = replica then run the bench.sh
> twice.

I assume this is wal_level=minimal, not wal_level=replica.




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-28 Thread Kyotaro Horiguchi
I measured the performance with the latest patch set.

> 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
>minute when done via syncs.
> 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> 3. Wait 10s.
> 4. Start one DDL backend that runs $DDL_COUNT transactions.
> 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

I did the following benchmarking.

1. Initialize bench database

  $ pgbench -i -s 20

2. Start server with wal_level = replica (all other variables are not
changed) then run the attached ./bench.sh

  $ ./bench.sh   

where count is the number of repetition, pages is the number of pages
to write in a run, and mode is "s" (sync) or "w"(WAL). The 
doesn't affect if wal_level = replica. The script shows the following
result.

| before: tps 240.2, lat 44.087 ms (29 samples)
| during: tps 109.1, lat 114.887 ms (14 samples)
| after : tps 269.9, lat 39.557 ms (107 samples)
| DDL time = 13965 ms
| # transaction type: 

before: mean numbers before "the DDL" starts.
during: mean numbers while "the DDL" is running.
after : mean numbers after "the DDL" ends.
DDL time: the time took to run "the DDL".

3. Restart server with wal_level = replica then run the bench.sh
twice.

  $ ./bench.sh   s
  $ ./bench.sh   w


Finally I got three graphs. (attached 1, 2, 3. PNGs)

* Graph 1 - The affect of the DDL on pgbench's TPS

 The virtical axis means "during TPS" / "before TPS" in %. Larger is
 better. The horizontal axis means the table pages size.
 
 Replica and Minimal-sync are almost flat.  Minimal-WAL getting worse
 as table size increases. 500 pages seems to be the crosspoint.


* Graph 2 - The affect of the DDL on pgbench's latency.

 The virtical axis means "during-letency" / "before-latency" in
 %. Smaller is better. Like TPS but more quickly WAL-latency gets
 worse as table size increases. The crosspoint seems to be 300 pages
 or so.


* Graph 3 - The affect of pgbench's work load on DDL runtime.

 The virtical axis means "time the DDL takes to run with pgbench" /
 "time the DDL to run solely". Smaller is better. Replica and
 Minimal-SYNC shows similar tendency. On Minimal-WAL the DDL runs
 quite fast with small tables. The crosspoint seems to be about 2500
 pages.

Seeing this, I became to be worry that the optimization might give far
smaller advantage than expected. Putting aside that, it seems to me
that the default value for the threshold would be 500-1000, same as
the previous benchmark showed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-28 Thread Kyotaro Horiguchi
At Tue, 26 Nov 2019 21:37:52 +0900 (JST), Kyotaro Horiguchi 
 Is is not fully checked. I didn't merged and mesured 
performance yet,
> but I post the status-quo patch for now.

It was actually inconsistency caused by swap_relation_files.

1. rd_createSubid of relcache for r2 is not turned off. This prevents
   the relcache entry from flushed. Commit processes pendingSyncs and
   leaves the relcache entry with rd_createSubid != Invalid. It is
   inconsistency.

2. relation_open(r1) returns a relcache entry with its relfilenode has
   the old value (relfilenode1) since command counter has not been
   incremented. On the other hand if it is incremented just before,
   AssertPendingSyncConsistency() aborts because of the inconsistency
   between relfilenode and rd_firstRel*.

As the result, I returned to think that we need to modify both
relcache entries with right relfilenode.

I once thought that taking AEL in the function has no side effect but
the code path is executed also when wal_level = replica or higher. And
as I mentioned upthread, we can even get there without taking any lock
on r1 or sometimes ShareLock. So upgrading to AEL emits Standby/LOCK
WAL and propagates to standby. After all I'd like to take the weakest
lock (AccessShareLock) there.

The attached is the new version of the patch.

- v26-0001-version-nm24.patch
  Same with v24

- v26-0002-change-swap_relation_files.patch

 Changes to swap_relation_files as mentioned above.

- v26-0003-Improve-the-performance-of-relation-syncs.patch

 Do multiple pending syncs by one shared_buffers scanning.

- v26-0004-Revert-FlushRelationBuffersWithoutRelcache.patch

 v26-0003 makes the function useless. Remove it.

- v26-0005-Fix-gistGetFakeLSN.patch

 gistGetFakeLSN fix.

- v26-0006-Sync-files-shrinked-by-truncation.patch

 Fix the problem of commit-time-FPI after truncation after checkpoint.
 I'm not sure this is the right direction but pendingSyncHash is
 removed from pendingDeletes list again.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From ee96bb1e14969823eab79ab1531d68e8aadc1915 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v26 1/6] version nm24

Noah Misch's version 24.
---
 doc/src/sgml/config.sgml |  43 +++--
 doc/src/sgml/perform.sgml|  47 ++
 src/backend/access/gist/gistutil.c   |   7 +-
 src/backend/access/heap/heapam.c |  45 +-
 src/backend/access/heap/heapam_handler.c |  22 +--
 src/backend/access/heap/rewriteheap.c|  21 +--
 src/backend/access/nbtree/nbtsort.c  |  41 ++---
 src/backend/access/transam/README|  47 +-
 src/backend/access/transam/xact.c|  14 ++
 src/backend/access/transam/xloginsert.c  |  10 +-
 src/backend/access/transam/xlogutils.c   |  17 +-
 src/backend/catalog/heap.c   |   4 +
 src/backend/catalog/storage.c| 198 +--
 src/backend/commands/cluster.c   |  11 ++
 src/backend/commands/copy.c  |  58 +--
 src/backend/commands/createas.c  |  11 +-
 src/backend/commands/matview.c   |  12 +-
 src/backend/commands/tablecmds.c |  11 +-
 src/backend/storage/buffer/bufmgr.c  |  37 +++--
 src/backend/storage/smgr/md.c|   9 +-
 src/backend/utils/cache/relcache.c   | 122 ++
 src/backend/utils/misc/guc.c |  13 ++
 src/include/access/heapam.h  |   3 -
 src/include/access/rewriteheap.h |   2 +-
 src/include/access/tableam.h |  18 +--
 src/include/catalog/storage.h|   5 +
 src/include/storage/bufmgr.h |   5 +
 src/include/utils/rel.h  |  57 +--
 src/include/utils/relcache.h |   8 +-
 src/test/regress/pg_regress.c|   2 +
 30 files changed, 551 insertions(+), 349 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe45cc..d0f7dbd7d7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2483,21 +2483,14 @@ include_dir 'conf.d'
 levels.  This parameter can only be set at server start.


-In minimal level, WAL-logging of some bulk
-operations can be safely skipped, which can make those
-operations much faster (see ).
-Operations in which this optimization can be applied include:
-
- CREATE TABLE AS
- CREATE INDEX
- CLUSTER
- COPY into tables that were created or truncated in the same
- transaction
-
-But minimal WAL does not contain enough information to reconstruct the
-data from a base backup and the WAL logs, so replica or
-higher must be used to enable WAL archiving
-() and streaming replication.
+In minimal level, no information is logged for
+tables or indexes for the remainder of a transaction that creates or
+truncates them.  This 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-26 Thread Kyotaro Horiguchi
At Sun, 24 Nov 2019 22:08:39 -0500, Noah Misch  wrote in 
> On Mon, Nov 25, 2019 at 11:08:54AM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch  wrote in 
> > > I noticed an additional defect:
> > > 
> > > BEGIN;
> > > CREATE TABLE t (c) AS SELECT 1;
> > > CHECKPOINT; -- write and fsync the table's one page
> > > TRUNCATE t; -- no WAL
> > > COMMIT; -- no FPI, just the commit record
> > > 
> > > If we crash after the COMMIT and before the next fsync or OS-elected sync 
> > > of
> > > the table's file, the table will stay on disk with its pre-TRUNCATE 
> > > content.
> > 
> > The TRUNCATE replaces relfilenode in the catalog
> 
> No, it does not.  Since the relation is new in the transaction, the TRUNCATE
> uses the heap_truncate_one_rel() strategy.
..
> The zero-pages case is not special.  Here's an example of the problem with a
> nonzero size:

I got it. That is, if the file has had blocks beyond the size at
commit, we should sync the file even if it is small enough. It nees to
track beore-trunction size as this patch used to have.

pendingSyncHash is resurrected to do truncate-size tracking. That
information cannot be stored in SMgrRelation, which will be dissapper
by invalidation, or Relation, which is not available in storage layer.
smgrDoPendingDeletes is needed to be called at aboft again to clean up
useless hash. I'm not sure the exact cause but
AssertPendingSyncs_RelationCache() fails at abort (so it is not called
at abort).

smgrDoPendingSyncs and RelFileNodeSkippingWAL() become simpler by
using the hash.

Is is not fully checked. I didn't merged and mesured performance yet,
but I post the status-quo patch for now.

- v25-0001-version-nm.patch

Noah's v24 patch.

- v25-0002-Revert-FlushRelationBuffersWithoutRelcache.patch

Remove useless function (added by this patch..).

- v25-0003-Improve-the-performance-of-relation-syncs.patch

Make smgrDoPendingSyncs scan shared buffer once.

v25-0004-Adjust-gistGetFakeLSN.patch

Amendment for gistGetFakeLSN. This uses GetXLogInsertRecPtr as long as
it is different from the previous call and emits dummy WAL if we need
a new LSN. Since other than switch_wal record cannot be empty so the
dummy WAL has an integer content for now.

v25-0005-Sync-files-shrinked-by-truncation.patch

Amendment for the truncation problem.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 86d7c2dee819b1171f0a02c56e4cda065c64246f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v25 1/5] version nm

---
 doc/src/sgml/config.sgml |  43 +++--
 doc/src/sgml/perform.sgml|  47 ++
 src/backend/access/gist/gistutil.c   |   7 +-
 src/backend/access/heap/heapam.c |  45 +-
 src/backend/access/heap/heapam_handler.c |  22 +--
 src/backend/access/heap/rewriteheap.c|  21 +--
 src/backend/access/nbtree/nbtsort.c  |  41 ++---
 src/backend/access/transam/README|  47 +-
 src/backend/access/transam/xact.c|  14 ++
 src/backend/access/transam/xloginsert.c  |  10 +-
 src/backend/access/transam/xlogutils.c   |  17 +-
 src/backend/catalog/heap.c   |   4 +
 src/backend/catalog/storage.c| 198 +--
 src/backend/commands/cluster.c   |  11 ++
 src/backend/commands/copy.c  |  58 +--
 src/backend/commands/createas.c  |  11 +-
 src/backend/commands/matview.c   |  12 +-
 src/backend/commands/tablecmds.c |  11 +-
 src/backend/storage/buffer/bufmgr.c  |  37 +++--
 src/backend/storage/smgr/md.c|   9 +-
 src/backend/utils/cache/relcache.c   | 122 ++
 src/backend/utils/misc/guc.c |  13 ++
 src/include/access/heapam.h  |   3 -
 src/include/access/rewriteheap.h |   2 +-
 src/include/access/tableam.h |  18 +--
 src/include/catalog/storage.h|   5 +
 src/include/storage/bufmgr.h |   5 +
 src/include/utils/rel.h  |  57 +--
 src/include/utils/relcache.h |   8 +-
 src/test/regress/pg_regress.c|   2 +
 30 files changed, 551 insertions(+), 349 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe45cc..d0f7dbd7d7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2483,21 +2483,14 @@ include_dir 'conf.d'
 levels.  This parameter can only be set at server start.


-In minimal level, WAL-logging of some bulk
-operations can be safely skipped, which can make those
-operations much faster (see ).
-Operations in which this optimization can be applied include:
-
- CREATE TABLE AS
- CREATE INDEX
- CLUSTER
- COPY into tables that were created or truncated in the same
- transaction
-
-But minimal WAL does not contain enough information to reconstruct 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-25 Thread Noah Misch
On Mon, Nov 25, 2019 at 03:58:14PM -0500, Robert Haas wrote:
> On Sat, Nov 23, 2019 at 4:21 PM Noah Misch  wrote:
> > I noticed an additional defect:
> >
> > BEGIN;
> > CREATE TABLE t (c) AS SELECT 1;
> > CHECKPOINT; -- write and fsync the table's one page
> > TRUNCATE t; -- no WAL
> > COMMIT; -- no FPI, just the commit record
> >
> > If we crash after the COMMIT and before the next fsync or OS-elected sync of
> > the table's file, the table will stay on disk with its pre-TRUNCATE content.
> 
> Shouldn't the TRUNCATE be triggering an fsync() to happen before
> COMMIT is permitted to complete?

With wal_skip_threshold=0, you do get an fsync().  The patch tries to avoid
at-commit fsync of small files by WAL-logging file contents instead.  However,
the patch doesn't WAL-log enough to handle files that decreased in size.

> You'd have the same problem if the
> TRUNCATE were replaced by INSERT, unless fsync() happens in that case.

I think an insert would be fine.  You'd get an FPI record for the relation's
one page, which fully reproduces the relation.




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-25 Thread Robert Haas
On Sat, Nov 23, 2019 at 4:21 PM Noah Misch  wrote:
> I noticed an additional defect:
>
> BEGIN;
> CREATE TABLE t (c) AS SELECT 1;
> CHECKPOINT; -- write and fsync the table's one page
> TRUNCATE t; -- no WAL
> COMMIT; -- no FPI, just the commit record
>
> If we crash after the COMMIT and before the next fsync or OS-elected sync of
> the table's file, the table will stay on disk with its pre-TRUNCATE content.

Shouldn't the TRUNCATE be triggering an fsync() to happen before
COMMIT is permitted to complete? You'd have the same problem if the
TRUNCATE were replaced by INSERT, unless fsync() happens in that case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-24 Thread Noah Misch
On Mon, Nov 25, 2019 at 11:08:54AM +0900, Kyotaro Horiguchi wrote:
> At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch  wrote in 
> > This benchmark procedure may help:
> > 
> > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> >minute when done via syncs.
> > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > 3. Wait 10s.
> > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
> > 
> > I would compare pgbench tps and latency between the seconds when DDL is and 
> > is
> > not running.  As you did in earlier tests, I would repeat it using various
> > page counts, with and without sync.
> 
> I understood the "DDL" is not pure DDLs but a kind of
> define-then-load, like "CREATE TABLE AS" , "CREATE TABLE" then "COPY
> FROM".

When I wrote "DDL", I meant the four-command transaction that you already used
in benchmarks.

> > I noticed an additional defect:
> > 
> > BEGIN;
> > CREATE TABLE t (c) AS SELECT 1;
> > CHECKPOINT; -- write and fsync the table's one page
> > TRUNCATE t; -- no WAL
> > COMMIT; -- no FPI, just the commit record
> > 
> > If we crash after the COMMIT and before the next fsync or OS-elected sync of
> > the table's file, the table will stay on disk with its pre-TRUNCATE content.
> 
> The TRUNCATE replaces relfilenode in the catalog

No, it does not.  Since the relation is new in the transaction, the TRUNCATE
uses the heap_truncate_one_rel() strategy.

> Since the file has no pages, it's right that no FPI emitted.

Correct.

> If we don't rely on such
> behavior, we can make sure thhat by turning the zero-pages case from
> WAL into file sync. I'll do that in the next version.

The zero-pages case is not special.  Here's an example of the problem with a
nonzero size:

BEGIN;
CREATE TABLE t (c) AS SELECT * FROM generate_series(1,10);
CHECKPOINT; -- write and fsync the table's many pages
TRUNCATE t; -- no WAL
INSERT INTO t VALUES (0); -- no WAL
COMMIT; -- FPI for one page; nothing removes the additional pages




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-24 Thread Kyotaro Horiguchi
At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch  wrote in 
> On Wed, Nov 20, 2019 at 03:05:46PM +0900, Kyotaro Horiguchi wrote:
> > By the way, before finalize this, I'd like to share the result of a
> > brief benchmarking.
> 
> What non-default settings did you use?  Please give the output of this or a
> similar command:

Only wal_level=minimal and max_wal_senders=0.

>   select name, setting from pg_settings where setting <> boot_val;
> 
> If you run more benchmarks and weren't already using wal_buffers=16MB, I
> recommend using it.

Roger.

> > With 10 pgbench sessions.
> > pages   SYNC WAL 
> > 1:   915 ms   301 ms
> > 3:  1634 ms   508 ms
> > 5:  1634 ms   293ms
> >10:  1671 ms  1043 ms
> >17:  1600 ms   333 ms
> >31:  1864 ms   314 ms
> >56:  1562 ms   448 ms
> >   100:  1538 ms   394 ms
> >   177:  1697 ms  1047 ms
> >   316:  3074 ms  1788 ms
> >   562:  3306 ms  1245 ms
> >  1000:  3440 ms  2182 ms
> >  1778:  5064 ms  6464 ms  # WAL's slope becomes steep
> >  3162:  8675 ms  8165 ms
> 
> For picking a default wal_skip_threshold, it would have been more informative
> to see how this changes pgbench latency statistics.  Some people want DDL to
> be fast, but more people want DDL not to reduce the performance of concurrent
> non-DDL.  This benchmark procedure may help:
> 
> 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
>minute when done via syncs.
> 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> 3. Wait 10s.
> 4. Start one DDL backend that runs $DDL_COUNT transactions.
> 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
> 
> I would compare pgbench tps and latency between the seconds when DDL is and is
> not running.  As you did in earlier tests, I would repeat it using various
> page counts, with and without sync.

I understood the "DDL" is not pure DDLs but a kind of
define-then-load, like "CREATE TABLE AS" , "CREATE TABLE" then "COPY
FROM".

> On Wed, Nov 20, 2019 at 05:31:43PM +0900, Kyotaro Horiguchi wrote:
> > +Prefer to do the same in future access methods.  However, two other 
> > approaches
> > +can work.  First, an access method can irreversibly transition a given fork
> > +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
> > +smgrimmedsync().  Second, an access method can opt to write WAL
> > +unconditionally for permanent relations.  When using the second method, do 
> > not
> > +call RelationCopyStorage(), which skips WAL.
> > 
> > Even using these methods, TransactionCommit flushes out buffers then
> > sync files again. Isn't a description something like the following
> > needed?
> > 
> > ===
> > Even an access method switched a in-transaction created relfilenode to
> > WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
> > file then smgrimmedsync() the file.
> > ===
> 
> It is enough that the text says to prefer the approach that core access
> methods use.  The extra flush and sync when using a non-preferred approach
> wastes some performance, but it is otherwise harmless.

Ah, right and I agreed.

> > +   rel1 = relation_open(r1, AccessExclusiveLock);
> > +   RelationAssumeNewRelfilenode(rel1);
> > 
> > It cannot be accessed from other sessions. Theoretically it doesn't
> > need a lock but NoLock cannot be used there since there's a path that
> > doesn't take lock on the relation. But AEL seems too strong and it
> > causes unecessary side effect. Couldn't we use weaker locks?
> 
> We could use NoLock.  I assumed we already hold AccessExclusiveLock, in which
> case this has no side effects.

I forgot that this optimization is used only in non-replication
configuragion. So I agree that AEL doesn't have no side
effect.

> On Thu, Nov 21, 2019 at 04:01:07PM +0900, Kyotaro Horiguchi wrote:
> > At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch  wrote in 
> > > === Defect 1: gistGetFakeLSN()
> > > 
> > > When I modified pg_regress.c to use wal_level=minimal for all suites,
> > > src/test/isolation/specs/predicate-gist.spec failed the assertion in
> > > gistGetFakeLSN().  One could reproduce the problem just by running this
> > > sequence in psql:
> > > 
> > >   begin;
> > >   create table gist_point_tbl(id int4, p point);
> > >   create index gist_pointidx on gist_point_tbl using gist(p);
> > >   insert into gist_point_tbl (id, p)
> > >   select g, point(g*10, g*10) from generate_series(1, 1000) g;
> > > 
> > > I've included a wrong-in-general hack to make the test pass.  I see two 
> > > main
> > > options for fixing this:
> > > 
> > > (a) Introduce an empty WAL record that reserves an LSN and has no other
> > > effect.  Make GiST use that for permanent relations that are skipping WAL.
> > > Further optimizations are possible.  For example, we could use a 
> > > backend-local
> > > counter (like the one gistGetFakeLSN() uses for temp relations) until the
> > > counter is greater a recent real LSN.  That 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-24 Thread Michael Paquier
On Sat, Nov 23, 2019 at 11:35:09AM -0500, Noah Misch wrote:
> That longstanding optimization is too useful to remove, but likely not useful
> enough to add today if we didn't already have it.  The initial-data-load use
> case remains plausible.  I can also imagine using wal_level=minimal for data
> warehouse applications where one can quickly rebuild from the authoritative
> data.

I can easily imagine cases where a user would like to use the benefit
of the optimization for an initial data load, and afterwards update
wal_level to replica so as they avoid the initial WAL burst which
serves no real purpose.  So the first argument is pretty strong IMO,
the second much less.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-23 Thread Noah Misch
On Wed, Nov 20, 2019 at 03:05:46PM +0900, Kyotaro Horiguchi wrote:
> By the way, before finalize this, I'd like to share the result of a
> brief benchmarking.

What non-default settings did you use?  Please give the output of this or a
similar command:

  select name, setting from pg_settings where setting <> boot_val;

If you run more benchmarks and weren't already using wal_buffers=16MB, I
recommend using it.

> With 10 pgbench sessions.
> pages   SYNC WAL 
> 1:   915 ms   301 ms
> 3:  1634 ms   508 ms
> 5:  1634 ms   293ms
>10:  1671 ms  1043 ms
>17:  1600 ms   333 ms
>31:  1864 ms   314 ms
>56:  1562 ms   448 ms
>   100:  1538 ms   394 ms
>   177:  1697 ms  1047 ms
>   316:  3074 ms  1788 ms
>   562:  3306 ms  1245 ms
>  1000:  3440 ms  2182 ms
>  1778:  5064 ms  6464 ms  # WAL's slope becomes steep
>  3162:  8675 ms  8165 ms

For picking a default wal_skip_threshold, it would have been more informative
to see how this changes pgbench latency statistics.  Some people want DDL to
be fast, but more people want DDL not to reduce the performance of concurrent
non-DDL.  This benchmark procedure may help:

1. Determine $DDL_COUNT, a number of DDL transactions that take about one
   minute when done via syncs.
2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
3. Wait 10s.
4. Start one DDL backend that runs $DDL_COUNT transactions.
5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

I would compare pgbench tps and latency between the seconds when DDL is and is
not running.  As you did in earlier tests, I would repeat it using various
page counts, with and without sync.

On Wed, Nov 20, 2019 at 05:31:43PM +0900, Kyotaro Horiguchi wrote:
> +Prefer to do the same in future access methods.  However, two other 
> approaches
> +can work.  First, an access method can irreversibly transition a given fork
> +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
> +smgrimmedsync().  Second, an access method can opt to write WAL
> +unconditionally for permanent relations.  When using the second method, do 
> not
> +call RelationCopyStorage(), which skips WAL.
> 
> Even using these methods, TransactionCommit flushes out buffers then
> sync files again. Isn't a description something like the following
> needed?
> 
> ===
> Even an access method switched a in-transaction created relfilenode to
> WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
> file then smgrimmedsync() the file.
> ===

It is enough that the text says to prefer the approach that core access
methods use.  The extra flush and sync when using a non-preferred approach
wastes some performance, but it is otherwise harmless.

> + rel1 = relation_open(r1, AccessExclusiveLock);
> + RelationAssumeNewRelfilenode(rel1);
> 
> It cannot be accessed from other sessions. Theoretically it doesn't
> need a lock but NoLock cannot be used there since there's a path that
> doesn't take lock on the relation. But AEL seems too strong and it
> causes unecessary side effect. Couldn't we use weaker locks?

We could use NoLock.  I assumed we already hold AccessExclusiveLock, in which
case this has no side effects.

On Thu, Nov 21, 2019 at 04:01:07PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch  wrote in 
> > === Defect 1: gistGetFakeLSN()
> > 
> > When I modified pg_regress.c to use wal_level=minimal for all suites,
> > src/test/isolation/specs/predicate-gist.spec failed the assertion in
> > gistGetFakeLSN().  One could reproduce the problem just by running this
> > sequence in psql:
> > 
> >   begin;
> >   create table gist_point_tbl(id int4, p point);
> >   create index gist_pointidx on gist_point_tbl using gist(p);
> >   insert into gist_point_tbl (id, p)
> >   select g, point(g*10, g*10) from generate_series(1, 1000) g;
> > 
> > I've included a wrong-in-general hack to make the test pass.  I see two main
> > options for fixing this:
> > 
> > (a) Introduce an empty WAL record that reserves an LSN and has no other
> > effect.  Make GiST use that for permanent relations that are skipping WAL.
> > Further optimizations are possible.  For example, we could use a 
> > backend-local
> > counter (like the one gistGetFakeLSN() uses for temp relations) until the
> > counter is greater a recent real LSN.  That optimization is probably too
> > clever, though it would make the new WAL record almost never appear.
> > 
> > (b) Exempt GiST from most WAL skipping.  GiST index build could still skip
> > WAL, but it would do its own smgrimmedsync() in addition to the one done at
> > commit.  Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead 
> > of
> > RelationNeedsWal(), and we'd need some hack for index_copy_data() and 
> > possibly
> > other AM-independent code that skips WAL.
> > 
> > Overall, I like the cleanliness of (a).  The main argument for (b) is that 
> > it
> > ensures we have all the 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-23 Thread Noah Misch
On Fri, Nov 22, 2019 at 01:21:31PM +0100, Peter Eisentraut wrote:
> On 2019-11-05 22:16, Robert Haas wrote:
> >First, I'd like to restate my understanding of the problem just to see
> >whether I've got the right idea and whether we're all on the same
> >page. When wal_level=minimal, we sometimes try to skip WAL logging on
> >newly-created relations in favor of fsync-ing the relation at commit
> >time.
> 
> How useful is this behavior, relative to all the effort required?
> 
> Even if the benefit is significant, how many users can accept running with
> wal_level=minimal and thus without replication or efficient backups?

That longstanding optimization is too useful to remove, but likely not useful
enough to add today if we didn't already have it.  The initial-data-load use
case remains plausible.  I can also imagine using wal_level=minimal for data
warehouse applications where one can quickly rebuild from the authoritative
data.

> Is there perhaps an alternative approach involving unlogged tables to get a
> similar performance benefit?

At wal_level=replica, it seems inevitable that ALTER TABLE SET LOGGED will
need to WAL-log the table contents.  I suppose we could keep wal_level=minimal
and change its only difference from wal_level=replica to be that ALTER TABLE
SET LOGGED skips WAL.  Currently, ALTER TABLE SET LOGGED also rewrites the
table; that would need to change.  I'd want to add ALTER INDEX SET LOGGED,
too.  After all that, users would need to modify their applications.  Overall,
it's possible, but it's not a clear win over the status quo.




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-22 Thread Peter Eisentraut

On 2019-11-05 22:16, Robert Haas wrote:

First, I'd like to restate my understanding of the problem just to see
whether I've got the right idea and whether we're all on the same
page. When wal_level=minimal, we sometimes try to skip WAL logging on
newly-created relations in favor of fsync-ing the relation at commit
time.


How useful is this behavior, relative to all the effort required?

Even if the benefit is significant, how many users can accept running 
with wal_level=minimal and thus without replication or efficient backups?


Is there perhaps an alternative approach involving unlogged tables to 
get a similar performance benefit?


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-21 Thread Kyotaro Horiguchi
At Thu, 21 Nov 2019 16:01:07 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> > For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
> > smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
> > sophisticated about optimizing the shared buffers scan.  Commit 279628a
> > introduced that, in 2013.  I think smgrDoPendingSyncs() should do likewise, 
> > to
> Seems reasonable. Please wait a minite.

This is the first cut of that. This makes the function 
FlushRelationBuffersWithoutRelcache useless, which was introduced in this work. 
The first patch reverts it, then the second patch adds the bulk sync feature.

The new function FlushRelFileNodesAllBuffers, differently from
DropRelFileNodesAllBuffers, takes SMgrRelation which is required by
FlushBuffer(). So it takes somewhat tricky way, where type
SMgrSortArray pointer to which is compatible with RelFileNode is used.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c51b44734d88fb19b568c4c0240848c8be2b7cf4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 19:28:35 +0900
Subject: [PATCH 1/2] Revert FlushRelationBuffersWithoutRelcache.

Succeeding patch makes the function useless and the function is no
longer useful globally. Revert it.
---
 src/backend/storage/buffer/bufmgr.c | 27 ++-
 src/include/storage/bufmgr.h|  2 --
 2 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 746ce477fc..67bbb26cae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3203,27 +3203,20 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-	RelationOpenSmgr(rel);
-
-	FlushRelationBuffersWithoutRelcache(rel->rd_smgr,
-		RelationUsesLocalBuffers(rel));
-}
-
-void
-FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
-{
-	RelFileNode rnode = smgr->smgr_rnode.node;
-	int i;
+	int			i;
 	BufferDesc *bufHdr;
 
-	if (islocal)
+	/* Open rel at the smgr level if not already done */
+	RelationOpenSmgr(rel);
+
+	if (RelationUsesLocalBuffers(rel))
 	{
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
 
 			bufHdr = GetLocalBufferDescriptor(i);
-			if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+			if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
 ((buf_state = pg_atomic_read_u32(>state)) &
  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
 			{
@@ -3240,7 +3233,7 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
 
 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-smgrwrite(smgr,
+smgrwrite(rel->rd_smgr,
 		  bufHdr->tag.forkNum,
 		  bufHdr->tag.blockNum,
 		  localpage,
@@ -3270,18 +3263,18 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
 		 * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
 		 * and saves some cycles.
 		 */
-		if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
+		if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
 			continue;
 
 		ReservePrivateRefCountEntry();
 
 		buf_state = LockBufHdr(bufHdr);
-		if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+		if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
 			(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8097d5ab22..8cd1cf25d9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -192,8 +192,6 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
-extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
-bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
    int nforks, BlockNumber *firstDelBlock);
-- 
2.23.0

>From 882731fcf063269d0bf85c57f23c83b9570e5df5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 19:33:18 +0900
Subject: [PATCH 2/2] Improve the performance of relation syncs.

We can improve performance of syncing multiple files at once in the
same way as b41669118. This reduces the number of scans on the whole
shared_bufffers from the number of synced relations to one.
---
 src/backend/catalog/storage.c   |  28 +--
 src/backend/storage/buffer/bufmgr.c | 113 
 src/backend/storage/smgr/smgr.c |  38 +-
 src/include/storage/bufmgr.h|   1 +
 src/include/storage/smgr.h 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-20 Thread Kyotaro Horiguchi
Wow.. This is embarrassing.. *^^*.

At Thu, 21 Nov 2019 16:01:07 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I should have replied this first.
> 
> At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch  wrote in 
> > On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> > > I started pre-commit editing on 2019-10-28, and comment+README updates 
> > > have
> > > been the largest part of that.  I'll check my edits against the things you
> > > list here, and I'll share on-list before committing.  I've now marked the 
> > > CF
> > > entry Ready for Committer.
> > 
> > Having dedicated many days to that, I am attaching v24nm.  I know of two
> > remaining defects:
> > 
> > === Defect 1: gistGetFakeLSN()
> > 
> > When I modified pg_regress.c to use wal_level=minimal for all suites,
> > src/test/isolation/specs/predicate-gist.spec failed the assertion in
> > gistGetFakeLSN().  One could reproduce the problem just by running this
> > sequence in psql:
> > 
> >   begin;
> >   create table gist_point_tbl(id int4, p point);
> >   create index gist_pointidx on gist_point_tbl using gist(p);
> >   insert into gist_point_tbl (id, p)
> >   select g, point(g*10, g*10) from generate_series(1, 1000) g;
> > 
> > I've included a wrong-in-general hack to make the test pass.  I see two main
> > options for fixing this:
> > 
> > (a) Introduce an empty WAL record that reserves an LSN and has no other
> > effect.  Make GiST use that for permanent relations that are skipping WAL.
> > Further optimizations are possible.  For example, we could use a 
> > backend-local
> > counter (like the one gistGetFakeLSN() uses for temp relations) until the
> > counter is greater a recent real LSN.  That optimization is probably too
> > clever, though it would make the new WAL record almost never appear.
> > 
> > (b) Exempt GiST from most WAL skipping.  GiST index build could still skip
> > WAL, but it would do its own smgrimmedsync() in addition to the one done at
> > commit.  Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead 
> > of
> > RelationNeedsWal(), and we'd need some hack for index_copy_data() and 
> > possibly
> > other AM-independent code that skips WAL.
> > 
> > Overall, I like the cleanliness of (a).  The main argument for (b) is that 
> > it
> > ensures we have all the features to opt-out of WAL skipping, which could be
> > useful for out-of-tree index access methods.  (I think we currently have the
> > features for a tableam to do so, but not for an indexam to do so.)  
> > Overall, I
> > lean toward (a).  Any other ideas or preferences?
> 
> I don't like (b), too.
> 
> What we need there is any sequential numbers for page LSN but really
> compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the

> case?  Or, I'm not sure but I suppose that nothing happenes when
> UNLOGGED GiST index gets turned into LOGGED one.

Yes, I just forgot to remove these lines when writing the following.

> Rewriting table like SET LOGGED will work but not realistic.
> 
> > === Defect 2: repetitive work when syncing many relations
> > 
> > For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
> > smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
> > sophisticated about optimizing the shared buffers scan.  Commit 279628a
> > introduced that, in 2013.  I think smgrDoPendingSyncs() should do likewise, 
> > to
> > further reduce the chance of causing performance regressions.  (One could,
> > however, work around the problem by raising wal_skip_threshold.)  Kyotaro, 
> > if
> > you agree, could you modify v24nm to implement that?
> 
> Seems reasonable. Please wait a minite.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-20 Thread Kyotaro Horiguchi
I should have replied this first.

At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch  wrote in 
> On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> > I started pre-commit editing on 2019-10-28, and comment+README updates have
> > been the largest part of that.  I'll check my edits against the things you
> > list here, and I'll share on-list before committing.  I've now marked the CF
> > entry Ready for Committer.
> 
> Having dedicated many days to that, I am attaching v24nm.  I know of two
> remaining defects:
> 
> === Defect 1: gistGetFakeLSN()
> 
> When I modified pg_regress.c to use wal_level=minimal for all suites,
> src/test/isolation/specs/predicate-gist.spec failed the assertion in
> gistGetFakeLSN().  One could reproduce the problem just by running this
> sequence in psql:
> 
>   begin;
>   create table gist_point_tbl(id int4, p point);
>   create index gist_pointidx on gist_point_tbl using gist(p);
>   insert into gist_point_tbl (id, p)
>   select g, point(g*10, g*10) from generate_series(1, 1000) g;
> 
> I've included a wrong-in-general hack to make the test pass.  I see two main
> options for fixing this:
> 
> (a) Introduce an empty WAL record that reserves an LSN and has no other
> effect.  Make GiST use that for permanent relations that are skipping WAL.
> Further optimizations are possible.  For example, we could use a backend-local
> counter (like the one gistGetFakeLSN() uses for temp relations) until the
> counter is greater a recent real LSN.  That optimization is probably too
> clever, though it would make the new WAL record almost never appear.
> 
> (b) Exempt GiST from most WAL skipping.  GiST index build could still skip
> WAL, but it would do its own smgrimmedsync() in addition to the one done at
> commit.  Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
> RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
> other AM-independent code that skips WAL.
> 
> Overall, I like the cleanliness of (a).  The main argument for (b) is that it
> ensures we have all the features to opt-out of WAL skipping, which could be
> useful for out-of-tree index access methods.  (I think we currently have the
> features for a tableam to do so, but not for an indexam to do so.)  Overall, I
> lean toward (a).  Any other ideas or preferences?

I don't like (b), too.

What we need there is any sequential numbers for page LSN but really
compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
case?  Or, I'm not sure but I suppose that nothing happenes when
UNLOGGED GiST index gets turned into LOGGED one.

Rewriting table like SET LOGGED will work but not realistic.

> === Defect 2: repetitive work when syncing many relations
> 
> For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
> smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
> sophisticated about optimizing the shared buffers scan.  Commit 279628a
> introduced that, in 2013.  I think smgrDoPendingSyncs() should do likewise, to
> further reduce the chance of causing performance regressions.  (One could,
> however, work around the problem by raising wal_skip_threshold.)  Kyotaro, if
> you agree, could you modify v24nm to implement that?

Seems reasonable. Please wait a minite.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 66c52d6dd6..387b1f7d18 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1017,8 +1017,7 @@ gistGetFakeLSN(Relation rel)
 	 * XXX before commit fix this.  This is not correct for
 	 * RELPERSISTENCE_PERMANENT, but it suffices to make tests pass.
 	 */
-	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP
-		|| rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
 	{
 		/*
 		 * Temporary relations are only accessible in our session, so a simple
@@ -1026,6 +1025,15 @@ gistGetFakeLSN(Relation rel)
 		 */
 		return counter++;
 	}
+	else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+	{
+		/*
+		 * Even though we are skipping WAL-logging of a permanent relations,
+		 * the LSN must be a real one because WAL-logging starts after commit.
+		 */
+		Assert(!RelationNeedsWAL(rel));
+		return GetXLogInsertRecPtr();
+	}
 	else
 	{
 		/*


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-20 Thread Kyotaro Horiguchi
At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch  wrote in 
> On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> > I started pre-commit editing on 2019-10-28, and comment+README updates have
> > been the largest part of that.  I'll check my edits against the things you
> > list here, and I'll share on-list before committing.  I've now marked the CF
> > entry Ready for Committer.

I looked the version.

> Notable changes in v24nm:
> 
> - Wrote section "Skipping WAL for New RelFileNode" in
>   src/backend/access/transam/README to be the main source concerning the new
>   coding rules.

Thanks for writing this.

+Prefer to do the same in future access methods.  However, two other approaches
+can work.  First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync().  Second, an access method can opt to write WAL
+unconditionally for permanent relations.  When using the second method, do not
+call RelationCopyStorage(), which skips WAL.

Even using these methods, TransactionCommit flushes out buffers then
sync files again. Isn't a description something like the following
needed?

===
Even an access method switched a in-transaction created relfilenode to
WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
file then smgrimmedsync() the file.
===

> - Updated numerous comments and doc sections.
> 
> - Eliminated the pendingSyncs list in favor of a "sync" field in
>   pendingDeletes.  I mostly did this to eliminate the possibility of the lists
>   getting out of sync.  This removed considerable parallel code for managing a
>   second list at end-of-xact.  We now call smgrDoPendingSyncs() only when
>   committing or preparing a top-level transaction.

Mmm. Right. The second list was a trace of older versions, maybe that
needed additional works at rollback. Actually as of v23 the function
syncs no files at rollback. It is wiser to merging the two.

> - Whenever code sets an rd_*Subid field of a Relation, it must call
>   EOXactListAdd().  swap_relation_files() was not doing so, so the field
>   remained set during the next transaction.  I introduced
>   RelationAssumeNewRelfilenode() to handle both tasks, and I located the call
>   so it also affects the mapped relation case.

Ugh.. Thanks for pointing out. By the way

+   /*
+* Recognize that rel1's relfilenode (swapped from rel2) is new in this
+* subtransaction. Since the next step for rel2 is deletion, don't 
bother
+* recording the newness of its relfilenode.
+*/
+   rel1 = relation_open(r1, AccessExclusiveLock);
+   RelationAssumeNewRelfilenode(rel1);

It cannot be accessed from other sessions. Theoretically it doesn't
need a lock but NoLock cannot be used there since there's a path that
doesn't take lock on the relation. But AEL seems too strong and it
causes unecessary side effect. Couldn't we use weaker locks?

... Time is up. I'll continue looking this.

regards.

> - In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild,
>   rd_createSubid remained set.  (That happened before this patch, but it has
>   been harmless.)  I fixed this in heap_create().
> 
> - Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM.  A sync is necessary
>   when checksums are enabled.  Observe the precedent that
>   RelationCopyStorage() has not been exempting FSM_FORKNUM.
> 
> - Pass log_newpage_range() a "false" for page_std, for the same reason
>   RelationCopyStorage() does.
> 
> - log_newpage_range() ignored its forkNum and page_std arguments, so we logged
>   the wrong data for non-main forks.  Before this patch, callers always passed
>   MAIN_FORKNUM and "true", hence the lack of complaints.
> 
> - Restored table_finish_bulk_insert(), though heapam no longer provides a
>   callback.  The API is still well-defined, and other table AMs might have use
>   for it.  Removing it feels like a separate proposal.
> 
> - Removed TABLE_INSERT_SKIP_WAL.  Any out-of-tree code using it should revisit
>   itself in light of this patch.
> 
> - Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation;
>   it was overcounting.
> 
> - Made us skip WAL after SET TABLESPACE, like we do after CLUSTER.
> 
> - Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to
>   "Write Ahead Log" -> "Settings", between similar settings
>   wal_writer_flush_after and commit_delay.  The other place I considered was
>   "Resource Consumption" -> "Asynchronous Behavior", due to the similarity of
>   backend_flush_after.
> 
> - Gave each test a unique name.  Changed test table names to be descriptive,
>   e.g. test7 became trunc_trig.
> 
> - Squashed all patches into one.  Split patches are good when one could
>   reasonably choose to push a subset, but that didn't apply here.  I wouldn't
>   push a GUC implementation without its documentation.  Since the tests fail
>   without the main bug fix, I 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-19 Thread Kyotaro Horiguchi
I'm in the benchmarking week..

Thanks for reviewing!.

At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch  wrote in 
> On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> > I started pre-commit editing on 2019-10-28, and comment+README updates have
> > been the largest part of that.  I'll check my edits against the things you
> > list here, and I'll share on-list before committing.  I've now marked the CF
> > entry Ready for Committer.

I'll look into that soon.

By the way, before finalize this, I'd like to share the result of a
brief benchmarking.

First, I measured the direct effect of WAL skipping.
I measured the time required to do the following sequence for the
COMMIT-FPW-WAL case and COMMIT-fsync case. WAL and heap files are on
non-server spec HDD.

  BEGIN;
  TRUNCATE t;
  INSERT INTO t (SELECT a FROM generate_series(1, n) a);
  COMMIT;

REPLICA means the time with wal_level = replica
SYNCmeans the time with wal_level = minimal and force file sync.
WAL means the time with wal_level = minimal and force commit-WAL.
pages is the number of pages of the table.
(REPLICA comes from run.sh 1, SYNC/WAL comes from run.sh 2)

pages REPLICASYNC  WAL
1:   144 ms   683 ms   217 ms
3:   303 ms   995 ms   385 ms
5:   271 ms  1007 ms   217 ms
   10:   157 ms  1043 ms   224 ms
   17:   189 ms  1007 ms   193 ms
   31:   202 ms  1091 ms   230 ms
   56:   265 ms  1175 ms   226 ms
  100:   510 ms  1307 ms   270 ms
  177:   790 ms  1523 ms   524 ms
  316:  1827 ms  1643 ms   719 ms
  562:  1904 ms  2109 ms  1148 ms
 1000:  3060 ms  2979 ms  2113 ms
 1778:  6077 ms  3945 ms  3618 ms
 3162: 13038 ms  7078 ms  6734 ms

There was a crossing point around 3000 pages. (bench1() finds that by
bisecting, run.sh 3).


With multiple sessions, the crossing point  but does not go so
small.

10 processes (run.pl 4 10) The numbers in parentheses are WAL[n]/WAL[n-1].
pagesSYNC WAL
  316:  8436 ms  4694 ms
  562: 12067 ms  9627 ms (x2.1) # WAL wins
 1000: 19154 ms 43262 ms (x4.5) # SYNC wins. WAL's slope becomes steep.
 1778: 32495 ms 63863 ms (x1.4)

100 processes (run.pl 4 100)
pagesSYNC WAL
   10: 13275 ms  1868 ms 
   17: 15919 ms  4438 ms (x2.3)
   31: 17063 ms  6431 ms (x1.5)
   56: 23193 ms 14276 ms (x2.2)  # WAL wins
  100: 35220 ms 67843 ms (x4.8)  # SYNC wins. WAL's slope becomes steep.

With 10 pgbench sessions.
pages   SYNC WAL 
1:   915 ms   301 ms
3:  1634 ms   508 ms
5:  1634 ms   293ms
   10:  1671 ms  1043 ms
   17:  1600 ms   333 ms
   31:  1864 ms   314 ms
   56:  1562 ms   448 ms
  100:  1538 ms   394 ms
  177:  1697 ms  1047 ms
  316:  3074 ms  1788 ms
  562:  3306 ms  1245 ms
 1000:  3440 ms  2182 ms
 1778:  5064 ms  6464 ms  # WAL's slope becomes steep
 3162:  8675 ms  8165 ms


I don't think the result of 100 processes is meaningful, so excluding
the result a candidate for wal_skip_threshold can be 1000.

Thoughts? The attached is the benchmark script.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#! /usr/bin/perl

use strict;
use IPC::Open2;
use Time::HiRes qw (gettimeofday tv_interval);

my $tupperpage = 226;

my @time = ();

sub bench {
my ($header, $nprocs, $ntups, $threshold) = @_;
my @result = ();
my @rds = ();

for (my $ip = 0 ; $ip < $nprocs ; $ip++)
{
pipe(my $rd, my $wr);
$rds[$ip] = $rd;

my $pid = fork();

die "fork failed: $!\n" if ($pid < 0);
if ($pid == 0)
{
close($rd);

my $pid = open2(my $psqlrd, my $psqlwr, "psql 
postgres");
print $psqlwr "SET wal_skip_threshold to $threshold;\n";
print $psqlwr "DROP TABLE IF EXISTS t$ip;";
print $psqlwr "CREATE TABLE t$ip (a int);\n";

my @st = gettimeofday();
for (my $i = 0 ; $i < 10 ; $i++)
{
print $psqlwr "BEGIN;";
print $psqlwr "TRUNCATE t$ip;";
print $psqlwr "INSERT INTO t$ip (SELECT a FROM 
generate_series(1, $ntups) a);";
print $psqlwr "COMMIT;";
}
close($psqlwr);
waitpid($pid, 0);

print $wr $ip, " ", 1000 * tv_interval(\@st, 
[gettimeofday]), "\n";
exit;
}
close($wr);
}

my $rpid;
while (($rpid = wait()) == 0) {}

my $sum = 0;
for (my $ip = 0 ; $ip < $nprocs ; $ip++)
{
my $ret = readline($rds[$ip]);
die "format? $ret\n" if ($ret !~ /^([0-9]+) ([0-9.]+)$/);

$sum += $2;
}

printf "$header: procs $nprocs: time 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-17 Thread Noah Misch
On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> I started pre-commit editing on 2019-10-28, and comment+README updates have
> been the largest part of that.  I'll check my edits against the things you
> list here, and I'll share on-list before committing.  I've now marked the CF
> entry Ready for Committer.

Having dedicated many days to that, I am attaching v24nm.  I know of two
remaining defects:

=== Defect 1: gistGetFakeLSN()

When I modified pg_regress.c to use wal_level=minimal for all suites,
src/test/isolation/specs/predicate-gist.spec failed the assertion in
gistGetFakeLSN().  One could reproduce the problem just by running this
sequence in psql:

  begin;
  create table gist_point_tbl(id int4, p point);
  create index gist_pointidx on gist_point_tbl using gist(p);
  insert into gist_point_tbl (id, p)
  select g, point(g*10, g*10) from generate_series(1, 1000) g;

I've included a wrong-in-general hack to make the test pass.  I see two main
options for fixing this:

(a) Introduce an empty WAL record that reserves an LSN and has no other
effect.  Make GiST use that for permanent relations that are skipping WAL.
Further optimizations are possible.  For example, we could use a backend-local
counter (like the one gistGetFakeLSN() uses for temp relations) until the
counter is greater a recent real LSN.  That optimization is probably too
clever, though it would make the new WAL record almost never appear.

(b) Exempt GiST from most WAL skipping.  GiST index build could still skip
WAL, but it would do its own smgrimmedsync() in addition to the one done at
commit.  Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
other AM-independent code that skips WAL.

Overall, I like the cleanliness of (a).  The main argument for (b) is that it
ensures we have all the features to opt-out of WAL skipping, which could be
useful for out-of-tree index access methods.  (I think we currently have the
features for a tableam to do so, but not for an indexam to do so.)  Overall, I
lean toward (a).  Any other ideas or preferences?

=== Defect 2: repetitive work when syncing many relations

For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
sophisticated about optimizing the shared buffers scan.  Commit 279628a
introduced that, in 2013.  I think smgrDoPendingSyncs() should do likewise, to
further reduce the chance of causing performance regressions.  (One could,
however, work around the problem by raising wal_skip_threshold.)  Kyotaro, if
you agree, could you modify v24nm to implement that?


Notable changes in v24nm:

- Wrote section "Skipping WAL for New RelFileNode" in
  src/backend/access/transam/README to be the main source concerning the new
  coding rules.

- Updated numerous comments and doc sections.

- Eliminated the pendingSyncs list in favor of a "sync" field in
  pendingDeletes.  I mostly did this to eliminate the possibility of the lists
  getting out of sync.  This removed considerable parallel code for managing a
  second list at end-of-xact.  We now call smgrDoPendingSyncs() only when
  committing or preparing a top-level transaction.

- Whenever code sets an rd_*Subid field of a Relation, it must call
  EOXactListAdd().  swap_relation_files() was not doing so, so the field
  remained set during the next transaction.  I introduced
  RelationAssumeNewRelfilenode() to handle both tasks, and I located the call
  so it also affects the mapped relation case.

- In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild,
  rd_createSubid remained set.  (That happened before this patch, but it has
  been harmless.)  I fixed this in heap_create().

- Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM.  A sync is necessary
  when checksums are enabled.  Observe the precedent that
  RelationCopyStorage() has not been exempting FSM_FORKNUM.

- Pass log_newpage_range() a "false" for page_std, for the same reason
  RelationCopyStorage() does.

- log_newpage_range() ignored its forkNum and page_std arguments, so we logged
  the wrong data for non-main forks.  Before this patch, callers always passed
  MAIN_FORKNUM and "true", hence the lack of complaints.

- Restored table_finish_bulk_insert(), though heapam no longer provides a
  callback.  The API is still well-defined, and other table AMs might have use
  for it.  Removing it feels like a separate proposal.

- Removed TABLE_INSERT_SKIP_WAL.  Any out-of-tree code using it should revisit
  itself in light of this patch.

- Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation;
  it was overcounting.

- Made us skip WAL after SET TABLESPACE, like we do after CLUSTER.

- Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to
  "Write Ahead Log" -> "Settings", between similar settings
  

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-06 Thread Kyotaro Horiguchi
Thank you for looking this.

At Tue, 5 Nov 2019 16:16:14 -0500, Robert Haas  wrote in 
> On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi
>  wrote:
> > This is the fixed verison v22.
> First, I'd like to restate my understanding of the problem just to see
..
> Second, for anyone who is not following this thread closely but is

Thanks for restating the issue and summarizing this patch. All of the
description match my understanding.

> perform all of the relevant fsync() calls at commit time. This is
> further augmented with a mechanism that instead logs all the relation
> pages in lieu of fsync()ing if the relation is very small, on the
> theory that logging a few FPIs will be cheaper than an fsync(). I view
> this additional mechanism as perhaps a bit much for a bug fix patch,
> but I understand that the goal is to prevent a performance regression,
> and it's not really over the top, so I think it's probably OK.

Thanks. It would need some benchmarking as mentioned upthread. My new
machine became to work steadily so I will do that.

> sound. I think there are a number of places where the comments could
> be better; I'll include a few points on that further down. I also
> think that the code in swap_relation_files() which takes ExclusiveLock
> on the relations looks quite strange. It's hard to understand why it's
> needed at all, or why that lock level is used. On the flip side, I

Right. It *was* a mistake of AccessExclusiveLock. On second thought,
callers must have taken locks on them with required level for
relfilenode swapping. However, one problematic case is toast indexes
of the target relation, which are not locked at all. Finally I used
AccessShareLock as it doesn't raise lock level other than
NoLock. Anyway the toast relation is not accessible outside the
session. (Done in the attached)

> think that the test suite looks really impressive and should be of
> considerable help not only in making sure that this is fixed but
> detecting if it gets broken again in the future. Perhaps it doesn't
> cover every scenario we care about, but if that turns out to be the
> case, it seems like it would be easily to further generalize. I really
> like the idea of this *kind* of test framework.

The paths running swap_relation_files are not covered. CLUSTER,
REFRESH MATVIEW and ALTER TABLE. CLUSTER and ALTER TABLE can interact
with INSERTs but MATVIEW cannot. Copying some of the existing test
cases using them will work. (Not yet done).

> Comments on comments, and other nitpicking:
> 
> - in-trasaction is mis-spelled in the doc patch. accidentially is
> mis-spelled in the 0002 patch.

Thanks. I found another couple of typos "issuing"->"issueing",
"skipped"->"skpped" by ispell'ing git diff output and all fixed.

> - I think the header comment for the new TAP test could do a far
> better job explaining the overall goal of this testing than it
> actually does.

I rewrote it... 

> - I think somewhere in relcache.c or rel.h there ought to be comments
> explaining the precise degree to which rd_createSubid,
> rd_newRelfilenodeSubid, and rd_firstRelfilenodeSubid are reliable,
> including problem scenarios. This patch removes some language of this
> sort from CopyFrom(), which was a funny place to have that information
> in the first place, but I don't see that it adds anything to replace
> it. I also think that we ought to explain - for the fields that are
> reliable - that they need to be reliable precisely for the purpose of
> not breaking this stuff. There's a bit of this right now:
> 
> + * rd_firstRelfilenodeSubid is the ID of the first subtransaction the
> + * relfilenode change has took place in the current transaction. Unlike
> + * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID
> + * means that the currently active relfilenode is transaction-local and we
> + * sync the relation at commit instead of WAL-logging.
> 
> ...but I think that needs to be somewhat expanded and clarified.

Agreed. It would be crude but I put augmenting descriptions of how the
variables work and descriptions in contrast to rd_first*.

# rd_first* is not a hint in the sense that it is reliable but it is
# mentioned as hint in some places, which will need fix.

If the fix of MarkBufferDirtyHint is ok, I'll merge it into 0002.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c5e7243ba05677ad84bc8b6b03077cadcaadf4b8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 25 Oct 2019 13:07:41 +0900
Subject: [PATCH v23 1/5] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 321 
 1 file changed, 321 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 00..ac62c77a42
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,321 @@
+# Test recovery for skipping 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-05 Thread Noah Misch
On Tue, Nov 05, 2019 at 04:16:14PM -0500, Robert Haas wrote:
> On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi  
> wrote:
> > This is the fixed verison v22.
> 
> I'd like to offer a few thoughts on this thread and on these patches,
> which is now more than 4 years old and more than 150 messages in
> length.
...

Your understanding matches mine.  Thanks for studying this.  I had been
feeling nervous about being the sole reviewer of the latest design.

> Comments on comments, and other nitpicking:

I started pre-commit editing on 2019-10-28, and comment+README updates have
been the largest part of that.  I'll check my edits against the things you
list here, and I'll share on-list before committing.  I've now marked the CF
entry Ready for Committer.




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-05 Thread Robert Haas
On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi
 wrote:
> This is the fixed verison v22.

I'd like to offer a few thoughts on this thread and on these patches,
which is now more than 4 years old and more than 150 messages in
length.

First, I'd like to restate my understanding of the problem just to see
whether I've got the right idea and whether we're all on the same
page. When wal_level=minimal, we sometimes try to skip WAL logging on
newly-created relations in favor of fsync-ing the relation at commit
time. The idea is that if the transaction aborts or is aborted by a
crash, the contents of the relation don't need to be reproduced
because they are irrelevant, so no WAL is needed, and if the
transaction commits we can't lose any data on a crash because we've
already fsync'd, and standbys don't matter because wal_level=minimal
precludes having any. However, we're not entirely consistent about
skipping WAL-logging: some operations do and others don't, and this
causes confusion if a crash occurs, because we might try to replay
some of the things that happened to that relation but not all of them.
For example, the original poster complained about a sequence of steps
where an index truncation was logged but subsequent index insertions
were not; a badly-timed crash will replay the truncation but can't
replay the index insertions because they weren't logged in the first
place; consequently, while the state was actually OK at the beginning
of replay, it's no longer OK by the end. Replaying nothing would've
been OK, but replaying some things and not others isn't.

Second, for anyone who is not following this thread closely but is
interested in a summary, I'd like to summarize how I believe that the
current patch proposes to solve the problem. As I understand it, the
approach taken by the patch is to try to change things so that we log
nothing at all for relations created or truncated in the current
top-level transaction, and everything for others. To achieve this, the
patch makes a number of changes, three of which seem to me to be
particularly key. One, the patch changes the relcache infrastructure
with the goal of making it possible to reliably identify whether a
relation has been created or truncated in the current toplevel
transaction; our current code does have tracking for this, but it's
not 100% accurate. Two, the patch changes the definition of
RelationNeedsWAL() so that it not only checks that the relation is a
permanent one, but also that either wal_level != minimal or the
relation was not created in the current transaction. It seems to me
that if RelationNeedsWAL() is used to gate every test for whether or
not to write WAL pertaining to a particular relation, this ought to
achieve the desired behavior of logging either everything or nothing.
It is not quite clear to me how we can be sure that we use that in
every relevant place. Three, the patch replaces the various ad-hoc
bits of code which fsync relations which perform unlogged operations
on permanent relations with a new tracking mechanism that arranges to
perform all of the relevant fsync() calls at commit time. This is
further augmented with a mechanism that instead logs all the relation
pages in lieu of fsync()ing if the relation is very small, on the
theory that logging a few FPIs will be cheaper than an fsync(). I view
this additional mechanism as perhaps a bit much for a bug fix patch,
but I understand that the goal is to prevent a performance regression,
and it's not really over the top, so I think it's probably OK.

Third, I'd like to offer a couple of general comments on the state of
these patches. Broadly, I think they look pretty good. They seem quite
well-engineered to me and as far as I can see the overall approach is
sound. I think there are a number of places where the comments could
be better; I'll include a few points on that further down. I also
think that the code in swap_relation_files() which takes ExclusiveLock
on the relations looks quite strange. It's hard to understand why it's
needed at all, or why that lock level is used. On the flip side, I
think that the test suite looks really impressive and should be of
considerable help not only in making sure that this is fixed but
detecting if it gets broken again in the future. Perhaps it doesn't
cover every scenario we care about, but if that turns out to be the
case, it seems like it would be easily to further generalize. I really
like the idea of this *kind* of test framework.

Comments on comments, and other nitpicking:

- in-trasaction is mis-spelled in the doc patch. accidentially is
mis-spelled in the 0002 patch.
- I think the header comment for the new TAP test could do a far
better job explaining the overall goal of this testing than it
actually does.
- I think somewhere in relcache.c or rel.h there ought to be comments
explaining the precise degree to which rd_createSubid,
rd_newRelfilenodeSubid, and rd_firstRelfilenodeSubid are reliable,
including problem scenarios. 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-10-25 Thread Kyotaro Horiguchi
On Fri, Oct 25, 2019 at 1:13 PM Kyotaro Horiguchi
 wrote:
> Hello. Thanks for the comment.
>
> # Sorry in advance for possilbe breaking the thread.
>
> > MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or
> > rd_createSubid is set; see attached test case.  It needs to skip WAL 
> > whenever
> > RelationNeedsWAL() returns false.
>
> Thanks for pointing out that. And the test patch helped me very much.
>
> Most of callers can tell that to the function, but SetHintBits()
> cannot easily. Rather I think we shouldn't even try to do
> that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
> for sync-pending state of the relfilenode for the buffer. In the
> attached patch (0003) RelFileNodeSkippingWAL loops over pendingSyncs
> but it is called only at the time FPW is added so I believe it doesn't
> affect performance so much. However, we can use hash for pendingSyncs
> instead of liked list. Anyway the change is in its own file
> v21-0003-Fix-MarkBufferDirtyHint.patch, which will be merged into
> 0002.
>
> AFAICS all XLogInsert is guarded by RelationNeedsWAL() or in the
> non-wal_minimal code paths.
>
> > Cylinder and track sizes are obsolete as user-visible concepts.  (They're 
> > not
> > onstant for a given drive, and I think modern disks provide no way to read
> > the relevant parameters.)  I like the name "wal_skip_threshold", and my 
> > second
>
> I strongly agree. Thanks for the draft. I used it as-is. I don't come
> up with an appropriate second description of the GUC so I just removed
> it.
>
> # it was "For rotating magnetic disks, it is around the size of a
> # track or sylinder."
>
> > the relevant parameters.)  I like the name "wal_skip_threshold", and
> > my second choice would be "wal_skip_min_size".  Possibly documented
> > as follows:
> ..
> > Any other opinions on the GUC name?
>
> I prefer the first candidate. I already used the terminology in
> storage.c and the name fits more to the context.
>
> > * We emit newpage WAL records for smaller size of relations.
> > *
> > * Small WAL records have a chance to be emitted at once along with
> > * other backends' WAL records. We emit WAL records instead of syncing
> > * for files that are smaller than a certain threshold expecting faster
> - * commit. The threshold is defined by the GUC effective_io_block_size.
> + * commit. The threshold is defined by the GUC wal_skip_threshold.

> It's wrong that it also skips changing flags.
> I"ll fix it soon

This is the fixed verison v22.

The attached are:

- v22-0001-TAP-test-for-copy-truncation-optimization.patch
  Same as v20, 21

- v22-0002-Fix-WAL-skipping-feature.patch
  GUC name changed. Same as v21.

- v22-0003-Fix-MarkBufferDirtyHint.patch
  PoC of fixing the function. will be merged into 0002. (New in v21,
fixed in v22)

- v21-0004-Documentation-for-wal_skip_threshold.patch
  GUC name and description changed. (Previous 0003, same as v21)

- v21-0005-Additional-test-for-new-GUC-setting.patch
  including adjusted version of wal-optimize-noah-tests-v3.patch
  Maybe test names need further adjustment. (Previous 0004, same as v21)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


v22-0004-Documentation-for-wal_skip_threshold.patch
Description: Binary data


v22-0001-TAP-test-for-copy-truncation-optimization.patch
Description: Binary data


v22-0002-Fix-WAL-skipping-feature.patch
Description: Binary data


v22-0005-Additional-test-for-new-GUC-setting.patch
Description: Binary data


v22-0003-Fix-MarkBufferDirtyHint.patch
Description: Binary data


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-10-24 Thread Kyotaro Horiguchi
Ugh!

2019年10月25日(金) 13:13 Kyotaro Horiguchi :

> that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
> for sync-pending state of the relfilenode for the buffer. In the
> attached patch (0003)
> regards.
>

It's wrong that it also skips chnging flags.
I"ll fix it soon

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-10-24 Thread Kyotaro Horiguchi
Hello. Thanks for the comment.

# Sorry in advance for possilbe breaking the thread.

> MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or
> rd_createSubid is set; see attached test case.  It needs to skip WAL whenever
> RelationNeedsWAL() returns false.

Thanks for pointing out that. And the test patch helped me very much.

Most of callers can tell that to the function, but SetHintBits()
cannot easily. Rather I think we shouldn't even try to do
that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
for sync-pending state of the relfilenode for the buffer. In the
attached patch (0003) RelFileNodeSkippingWAL loops over pendingSyncs
but it is called only at the time FPW is added so I believe it doesn't
affect performance so much. However, we can use hash for pendingSyncs
instead of liked list. Anyway the change is in its own file
v21-0003-Fix-MarkBufferDirtyHint.patch, which will be merged into
0002.

AFAICS all XLogInsert is guarded by RelationNeedsWAL() or in the
non-wal_minimal code paths.

> Cylinder and track sizes are obsolete as user-visible concepts.  (They're not
> onstant for a given drive, and I think modern disks provide no way to read
> the relevant parameters.)  I like the name "wal_skip_threshold", and my second

I strongly agree. Thanks for the draft. I used it as-is. I don't come
up with an appropriate second description of the GUC so I just removed
it.

# it was "For rotating magnetic disks, it is around the size of a
# track or sylinder."

> the relevant parameters.)  I like the name "wal_skip_threshold", and
> my second choice would be "wal_skip_min_size".  Possibly documented
> as follows:
..
> Any other opinions on the GUC name?

I prefer the first candidate. I already used the terminology in
storage.c and the name fits more to the context.

> * We emit newpage WAL records for smaller size of relations.
> *
> * Small WAL records have a chance to be emitted at once along with
> * other backends' WAL records. We emit WAL records instead of syncing
> * for files that are smaller than a certain threshold expecting faster
- * commit. The threshold is defined by the GUC effective_io_block_size.
+ * commit. The threshold is defined by the GUC wal_skip_threshold.

The attached are:

- v21-0001-TAP-test-for-copy-truncation-optimization.patch
  same as v20

- v21-0002-Fix-WAL-skipping-feature.patch
  GUC name changed.

- v21-0003-Fix-MarkBufferDirtyHint.patch
  PoC of fixing the function. will be merged into 0002. (New)

- v21-0004-Documentation-for-wal_skip_threshold.patch
  GUC name and description changed. (Previous 0003)

- v21-0005-Additional-test-for-new-GUC-setting.patch
  including adjusted version of wal-optimize-noah-tests-v3.patch
  Maybe test names need further adjustment. (Previous 0004)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 34149545942480d8dcc1cc587f40091b19b5aa39 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH v21 1/5] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 312 
 1 file changed, 312 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 00..b041121745
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,312 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+
+sub check_orphan_relfilenodes
+{
+	my($node, $test_name) = @_;
+
+	my $db_oid = $node->safe_psql('postgres',
+	   "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+	my $prefix = "base/$db_oid/";
+	my $filepaths_referenced = $node->safe_psql('postgres', "
+	   SELECT pg_relation_filepath(oid) FROM pg_class
+	   WHERE reltablespace = 0 and relpersistence <> 't' and
+	   pg_relation_filepath(oid) IS NOT NULL;");
+	is_deeply([sort(map { "$prefix$_" }
+	grep(/^[0-9]+$/,
+		 slurp_dir($node->data_dir . "/$prefix")))],
+			  [sort split /\n/, $filepaths_referenced],
+			  $test_name);
+	return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+	my $wal_level = shift;
+
+	# Primary needs to have wal_level = minimal here
+	my $node = get_new_node("node_$wal_level");
+	$node->init;
+	$node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+));
+	$node->start;
+
+	# Setup
+	my $tablespace_dir = 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-09-10 Thread Noah Misch
[Casual readers with opinions on GUC naming: consider skipping to the end.]

MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or
rd_createSubid is set; see attached test case.  It needs to skip WAL whenever
RelationNeedsWAL() returns false.

On Tue, Aug 27, 2019 at 03:49:32PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch  wrote in 
> <20190826050843.gb3153...@rfd.leadboat.com>
> > Consider a one-page relfilenode.  Doing all the things you list for a single
> > page may be cheaper than locking millions of buffer headers.
> 
> If I understand you correctly, I would say that *all* buffers
> that don't belong to in-transaction-created files are skipped
> before taking locks. No lock conflict happens with other
> backends.
> 
> FlushRelationBuffers uses double-checked-locking as follows:

I had misread the code; you're right.

> > This should be GUC-controlled, especially since this is back-patch material.
> 
> Is this size of patch back-patchable?

Its size is not an obstacle.  It's not ideal to back-patch such a user-visible
performance change, but it would be worse to leave back branches able to
corrupt data during recovery.

On Wed, Aug 28, 2019 at 03:42:10PM +0900, Kyotaro Horiguchi wrote:
> - Use log_newpage instead of fsync for small tables.

> I'm trying to measure performance difference on WAL/fsync.

I would measure it with simultaneous pgbench instances:

1. DDL pgbench instance repeatedly creates and drops a table of X kilobytes,
   using --rate to make this happen a fixed number of times per second.
2. Regular pgbench instance runs the built-in script at maximum qps.

For each X, try one test run with effective_io_block_size = X-1 and one with
effective_io_block_size = X.  If the regular pgbench instance gets materially
higher qps with effective_io_block_size = X-1, the ideal default is =X.

> +  xreflabel="effective_io_block_size">
> +  effective_io_block_size (integer)
> +  
> +   effective_io_block_size configuration 
> parameter
> +  
> +  
> +  
> +   
> +Specifies the expected maximum size of a file for which 
> fsync returns in the minimum required duration. It is 
> approximately the size of a track or sylinder for magnetic disks.
> +The value is specified in kilobytes and the default is 
> 64 kilobytes.
> +   
> +   
> +When  is minimal,
> +WAL-logging is skipped for tables created in-trasaction.  If a table
> +is smaller than that size at commit, it is WAL-logged instead of
> +issueing fsync on it.
> +
> +   
> +  
> + 

Cylinder and track sizes are obsolete as user-visible concepts.  (They're not
constant for a given drive, and I think modern disks provide no way to read
the relevant parameters.)  I like the name "wal_skip_threshold", and my second
choice would be "wal_skip_min_size".  Possibly documented as follows:

  When wal_level is minimal and a transaction commits after creating or
  rewriting a permanent table, materialized view, or index, this setting
  determines how to persist the new data.  If the data is smaller than this
  setting, write it to the WAL log; otherwise, use an fsync of the data file.
  Depending on the properties of your storage, raising or lowering this value
  might help if such commits are slowing concurrent transactions.  The default
  is 64 kilobytes (64kB).

Any other opinions on the GUC name?
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
index 95063ab..5d476a4 100644
--- a/src/test/recovery/t/018_wal_optimize.pl
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -11,7 +11,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 28;
+use Test::More tests => 32;
 
 sub check_orphan_relfilenodes
 {
@@ -43,6 +43,8 @@ sub run_wal_optimize
 	$node->append_conf('postgresql.conf', qq(
 wal_level = $wal_level
 max_prepared_transactions = 1
+wal_log_hints = on
+effective_io_block_size = 0
 ));
 	$node->start;
 
@@ -194,6 +196,24 @@ max_prepared_transactions = 1
 	is($result, qq(3),
 	   "wal_level = $wal_level, SET TABLESPACE in subtransaction");
 
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test3a5 (c int PRIMARY KEY);
+		SAVEPOINT q; INSERT INTO test3a5 VALUES (1); ROLLBACK TO q;
+		CHECKPOINT;
+		INSERT INTO test3a5 VALUES (1);  -- set index hint bit
+		INSERT INTO test3a5 VALUES (2);
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->psql('postgres', );
+	my($ret, $stdout, $stderr) = $node->psql(
+		'postgres', "INSERT INTO test3a5 VALUES (2);");
+	is($ret, qq(3),
+	   "wal_level = $wal_level, unique index LP_DEAD");
+	like($stderr, qr/violates unique/,
+	   "wal_level = $wal_level, unique index LP_DEAD message");
+
 	# UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
 	$node->safe_psql('postgres', "
 		BEGIN;


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-09-02 Thread Noah Misch
On Mon, Sep 02, 2019 at 05:15:00PM -0400, Alvaro Herrera wrote:
> I have updated this patch's status to "needs review", since v20 has not
> received any comments yet.
> 
> Noah, you're listed as committer for this patch.  Are you still on the
> hook for getting it done during the v13 timeframe?

Yes, assuming "getting it done" = "getting the CF entry to state other than
Needs Review".




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-09-02 Thread Alvaro Herrera
I have updated this patch's status to "needs review", since v20 has not
received any comments yet.

Noah, you're listed as committer for this patch.  Are you still on the
hook for getting it done during the v13 timeframe?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-28 Thread Kyotaro Horiguchi
Hello, Noah.

At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190827.154932.250364935.horikyota@gmail.com>
> I'm not sure whether the knob shows apparent performance gain and
> whether we can offer the criteria to identify the proper
> value. But I'll add this feature with a GUC
> effective_io_block_size defaults to 64kB as the threshold in the
> next version. (The name and default value are arguable, of course.)

This is a new version of the patch based on the discussion.

The differences from v19 are the follows.

- Removed the new stuff in two-phase.c.

  The action on PREPARE TRANSACTION is now taken in
  PrepareTransaction(). Instead of storing pending syncs in
  two-phase files, the function immediately syncs all files that
  can survive the transaction end. (twophase.c, xact.c)

- Separate pendingSyncs from pendingDeletes.

  pendingSyncs gets handled differently from pendingDeletes so it
  is separated.

- Let smgrDoPendingSyncs() to avoid performing fsync on
  to-be-deleted files.

  In previous versions the function syncs all recorded files even
  if it is being deleted.  Since we use WAL-logging as the
  alternative of fsync now, performance gets more significance
g  than before.  Thus this version avoids uesless fsyncs.

- Use log_newpage instead of fsync for small tables.

  As in the discussion up-thread, I think I understand how
  WAL-logging works better than fsync.  smgrDoPendingSync issues
  log_newpage for all blocks in the table smaller than the GUC
  variable "effective_io_block_size".  I found
  log_newpage_range() that does exact what is needed here but it
  requires Relation that is no available there.  I removed an
  assertion in CreateFakeRelcacheEntry so that it works while
  non-recovery mode.

- Rebased and fixed some bugs.

I'm trying to measure performance difference on WAL/fsync.


By the way, smgrDoPendingDelete is called from CommitTransaction
and AbortTransaction directlry, and from AbortSubTransaction via
AtSubAbort_smgr(), which calls only smgrDoPendingDeletes() and is
called only from AbortSubTransaction. I think these should be
unified either way.  Any opinions?

CommitTransaction()
  + msgrDoPendingDelete()

AbortTransaction()
  + msgrDoPendingDelete()

AbortSubTransactoin()
  AtSubAbort_smgr()
   + msgrDoPendingDelete()

# Looking around, the prefixes AtEOact/PreCommit/AtAbort don't
# seem to be used keeping a principle.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 83deb772808cdd3afdb44a7630656cc827adfe33 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 312 
 1 file changed, 312 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 00..b041121745
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,312 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+
+sub check_orphan_relfilenodes
+{
+	my($node, $test_name) = @_;
+
+	my $db_oid = $node->safe_psql('postgres',
+	   "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+	my $prefix = "base/$db_oid/";
+	my $filepaths_referenced = $node->safe_psql('postgres', "
+	   SELECT pg_relation_filepath(oid) FROM pg_class
+	   WHERE reltablespace = 0 and relpersistence <> 't' and
+	   pg_relation_filepath(oid) IS NOT NULL;");
+	is_deeply([sort(map { "$prefix$_" }
+	grep(/^[0-9]+$/,
+		 slurp_dir($node->data_dir . "/$prefix")))],
+			  [sort split /\n/, $filepaths_referenced],
+			  $test_name);
+	return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+	my $wal_level = shift;
+
+	# Primary needs to have wal_level = minimal here
+	my $node = get_new_node("node_$wal_level");
+	$node->init;
+	$node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+));
+	$node->start;
+
+	# Setup
+	my $tablespace_dir = $node->basedir . '/tablespace_other';
+	mkdir ($tablespace_dir);
+	$tablespace_dir = TestLib::perl2host($tablespace_dir);
+	$node->safe_psql('postgres',
+	   "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+	# Test direct truncation optimization.  No tuples
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test1 (id serial 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-27 Thread Kyotaro Horiguchi
Hello.

At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch  wrote in 
<20190826050843.gb3153...@rfd.leadboat.com>
noah> On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote:
noah> > At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch  
wrote in <20190820060314.ga3086...@rfd.leadboat.com>
> > > On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > > > I'm not sure the point of the behavior. I suppose that the "log"
> > > > is a sequence of new_page records. It also needs to be synced and
> > > > it is always larger than the file to be synced. I can't think of
> > > > an appropriate threshold without the point.
> > > 
> > > Yes, it would be a sequence of new-page records.  FlushRelationBuffers() 
> > > locks
> > > every buffer header containing a buffer of the current database.  The 
> > > belief
> > > has been that writing one page to xlog is cheaper than 
> > > FlushRelationBuffers()
> > > in a busy system with large shared_buffers.
> > 
> > I'm at a loss.. The decision between WAL and sync is made at
> > commit time, when we no longer have a pin on a buffer. When
> > emitting WAL, opposite to the assumption, lock needs to be
> > re-acquired for every page to emit log_new_page. What is worse,
> > we may need to reload evicted buffers.  If the file has been
> > CopyFrom'ed, ring buffer strategy makes the situnation farther
> > worse. That doesn't seem cheap at all..
> 
> Consider a one-page relfilenode.  Doing all the things you list for a single
> page may be cheaper than locking millions of buffer headers.

If I understand you correctly, I would say that *all* buffers
that don't belong to in-transaction-created files are skipped
before taking locks. No lock conflict happens with other
backends.

FlushRelationBuffers uses double-checked-locking as follows:

FlushRelationBuffers_common():
..
  if(!islocal) {
for (i for all buffers) {
  if (RelFileNodeEquals(bufHder->tag.rnode, rnode)) {
LockBufHdr(bufHdr);
if (RelFileNodeEquals(bufHder->tag.rnode, rnode) && valid & dirty) {
  PinBuffer_Locked(bubHder);
  LWLockAcquire();
  FlushBuffer();

128GB shared buffers contain 16M buffers. On my
perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it
has only 6GB, the test is ignoring the effect of cache that comes
from the difference of the buffer size). (attached 1)

With WAL-emitting we find every buffers of the file using buffer
hash, we suffer partition locks instead of the 15ms of local
latency. That seems worse.

> > If there were any chance on WAL for smaller files here, it would
> > be on the files smaller than the ring size of bulk-write
> > strategy(16MB).
> 
> Like you, I expect the optimal threshold is less than 16MB, though you should
> benchmark to see.  Under the ideal threshold, when a transaction creates a new
> relfilenode just smaller than the threshold, that transaction will be somewhat
> slower than it would be if the threshold were zero.  Locking every buffer

I looked closer on this.

For a 16MB file, the cost of write-fsyncing cost is almost the
same to that of WAL-emitting cost. It was about 200 ms on the
Vista-era machine with non-performant rotating magnetic disks
with xfs. (attached 2, 3) Although write-fsyncing of relation
file makes no lock conflict with other backends, WAL-emitting
delays other backends' commits at most by that milliseconds.


In summary, the characteristics of the two methods on a 16MB file
are as the follows.

File write:
 - 15ms of buffer scan without locks (@128GB shared buffer)

 + no hash search for a buffer

 = take locks on all buffers only of the file one by one (to write)

 + plus 200ms of write-fdatasync (of whole the relation file),
which doesn't conflict with other backends. (except via CPU
time slots and IO bandwidth.)

WAL write : 
 + no buffer scan

 - 2048 times (16M/8k) of partition lock on finding every buffer
   for the target file, which can conflict with other backends.

 = take locks on all buffers only of the file one by one (to take FPW)

 - plus 200ms of open(create)-write-fdatasync (of a WAL file (of
   default size)), which can delay commits on other backends at
   most by that duration.

> header causes a distributed slow-down for other queries, and protecting the
> latency of non-DDL queries is typically more useful than accelerating
> TRUNCATE, CREATE TABLE, etc.  Writing more WAL also slows down other queries;
> beyond a certain relfilenode size, the extra WAL harms non-DDL queries more
> than the buffer scan harms them.  That's about where the threshold should be.

If the discussion above is correct, we shouldn't use WAL-write
even for files around 16MB. For smaller shared_buffers and file
size, the delays are:

Scan all buffers takes:
  15  ms for 128GB shared_buffers
   4.5ms for 32GB shared_buffers

fdatasync takes:
  200 ms for  16MB/sync
   51 ms for   1MB/sync
   46 ms for 512kB/sync
   40 ms for 256kB/sync
   37 ms for 128kB/sync
   35 ms for 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-27 Thread Kyotaro Horiguchi
At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190827.154932.250364935.horikyota@gmail.com>
> 128GB shared buffers contain 16M buffers. On my
> perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it
> has only 6GB, the test is ignoring the effect of cache that comes
> from the difference of the buffer size). (attached 1)
...
> For a 16MB file, the cost of write-fsyncing cost is almost the
> same to that of WAL-emitting cost. It was about 200 ms on the
> Vista-era machine with non-performant rotating magnetic disks
> with xfs. (attached 2, 3) Although write-fsyncing of relation
> file makes no lock conflict with other backends, WAL-emitting
> delays other backends' commits at most by that milliseconds.

FWIW, the attached are the programs I used to take the numbers.

testloop.c: to take time to loop over buffers in FlushRelationBuffers

testfile.c: to take time to sync a heap file. (means one file for the size)

testfile2.c: to take time to emit a wal record. (means 16MB per file)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#include 
#include 
#include 

typedef struct RelFileNode
{
  unsigned int spc;
  unsigned int db;
  unsigned int rel;
} RelFileNode;

typedef struct Buffer
{
  RelFileNode rnode;
} Buffer;

//#define NBUFFERS ((int)((128.0 * 1024 * 1024 * 1024) / (8.0 * 1024)))
#define NBUFFERS ((int)((32.0 * 1024 * 1024 * 1024) / (8.0 * 1024)))
int main(void) {
  int i;
  RelFileNode t = {1,2,3};
  Buffer *bufs = (Buffer *) malloc(sizeof(Buffer) * NBUFFERS);
  struct timeval st, ed;
  int matches = 0, unmatches = 0;
  Buffer *b;

  for (i = 0 ; i < NBUFFERS ; i++) {
bufs[i].rnode.spc = random() * 100;
bufs[i].rnode.db = random() * 100;
bufs[i].rnode.rel = random() * 1;
  }

  /* start measuring */
  gettimeofday(, NULL);

  b = bufs;
  for (i = 0 ; i < NBUFFERS ; i++) {
if (b->rnode.spc == t.spc && b->rnode.db == t.db && b->rnode.rel == 
t.rel)
  matches++;
else
  unmatches++;

b++;
  }
  gettimeofday(, NULL);

  printf("%lf ms for %d loops, matches %d, unmatches %d\n",
 (double)((ed.tv_sec - st.tv_sec) * 1000.0 +
  (ed.tv_usec - st.tv_usec) / 1000.0),
 i, matches, unmatches);

  return 0;
}
#include 
#include 
#include 
#include 
#include 
#include 

//#define FILE_SIZE (16 * 1024 * 1024)
//#define LOOPS 100

#define FILE_SIZE (64 * 1024)
#define LOOPS 1000

//#define FILE_SIZE (8 * 1024)
//#define LOOPS 1000

//#define FILE_SIZE (1 * 1024)
//#define LOOPS 1000

//#define FILE_SIZE (512)
//#define LOOPS 1000

//#define FILE_SIZE (128)
//#define LOOPS 1000

char buf[FILE_SIZE];
char fname[256];

int main(void) {
  int i, j;
  int fd = -1;
  struct timeval st, ed;
  double accum = 0.0;
  int bufperfile = (int)((16.0 * 1024 * 1024) / FILE_SIZE);

  for (i = 0 ; i < LOOPS ; i++) {
snprintf(fname, 256, "test%03d.file", i);
unlink(fname); // ignore errors
  }

  for (i = 0 ; i < LOOPS ; i++) {
for (j = 0 ; j < FILE_SIZE ; j++)
  buf[j] = random()* 256;

if (i % bufperfile == 0) {
  if (fd >= 0)
close(fd);

  snprintf(fname, 256, "test%03d.file", i / bufperfile);
  fd = open(fname, O_CREAT | O_RDWR, 0644);
  if (fd < 0) {
fprintf(stderr, "open error: %m\n");
exit(1);
  }
  memset(buf, 0, sizeof(buf));
  if (write(fd, buf, sizeof(buf)) < 0) {
fprintf(stderr, "init write error: %m\n");
exit(1);
  }
  if (fsync(fd) < 0) {
fprintf(stderr, "init fsync error: %m\n");
exit(1);
  }
  if (lseek(fd, 0, SEEK_SET) < 0) {
fprintf(stderr, "init lseek error: %m\n");
exit(1);
  }
  
}

gettimeofday(, NULL);
if (write(fd, buf, FILE_SIZE) < 0) {
  fprintf(stderr, "write error: %m\n");
  exit(1);
}
if (fdatasync(fd) < 0) {
  fprintf(stderr, "fdatasync error: %m\n");
  exit(1);
}
gettimeofday(, NULL);

accum += (double)((ed.tv_sec - st.tv_sec) * 1000.0 +
  (ed.tv_usec - st.tv_usec) / 1000.0);
  }

  printf("%.2lf ms for %d %dkB-records (%d MB), %.2lf ms per %dkB)\n",
 accum, i, FILE_SIZE / 1024, i * FILE_SIZE, accum / i, 
FILE_SIZE / 1024);

  return 0;
}

#include 
#include 
#include 
#include 
#include 

//#define FILE_SIZE (16 * 1024 * 1024)
//#define LOOPS 100

//#define FILE_SIZE (8 * 1024)
//#define LOOPS 1000

//#define FILE_SIZE (1 * 1024)
//#define LOOPS 1000

//#define FILE_SIZE (512)
//#define LOOPS 1000

#define FILE_SIZE (128)
#define LOOPS 1000

char buf[FILE_SIZE];

int main(void) {
  int i;
  int fd = -1;
  double accum = 0.0;
  struct timeval st, ed;

  for (i = 0 ; i < 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-25 Thread Noah Misch
On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch  wrote in 
> <20190820060314.ga3086...@rfd.leadboat.com>
> > On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch  wrote 
> > > in <20190818035230.gb3021...@rfd.leadboat.com>
> > > > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another 
> > > > component
> > > > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the 
> > > > relation,
> > > > or if it's smaller than some threshold, WAL-log the contents of the 
> > > > whole file
> > > > at that point."  Please write the part to WAL-log the contents of small 
> > > > files
> > > > instead of syncing them.
> > > 
> > > I'm not sure the point of the behavior. I suppose that the "log"
> > > is a sequence of new_page records. It also needs to be synced and
> > > it is always larger than the file to be synced. I can't think of
> > > an appropriate threshold without the point.
> > 
> > Yes, it would be a sequence of new-page records.  FlushRelationBuffers() 
> > locks
> > every buffer header containing a buffer of the current database.  The belief
> > has been that writing one page to xlog is cheaper than 
> > FlushRelationBuffers()
> > in a busy system with large shared_buffers.
> 
> I'm at a loss.. The decision between WAL and sync is made at
> commit time, when we no longer have a pin on a buffer. When
> emitting WAL, opposite to the assumption, lock needs to be
> re-acquired for every page to emit log_new_page. What is worse,
> we may need to reload evicted buffers.  If the file has been
> CopyFrom'ed, ring buffer strategy makes the situnation farther
> worse. That doesn't seem cheap at all..

Consider a one-page relfilenode.  Doing all the things you list for a single
page may be cheaper than locking millions of buffer headers.

> If there were any chance on WAL for smaller files here, it would
> be on the files smaller than the ring size of bulk-write
> strategy(16MB).

Like you, I expect the optimal threshold is less than 16MB, though you should
benchmark to see.  Under the ideal threshold, when a transaction creates a new
relfilenode just smaller than the threshold, that transaction will be somewhat
slower than it would be if the threshold were zero.  Locking every buffer
header causes a distributed slow-down for other queries, and protecting the
latency of non-DDL queries is typically more useful than accelerating
TRUNCATE, CREATE TABLE, etc.  Writing more WAL also slows down other queries;
beyond a certain relfilenode size, the extra WAL harms non-DDL queries more
than the buffer scan harms them.  That's about where the threshold should be.

This should be GUC-controlled, especially since this is back-patch material.
We won't necessarily pick the best value on the first attempt, and the best
value could depend on factors like the filesystem, the storage hardware, and
the database's latency goals.  One could define the GUC as an absolute size
(e.g. 1MB) or as a ratio of shared_buffers (e.g. GUC value of 0.001 means the
threshold is 1MB when shared_buffers is 1GB).  I'm not sure which is better.




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-22 Thread Kyotaro Horiguchi
Hello.

At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch  wrote in 
<20190820060314.ga3086...@rfd.leadboat.com>
> On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch  wrote in 
> > <20190818035230.gb3021...@rfd.leadboat.com>
> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
> > 
> > Now TwoPhaseFileHeader has two new members for (commit-time)
> > pending syncs. Pending-syncs are useless on wal-replay, but that
> > is needed for commit-prepared.
> 
> There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
> command, which is far too late to be syncing new relation files.  (A crash may
> have already destroyed their data.)  PrepareTransaction(), which implements
> the PREPARE TRANSACTION command, is the right place for these syncs.
> 
> A failure in these new syncs needs to prevent the transaction from being
> marked committed.  Hence, in CommitTransaction(), these new syncs need to

Agreed.

> happen after the last step that could create assign a new relfilenode and
> before RecordTransactionCommit().  I suspect it's best to do it after
> PreCommit_on_commit_actions() and before AtEOXact_LargeObject().

I don't find an obvious problem there. Since pending deletes and
pending syncs are separately processed, I'm planning to make a
separate list for syncs from deletes.

> > > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL 
> > > for
> > > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There 
> > > may be
> > > no bug today, but it's conceptually wrong to make RelationNeedsWAL() 
> > > return
> > > false due to this code, use RelationNeedsWAL() for multiple forks, and 
> > > then
> > > not actually sync all forks.
> > 
> > I agree that all forks needs syncing, but FSM and VM are checking
> > RelationNeedsWAL(modified). To make sure, are you suggesting to
> > sync all forks instead of emitting WAL for them, or suggesting
> > that VM and FSM to emit WALs even when the modified
> > RelationNeedsWAL returns false (+ sync all forks)?
> 
> I hadn't thought that far.  What do you think is best?

As in the latest patch, sync ALL forks then no WALs. We could
skip syncing FSM but I'm not sure it's work doing.


> > > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another 
> > > component
> > > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the 
> > > relation,
> > > or if it's smaller than some threshold, WAL-log the contents of the whole 
> > > file
> > > at that point."  Please write the part to WAL-log the contents of small 
> > > files
> > > instead of syncing them.
> > 
> > I'm not sure the point of the behavior. I suppose that the "log"
> > is a sequence of new_page records. It also needs to be synced and
> > it is always larger than the file to be synced. I can't think of
> > an appropriate threshold without the point.
> 
> Yes, it would be a sequence of new-page records.  FlushRelationBuffers() locks
> every buffer header containing a buffer of the current database.  The belief
> has been that writing one page to xlog is cheaper than FlushRelationBuffers()
> in a busy system with large shared_buffers.

I'm at a loss.. The decision between WAL and sync is made at
commit time, when we no longer have a pin on a buffer. When
emitting WAL, opposite to the assumption, lock needs to be
re-acquired for every page to emit log_new_page. What is worse,
we may need to reload evicted buffers.  If the file has been
CopyFrom'ed, ring buffer strategy makes the situnation farther
worse. That doesn't seem cheap at all..

If there were any chance on WAL for smaller files here, it would
be on the files smaller than the ring size of bulk-write
strategy(16MB).

If we pick up every buffer page of the file instead of scanning
through all buffers, that makes things worse by conflicts on
partition locks.

Any thoughts?



# Sorry time's up today.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-21 Thread Noah Misch
On Wed, Aug 21, 2019 at 04:32:38PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
>  wrote in 
> <20190819.185959.118543656.horikyota@gmail.com>
> > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch  wrote in 
> > <20190818035230.gb3021...@rfd.leadboat.com>
> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
>  
> Now TwoPhaseFileHeader has two new members for pending syncs. It
> is useless on wal-replay, but that is needed for commit-prepared.

Syncs need to happen in PrepareTransaction(), not in commit-prepared.  I wrote
about that in https://postgr.es/m/20190820060314.ga3086...@rfd.leadboat.com




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-21 Thread Kyotaro Horiguchi
Hello. New version is attached.

At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190819.185959.118543656.horikyota@gmail.com>
> Thank you for taking time.
> 
> At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch  wrote in 
> <20190818035230.gb3021...@rfd.leadboat.com>
> > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
 
Now TwoPhaseFileHeader has two new members for pending syncs. It
is useless on wal-replay, but that is needed for commit-prepared.

> > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/backend/access/heap/heapam_handler.c
> > > +++ b/src/backend/access/heap/heapam_handler.c
> > > @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
> > > Relation NewHeap, 
> ...
> > > - use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
> > > -
> > >   /* use_wal off requires smgr_targblock be initially invalid */
> > >   Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
> > 
> > Since you're deleting the use_wal variable, update that last comment.

Oops! Rewrote it.

> > > --- a/src/backend/catalog/storage.c
> > > +++ b/src/backend/catalog/storage.c
> > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
> ...
> > > + smgrimmedsync(srel, MAIN_FORKNUM);
> > 
> > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may 
> > be
> > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > not actually sync all forks.
> 
> I agree that all forks needs syncing, but FSM and VM are checking
> RelationNeedsWAL(modified). To make sure, are you suggesting to
> sync all forks instead of emitting WAL for them, or suggesting
> that VM and FSM to emit WALs even when the modified
> RelationNeedsWAL returns false (+ sync all forks)?

All forks are synced and have no WALs emitted (as before) in the
attached version 19. FSM and VM are not changed.

> > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component
> > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the 
> > relation,
> > or if it's smaller than some threshold, WAL-log the contents of the whole 
> > file
> > at that point."  Please write the part to WAL-log the contents of small 
> > files
> > instead of syncing them.
> 
> I'm not sure the point of the behavior. I suppose that the "log"
> is a sequence of new_page records. It also needs to be synced and
> it is always larger than the file to be synced. I can't think of
> an appropriate threshold without the point.

This is not included in this version. I'll continue to consider
this.

> > > --- a/src/backend/commands/copy.c
> > > +++ b/src/backend/commands/copy.c
> > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> > >* If it does commit, we'll have done the table_finish_bulk_insert() at
> > >* the bottom of this routine first.
> > >*
> > > -  * As mentioned in comments in utils/rel.h, the in-same-transaction test
> > > -  * is not always set correctly, since in rare cases 
> > > rd_newRelfilenodeSubid
> > > -  * can be cleared before the end of the transaction. The exact case is
> > > -  * when a relation sets a new relfilenode twice in same transaction, yet
> > > -  * the second one fails in an aborted subtransaction, e.g.
> > > -  *
> > > -  * BEGIN;
> > > -  * TRUNCATE t;
> > > -  * SAVEPOINT save;
> > > -  * TRUNCATE t;
> > > -  * ROLLBACK TO save;
> > > -  * COPY ...
> > 
> > The comment material being deleted is still correct, so don't delete it.

The code is changed to use rd_firstRelfilenodeSubid instead of
rd_firstRelfilenodeSubid which has the issue mentioned in the
deleted section. So this is right but irrelevant to the code
here. The same thing is written in the comment in RelationData.

(In short, not reverted)

> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
..
> > In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and 
> > audit
> > all the lines printed.  Many bits of code need to look at all three,
> > e.g. RelationClose().

I forgot to maintain rd_firstRelfilenode in many places and the
assertion failure no longer happens after I fixed it. Opposite to
my previous mail, of course useless pending entries are removed
at subtransction abort and no needless syncs happen in that
meaning. But another type of useless sync was seen with the
previous version 18.

(In short fixed.)


> >  This field needs to be 100% reliable.  In other words,
> > it must equal InvalidSubTransactionId if and only if the relfilenode matches
> > the 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-20 Thread Kyotaro Horiguchi
Hello.

At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190819.185959.118543656.horikyota@gmail.com>
> > The comment material being deleted is still correct, so don't delete it.
> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
> 
> (Un?)Fortunately, that doesn't fail.. (with rebased version on
> the recent master) I'll recheck that tomorrow.

I saw the assertion failure.  It's a part of intended
behavior. In this patch, relcache doesn't hold the whole history
of relfilenodes so we cannot remove useless pending syncs
perfectly. On the other hand they are harmless except that they
cause extra sync of files that are removed immediately. So I
choosed that once registered pending syncs are not removed.

If we want consistency here, we need to record creator subxid in
PendingRelOps (PendingRelDelete) struct and rather large work at
subtransaction end.

> > > --- a/src/include/utils/rel.h
> > > +++ b/src/include/utils/rel.h
> > > @@ -74,11 +74,13 @@ typedef struct RelationData
> > >   SubTransactionId rd_createSubid;/* rel was created in current 
> > > xact */
> > >   SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode 
> > > assigned in
> > >   
> > >  * current xact */
> > > + SubTransactionId rd_firstRelfilenodeSubid;  /* new relfilenode 
> > > assigned
> > > + 
> > >  * first in current xact */
> > 
> > In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and 
> > audit
> > all the lines printed.  Many bits of code need to look at all three,
> > e.g. RelationClose().
> 
> Agreed. I'll recheck that.
> 
> >  This field needs to be 100% reliable.  In other words,
> > it must equal InvalidSubTransactionId if and only if the relfilenode matches
> > the relfilenode that would be in place if the top transaction rolled back.
> 
> I don't get this. I think the variable moves as you suggested. It
> is handled same way with fd_new* in AtEOSubXact_cleanup but the
> difference is in assignment but rollback. rd_fist* won't change
> after the first assignment so rollback of the subid means
> relfilenode is also rolled back to the initial value at the
> beginning of the top transaction.

So I'll add this in the next version to see how it looks.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-20 Thread Noah Misch
On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch  wrote in 
> <20190818035230.gb3021...@rfd.leadboat.com>
> > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
> 
> Now TwoPhaseFileHeader has two new members for (commit-time)
> pending syncs. Pending-syncs are useless on wal-replay, but that
> is needed for commit-prepared.

There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
command, which is far too late to be syncing new relation files.  (A crash may
have already destroyed their data.)  PrepareTransaction(), which implements
the PREPARE TRANSACTION command, is the right place for these syncs.

A failure in these new syncs needs to prevent the transaction from being
marked committed.  Hence, in CommitTransaction(), these new syncs need to
happen after the last step that could create assign a new relfilenode and
before RecordTransactionCommit().  I suspect it's best to do it after
PreCommit_on_commit_actions() and before AtEOXact_LargeObject().

> > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/backend/catalog/storage.c
> > > +++ b/src/backend/catalog/storage.c
> > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
> ...
> > > + smgrimmedsync(srel, MAIN_FORKNUM);
> > 
> > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may 
> > be
> > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > not actually sync all forks.
> 
> I agree that all forks needs syncing, but FSM and VM are checking
> RelationNeedsWAL(modified). To make sure, are you suggesting to
> sync all forks instead of emitting WAL for them, or suggesting
> that VM and FSM to emit WALs even when the modified
> RelationNeedsWAL returns false (+ sync all forks)?

I hadn't thought that far.  What do you think is best?

> > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component
> > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the 
> > relation,
> > or if it's smaller than some threshold, WAL-log the contents of the whole 
> > file
> > at that point."  Please write the part to WAL-log the contents of small 
> > files
> > instead of syncing them.
> 
> I'm not sure the point of the behavior. I suppose that the "log"
> is a sequence of new_page records. It also needs to be synced and
> it is always larger than the file to be synced. I can't think of
> an appropriate threshold without the point.

Yes, it would be a sequence of new-page records.  FlushRelationBuffers() locks
every buffer header containing a buffer of the current database.  The belief
has been that writing one page to xlog is cheaper than FlushRelationBuffers()
in a busy system with large shared_buffers.

> > > --- a/src/backend/commands/copy.c
> > > +++ b/src/backend/commands/copy.c
> > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> > >* If it does commit, we'll have done the table_finish_bulk_insert() at
> > >* the bottom of this routine first.
> > >*
> > > -  * As mentioned in comments in utils/rel.h, the in-same-transaction test
> > > -  * is not always set correctly, since in rare cases 
> > > rd_newRelfilenodeSubid
> > > -  * can be cleared before the end of the transaction. The exact case is
> > > -  * when a relation sets a new relfilenode twice in same transaction, yet
> > > -  * the second one fails in an aborted subtransaction, e.g.
> > > -  *
> > > -  * BEGIN;
> > > -  * TRUNCATE t;
> > > -  * SAVEPOINT save;
> > > -  * TRUNCATE t;
> > > -  * ROLLBACK TO save;
> > > -  * COPY ...
> > 
> > The comment material being deleted is still correct, so don't delete it.
> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
> 
> (Un?)Fortunately, that doesn't fail.. (with rebased version on
> the recent master) I'll recheck that tomorrow.

Did you build with --enable-cassert?

> > > --- a/src/include/utils/rel.h
> > > +++ b/src/include/utils/rel.h
> > > @@ -74,11 +74,13 @@ typedef struct RelationData
> > >   SubTransactionId rd_createSubid;/* rel was created in current 
> > > xact */
> > >   SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode 
> > > assigned in
> > >   
> > >  * current xact */
> > > + SubTransactionId rd_firstRelfilenodeSubid;  /* new relfilenode 
> > > assigned
> > > + 
> > >  * first in current xact 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-19 Thread Kyotaro Horiguchi
Thank you for taking time.

At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch  wrote in 
<20190818035230.gb3021...@rfd.leadboat.com>
> For two-phase commit, PrepareTransaction() needs to execute pending syncs.

Now TwoPhaseFileHeader has two new members for (commit-time)
pending syncs. Pending-syncs are useless on wal-replay, but that
is needed for commit-prepared.


> On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
> > Relation NewHeap,   
...
> > -   use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
> > -
> > /* use_wal off requires smgr_targblock be initially invalid */
> > Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
> 
> Since you're deleting the use_wal variable, update that last comment.

Oops. Rewrote it.

> > --- a/src/backend/catalog/storage.c
> > +++ b/src/backend/catalog/storage.c
> > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
...
> > +   smgrimmedsync(srel, MAIN_FORKNUM);
> 
> This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may be
> no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> false due to this code, use RelationNeedsWAL() for multiple forks, and then
> not actually sync all forks.

I agree that all forks needs syncing, but FSM and VM are checking
RelationNeedsWAL(modified). To make sure, are you suggesting to
sync all forks instead of emitting WAL for them, or suggesting
that VM and FSM to emit WALs even when the modified
RelationNeedsWAL returns false (+ sync all forks)?

> The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component
> not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
> or if it's smaller than some threshold, WAL-log the contents of the whole file
> at that point."  Please write the part to WAL-log the contents of small files
> instead of syncing them.

I'm not sure the point of the behavior. I suppose that the "log"
is a sequence of new_page records. It also needs to be synced and
it is always larger than the file to be synced. I can't think of
an appropriate threshold without the point.

> > --- a/src/backend/commands/copy.c
> > +++ b/src/backend/commands/copy.c
> > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> >  * If it does commit, we'll have done the table_finish_bulk_insert() at
> >  * the bottom of this routine first.
> >  *
> > -* As mentioned in comments in utils/rel.h, the in-same-transaction test
> > -* is not always set correctly, since in rare cases 
> > rd_newRelfilenodeSubid
> > -* can be cleared before the end of the transaction. The exact case is
> > -* when a relation sets a new relfilenode twice in same transaction, yet
> > -* the second one fails in an aborted subtransaction, e.g.
> > -*
> > -* BEGIN;
> > -* TRUNCATE t;
> > -* SAVEPOINT save;
> > -* TRUNCATE t;
> > -* ROLLBACK TO save;
> > -* COPY ...
> 
> The comment material being deleted is still correct, so don't delete it.
> Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> attached patch adds an assertion that RelationNeedsWAL() and the
> pendingDeletes array have the same opinion about the relfilenode, and it
> expands a test case to fail that assertion.

(Un?)Fortunately, that doesn't fail.. (with rebased version on
the recent master) I'll recheck that tomorrow.

> > --- a/src/include/utils/rel.h
> > +++ b/src/include/utils/rel.h
> > @@ -74,11 +74,13 @@ typedef struct RelationData
> > SubTransactionId rd_createSubid;/* rel was created in current 
> > xact */
> > SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode 
> > assigned in
> > 
> >  * current xact */
> > +   SubTransactionId rd_firstRelfilenodeSubid;  /* new relfilenode 
> > assigned
> > +   
> >  * first in current xact */
> 
> In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
> all the lines printed.  Many bits of code need to look at all three,
> e.g. RelationClose().

Agreed. I'll recheck that.

>  This field needs to be 100% reliable.  In other words,
> it must equal InvalidSubTransactionId if and only if the relfilenode matches
> the relfilenode that would be in place if the top transaction rolled back.

I don't get this. I think the variable moves as you suggested. It
is handled same way with fd_new* in AtEOSubXact_cleanup but the
difference is in assignment but rollback. rd_fist* won't change
after the first assignment so rollback of the subid means

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-17 Thread Noah Misch
For two-phase commit, PrepareTransaction() needs to execute pending syncs.

On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
> Relation NewHeap,
>   /* Remember if it's a system catalog */
>   is_system_catalog = IsSystemRelation(OldHeap);
>  
> - /*
> -  * We need to log the copied data in WAL iff WAL archiving/streaming is
> -  * enabled AND it's a WAL-logged rel.
> -  */
> - use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
> -
>   /* use_wal off requires smgr_targblock be initially invalid */
>   Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);

Since you're deleting the use_wal variable, update that last comment.

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
>   {
>   SMgrRelation srel;
>  
> - srel = smgropen(pending->relnode, 
> pending->backend);
> -
> - /* allocate the initial array, or extend it, if 
> needed */
> - if (maxrels == 0)
> + if (pending->dosync)
>   {
> - maxrels = 8;
> - srels = palloc(sizeof(SMgrRelation) * 
> maxrels);
> + /* Perform pending sync of WAL-skipped 
> relation */
> + 
> FlushRelationBuffersWithoutRelcache(pending->relnode,
> + 
> false);
> + srel = smgropen(pending->relnode, 
> pending->backend);
> + smgrimmedsync(srel, MAIN_FORKNUM);

This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may be
no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
false due to this code, use RelationNeedsWAL() for multiple forks, and then
not actually sync all forks.

The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component
not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
or if it's smaller than some threshold, WAL-log the contents of the whole file
at that point."  Please write the part to WAL-log the contents of small files
instead of syncing them.

> --- a/src/backend/commands/copy.c
> +++ b/src/backend/commands/copy.c
> @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
>* If it does commit, we'll have done the table_finish_bulk_insert() at
>* the bottom of this routine first.
>*
> -  * As mentioned in comments in utils/rel.h, the in-same-transaction test
> -  * is not always set correctly, since in rare cases 
> rd_newRelfilenodeSubid
> -  * can be cleared before the end of the transaction. The exact case is
> -  * when a relation sets a new relfilenode twice in same transaction, yet
> -  * the second one fails in an aborted subtransaction, e.g.
> -  *
> -  * BEGIN;
> -  * TRUNCATE t;
> -  * SAVEPOINT save;
> -  * TRUNCATE t;
> -  * ROLLBACK TO save;
> -  * COPY ...

The comment material being deleted is still correct, so don't delete it.
Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
attached patch adds an assertion that RelationNeedsWAL() and the
pendingDeletes array have the same opinion about the relfilenode, and it
expands a test case to fail that assertion.

> --- a/src/include/utils/rel.h
> +++ b/src/include/utils/rel.h
> @@ -74,11 +74,13 @@ typedef struct RelationData
>   SubTransactionId rd_createSubid;/* rel was created in current 
> xact */
>   SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode 
> assigned in
>   
>  * current xact */
> + SubTransactionId rd_firstRelfilenodeSubid;  /* new relfilenode 
> assigned
> + 
>  * first in current xact */

In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
all the lines printed.  Many bits of code need to look at all three,
e.g. RelationClose().  This field needs to be 100% reliable.  In other words,
it must equal InvalidSubTransactionId if and only if the relfilenode matches
the relfilenode that would be in place if the top transaction rolled back.

nm
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 6ebe75a..d74e9a5 100644
--- 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-02 Thread Kyotaro Horiguchi
Hello.

At Fri, 2 Aug 2019 11:35:06 +1200, Thomas Munro  wrote 
in 
> On Sat, Jul 27, 2019 at 6:26 PM Noah Misch  wrote:
> > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > > No substantial change have been made by this rebasing.
> >
> > Thanks.  I'll likely review this on 2019-08-20.  If someone opts to review 
> > it
> > earlier, I welcome that.
> 
> Cool.  That'll be in time to be marked committed in the September CF,
> this patch's 16th.

Yeah, this patch has been reborn far simpler and generic (or
robust) thanks to Noah.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-01 Thread Thomas Munro
On Sat, Jul 27, 2019 at 6:26 PM Noah Misch  wrote:
> On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > No substantial change have been made by this rebasing.
>
> Thanks.  I'll likely review this on 2019-08-20.  If someone opts to review it
> earlier, I welcome that.

Cool.  That'll be in time to be marked committed in the September CF,
this patch's 16th.

-- 
Thomas Munro
https://enterprisedb.com




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-07-27 Thread Noah Misch
On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> No substantial change have been made by this rebasing.

Thanks.  I'll likely review this on 2019-08-20.  If someone opts to review it
earlier, I welcome that.




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-07-24 Thread Kyotaro Horiguchi
I found that CF-bot complaining on this.

Seems that some comment fixes by the recent 21039555cd are the
cause.

No substantial change have been made by this rebasing.

regards.

On Fri, Jul 12, 2019 at 5:37 PM Kyotaro Horiguchi
 wrote:
>
> At Fri, 12 Jul 2019 17:30:41 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
>  wrote in 
> <20190712.173041.236938840.horikyota@gmail.com>
> > The v16 seems no longer works so I'll send further rebased version.
>
> It's just by renaming of TestLib::real_dir to perl2host.
> This is rebased version v17.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center



--
Kyotaro Horiguchi
NTT Open Source Software Center


v18-0001-TAP-test-for-copy-truncation-optimization.patch
Description: Binary data


v18-0002-Fix-WAL-skipping-feature.patch
Description: Binary data


v18-0003-Rename-smgrDoPendingDeletes-to-smgrDoPendingOperatio.patch
Description: Binary data


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-07-12 Thread Kyotaro Horiguchi
At Fri, 12 Jul 2019 17:30:41 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190712.173041.236938840.horikyota@gmail.com>
> The v16 seems no longer works so I'll send further rebased version.

It's just by renaming of TestLib::real_dir to perl2host.
This is rebased version v17.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 9bcd4acb14c5cef2d4bdf20c9be8c86597a9cf7c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/3] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 291 
 1 file changed, 291 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 00..b26cd8efd5
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+	my($node, $test_name) = @_;
+
+	my $db_oid = $node->safe_psql('postgres',
+	   "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+	my $prefix = "base/$db_oid/";
+	my $filepaths_referenced = $node->safe_psql('postgres', "
+	   SELECT pg_relation_filepath(oid) FROM pg_class
+	   WHERE reltablespace = 0 and relpersistence <> 't' and
+	   pg_relation_filepath(oid) IS NOT NULL;");
+	is_deeply([sort(map { "$prefix$_" }
+	grep(/^[0-9]+$/,
+		 slurp_dir($node->data_dir . "/$prefix")))],
+			  [sort split /\n/, $filepaths_referenced],
+			  $test_name);
+	return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+	my $wal_level = shift;
+
+	# Primary needs to have wal_level = minimal here
+	my $node = get_new_node("node_$wal_level");
+	$node->init;
+	$node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+	$node->start;
+
+	# Setup
+	my $tablespace_dir = $node->basedir . '/tablespace_other';
+	mkdir ($tablespace_dir);
+	$tablespace_dir = TestLib::perl2host($tablespace_dir);
+	$node->safe_psql('postgres',
+	   "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+	# Test direct truncation optimization.  No tuples
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test1 (id serial PRIMARY KEY);
+		TRUNCATE test1;
+		COMMIT;");
+
+	$node->stop('immediate');
+	$node->start;
+
+	my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+	is($result, qq(0),
+	   "wal_level = $wal_level, optimized truncation with empty table");
+
+	# Test truncation with inserted tuples within the same transaction.
+	# Tuples inserted after the truncation should be seen.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test2 (id serial PRIMARY KEY);
+		INSERT INTO test2 VALUES (DEFAULT);
+		TRUNCATE test2;
+		INSERT INTO test2 VALUES (DEFAULT);
+		COMMIT;");
+
+	$node->stop('immediate');
+	$node->start;
+
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+	is($result, qq(1),
+	   "wal_level = $wal_level, optimized truncation with inserted table");
+
+	# Data file for COPY query in follow-up tests.
+	my $basedir = $node->basedir;
+	my $copy_file = "$basedir/copy_data.txt";
+	TestLib::append_to_file($copy_file, qq(2,3
+20001,30001
+20002,30002));
+
+	# Test truncation with inserted tuples using COPY.  Tuples copied after the
+	# truncation should be seen.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+		TRUNCATE test3;
+		COPY test3 FROM '$copy_file' DELIMITER ',';
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+	is($result, qq(3),
+	   "wal_level = $wal_level, optimized truncation with copied table");
+
+	# Like previous test, but rollback SET TABLESPACE in a subtransaction.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+		TRUNCATE test3a;
+		SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+		COPY test3a FROM '$copy_file' DELIMITER ',';
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+	is($result, qq(3),
+	   "wal_level = $wal_level, SET TABLESPACE in 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-07-12 Thread Kyotaro Horiguchi
Many message seem lost during moving to new environmet..
I'm digging the archive but coudn't find the message for v15..

At Thu, 11 Jul 2019 18:03:35 -0700, Noah Misch  wrote in 
<20190712010335.gb1610...@rfd.leadboat.com>
> On Wed, Jul 10, 2019 at 01:19:14PM +0900, Kyotaro Horiguchi wrote:
> > Hello. Rebased the patch to master(bd56cd75d2).
> 
> It looks like you did more than just a rebase, because this v16 no longer
> modifies many files that v14 did modify.  (That's probably good, since you had
> pending review comments.)  What other changes did you make?

Yeah.. Maybe I forgot to send pre-v15 or v16 before rebasing.

v14: WAL-logging is controled by AMs and syncing at commit is
controled according to the behavior.  At-commit sync is still
controlled per-relation basis, which means it must be
processed before transaction state becomes TRNAS_COMMIT. So
it needs to be separated into PreCommit_RelationSync() from
AtEOXact_RelationCache().

v15: The biggest change is that at-commit sync is changed to smgr
   basis. At-commit sync is programmed at creation of a storage
   file (RelationCreateStorage), and smgrDoPendingDelete(or
   smgrDoPendingOperations after rename) runs syncs.  AM are no
   longer involved and all permanent relations are WAL-skipped at
   all in the creation transaction while wal_level=minimal.

   All storages created for a relation are once synced then
   removed at commit.

v16: rebased.

The v16 seems no longer works so I'll send further rebased version.

Sorry for the late reply and confusion..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-07-11 Thread Noah Misch
On Wed, Jul 10, 2019 at 01:19:14PM +0900, Kyotaro Horiguchi wrote:
> Hello. Rebased the patch to master(bd56cd75d2).

It looks like you did more than just a rebase, because this v16 no longer
modifies many files that v14 did modify.  (That's probably good, since you had
pending review comments.)  What other changes did you make?




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-07-09 Thread Kyotaro Horiguchi
Hello. Rebased the patch to master(bd56cd75d2).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From ac52e2c1c56a96c1745149ff4220a3a116d6c811 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/3] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 291 
 1 file changed, 291 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 00..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+	my($node, $test_name) = @_;
+
+	my $db_oid = $node->safe_psql('postgres',
+	   "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+	my $prefix = "base/$db_oid/";
+	my $filepaths_referenced = $node->safe_psql('postgres', "
+	   SELECT pg_relation_filepath(oid) FROM pg_class
+	   WHERE reltablespace = 0 and relpersistence <> 't' and
+	   pg_relation_filepath(oid) IS NOT NULL;");
+	is_deeply([sort(map { "$prefix$_" }
+	grep(/^[0-9]+$/,
+		 slurp_dir($node->data_dir . "/$prefix")))],
+			  [sort split /\n/, $filepaths_referenced],
+			  $test_name);
+	return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+	my $wal_level = shift;
+
+	# Primary needs to have wal_level = minimal here
+	my $node = get_new_node("node_$wal_level");
+	$node->init;
+	$node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+	$node->start;
+
+	# Setup
+	my $tablespace_dir = $node->basedir . '/tablespace_other';
+	mkdir ($tablespace_dir);
+	$tablespace_dir = TestLib::real_dir($tablespace_dir);
+	$node->safe_psql('postgres',
+	   "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+	# Test direct truncation optimization.  No tuples
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test1 (id serial PRIMARY KEY);
+		TRUNCATE test1;
+		COMMIT;");
+
+	$node->stop('immediate');
+	$node->start;
+
+	my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+	is($result, qq(0),
+	   "wal_level = $wal_level, optimized truncation with empty table");
+
+	# Test truncation with inserted tuples within the same transaction.
+	# Tuples inserted after the truncation should be seen.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test2 (id serial PRIMARY KEY);
+		INSERT INTO test2 VALUES (DEFAULT);
+		TRUNCATE test2;
+		INSERT INTO test2 VALUES (DEFAULT);
+		COMMIT;");
+
+	$node->stop('immediate');
+	$node->start;
+
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+	is($result, qq(1),
+	   "wal_level = $wal_level, optimized truncation with inserted table");
+
+	# Data file for COPY query in follow-up tests.
+	my $basedir = $node->basedir;
+	my $copy_file = "$basedir/copy_data.txt";
+	TestLib::append_to_file($copy_file, qq(2,3
+20001,30001
+20002,30002));
+
+	# Test truncation with inserted tuples using COPY.  Tuples copied after the
+	# truncation should be seen.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+		TRUNCATE test3;
+		COPY test3 FROM '$copy_file' DELIMITER ',';
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+	is($result, qq(3),
+	   "wal_level = $wal_level, optimized truncation with copied table");
+
+	# Like previous test, but rollback SET TABLESPACE in a subtransaction.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+		TRUNCATE test3a;
+		SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+		COPY test3a FROM '$copy_file' DELIMITER ',';
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+	is($result, qq(3),
+	   "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+	# in different subtransaction patterns
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+		TRUNCATE 

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-06-28 Thread Amit Kapila
On Tue, May 28, 2019 at 4:33 AM Noah Misch  wrote:
>
> On Mon, May 27, 2019 at 02:08:26PM +0900, Kyotaro HORIGUCHI wrote:
> > At Fri, 24 May 2019 19:33:32 -0700, Noah Misch  wrote in 
> > <20190525023332.ge1624...@rfd.leadboat.com>
> > > On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
> > > > Following this direction, the attached PoC works *at least for*
> > > > the wal_optimization TAP tests, but doing pending flush not in
> > > > smgr but in relcache.
> > >
> > > This task, syncing files created in the current transaction, is not the 
> > > kind
> > > of task normally assigned to a cache.  We already have a module, 
> > > storage.c,
> > > that maintains state about files created in the current transaction.  Why 
> > > did
> > > you use relcache instead of storage.c?
> >
> > The reason was at-commit sync needs buffer flush beforehand. But
> > FlushRelationBufferWithoutRelCache() in v11 can do
> > that. storage.c is reasonable as the place.
>
> Okay.  I do want this to work in 9.5 and later, but I'm not aware of a reason
> relcache.c would be a better code location in older branches.  Unless you
> think of a reason to prefer relcache.c, please use storage.c.
>
> > > On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > > This is a tidier version of the patch.
> > >
> > > > - Move the substantial work to table/index AMs.
> > > >
> > > >   Each AM can decide whether to support WAL skip or not.
> > > >   Currently heap and nbtree support it.
> > >
> > > Why would an AM find it important to disable WAL skip?
> >
> > The reason is currently it's AM's responsibility to decide
> > whether to skip WAL or not.
>
> I see.  Skipping the sync would be a mere optimization; no AM would require it
> for correctness.  An AM might want RelationNeedsWAL() to keep returning true
> despite the sync happening, perhaps because it persists data somewhere other
> than the forks of pg_class.relfilenode.  Since the index and table APIs
> already assume one relfilenode captures all persistent data, I'm not seeing a
> use case for an AM overriding this behavior.  Let's take away the AM's
> responsibility for this decision, making the system simpler.  A future patch
> could let AM code decide, if someone find a real-world use case for
> AM-specific logic around when to skip WAL.
>

It seems there is some feedback for this patch and the CF is going to
start in 2 days.  Are you planning to work on this patch for next CF,
if not then it is better to bump this?  It is not a good idea to see
the patch in "waiting on author" in the beginning of CF unless the
author is actively working on the patch and is going to produce a
version in next few days.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-05-27 Thread Noah Misch
On Mon, May 27, 2019 at 02:08:26PM +0900, Kyotaro HORIGUCHI wrote:
> At Fri, 24 May 2019 19:33:32 -0700, Noah Misch  wrote in 
> <20190525023332.ge1624...@rfd.leadboat.com>
> > On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
> > > Following this direction, the attached PoC works *at least for*
> > > the wal_optimization TAP tests, but doing pending flush not in
> > > smgr but in relcache.
> > 
> > This task, syncing files created in the current transaction, is not the kind
> > of task normally assigned to a cache.  We already have a module, storage.c,
> > that maintains state about files created in the current transaction.  Why 
> > did
> > you use relcache instead of storage.c?
> 
> The reason was at-commit sync needs buffer flush beforehand. But
> FlushRelationBufferWithoutRelCache() in v11 can do
> that. storage.c is reasonable as the place.

Okay.  I do want this to work in 9.5 and later, but I'm not aware of a reason
relcache.c would be a better code location in older branches.  Unless you
think of a reason to prefer relcache.c, please use storage.c.

> > On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > This is a tidier version of the patch.
> > 
> > > - Move the substantial work to table/index AMs.
> > > 
> > >   Each AM can decide whether to support WAL skip or not.
> > >   Currently heap and nbtree support it.
> > 
> > Why would an AM find it important to disable WAL skip?
> 
> The reason is currently it's AM's responsibility to decide
> whether to skip WAL or not.

I see.  Skipping the sync would be a mere optimization; no AM would require it
for correctness.  An AM might want RelationNeedsWAL() to keep returning true
despite the sync happening, perhaps because it persists data somewhere other
than the forks of pg_class.relfilenode.  Since the index and table APIs
already assume one relfilenode captures all persistent data, I'm not seeing a
use case for an AM overriding this behavior.  Let's take away the AM's
responsibility for this decision, making the system simpler.  A future patch
could let AM code decide, if someone find a real-world use case for
AM-specific logic around when to skip WAL.




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-05-26 Thread Kyotaro HORIGUCHI
Thanks for the comment!

At Fri, 24 May 2019 19:33:32 -0700, Noah Misch  wrote in 
<20190525023332.ge1624...@rfd.leadboat.com>
> On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
> > Following this direction, the attached PoC works *at least for*
> > the wal_optimization TAP tests, but doing pending flush not in
> > smgr but in relcache.
> 
> This task, syncing files created in the current transaction, is not the kind
> of task normally assigned to a cache.  We already have a module, storage.c,
> that maintains state about files created in the current transaction.  Why did
> you use relcache instead of storage.c?

The reason was at-commit sync needs buffer flush beforehand. But
FlushRelationBufferWithoutRelCache() in v11 can do
that. storage.c is reasonable as the place.

> On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
> > This is a tidier version of the patch.
> 
> > - Move the substantial work to table/index AMs.
> > 
> >   Each AM can decide whether to support WAL skip or not.
> >   Currently heap and nbtree support it.
> 
> Why would an AM find it important to disable WAL skip?

The reason is currently it's AM's responsibility to decide
whether to skip WAL or not.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





  1   2   >