Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-24 Thread Noah Misch
On Mon, Apr 15, 2024 at 04:12:38PM +0700, John Naylor wrote:
> - Some paths for single-value leaves are not covered:
> 
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606
> 
> However, these paths do get regression test coverage on 32-bit
> machines. 64-bit builds only have leaves in the TID store, which
> doesn't (currently) delete entries, and doesn't instantiate the tree
> with the debug option.
> 
> - In RT_SET "if (found)" is not covered:
> 
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
> 
> That's because we don't yet have code that replaces an existing value
> with a value of a different length.

I saw a SIGSEGV there when using tidstore to write a fix for something else.
Patch attached.
Author: Noah Misch 
Commit: Noah Misch 

radixtree: Fix SIGSEGV at update of embeddable value to non-embeddable.

Also, fix a memory leak when updating from non-embeddable to embeddable.
Both were unreachable without adding C code.

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index dc4c00d..aa8f44c 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1749,6 +1749,9 @@ have_slot:
 
if (RT_VALUE_IS_EMBEDDABLE(value_p))
{
+   if (found && !RT_CHILDPTR_IS_VALUE(*slot))
+   RT_FREE_LEAF(tree, *slot);
+
/* store value directly in child pointer slot */
memcpy(slot, value_p, value_sz);
 
@@ -1765,7 +1768,7 @@ have_slot:
{
RT_CHILD_PTR leaf;
 
-   if (found)
+   if (found && !RT_CHILDPTR_IS_VALUE(*slot))
{
Assert(RT_PTR_ALLOC_IS_VALID(*slot));
leaf.alloc = *slot;
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out 
b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 06c610e..d116927 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -79,6 +79,96 @@ SELECT test_destroy();
  
 (1 row)
 
+-- Test replacements crossing RT_CHILDPTR_IS_VALUE in both directions
+SELECT test_create(false);
+ test_create 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3,4]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT test_destroy();
+ test_destroy 
+--
+ 
+(1 row)
+
 -- Use shared memory this time. We can't do that in test_radixtree.sql,
 -- because unused static functions would raise warnings there.
 SELECT test_create(true);
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql 
b/src/test/modules/test_tidstore/sql/test_tidstore.sql
index bb31877..704d869 100644
--- a/src/test/modules/test_tidstore/sql/test_tidstore.sql
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -14,6 +14,7 @@ CREATE TEMP TABLE hideblocks(blockno bigint);
 -- We use a higher number to test tidstore.
 \set maxoffset 512
 
+
 SELEC

datfrozen/relfrozen update race condition

2024-04-22 Thread Noah Misch
On Tue, May 24, 2016 at 03:01:13PM -0400, Tom Lane wrote:
> Also, I notice another problem in vac_truncate_clog() now that I'm looking
> at it: it's expecting that the pg_database datfrozenxid and datminmxid
> values will hold still while it's looking at them.  Since
> vac_update_datfrozenxid updates those values in-place, there's a race
> condition against VACUUMs happening in other databases.  We should fetch
> those values into local variables before doing the various tests inside
> the scan loop.

Commit 2d2e40e fixed the above.  There's another problem just like it, one
layer lower.  vac_update_datfrozenxid() has:

if (TransactionIdPrecedes(classForm->relfrozenxid, 
newFrozenXid))
newFrozenXid = classForm->relfrozenxid;

classForm points to buffer memory, and vac_update_relstats() inplace-updates
the buffer.  Like vac_truncate_clog(), we don't mind using an old value, but
those two lines must use the same value.  The attached test case shows this
bug making datfrozenxid move ahead of relfrozenxid.  The attached patch fixes
it.  (I noticed this while finishing up patches for the heap_inplace_update
writer race in https://postgr.es/m/20231102030915.d3.nmi...@google.com.)

I audited other read-only use of inplace-updated fields.  Others look safe,
because they hold rel locks that exclude VACUUM, or they make only
non-critical decisions.  Still, let's change some to the load-once style, to
improve the chance of future copy/paste finding the safe style.  I'm attaching
a patch for that, too.  I didn't add "volatile", because I couldn't think of
how we'd care if the load moved earlier.
Author:     Noah Misch 
Commit: Noah Misch 

Test datfrozenxid/relfrozenxid update race condition

Not for commit, mostly because the injection point won't make sense
after the fix; it would be obvious that the test is unlikely to detect
similar future bugs.  (A commit-ready test would also need to remove
temporary WARNING messages added to clarify what's happening, and it
would need to hide outputs sensitive to exact XID consumption.)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b589279..73f0245 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -53,9 +53,11 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/acl.h"
+#include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
+#include "utils/injection_point.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -1620,6 +1622,8 @@ vac_update_datfrozenxid(void)
while ((classTup = systable_getnext(scan)) != NULL)
{
Form_pg_class classForm = (Form_pg_class) GETSTRUCT(classTup);
+   TransactionId before, after;
+   before = classForm->relfrozenxid;
 
/*
 * Only consider relations able to hold unfrozen XIDs (anything 
else
@@ -1662,7 +1666,11 @@ vac_update_datfrozenxid(void)
 
/* determine new horizon */
if (TransactionIdPrecedes(classForm->relfrozenxid, 
newFrozenXid))
+   {
+   if (namestrcmp(>relname, 
"pg_test_last_table") == 0)
+   
INJECTION_POINT("between-relfrozenxid-reads");
newFrozenXid = classForm->relfrozenxid;
+   }
}
 
if (MultiXactIdIsValid(classForm->relminmxid))
@@ -1678,6 +1686,12 @@ vac_update_datfrozenxid(void)
if (MultiXactIdPrecedes(classForm->relminmxid, 
newMinMulti))
newMinMulti = classForm->relminmxid;
}
+
+   after = classForm->relfrozenxid;
+   elog(WARNING, "done reading %s before=%u after=%u myxid=%u 
myxmin=%u",
+NameStr(classForm->relname),
+before, after,
+MyProc->xid, MyProc->xmin);
}
 
/* we're done with pg_class */
diff --git a/src/backend/utils/adt/lockfuncs.c 
b/src/backend/utils/adt/lockfuncs.c
index 13009cc..bb2db58 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -17,8 +17,11 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "storage/predicate_internals.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
+#include "utils/wait_event.h"
 
 
 /*
@@ -606,8 +609,9 @@ pg_safe_snapshot_blocking_pids(PG

Re: TerminateOtherDBBackends code comments inconsistency.

2024-04-22 Thread Noah Misch
On Mon, Apr 15, 2024 at 11:17:54AM +0530, Amit Kapila wrote:
> On Thu, Apr 11, 2024 at 6:58 PM Kirill Reshke  wrote:
> >
> > While working on [0] i have noticed this comment in
> > TerminateOtherDBBackends function:
> >
> > /*
> > * Check whether we have the necessary rights to terminate other
> > * sessions. We don't terminate any session until we ensure that we
> > * have rights on all the sessions to be terminated. These checks are
> > * the same as we do in pg_terminate_backend.
> > *
> > * In this case we don't raise some warnings - like "PID %d is not a
> > * PostgreSQL server process", because for us already finished session
> > * is not a problem.
> > */
> >
> > This statement is not true after 3a9b18b.
> > "These checks are the same as we do in pg_terminate_backend."

The comment mismatch is a problem.  Thanks for reporting it.  The DROP
DATABASE doc mimics the comment, using, "permissions are the same as with
pg_terminate_backend".

> > But the code is still correct, I assume... or not? In fact, we are
> > killing autovacuum workers which are working with a given database
> > (proc->roleId == 0), which is OK in that case. Are there any other
> > cases when proc->roleId == 0 but we should not be able to kill such a
> > process?
> >
> 
> Good question. I am not aware of such cases but I wonder if we should
> add a check similar to 3a9b18b [1] for the reason given in the commit
> message. I have added Noah to see if he has any suggestions on this
> matter.
> 
> [1] -
> commit 3a9b18b3095366cd0c4305441d426d04572d88c1
> Author: Noah Misch 
> Date:   Mon Nov 6 06:14:13 2023 -0800
> 
> Ban role pg_signal_backend from more superuser backend types.

That commit distinguished several scenarios.  Here's how they apply in
TerminateOtherDBBackends():

- logical replication launcher, autovacuum launcher: the proc->databaseId test
  already skips them, since they don't connect to a database.  Vignesh said
  this.

- autovacuum worker: should terminate, since CountOtherDBBackends() would
  terminate them in DROP DATABASE without FORCE.

- background workers that connect to a database: the right thing is less clear
  on these, but I lean toward allowing termination and changing the DROP
  DATABASE doc.  As a bgworker author, I would value being able to recommend
  DROP DATABASE FORCE if a worker is sticking around unexpectedly.  There's
  relatively little chance of a bgworker actually wanting to block DROP
  DATABASE FORCE or having an exploitable termination-time bug.

Thoughts?




Re: DROP OWNED BY fails to clean out pg_init_privs grants

2024-04-05 Thread Noah Misch
On Fri, Apr 05, 2024 at 07:10:59PM -0400, Tom Lane wrote:
> I wondered why buildfarm member copperhead has started to fail
> xversion-upgrade-HEAD-HEAD tests.  I soon reproduced the problem here:
> 
> pg_restore: creating ACL "regress_pg_dump_schema.TYPE "test_type""
> pg_restore: while PROCESSING TOC:
> pg_restore: from TOC entry 4355; 0 0 ACL TYPE "test_type" buildfarm
> pg_restore: error: could not execute query: ERROR:  role "74603" does not 
> exist
> Command was: SELECT pg_catalog.binary_upgrade_set_record_init_privs(true);
> GRANT ALL ON TYPE "regress_pg_dump_schema"."test_type" TO "74603";
> SELECT pg_catalog.binary_upgrade_set_record_init_privs(false);
> REVOKE ALL ON TYPE "regress_pg_dump_schema"."test_type" FROM "74603";
> 
> (So now I'm wondering why *only* copperhead has shown this so far.
> Are our other cross-version-upgrade testing animals AWOL?)
> 
> I believe this is a longstanding problem that was exposed by accident
> by commit 936e3fa37.  If you run "make installcheck" in HEAD's
> src/test/modules/test_pg_dump, and then poke around in the leftover
> contrib_regression database, you can find dangling grants in
> pg_init_privs:
> 
> contrib_regression=# table pg_init_privs;
>  objoid | classoid | objsubid | privtype |
> initprivs 
>
> +--+--+--+--
> ---
>   ...
> es}
>   43134 | 1259 |0 | e| 
> {postgres=rwU/postgres,43125=U/postgr
> es}
>   43128 | 1259 |0 | e| 
> {postgres=arwdDxtm/postgres,43125=r/p
> ostgres}
>   ...
> 
> The fact that the DROP ROLE added by 936e3fa37 succeeded indicates
> that these role references weren't captured in pg_shdepend.
> I imagine that we also lack code that would allow DROP OWNED BY to
> follow up on such entries if they existed, but I've not checked that
> for sure.  In any case, there's probably a nontrivial amount of code
> to be written to make this work.
> 
> Given the lack of field complaints, I suspect that extension scripts
> simply don't grant privileges to random roles that aren't the
> extension's owner.  So I wonder a little bit if this is even worth
> fixing, as opposed to blocking off somehow.  But probably we should
> first try to fix it.

This sounds closely-related to the following thread:
https://www.postgresql.org/message-id/flat/1573808483712.96817%40Optiver.com




Re: AIX support

2024-04-05 Thread Noah Misch
On Fri, Apr 05, 2024 at 04:12:06PM +, Sriram RK wrote:
> 
> > What you do need to do to reproduce the described problems is
> > check out the Postgres git tree and rewind to just before
> > commit 0b16bb877, where we deleted AIX support.  Any attempt
> > to restore AIX support would have to start with reverting that
> > commit (and perhaps the followup f0827b443).

> Going ahead, we want to build the changes that were removed as part of 
> “0b16bb8776bb8”, with latest Xlc and gcc version.
> 
> We were building the source from the postgres ftp 
> server(https://ftp.postgresql.org/pub/source/), would like to understand if 
> there are any source level changes between the ftp server and the source on 
> github?

To succeed in this endeavor, you'll need to develop fluency in the tools to
answer questions like that, or bring in someone who is fluent.  In this case,
GNU diff is the standard tool for answering your question.  These resources
cover other topics you would need to learn:

https://wiki.postgresql.org/wiki/Developer_FAQ
https://wiki.postgresql.org/wiki/So,_you_want_to_be_a_developer%3F




Re: Query execution in Perl TAP tests needs work

2024-03-30 Thread Noah Misch
On Wed, Aug 30, 2023 at 09:48:42AM +1200, Thomas Munro wrote:
> On Wed, Aug 30, 2023 at 1:49 AM Noah Misch  wrote:
> > On Tue, Aug 29, 2023 at 04:25:24PM +1200, Thomas Munro wrote:
> > > On Tue, Aug 29, 2023 at 1:48 PM Noah Misch  wrote:
> > > > https://github.com/cpan-authors/IPC-Run/issues/166#issuecomment-1288190929
> > >
> > > Interesting.  But that shows a case with no pipes connected, using
> > > select() as a dumb sleep and ignoring SIGCHLD.  In our usage we have
> > > pipes connected, and I think select() should return when the child's
> > > output pipes become readable due to EOF.  I guess something about that
> > > might be b0rked on Windows?  I see there is an extra helper process
> > > doing socket<->pipe conversion (hah, that explains an extra ~10ms at
> > > the start in my timestamps)...
> >
> > In that case, let's assume it's not the same issue.
> 
> Yeah, I think it amounts to the same thing, if EOF never arrives.
> 
> I suspect that we could get ->safe_psql() down to about ~25ms baseline
> if someone could fix the posited IPC::Run EOF bug

I pushed optimizations in https://github.com/cpan-authors/IPC-Run/pull/172
that make the TAP portion of "make check-world" 7% faster on my GNU/Linux
machine.  I didn't confirm an EOF bug, but that change also reduces Windows
idle time in simple tests.  I didn't run Windows check-world with it.  For
non-Windows, we can get almost all the benefit from the attached one-liner.
(The relative benefit is probably lower for parallel check-world, where idle
threads matter less, and for slower machines.)
Author: Noah Misch 
Commit: Noah Misch 

Backport IPC::Run optimization to src/test/perl.

This one-liner makes the TAP portion of "make check-world" 7% faster on
a non-Windows machine.

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME

diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm 
b/src/test/perl/PostgreSQL/Test/Utils.pm
index 42d5a50..6f7a48b 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -189,6 +189,10 @@ Set to true when running under MSYS2.
 
 INIT
 {
+   # See https://github.com/cpan-authors/IPC-Run/commit/fc9288c for how 
this
+   # reduces idle time.  Remove this when IPC::Run 20231003.0 is too old to
+   # matter (when all versions that matter provide the optimization).
+   $SIG{CHLD} = sub { };
 
# Return EPIPE instead of killing the process with SIGPIPE.  An affected
# test may still fail, but it's more likely to report useful facts.


Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2024-03-29 Thread Noah Misch
On Fri, Mar 29, 2024 at 02:17:08PM -0400, Peter Geoghegan wrote:
> On Mon, Mar 25, 2024 at 2:24 PM Noah Misch  wrote:
> > I wasn't thinking about changing the pre-v17 bt_right_page_check_scankey()
> > code.  I got interested in this area when I saw the interaction of the new
> > "first key on the next page" logic with bt_right_page_check_scankey().  The
> > patch made bt_right_page_check_scankey() pass back rightfirstoffset.  The 
> > new
> > code then does palloc_btree_page() and PageGetItem() with that offset, which
> > bt_right_page_check_scankey() had already done.  That smelled like a 
> > misplaced
> > distribution of responsibility.  For a time, I suspected the new code should
> > move down into bt_right_page_check_scankey().  Then I transitioned to 
> > thinking
> > checkunique didn't need new code for the page boundary.

>  I did notice (I meant to point out) that I have concerns about this
> part of the new uniqueness check code:
> 
> "
> if (P_IGNORE(topaque) || !P_ISLEAF(topaque))
> break;
> "
> 
> My concern here is with the !P_ISLEAF(topaque) test -- it shouldn't be
> required. If the page in question isn't a leaf page, then the index
> must be corrupt (or the page deletion recycle safety/drain technique
> thing is buggy). The " !P_ISLEAF(topaque)" part of the check is either
> superfluous or something that ought to be reported as corruption --
> it's not a legal/expected state.

Good point.

> Separately, I dislike the way the target block changes within
> bt_target_page_check(). The general idea behind verify_nbtree.c's
> target block is that every block becomes the target exactly once, in a
> clearly defined place.

Agreed.




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-29 Thread Noah Misch
On Fri, Mar 29, 2024 at 09:17:55AM +0100, Jelte Fennema-Nio wrote:
> On Thu, 28 Mar 2024 at 19:03, Tom Lane  wrote:
> > Could we make this test bulletproof by using an injection point?
> > If not, I remain of the opinion that we're better off without it.
> 
> Possibly, and if so, I agree that would be better than the currently
> added test. But I honestly don't feel like spending the time on
> creating such a test.

The SQL test is more representative of real applications, and it's way simpler
to understand.  In general, I prefer 6-line SQL tests that catch a problem 10%
of the time over injection point tests that catch it 100% of the time.  For
low detection rate to be exciting, it needs to be low enough to have a serious
chance of all buildfarm members reporting green for the bad commit.  With ~115
buildfarm members running in the last day, 0.1% detection rate would have been
low enough to bother improving, but 4% would be high enough to call it good.




Re: AIX support

2024-03-28 Thread Noah Misch
On Thu, Mar 28, 2024 at 11:09:43AM +, Sriram RK wrote:
> We are setting up the build environment and trying to build the source and 
> also trying to analyze the assert from the Aix point of view.

The thread Alvaro and Tom cited contains an analysis.  It's a compiler bug.
You can get past the compiler bug by upgrading your compiler; both ibm-clang
17.1.1.2 and gcc 13.2.0 are free from the bug.

> Also, would like to know if we can access the buildfarm(power machines) to 
> run any of the specific tests to hit this assert.

https://portal.cfarm.net/users/new/ is the form to request access.  It lists
the eligibility criteria.




Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2024-03-25 Thread Noah Misch
On Mon, Mar 25, 2024 at 12:03:10PM -0400, Peter Geoghegan wrote:
> On Sun, Mar 24, 2024 at 10:03 PM Noah Misch  wrote:

> Separately, I now see that the committed patch just reuses the code
> that has long been used to check that things are in the correct order
> across page boundaries: this is the bt_right_page_check_scankey check,
> which existed in the very earliest versions of amcheck. So while I
> agree that we could just keep the original scan key (from the last
> item on every leaf page), and then make the check at the start of the
> next page instead (as opposed to making it at the end of the previous
> leaf page, which is how it works now), it's not obvious that that
> would be a good trade-off, all things considered.
> 
> It might still be a little better that way around, overall, but you're
> not just talking about changing the recently committed checkunique
> patch (I think). You're also talking about restructuring the long
> established bt_right_page_check_scankey check (otherwise, what's the
> point?). I'm not categorically opposed to that, but it's not as if

I wasn't thinking about changing the pre-v17 bt_right_page_check_scankey()
code.  I got interested in this area when I saw the interaction of the new
"first key on the next page" logic with bt_right_page_check_scankey().  The
patch made bt_right_page_check_scankey() pass back rightfirstoffset.  The new
code then does palloc_btree_page() and PageGetItem() with that offset, which
bt_right_page_check_scankey() had already done.  That smelled like a misplaced
distribution of responsibility.  For a time, I suspected the new code should
move down into bt_right_page_check_scankey().  Then I transitioned to thinking
checkunique didn't need new code for the page boundary.

> it'll allow you to throw out a bunch of code -- AFAICT that proposal
> doesn't have that clear advantage going for it. The race condition
> that is described at great length in bt_right_page_check_scankey isn't
> ever going to be a problem for the recently committed checkunique
> patch (as you more or less pointed out yourself), but obviously it is
> still a concern for the cross-page order check.
> 
> In summary, the old bt_right_page_check_scankey check is strictly
> concerned with the consistency of a physical data structure (the index
> itself), whereas the new checkunique check makes sure that the logical
> content of the database is consistent (the index, the heap, and all
> associated transaction status metadata have to be consistent). That
> means that the concerns that are described at length in
> bt_right_page_check_scankey (nor anything like those concerns) don't
> apply to the new checkunique check. We agree on all that, I think. But
> it's less clear that that presents us with an opportunity to simplify
> this patch.

See above for why I anticipated a simplification opportunity with respect to
new-in-v17 code.  Still, it may not pan out.

> > Adding checkunique raised runtime from 58s to 276s, because it checks

Side note: my last email incorrectly described that as "raises runtime by
476%".  It should have said "by 376%" or "by a factor of 4.76".

> > visibility for every heap tuple.  It could do the heap fetch and visibility
> > check lazily, when the index yields two heap TIDs for one scan key.  That
> > should give zero visibility checks for this particular test case, and it
> > doesn't add visibility checks to bloated-table cases.

> It seems like the implication of everything that you said about
> refactoring/moving the check was that doing so would enable this
> optimization (at least an implementation along the lines of your
> pseudo code). If that was what you intended, then it's not obvious to
> me why it is relevant. What, if anything, does it have to do with
> making the new checkunique visibility checks happen lazily?

Their connection is just being the two big-picture topics I found in
post-commit review.  Decisions about the cross-page check are indeed separable
from decisions about lazy vs. eager visibility checks.

Thanks,
nm




Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2024-03-24 Thread Noah Misch
On Mon, Oct 30, 2023 at 11:29:04AM +0400, Pavel Borisov wrote:
> On Wed, 25 Oct 2023 at 00:13, Alexander Korotkov  wrote:
> > I think this patch is ready to go.  I'm going to push it if there are
> > no objections.

> It's very good that this long-standing patch is finally committed. Thanks a 
> lot!

Agreed.  I gave this feature (commit 5ae2087) a try.  Thanks for implementing
it.  Could I get your input on two topics?


 1. Cross-page comparison at "first key on the next page" only

Cross-page comparisons got this discussion upthread:

On Tue, Mar 02, 2021 at 07:10:32PM -0800, Peter Geoghegan wrote:
> On Mon, Feb 8, 2021 at 2:46 AM Pavel Borisov  wrote:
> > Caveat: if the first entry on the next index page has a key equal to the 
> > key on a previous page AND all heap tid's corresponding to this entry are 
> > invisible, currently cross-page check can not detect unique constraint 
> > violation between previous index page entry and 2nd, 3d and next current 
> > index page entries. In this case, there would be a message that recommends 
> > doing VACUUM to remove the invisible entries from the index and repeat the 
> > check. (Generally, it is recommended to do vacuum before the check, but for 
> > the testing purpose I'd recommend turning it off to check the detection of 
> > visible-invisible-visible duplicates scenarios)

> You're going to have to "couple" buffer locks in the style of
> _bt_check_unique() (as well as keeping a buffer lock on "the first
> leaf page a duplicate might be on" throughout) if you need the test to
> work reliably.

The amcheck feature has no lock coupling at its "first key on the next page"
check.  I think that's fine, because amcheck takes one snapshot at the
beginning and looks for pairs of visible-to-that-snapshot heap tuples with the
same scan key.  _bt_check_unique(), unlike amcheck, must catch concurrent
inserts.  If amcheck "checkunique" wanted to detect duplicates that would
appear when all transactions commit, it would need lock coupling.  (I'm not
suggesting it do that.)  Do you see a problem with the lack of lock coupling
at "first key on the next page"?

> But why bother with that? The tool doesn't have to be
> 100% perfect at detecting corruption (nothing can be), and it's rather
> unlikely that it will matter for this test. A simple test that doesn't
> handle cross-page duplicates is still going to be very effective.

I agree, but perhaps the "first key on the next page" code is more complex
than general-case code would be.  If the lack of lock coupling is fine, then I
think memory context lifecycle is the only obstacle making index page
boundaries special.  Are there factors beyond that?  We already have
state->lowkey kept across pages via MemoryContextAlloc().  Similar lines of
code could preserve the scan key for checkunique, making the "first key on the
next page" code unnecessary.


 2. Raises runtime by 476% despite no dead tuples

I used the following to create a table larger than RAM, 17GB table and 10GB
index on a system with 12GB RAM:

\set count 5
begin;
set maintenance_work_mem = '1GB';
set client_min_messages = debug1;  -- debug2 is per-block spam
create temp table t as select n from generate_series(1,:count) t(n);
create unique index t_idx on t(n);
\dt+ t
\di+ t_idx
create extension amcheck;
select bt_index_check('t_idx', heapallindexed => false, checkunique => false);
select bt_index_check('t_idx', heapallindexed => false, checkunique => true);

Adding checkunique raised runtime from 58s to 276s, because it checks
visibility for every heap tuple.  It could do the heap fetch and visibility
check lazily, when the index yields two heap TIDs for one scan key.  That
should give zero visibility checks for this particular test case, and it
doesn't add visibility checks to bloated-table cases.  Pseudo-code:

/*---
 * scan_key is the last uniqueness-relevant scan key observed as
 * bt_check_level_from_leftmost() moves right to traverse the leaf 
level.
 * Will be NULL if the next tuple can't be the second tuple of a
 * uniqueness violation, because any of the following apply:
 * - we're evaluating the first leaf tuple of the entire index
 * - last scan key had anynullkeys (never forms a uniqueness violation 
w/
 *   any other scan key)
 */
scan_key = NULL;
/*
 * scan_key_known_visible==true indicates that scan_key_heap_tid is the
 * last _visible_ heap TID observed for scan_key.  Otherwise,
 * scan_key_heap_tid is the last heap TID observed for scan_key, and 
we've
 * not yet checked its visibility.
 */
bool scan_key_known_visible;
scan_key_heap_tid;
foreach itup (leftmost_leaf_level_tup .. rightmost_leaf_level_tup) {
if (itup.anynullkeys)
scan_key = NULL;
else if (scan_key != NULL &&
 

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-20 Thread Noah Misch
On Mon, Mar 18, 2024 at 07:40:10PM +0100, Alvaro Herrera wrote:
> I enabled the test again and also pushed the changes to dblink,
> isolationtester and fe_utils (which AFAICS is used by pg_dump,

I recommend adding a libpqsrv_cancel() function to libpq-be-fe-helpers.h, to
use from dblink and postgres_fdw.  pgxn modules calling PQcancel() from the
backend (citus pg_bulkload plproxy pmpp) then have a better chance to adopt
the new way.




Re: Autogenerate some wait events code and documentation

2024-03-18 Thread Noah Misch
On Mon, Mar 18, 2024 at 09:04:44AM +, Bertrand Drouvot wrote:
> --- a/src/backend/utils/activity/wait_event_names.txt
> +++ b/src/backend/utils/activity/wait_event_names.txt
> @@ -24,7 +24,12 @@
>  #  SGML tables of wait events for inclusion in the documentation.
>  #
>  # When adding a new wait event, make sure it is placed in the appropriate
> -# ClassName section.
> +# ClassName section. If the wait event is backpatched to a version < 17 then
> +# put it under a "Backpatch" delimiter at the end of the related ClassName
> +# section.

Back-patch from v17 to pre-v17 won't use this, because v16 has hand-maintained
enums.  It's back-patch v18->v17 or v22->v17 where this will come up.

> +# Ensure that the wait events under the "Backpatch" delimiter are placed in 
> the
> +# same order as in the pre 17 wait_event_types.h (see VERSION_FILE_SYNC as an
> +# example).

I expect the normal practice will be to put the entry in its natural position
in git master, then put it in the backpatch section for any other branch.  In
other words, the backpatch regions are always empty in git master, and the
non-backpatch regions change in master only.




Re: Autogenerate some wait events code and documentation

2024-03-17 Thread Noah Misch
On Mon, Mar 18, 2024 at 08:02:24AM +0900, Michael Paquier wrote:
> > 1. Don't back-patch wait events to v17+.  Use the closest existing event.
> > 2. Let wait_event_names.txt back-patches control the enum order.  For 
> > example,
> >a line could have an annotation that controls its position relative to 
> > the
> >auto-sorted lines.  For another example, the generator could stop 
> > sorting.
> > 3. Accept the renumbering, because the consequence isn't that horrible.

> I see an option (4), similar to your (2) without the per-line
> annotation: add a new magic keyword like the existing "Section" that
> is used in the first lines of generate-wait_event_types.pl where we
> generate tab-separated lines with the section name as prefix of each
> line.  So I can think of something like:
> Section: ClassName - WaitEventFoo
> FOO_1 "Waiting in foo1"
> FOO_2 "Waiting in foo2"
> Backpatch:
> BAR_1 "Waiting in bar1"
> BAR_2 "Waiting in bar2"
> 
> Then force the ordering for the docs and keep the elements in the
> backpatch section at the end of the enums in the order in the txt.
> One thing that could make sense is to enforce that "Backpatch" is at
> the end of a section, meaning that we would need a second keyword like
> a "Section: EndBackpatch" or something like that.  That's not strictly
> necessary IMO as the format of the txt is easy to follow.

Works for me, with or without the trailing keyword line.




Re: Autogenerate some wait events code and documentation

2024-03-17 Thread Noah Misch
On Wed, Jul 05, 2023 at 10:57:19AM +0900, Michael Paquier wrote:
> I have applied it.

I like the new developer experience of adding a wait event.  After release of
v17, how should we approach back-patching an event, like was done in commits
8fa4a1a 1396b5c 78c0f85?  Each of those commits put the new event at the end
of its released-branch wait_event.h enum.  In v17,
generate-wait_event_types.pl sorts events to position them.  Adding an event
will renumber others, which can make an extension report the wrong event until
recompiled.  Extensions citus, pg_bulkload, and vector refer to static events.
If a back-patch added WAIT_EVENT_MESSAGE_QUEUE_SOMETHING_NEW, an old-build
pg_bulkload report of WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN would show up in
pg_stat_activity as WAIT_EVENT_PARALLEL_BITMAP_SCAN.  (WAIT_EVENT_EXTENSION is
not part of a generated enum, fortunately.)  Some options:

1. Don't back-patch wait events to v17+.  Use the closest existing event.
2. Let wait_event_names.txt back-patches control the enum order.  For example,
   a line could have an annotation that controls its position relative to the
   auto-sorted lines.  For another example, the generator could stop sorting.
3. Accept the renumbering, because the consequence isn't that horrible.

Option (3) is worse than (1), but I don't have a recommendation between (1)
and (2).  I tend to like (1), a concern being the ease of accidental
violations.  If we had the ABI compliance checking that
https://postgr.es/m/flat/cah2-wzk7tvglxzoz8a22af-gmo5ghojwtyrvak5zgovtrce...@mail.gmail.com
explored, (1) would be plenty safe.  Should anything change here, or not?




Re: 035_standby_logical_decoding unbounded hang

2024-03-07 Thread Noah Misch
On Thu, Mar 07, 2024 at 02:46:55PM +0500, Andrey M. Borodin wrote:
> I’m not sure if it is connected, but so far many patches in CFbot keep 
> hanging in this test. For example [0].

> [0] https://cirrus-ci.com/task/5911176840740864?logs=check_world#L292 

Relevant part:

[22:03:10.292] stderr:
[22:03:10.292] # poll_query_until timed out executing this query:
[22:03:10.292] # SELECT (SELECT catalog_xmin::text::int - 770 from 
pg_catalog.pg_replication_slots where slot_name = 'injection_activeslot') > 0
[22:03:10.292] # expecting this output:
[22:03:10.292] # t
[22:03:10.292] # last actual query output:
[22:03:10.292] # f
[22:03:10.292] # with stderr:
[22:03:10.292] # Tests were run but no plan was declared and done_testing() was 
not seen.
[22:03:10.292] # Looks like your test exited with 255 just after 57.

The injection_activeslot test got added after $SUBJECT, in 08a52ab151
(2024-03-06).  It's now reverted in 65db0cfb4c (2024-03-07).




Re: Relation bulk write facility

2024-02-27 Thread Noah Misch
On Wed, Feb 28, 2024 at 12:24:01AM +0400, Heikki Linnakangas wrote:
> Here's a patch to fully remove AIX support.

> Subject: [PATCH 1/1] Remove AIX support
> 
> There isn't a lot of user demand for AIX support, no one has stepped
> up to the plate to properly maintain it, so it's best to remove it

Regardless of how someone were to step up to maintain it, we'd be telling them
such contributions have negative value and must stop.  We're expelling AIX due
to low demand, compiler bugs, its ABI, and its shlib symbol export needs.

> altogether. AIX is still supported for stable versions.
> 
> The acute issue that triggered this decision was that after commit
> 8af2565248, the AIX buildfarm members have been hitting this
> assertion:
> 
> TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, 
> buffer)"), File: "md.c", Line: 472, PID: 2949728
> 
> Apperently the "pg_attribute_aligned(a)" attribute doesn't work on AIX
> (linker?) for values larger than PG_IO_ALIGN_SIZE.

No; see https://postgr.es/m/20240225194322.a5%40rfd.leadboat.com




Re: Relation bulk write facility

2024-02-25 Thread Noah Misch
On Sun, Feb 25, 2024 at 04:34:47PM +0200, Heikki Linnakangas wrote:
> I agree with Andres though, that unless someone raises their hand and
> volunteers to properly maintain the AIX support, we should drop it.

There's no way forward in which AIX support stops doing net harm.  Even if AIX
enthusiasts intercepted would-be buildfarm failures and fixed them before
buildfarm.postgresql.org could see them, the damage from the broader community
seeing the AIX-specific code would outweigh the benefits of AIX support.  I've
now disabled the animals for v17+, though each may do one more run before
picking up the disable.

My upthread observation about xcoff section alignment was a red herring.  gcc
populates symbol-level alignment, and section-level alignment is unnecessary
if symbol-level alignment is correct.  The simplest workaround for $SUBJECT
AIX failure would be to remove the "const", based on the results of the
attached test program.  The pg_prewarm.c var is like al4096_static in the
outputs below, hence the lack of trouble there.  The bulk_write.c var is like
al4096_static_const_initialized.

 gcc 8.3.0
al4096   4096 @ 0x11000c000 (mod 0)
al4096_initialized   4096 @ 0x11fd0 (mod 4048 - BUG)
al4096_const 4096 @ 0x11000f000 (mod 0)
al4096_const_initialized 4096 @ 0x1cd00 (mod 3328 - BUG)
al4096_static4096 @ 0x110005000 (mod 0)
al4096_static_initialized4096 @ 0x110008000 (mod 0)
al4096_static_const  4096 @ 0x10c10 (mod 3088 - BUG)
al4096_static_const_initialized  4096 @ 0x13c10 (mod 3088 - BUG)
 xlc 12.01..
al4096   4096 @ 0x110008000 (mod 0)
al4096_initialized   4096 @ 0x110004000 (mod 0)
al4096_const 4096 @ 0x11000b000 (mod 0)
al4096_const_initialized 4096 @ 0x17000 (mod 0)
al4096_static4096 @ 0x11000e000 (mod 0)
al4096_static_initialized4096 @ 0x110001000 (mod 0)
al4096_static_const  4096 @ 0x110011000 (mod 0)
al4096_static_const_initialized  4096 @ 0x107d0 (mod 2000 - BUG)
 ibm-clang 17.1.1.2
al4096   4096 @ 0x110001000 (mod 0)
al4096_initialized   4096 @ 0x110004000 (mod 0)
al4096_const 4096 @ 0x11000 (mod 0)
al4096_const_initialized 4096 @ 0x15000 (mod 0)
al4096_static4096 @ 0x110008000 (mod 0)
al4096_static_initialized4096 @ 0x11000b000 (mod 0)
al4096_static_const  4096 @ 0x19000 (mod 0)
al4096_static_const_initialized  4096 @ 0x1d000 (mod 0)
#include 
#include 

/* add a byte so the compiler has to go out of its way to achieve alignment for
   consecutive instances */
typedef union PGIOAlignedBlock
{
__attribute__((aligned(4096)))
chardata[1 + 8192];
double  force_align_d;
uint64_tforce_align_i64;
} PGIOAlignedBlock;

charpad;

/* with and without each of: static, const, initializer */
PGIOAlignedBlock al4096;
PGIOAlignedBlock al4096_initialized = {{0}};
const PGIOAlignedBlock al4096_const;
const PGIOAlignedBlock al4096_const_initialized = {{0}};
static PGIOAlignedBlock al4096_static;
static PGIOAlignedBlock al4096_static_initialized = {{0}};
static const PGIOAlignedBlock al4096_static_const;
static const PGIOAlignedBlock al4096_static_const_initialized = {{0}};

/* in case last position is special */
static const PGIOAlignedBlock last;
static const PGIOAlignedBlock last_initialized = {{0}};

#define DUMP(want_align, var) dump_internal(#var, (want_align), &(var))

static void
dump_internal(const char *ident, unsigned want_align, const void *addr)
{
unsignedmod = (uintptr_t) addr % want_align;

printf("%-32s %4u @ 0x%p (mod %u%s)\n", ident, want_align, addr,
   mod, mod ? " - BUG" : "");
}

int
main(int argc, char **argv)
{
DUMP(4096, al4096);
DUMP(4096, al4096_initialized);
DUMP(4096, al4096_const);
DUMP(4096, al4096_const_initialized);
DUMP(4096, al4096_static);
DUMP(4096, al4096_static_initialized);
DUMP(4096, al4096_static_const);
DUMP(4096, al4096_static_const_initialized);
return 0;
}


Re: Relation bulk write facility

2024-02-24 Thread Noah Misch
On Sun, Feb 25, 2024 at 09:13:47AM +1300, Thomas Munro wrote:
> On Sun, Feb 25, 2024 at 9:12 AM Thomas Munro  wrote:
> > On Sun, Feb 25, 2024 at 8:50 AM Noah Misch  wrote:
> > > On GNU/Linux x64, gcc correctly records alignment=2**12 for the associated
> > > section (.rodata for bulk_write.o zero_buffer, .bss for pg_prewarm.o
> > > blockbuffer).  If I'm reading this right, neither AIX gcc nor xlc is 
> > > marking
> > > the section with sufficient alignment, in bulk_write.o or pg_prewarm.o:
> >
> > Ah, that is a bit of a hazard that we should probably document.
> >
> > I guess the ideas to fix this would be: use smgrzeroextend() instead
> > of this coding, and/or perhaps look at the coding of pg_pwrite_zeros()
> > (function-local static) for any other place that needs such a thing,
> > if it would be satisfied by function-local scope?

True.  Alternatively, could arrange for "#define PG_O_DIRECT 0" on AIX, which
disables the alignment assertions (and debug_io_direct).

> Erm, wait, how does that function-local static object work differently?

I don't know specifically, but I expect they're different parts of the gcc
implementation.  Aligning an xcoff section may entail some xcoff-specific gcc
component.  Aligning a function-local object just changes the early
instructions of the function; it's independent of the object format.




Re: Relation bulk write facility

2024-02-24 Thread Noah Misch
On Sun, Feb 25, 2024 at 07:52:16AM +1300, Thomas Munro wrote:
> On Sun, Feb 25, 2024 at 6:24 AM Noah Misch  wrote:
> > On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:
> > > Committed this. Thanks everyone!
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill=2024-02-24%2015%3A13%3A14
> >  got:
> > TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, 
> > buffer)"), File: "md.c", Line: 472, PID: 43188608
> >
> > with this stack trace:
> > #5  0x10005cf0 in ExceptionalCondition (conditionName=0x1015d790 
> >  "`", fileName=0x0, lineNumber=16780064) at assert.c:66
> > #6  0x102daba8 in mdextend (reln=0x1042628c , 
> > forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at 
> > md.c:472
> > #7  0x102d6760 in smgrextend (reln=0x306e6670, forknum=812540744, 
> > blocknum=33, buffer=0x306e6000, skipFsync=812539904) at smgr.c:541
> > #8  0x104c8dac in smgr_bulk_flush (bulkstate=0x306e6000) at bulk_write.c:245
> 
> So that's:
> 
> static const PGIOAlignedBlock zero_buffer = {{0}};  /* worth BLCKSZ */
> 
> ...
> smgrextend(bulkstate->smgr, bulkstate->forknum,
>bulkstate->pages_written++,
>_buffer,
>true);
> 
> ... where PGIOAlignedBlock is:
> 
> typedef union PGIOAlignedBlock
> {
> #ifdef pg_attribute_aligned
> pg_attribute_aligned(PG_IO_ALIGN_SIZE)
> #endif
> chardata[BLCKSZ];
> ...
> 
> We see this happen with both xlc and gcc (new enough to know how to do
> this).  One idea would be that the AIX *linker* is unable to align it,
> as that is the common tool-chain component here (and unlike stack and
> heap objects, this scope is the linker's job).  There is a
> pre-existing example of a zero-buffer that is at file scope like that:
> pg_prewarm.c.  Perhaps it doesn't get tested?
> 
> Hmm.

GCC docs do say "For some linkers, the maximum supported alignment may be very
very small.", but AIX "man LD" says "data sections are aligned on a boundary
so as to satisfy the alignment of all CSECTs in the sections".  It also has -H
and -K flags to force some particular higher alignment.

On GNU/Linux x64, gcc correctly records alignment=2**12 for the associated
section (.rodata for bulk_write.o zero_buffer, .bss for pg_prewarm.o
blockbuffer).  If I'm reading this right, neither AIX gcc nor xlc is marking
the section with sufficient alignment, in bulk_write.o or pg_prewarm.o:

$ /opt/cfarm/binutils-latest/bin/objdump --section-headers 
~/farm/*/HEAD/pgsqlkeep.*/src/backend/storage/smgr/bulk_write.o

/home/nm/farm/gcc64/HEAD/pgsqlkeep.2024-02-24_00-03-22/src/backend/storage/smgr/bulk_write.o:
 file format aix5coff64-rs6000

Sections:
Idx Name  Size  VMA   LMA   File off  Algn
  0 .text 277c      00f0  2**2
  CONTENTS, ALLOC, LOAD, RELOC, CODE
  1 .data 00e4  277c  277c  286c  2**3
  CONTENTS, ALLOC, LOAD, RELOC, DATA
  2 .debug0001f7ea      2950  2**3
  CONTENTS

/home/nm/farm/xlc32/HEAD/pgsqlkeep.2024-02-24_15-12-23/src/backend/storage/smgr/bulk_write.o:
 file format aixcoff-rs6000

Sections:
Idx Name  Size  VMA   LMA   File off  Algn
  0 .text 0880      0180  2**2
  CONTENTS, ALLOC, LOAD, RELOC, CODE
  1 .data 410c  0880  0880  0a00  2**3
  CONTENTS, ALLOC, LOAD, RELOC, DATA
  2 .bss    498c  498c    2**3
  ALLOC
  3 .debug8448      4b24  2**3
  CONTENTS
  4 .except   0018      4b0c  2**3
  CONTENTS, LOAD

$ /opt/cfarm/binutils-latest/bin/objdump --section-headers 
~/farm/*/HEAD/pgsqlkeep.*/contrib/pg_prewarm/pg_prewarm.o

/home/nm/farm/gcc32/HEAD/pgsqlkeep.2024-01-21_03-16-12/contrib/pg_prewarm/pg_prewarm.o:
 file format aixcoff-rs6000

Sections:
Idx Name  Size  VMA   LMA   File off  Algn
  0 .text 0a6c      00b4  2**2
  CONTENTS, ALLOC, LOAD, RELOC, CODE
  1 .data 0044  0a6c  0a6c  0b20  2**3
  CONTENTS, ALLOC, LOAD, RELOC, DATA
  2 .bss  2550  0ab0  0ab0    2**3
  ALLOC
  3 .debug0001c50e      0b64  2**3
  CONTENTS

/home/nm/farm/gcc64/HEAD/pgsqlkeep.2024-02-15_17-13-04/contrib/pg_p

Re: Relation bulk write facility

2024-02-24 Thread Noah Misch
On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:
> Committed this. Thanks everyone!

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill=2024-02-24%2015%3A13%3A14
 got:
TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, 
buffer)"), File: "md.c", Line: 472, PID: 43188608

with this stack trace:
#5  0x10005cf0 in ExceptionalCondition (conditionName=0x1015d790 
 "`", fileName=0x0, lineNumber=16780064) at assert.c:66
#6  0x102daba8 in mdextend (reln=0x1042628c , 
forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904) at 
md.c:472
#7  0x102d6760 in smgrextend (reln=0x306e6670, forknum=812540744, blocknum=33, 
buffer=0x306e6000, skipFsync=812539904) at smgr.c:541
#8  0x104c8dac in smgr_bulk_flush (bulkstate=0x306e6000) at bulk_write.c:245
#9  0x107baf24 in _bt_blwritepage (wstate=0x100d0a14 
, buf=0x304f13b0, blkno=807631240) at nbtsort.c:638
#10 0x107bccd8 in _bt_buildadd (wstate=0x104c9384 , 
state=0x304eb190, itup=0xe10, truncextra=805686672) at nbtsort.c:984
#11 0x107bc86c in _bt_sort_dedup_finish_pending (wstate=0x3b6, state=0x19, 
dstate=0x3) at nbtsort.c:1036
#12 0x107bc188 in _bt_load (wstate=0x10, btspool=0x0, btspool2=0x0) at 
nbtsort.c:1331
#13 0x107bd4ec in _bt_leafbuild (btspool=0x101589fc 
, btspool2=0x0) at nbtsort.c:571
#14 0x107be028 in btbuild (heap=0x304d2a00, index=0x4e1f, indexInfo=0x3) at 
nbtsort.c:329
#15 0x1013538c in index_build (heapRelation=0x2, indexRelation=0x10bdc518 
, indexInfo=0x2, isreindex=10, parallel=false) at 
index.c:3047
#16 0x101389e0 in index_create (heapRelation=0x1001aac0 , 
indexRelationName=0x20 , 
indexRelationId=804393376, parentIndexRelid=805686672,
parentConstraintId=268544704, relFileNumber=805309688, 
indexInfo=0x3009a328, indexColNames=0x30237a20, accessMethodId=403, 
tableSpaceId=0, collationIds=0x304d29d8, opclassIds=0x304d29f8,
opclassOptions=0x304d2a18, coloptions=0x304d2a38, reloptions=0, flags=0, 
constr_flags=0, allow_system_table_mods=false, is_internal=false, 
constraintId=0x2ff211b4) at index.c:1260
#17 0x1050342c in DefineIndex (tableId=19994, stmt=0x2ff21370, 
indexRelationId=0, parentIndexId=0, parentConstraintId=0, total_parts=0, 
is_alter_table=false, check_rights=true, check_not_in_use=true,
skip_build=false, quiet=false) at indexcmds.c:1204
#18 0x104b4474 in ProcessUtilitySlow (pstate=, 
pstmt=0x3009a408, queryString=0x30099730 "CREATE INDEX dupindexcols_i ON 
dupindexcols (f1, id, f1 text_pattern_ops);",

If there are other ways I should poke at it, let me know.





Re: Removing unneeded self joins

2024-02-24 Thread Noah Misch
Hello,

On Sat, Feb 24, 2024 at 01:02:01PM +0200, Alexander Korotkov wrote:
> On Sat, Feb 24, 2024 at 7:12 AM Noah Misch  wrote:
> > On Sat, Feb 24, 2024 at 12:36:59AM +0200, Alexander Korotkov wrote:
> > > On Thu, Feb 22, 2024 at 10:51 AM Andrei Lepikhov 
> > >  wrote:
> > > > On 21/2/2024 14:26, Richard Guo wrote:
> > > > > I think the right fix for these issues is to introduce a new element
> > > > > 'sublevels_up' in ReplaceVarnoContext, and enhance 
> > > > > replace_varno_walker
> > > > > to 1) recurse into subselects with sublevels_up increased, and 2)
> > > > > perform the replacement only when varlevelsup is equal to 
> > > > > sublevels_up.
> > > > This code looks good. No idea how we have lost it before.
> > >
> > > Thanks to Richard for the patch and to Andrei for review.  I also find
> > > code looking good.  Pushed with minor edits from me.
> >
> > I feel this, commit 466979e, misses a few of our project standards:
> >
> > - The patch makes many non-whitespace changes to existing test queries.  
> > This
> >   makes it hard to review the consequences of the non-test part of the 
> > patch.
> >   Did you minimize such edits?  Of course, not every such edit is avoidable.
> >
> > - The commit message doesn't convey whether this is refactoring or is a bug
> >   fix.  This makes it hard to write release notes, among other things.  From
> >   this mailing list thread, it gather it's a bug fix in 489072ab7a, hence
> >   v17-specific.  The commit message for 489072ab7a is also silent about that
> >   commit's status as refactoring or as a bug fix.
> >
> > - Normally, I could answer the previous question by reading the test case
> >   diffs.  However, in addition to the first point about non-whitespace
> >   changes, the first three join.sql patch hunks just change whitespace.
> >   Worse, since they move line breaks, "git diff -w" doesn't filter them out.
> >
> > To what extent are those community standards vs. points of individual
> > committer preference?  Please tell me where I'm wrong here.
> 
> I agree that commit 466979e is my individual committer failure.  I
> should have written a better, more clear commit message and separate
> tests refactoring from the bug fix.
> 
> I'm not so sure about 489072ab7a (except it provides a wrong fix).  It
> has a "Reported-by:" field meaning it's a problem reported by a
> particular person.  The "Discussion:" points directly to the reported
> test case.  And commit contains the relevant test case.  The commit
> message could be more wordy though.

Agreed, the first and third points don't apply to 489072ab7a.  Thanks to that,
one can deduce from its new test case query that it fixes a bug.  It sounds
like we agree about commit 466979e, so that's good.




Re: incremental backup mishandles XLOG_DBASE_CREATE_FILE_COPY

2024-02-24 Thread Noah Misch
On Sat, Feb 24, 2024 at 04:16:24PM +0530, Robert Haas wrote:
> On Sat, Feb 24, 2024 at 10:05 AM Noah Misch  wrote:
> > On Fri, Feb 23, 2024 at 08:47:52PM +0530, Robert Haas wrote:
> > > I thought about whether there were any other WAL records that have
> > > similar problems to XLOG_DBASE_CREATE_FILE_COPY and didn't come up
> > > with anything. If anyone knows of any similar cases, please let me
> > > know.
> >
> > Regarding records the summarizer potentially can't ignore that don't deal in
> > relfilenodes, these come to mind:
> >
> > XLOG_DBASE_DROP - covered in this thread's patch
> > XLOG_RELMAP_UPDATE
> > XLOG_TBLSPC_CREATE
> > XLOG_TBLSPC_DROP
> > XLOG_XACT_PREPARE
> 
> At present, only relation data files are ever sent incrementally; I
> don't think any of these touch those.

Agreed, those don't touch relation data files.  I think you've got all
relation data file mutations.  XLOG_DBASE_CREATE_FILE_COPY and XLOG_DBASE_DROP
are the only record types that touch a relation data file without mentioning
it in XLogRecordBlockHeader, XACT_XINFO_HAS_RELFILELOCATORS, or an RM_SMGR_ID
rlocator field.

> > Also, any record that writes XIDs needs to update nextXid; likewise for 
> > other
> > ID spaces.  See the comment at "XLOG stuff" in heap_lock_tuple().  Perhaps 
> > you
> > don't summarize past a checkpoint, making that irrelevant.
> 
> I'm not quite following this. It's true that we summarize from one
> redo pointer to the next; but also, our summary is only trying to
> ascertain which relation data blocks have been modified. Therefore, I
> don't understand the relevance of nextXid here.

No relevance, given incremental backup is incremental with respect to relation
data blocks only.




Re: Removing unneeded self joins

2024-02-23 Thread Noah Misch
On Sat, Feb 24, 2024 at 12:36:59AM +0200, Alexander Korotkov wrote:
> On Thu, Feb 22, 2024 at 10:51 AM Andrei Lepikhov  
> wrote:
> > On 21/2/2024 14:26, Richard Guo wrote:
> > > I think the right fix for these issues is to introduce a new element
> > > 'sublevels_up' in ReplaceVarnoContext, and enhance replace_varno_walker
> > > to 1) recurse into subselects with sublevels_up increased, and 2)
> > > perform the replacement only when varlevelsup is equal to sublevels_up.
> > This code looks good. No idea how we have lost it before.
> 
> Thanks to Richard for the patch and to Andrei for review.  I also find
> code looking good.  Pushed with minor edits from me.

I feel this, commit 466979e, misses a few of our project standards:

- The patch makes many non-whitespace changes to existing test queries.  This
  makes it hard to review the consequences of the non-test part of the patch.
  Did you minimize such edits?  Of course, not every such edit is avoidable.

- The commit message doesn't convey whether this is refactoring or is a bug
  fix.  This makes it hard to write release notes, among other things.  From
  this mailing list thread, it gather it's a bug fix in 489072ab7a, hence
  v17-specific.  The commit message for 489072ab7a is also silent about that
  commit's status as refactoring or as a bug fix.

- Normally, I could answer the previous question by reading the test case
  diffs.  However, in addition to the first point about non-whitespace
  changes, the first three join.sql patch hunks just change whitespace.
  Worse, since they move line breaks, "git diff -w" doesn't filter them out.

To what extent are those community standards vs. points of individual
committer preference?  Please tell me where I'm wrong here.




Re: incremental backup mishandles XLOG_DBASE_CREATE_FILE_COPY

2024-02-23 Thread Noah Misch
On Fri, Feb 23, 2024 at 08:47:52PM +0530, Robert Haas wrote:
> If XLOG_DBASE_CREATE_FILE_COPY occurs between an incremental backup
> and its reference backup, every relation whose DB OID and tablespace
> OID match the corresponding values in that record should be backed up
> in full. Currently that's not happening, because the WAL summarizer
> doesn't see the XLOG_DBASE_CREATE_FILE_COPY as referencing any
> particular relfilenode and so basically ignores it. The same happens
> for XLOG_DBASE_CREATE_WAL_LOG, but that case is OK because that only
> covers creating the directory itself, not anything underneath it, and
> there will be separate WAL records telling us the relfilenodes created
> below the new directory and the pages modified therein.

XLOG_DBASE_CREATE_WAL_LOG creates PG_VERSION in addition to creating the
directory.  I see your patch covers it.

> I thought about whether there were any other WAL records that have
> similar problems to XLOG_DBASE_CREATE_FILE_COPY and didn't come up
> with anything. If anyone knows of any similar cases, please let me
> know.

Regarding records the summarizer potentially can't ignore that don't deal in
relfilenodes, these come to mind:

XLOG_DBASE_DROP - covered in this thread's patch
XLOG_RELMAP_UPDATE
XLOG_TBLSPC_CREATE
XLOG_TBLSPC_DROP
XLOG_XACT_PREPARE

Also, any record that writes XIDs needs to update nextXid; likewise for other
ID spaces.  See the comment at "XLOG stuff" in heap_lock_tuple().  Perhaps you
don't summarize past a checkpoint, making that irrelevant.

If walsummarizer.c handles any of the above, my brief look missed it.  I also
didn't find the string "clog" or "slru" anywhere in dc21234 "Add support for
incremental backup", 174c480 "Add a new WAL summarizer process.", or thread
https://postgr.es/m/flat/CA%2BTgmoYOYZfMCyOXFyC-P%2B-mdrZqm5pP2N7S-r0z3_402h9rsA%40mail.gmail.com
"trying again to get incremental backup".  I wouldn't be surprised if you
treat clog, pg_filenode.map, and/or 2PC state files as unconditionally
non-incremental, in which case some of the above doesn't need explicit
summarization code.  I stopped looking for that logic, though.

> --- a/src/backend/postmaster/walsummarizer.c
> +++ b/src/backend/postmaster/walsummarizer.c

> +  * Technically, this special handling is only needed in the case of
> +  * XLOG_DBASE_CREATE_FILE_COPY, because that can create a whole bunch
> +  * of relation files in a directory without logging anything
> +  * specific to each one. If we didn't mark the whole DB OID/TS OID
> +  * combination in some way, then a tablespace that was dropped after
s/tablespace/database/ I suspect.
> +  * the reference backup and recreated using the FILE_COPY method prior
> +  * to the incremental backup would look just like one that was never
> +  * touched at all, which would be catastrophic.




Re: 035_standby_logical_decoding unbounded hang

2024-02-19 Thread Noah Misch
On Fri, Feb 16, 2024 at 06:37:38AM +, Bertrand Drouvot wrote:
> On Thu, Feb 15, 2024 at 12:48:16PM -0800, Noah Misch wrote:
> > On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote:
> > > What about creating a sub, say wait_for_restart_lsn_calculation() in 
> > > Cluster.pm
> > > and then make use of it in create_logical_slot_on_standby() and above? 
> > > (something
> > > like wait_for_restart_lsn_calculation-v1.patch attached).
> > 
> > Waiting for restart_lsn is just a prerequisite for calling
> > pg_log_standby_snapshot(), so I wouldn't separate those two.
> 
> Yeah, makes sense.
> 
> > If we're
> > extracting a sub, I would move the pg_log_standby_snapshot() call into the 
> > sub
> > and make the API like one of these:
> > 
> >   $standby->wait_for_subscription_starting_point($primary, $slot_name);
> >   $primary->log_standby_snapshot($standby, $slot_name);
> > 
> > Would you like to finish the patch in such a way?
> 
> Sure, please find attached standby-slot-test-2-race-v2.patch doing so. I used
> log_standby_snapshot() as it seems more generic to me.

Thanks.  Pushed at commit 0e16281.




Re: 035_standby_logical_decoding unbounded hang

2024-02-15 Thread Noah Misch
On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote:
> On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote:
> > The 035_standby_logical_decoding.pl hang is
> > a race condition arising from an event sequence like this:
> > 
> > - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
> > - Test script calls pg_log_standby_snapshot() on primary.  Emits 
> > XLOG_RUNNING_XACTS.
> > - checkpoint_timeout makes a primary checkpoint finish.  Emits 
> > XLOG_RUNNING_XACTS.
> > - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic.  Emits 
> > XLOG_RUNNING_XACTS.
> > - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.
> > 
> > Other test code already has a solution for this, so the attached patches 
> > add a
> > timeout and copy the existing solution.  I'm also attaching the hack that
> > makes it 100% reproducible.

> I did a few tests and confirm that the proposed solution fixes the corner 
> case.

Thanks for reviewing.

> What about creating a sub, say wait_for_restart_lsn_calculation() in 
> Cluster.pm
> and then make use of it in create_logical_slot_on_standby() and above? 
> (something
> like wait_for_restart_lsn_calculation-v1.patch attached).

Waiting for restart_lsn is just a prerequisite for calling
pg_log_standby_snapshot(), so I wouldn't separate those two.  If we're
extracting a sub, I would move the pg_log_standby_snapshot() call into the sub
and make the API like one of these:

  $standby->wait_for_subscription_starting_point($primary, $slot_name);
  $primary->log_standby_snapshot($standby, $slot_name);

Would you like to finish the patch in such a way?




Re: common signal handler protection

2024-02-14 Thread Noah Misch
On Mon, Dec 18, 2023 at 11:32:47AM -0600, Nathan Bossart wrote:
> rebased for cfbot

I took a look over each of these.  +1 for all.  Thank you.




035_standby_logical_decoding unbounded hang

2024-02-10 Thread Noah Misch
Coincidentally, one of my buildfarm animals hanged several weeks in a
different test, 035_standby_logical_decoding.pl.  A LOG_SNAPSHOT_INTERVAL_MS
reduction was part of making it reproducible:

On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote:
> On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
> > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> > > If you look at the buildfarm's failures page and filter down to
> > > just subscriptionCheck failures, what you find is that all of the
> > > last 6 such failures are in 031_column_list.pl:

> > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> > reported a replacement BgWriterDelay value reproducing it.
> 
> Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
> BgWriterDelay.

I'm reusing this thread just in case there's overlap with the
031_column_list.pl cause and fix.  The 035_standby_logical_decoding.pl hang is
a race condition arising from an event sequence like this:

- Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
- Test script calls pg_log_standby_snapshot() on primary.  Emits 
XLOG_RUNNING_XACTS.
- checkpoint_timeout makes a primary checkpoint finish.  Emits 
XLOG_RUNNING_XACTS.
- bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic.  Emits XLOG_RUNNING_XACTS.
- CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.

Other test code already has a solution for this, so the attached patches add a
timeout and copy the existing solution.  I'm also attaching the hack that
makes it 100% reproducible.
Author: Noah Misch 
Commit: Noah Misch 

[not for commit] Demonstrate 035_standby_logical_decoding.pl hang.

There are three way XLOG_RUNNING_XACTS can show up:
- SELECT pg_log_standby_snapshot()
- CHECKPOINT
- bgwriter LOG_SNAPSHOT_INTERVAL_MS

An idle primary won't do the second two.  If you're unlucky enough for
all three to happen before CREATE_REPLICATION_SLOT's
DecodingContextFindStartpoint() call starts watching, then
CREATE_REPLICATION_SLOT will hang indefinitely.

diff --git a/src/backend/postmaster/bgwriter.c 
b/src/backend/postmaster/bgwriter.c
index f11ce27..07f987c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -70,7 +70,7 @@ int   BgWriterDelay = 200;
  * Interval in which standby snapshots are logged into the WAL stream, in
  * milliseconds.
  */
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 1
 
 /*
  * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c 
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9270d7b..0f170a4 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -1085,6 +1085,7 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char 
*slotname,
appendStringInfoString(, " PHYSICAL RESERVE_WAL");
}
 
+   pg_usleep(1000 * 1000);
res = libpqrcv_PQexec(conn->streamConn, cmd.data);
pfree(cmd.data);
 
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl 
b/src/test/recovery/t/035_standby_logical_decoding.pl
index cebfa52..c371369 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -467,6 +467,7 @@ $psql_subscriber{run}->pump_nb();
 
 # Speed up the subscription creation
 $node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+$node_primary->safe_psql('postgres', "CHECKPOINT");
 
 # Explicitly shut down psql instance gracefully - to avoid hangs
 # or worse on windows
Author: Noah Misch 
Commit: Noah Misch 

Bound waits in 035_standby_logical_decoding.pl.

One IPC::Run::start() used an IPC::Run::timer() without checking for
expiration.  The other used no timeout or timer.  Back-patch to v16,
which introduced the test.

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl 
b/src/test/recovery/t/035_standby_logical_decoding.pl
index c371369..61da915 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -21,7 +21,6 @@ my $node_cascading_standby =
   PostgreSQL::Test::Cluster->new('cascading_standby');
 my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
-my $psql_timeout = IPC::Run::timer($default_timeout);
 my $res;
 
 # Name for the physical slot on primary
@@ -90,7 +89,8 @@ sub make_slot_active
'>',
$to_std

Re: Popcount optimization using AVX512

2024-02-10 Thread Noah Misch
On Fri, Feb 09, 2024 at 08:33:23PM -0800, Andres Freund wrote:
> On 2024-02-09 15:27:57 -0800, Noah Misch wrote:
> > On Fri, Feb 09, 2024 at 10:24:32AM -0800, Andres Freund wrote:
> > > On 2024-01-26 07:42:33 +0100, Alvaro Herrera wrote:
> > > > This suggests that finding a way to make the ifunc stuff work (with good
> > > > performance) is critical to this work.
> > > 
> > > Ifuncs are effectively implemented as a function call via a pointer, 
> > > they're
> > > not magic, unfortunately. The sole trick they provide is that you don't
> > > manually have to use the function pointer.
> > 
> > The IFUNC creators introduced it so glibc could use arch-specific memcpy 
> > with
> > the instruction sequence of a non-pointer, extern function call, not the
> > instruction sequence of a function pointer call.
> 
> My understanding is that the ifunc mechanism just avoid the need for repeated
> indirect calls/jumps to implement a single function call, not the use of
> indirect function calls at all. Calls into shared libraries, like libc, are
> indirected via the GOT / PLT, i.e. an indirect function call/jump.  Without
> ifuncs, the target of the function call would then have to dispatch to the
> resolved function. Ifuncs allow to avoid this repeated dispatch by moving the
> dispatch to the dynamic linker stage, modifying the contents of the GOT/PLT to
> point to the right function. Thus ifuncs are an optimization when calling a
> function in a shared library that's then dispatched depending on the cpu
> capabilities.
> 
> However, in our case, where the code is in the same binary, function calls
> implemented in the main binary directly (possibly via a static library) don't
> go through GOT/PLT. In such a case, use of ifuncs turns a normal direct
> function call into one going through the GOT/PLT, i.e. makes it indirect. The
> same is true for calls within a shared library if either explicit symbol
> visibility is used, or -symbolic, -Wl,-Bsymbolic or such is used. Therefore
> there's no efficiency gain of ifuncs over a call via function pointer.
> 
> 
> This isn't because ifunc is implemented badly or something - the reason for
> this is that dynamic relocations aren't typically implemented by patching all
> callsites (".text relocations"), which is what you would need to avoid the
> need for an indirect call to something that fundamentally cannot be a constant
> address at link time. The reason text relocations are disfavored is that
> they can make program startup quite slow, that they require allowing
> modifications to executable pages which are disliked due to the security
> implications, and that they make the code non-shareable, as the in-memory
> executable code has to differ from the on-disk code.
> 
> 
> I actually think ifuncs within the same binary are a tad *slower* than plain
> function pointer calls, unless -fno-plt is used. Without -fno-plt, an ifunc is
> called by 1) a direct call into the PLT, 2) loading the target address from
> the GOT, 3) making an an indirect jump to that address.  Whereas a "plain
> indirect function call" is just 1) load target address from variable 2) making
> an indirect jump to that address. With -fno-plt the callsites themselves load
> the address from the GOT.

That sounds more accurate than what I wrote.  Thanks.




Re: Popcount optimization using AVX512

2024-02-09 Thread Noah Misch
On Fri, Feb 09, 2024 at 10:24:32AM -0800, Andres Freund wrote:
> On 2024-01-26 07:42:33 +0100, Alvaro Herrera wrote:
> > This suggests that finding a way to make the ifunc stuff work (with good
> > performance) is critical to this work.
> 
> Ifuncs are effectively implemented as a function call via a pointer, they're
> not magic, unfortunately. The sole trick they provide is that you don't
> manually have to use the function pointer.

The IFUNC creators introduced it so glibc could use arch-specific memcpy with
the instruction sequence of a non-pointer, extern function call, not the
instruction sequence of a function pointer call.  I don't know why the
upthread ifunc_test.patch benchmark found ifunc performing worse than function
pointers.  However, it would be odd if toolchains have replaced the original
IFUNC with something equivalent to or slower than function pointers.




Re: Should we remove -Wdeclaration-after-statement?

2024-02-07 Thread Noah Misch
On Mon, Jan 29, 2024 at 04:03:44PM +0100, Jelte Fennema-Nio wrote:
> I feel like this is the type of change where there's not much
> discussion to be had. And the only way to resolve it is to use some
> voting to gauge community opinion.
> 
> So my suggestion is for people to respond with -1, -0.5, +-0, +0.5, or
> +1 to indicate support against/for the change.

I'm +1 for the change, for these reasons:

- Fewer back-patch merge conflicts.  The decls section of long functions is a
  classic conflict point.
- A mid-func decl demonstrates that its var is unused in the first half of the
  func.
- We write Perl in the mixed decls style, without problems.

For me personally, the "inconsistency" concern is negligible.  We allowed "for
(int i = 0", and that inconsistency has been invisible to me.




Re: Draft release notes for minor releases are up

2024-02-04 Thread Noah Misch
On Sun, Feb 04, 2024 at 01:13:53PM -0500, Tom Lane wrote:
> I now have this text for your CREATE DATABASE fixes:
> 
>  
>   Ensure durability of CREATE DATABASE (Noah Misch)
>  
> 
>  
>   If an operating system crash occurred during or shortly
>   after CREATE DATABASE, recovery could fail, or
>   subsequent connections to the new database could fail.  If a base
>   backup was taken in that window, similar problems could be observed
>   when trying to use the backup.  The symptom would be that the
>   database directory, PG_VERSION file, or
>   pg_filenode.map file was missing or empty.
>  

Thanks for updating it; this text works for me.

> This is ignoring the point that the set of observable symptoms might
> differ between the OS crash and base-backup-recovery cases, but
> I'm not sure that that's actually true, and in any case I don't think
> it matters for the release notes.

I agree with stopping short of adding that detail; it wouldn't help users make
a known decision.




Re: Draft release notes for minor releases are up

2024-02-02 Thread Noah Misch
On Fri, Feb 02, 2024 at 08:18:50PM -0500, Tom Lane wrote:
> Noah Misch  writes:
> > Shall the top of the notes advise to reindex GIN indexes?
> 
> I thought about that, but it's a pretty low-probability failure
> I think, so I didn't write that advice.  Maybe I misjudged it.

I can see there being failures so low-probability to omit that text, e.g. a
failure identified theoretically and requiring a process to lose the CPU for
hours.  For this one, the reporter seems to have arrived at it without a
deliberate search.  This one just needs a recovery at the right WAL record,
then two processes reaching the incomplete split concurrently.




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-02 Thread Noah Misch
On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
> On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> > If you look at the buildfarm's failures page and filter down to
> > just subscriptionCheck failures, what you find is that all of the
> > last 6 such failures are in 031_column_list.pl:
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-02-02%2019%3A33%3A16
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-02-02%2011%3A21%3A44
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2020%3A34%3A29
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2016%3A57%3A14
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-01-31%2022%3A18%3A24
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-01-30%2011%3A29%3A23
> > 
> > There are some further back too:
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2023-11-17%2018%3A28%3A24
> > 
> > but this definitely got way more common in the last few days.
> 
> > I don't see anything that 031_column_list.pl is doing that is much
> > different from other subscription tests, so why is it the only one
> > failing?  And more to the point, what's going wrong exactly?
> 
> I don't know, but
> https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> reported a replacement BgWriterDelay value reproducing it.

Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
BgWriterDelay.

> That hasn't reproduced it in ~10 runs on my machine, though.

After 207 successes, it did fail once for me.

> > I am suspicious that this somehow represents a failure of the
> > historical catalog decoding logic, but I don't see how that theory
> > explains this only breaking in one test script.




Re: Draft release notes for minor releases are up

2024-02-02 Thread Noah Misch
On Fri, Feb 02, 2024 at 12:54:48PM -0500, Tom Lane wrote:
> First-draft release notes for 16.2 are available at
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=87dcc5e45fad3021514f01360d3a2abd4e6480ee

> +
> +
> + 
> +  Fix insufficient locking when cleaning up an incomplete split of
> +  a GIN index's internal page (Fei Changhong, Heikki Linnakangas)
> + 
> +
> + 
> +  The code tried to do this with shared rather than exclusive lock on
> +  the buffer.  This could lead to index corruption if two processes
> +  attempted the cleanup concurrently.
> + 
> +

Shall the top of the notes advise to reindex GIN indexes?

> +
> +
> + 
> +  Add more interlocks between CREATE DATABASE and
> +  base backup (Noah Misch)
> + 
> +
> + 
> +  This fixes some cases where a base backup taken concurrently
> +  with CREATE DATABASE could produce a corrupt
> +  image of the new database.
> + 
> +

Things I'd like to capture for this one:

- Commit 0b6517a3b deals with crash recovery, not base backups.
- Connection establishment will fail if one of these bugs corrupted the
  database, so there's no need to worry about silent corruption.  (My commit
  messages didn't make that clear.)

Perhaps like this:

diff --git a/doc/src/sgml/release-16.sgml b/doc/src/sgml/release-16.sgml
index 21387e3..8997279 100644
--- a/doc/src/sgml/release-16.sgml
+++ b/doc/src/sgml/release-16.sgml
@@ -750,15 +750,15 @@ Branch: REL_16_STABLE [48a6bf5c4] 2024-02-01 13:44:22 
-0800
 Branch: REL_15_STABLE [8fa4a1ac6] 2024-02-01 13:44:23 -0800
 -->
  
-  Add more interlocks between CREATE DATABASE and
-  base backup (Noah Misch)
+  Fix durability of CREATE DATABASE (Noah Misch)
  
 
  
-  This fixes some cases where a base backup taken concurrently
-  with CREATE DATABASE could produce a corrupt
-  image of the new database.
+  Recovery failed, or establishing connections to the new database failed.
+  Effects required an operating system crash or base backup, concurrently
+  with or shortly after the CREATE DATABASE.
  
+
 
 
 




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-02 Thread Noah Misch
On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> If you look at the buildfarm's failures page and filter down to
> just subscriptionCheck failures, what you find is that all of the
> last 6 such failures are in 031_column_list.pl:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-02-02%2019%3A33%3A16
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-02-02%2011%3A21%3A44
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2020%3A34%3A29
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2016%3A57%3A14
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-01-31%2022%3A18%3A24
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-01-30%2011%3A29%3A23
> 
> There are some further back too:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2023-11-17%2018%3A28%3A24
> 
> but this definitely got way more common in the last few days.

> I don't see anything that 031_column_list.pl is doing that is much
> different from other subscription tests, so why is it the only one
> failing?  And more to the point, what's going wrong exactly?

I don't know, but
https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
reported a replacement BgWriterDelay value reproducing it.  That hasn't
reproduced it in ~10 runs on my machine, though.

> I am suspicious that this somehow represents a failure of the
> historical catalog decoding logic, but I don't see how that theory
> explains this only breaking in one test script.




Re: dblink query interruptibility

2024-01-26 Thread Noah Misch
On Thu, Jan 25, 2024 at 12:28:43PM +0900, Fujii Masao wrote:
> On Thu, Jan 25, 2024 at 5:45 AM Noah Misch  wrote:
> > On Thu, Jan 25, 2024 at 04:23:39AM +0900, Fujii Masao wrote:
> > > I found that this patch was committed at d3c5f37dd5 and changed the
> > > error message in postgres_fdw slightly. Here's an example:
> > >
> > > #1. Begin a new transaction.
> > > #2. Execute a query accessing to a foreign table, like SELECT * FROM
> > > 
> > > #3. Terminate the *remote* session corresponding to the foreign table.
> > > #4. Commit the transaction, and then currently the following error
> > > message is output.
> > >
> > > ERROR:  FATAL:  terminating connection due to administrator command
> > > server closed the connection unexpectedly
> > > This probably means the server terminated abnormally
> > > before or while processing the request.
> > >invalid socket
> > >
> > > Previously, before commit d3c5f37dd5, the error message at #4 did not
> > > include "invalid socket." Now, after the commit, it does.

Other clients have witnessed the extra "invalid socket" message:
https://dba.stackexchange.com/questions/335081/how-to-investigate-intermittent-postgres-connection-errors
https://stackoverflow.com/questions/77781358/pgbackrest-backup-error-with-exit-code-57
https://github.com/timescale/timescaledb/issues/102

> > > + /* Consume whatever data is available from the socket */
> > > + if (PQconsumeInput(conn) == 0)
> > > + {
> > > + /* trouble; expect PQgetResult() to return NULL */
> > > + break;
> > > + }
> > > + }
> > > +
> > > + /* Now we can collect and return the next PGresult */
> > > + return PQgetResult(conn);
> > >
> > > This code appears to cause the change. When the remote session ends,
> > > PQconsumeInput() returns 0 and marks conn->socket as invalid.
> > > Subsequent PQgetResult() calls pqWait(), detecting the invalid socket
> > > and appending "invalid socket" to the error message.

What do you think of making PQconsumeInput() set PGASYNC_READY and
CONNECTION_BAD in this case?  Since libpq appended "server closed the
connection unexpectedly", it knows those indicators are correct.  That way,
PQgetResult() won't issue a pointless pqWait() call.

> > > I think the "invalid socket" message is unsuitable in this scenario,
> > > and PQgetResult() should not be called after PQconsumeInput() returns
> > > 0. Thought?
> >
> > The documentation is absolute about the necessity of PQgetResult():
> 
> The documentation looks unclear to me regarding what should be done
> when PQconsumeInput() returns 0. So I'm not sure if PQgetResult()
> must be called even in that case.

I agree PQconsumeInput() docs don't specify how to interpret it returning 0.

> As far as I read some functions like libpqrcv_PQgetResult() that use
> PQconsumeInput(), it appears that they basically report the error message
> using PQerrorMessage(), without calling PQgetResult(),
> when PQconsumeInput() returns 0.

libpqrcv_PQgetResult() is part of walreceiver, where any ERROR becomes FATAL.
Hence, it can't hurt anything by eagerly skipping to ERROR.  I designed
libpqsrv_exec() to mimic PQexec() as closely as possible, so it would be a
drop-in replacement for arbitrary callers.  Ideally, accepting interrupts
would be the only caller-visible difference.

I know of three ways PQconsumeInput() can return 0, along with my untested
estimates of how they work:

a. Protocol violation.  handleSyncLoss() sets PGASYNC_READY and
   CONNECTION_BAD.  PQgetResult() is optional.

b. Connection broken.  PQgetResult() is optional.

c. ENOMEM.  PGASYNC_ and CONNECTION_ status don't change.  Applications choose
   among (c1) free memory and retry, (c2) close the connection, or (c3) call
   PQgetResult() to break protocol sync and set PGASYNC_IDLE:

Comparing PQconsumeInput() with the PQgetResult() block under "while
(conn->asyncStatus == PGASYNC_BUSY)", there's a key difference that
PQgetResult() sets PGASYNC_IDLE on most errors, including ENOMEM.  That
prevents PQexec() subroutine PQexecFinish() from busy looping on ENOMEM, but I
suspect that also silently breaks protocol sync.  While we could change it
from (c3) to (c2) by dropping the connection via handleSyncLoss() or
equivalent, I'm not confident about that being better.

libpqsrv_exec() implements (c3) by way of calling PQgetResult() after
PQconsumeInput() fails.  If PQisBusy(), the same ENOMEM typically will repeat,
yielding (c3).  If memory became available in that brief time, PQgetResult()
may instead block.  That blocking is unwanted but unimportant.




Re: dblink query interruptibility

2024-01-24 Thread Noah Misch
On Thu, Jan 25, 2024 at 04:23:39AM +0900, Fujii Masao wrote:
> I found that this patch was committed at d3c5f37dd5 and changed the
> error message in postgres_fdw slightly. Here's an example:
> 
> #1. Begin a new transaction.
> #2. Execute a query accessing to a foreign table, like SELECT * FROM
> 
> #3. Terminate the *remote* session corresponding to the foreign table.
> #4. Commit the transaction, and then currently the following error
> message is output.
> 
> ERROR:  FATAL:  terminating connection due to administrator command
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>invalid socket
> 
> Previously, before commit d3c5f37dd5, the error message at #4 did not
> include "invalid socket." Now, after the commit, it does. Is this
> change intentional?

No.  It's a consequence of an intentional change in libpq call sequence, but I
was unaware I was changing an error message.

> + /* Consume whatever data is available from the socket */
> + if (PQconsumeInput(conn) == 0)
> + {
> + /* trouble; expect PQgetResult() to return NULL */
> + break;
> + }
> + }
> +
> + /* Now we can collect and return the next PGresult */
> + return PQgetResult(conn);
> 
> This code appears to cause the change. When the remote session ends,
> PQconsumeInput() returns 0 and marks conn->socket as invalid.
> Subsequent PQgetResult() calls pqWait(), detecting the invalid socket
> and appending "invalid socket" to the error message.
> 
> I think the "invalid socket" message is unsuitable in this scenario,
> and PQgetResult() should not be called after PQconsumeInput() returns
> 0. Thought?

The documentation is absolute about the necessity of PQgetResult():

  PQsendQuery cannot be called again (on the same connection) until
  PQgetResult has returned a null pointer, indicating that the command is
  done.

  PQgetResult must be called repeatedly until it returns a null pointer,
  indicating that the command is done. (If called when no command is active,
  PQgetResult will just return a null pointer at once.)

Similar statements also appear in libpq-pipeline-results,
libpq-pipeline-errors, and libpq-copy.


So, unless the documentation or my reading of it is wrong there, I think the
answer is something other than skipping PQgetResult().  Perhaps PQgetResult()
should not append "invalid socket" in this case?  The extra line is a net
negative, though it's not wrong and not awful.

Thanks for reporting the change.




Re: Recovering from detoast-related catcache invalidations

2024-01-14 Thread Noah Misch
On Fri, Jan 12, 2024 at 03:47:13PM -0500, Tom Lane wrote:
> I wrote:
> > This is uncomfortably much in bed with the tuple table slot code,
> > perhaps, but I don't see a way to do it more cleanly unless we want
> > to add some new provisions to that API.  Andres, do you have any
> > thoughts about that?
> 
> Oh!  After nosing around a bit more I remembered systable_recheck_tuple,
> which is meant for exactly this purpose.  So v4 attached.

systable_recheck_tuple() is blind to heap_inplace_update(), so it's not a
general proxy for invalidation messages.  The commit for $SUBJECT (ad98fb1)
doesn't create any new malfunctions, but I expect the systable_recheck_tuple()
part will change again before the heap_inplace_update() story is over
(https://postgr.es/m/flat/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com).




Re: broken master regress tests

2023-12-12 Thread Noah Misch
On Tue, Oct 10, 2023 at 05:54:34PM -0700, Andres Freund wrote:
> On 2023-10-10 17:08:25 -0700, Jeff Davis wrote:
> > After this, it seems "make check" no longer picks up the locale from
> > the system environment by default.
> 
> Yea. I wonder if the better fix would have been to copy setenv("LC_MESSAGES", 
> "C", 1);
> to the initdb template creation. That afaict also fixes the issue, with a
> smaller blast radius?

+1, that would restore the testing semantics known from v16-.  I think the
intent of the template was to optimize without changing semantics, and the
above proposal aligns with that.  Currently, for v17 alone, one needs
installcheck or build file hacks to reproduce a locale-specific test failure.

An alternative would be to declare that the tests are supported in one
encoding+locale only, then stop testing others in the buildfarm.  Even if we
did that, I'm fairly sure we'd standardize on UTF8, not SQL_ASCII, as the one
testable encoding.




Re: Is WAL_DEBUG related code still relevant today?

2023-12-07 Thread Noah Misch
On Thu, Dec 07, 2023 at 04:50:30PM +0530, Bharath Rupireddy wrote:
> On Mon, Dec 4, 2023 at 12:37 AM Noah Misch  wrote:
> > On Sun, Dec 03, 2023 at 08:30:24PM +0530, Bharath Rupireddy wrote:
> > > On Sun, Dec 3, 2023 at 4:00 AM Nathan Bossart  
> > > wrote:
> > > > On Sat, Dec 02, 2023 at 07:36:29PM +0530, Bharath Rupireddy wrote:

> The interesting pieces that WAL_DEBUG code does are the following:
> 
> 1. Decodes the WAL record right after it's written to WAL buffers in
> XLogInsertRecord. What problem does it help to detect?

I think it helped me understand why a test case I was writing didn't reach the
bug I expected it to reach.

> 2. Emits a log message for every WAL record applied in the main redo
> apply loop. Enabling this isn't cheap for sure even for developer
> environments; I've observed a 10% increase in recovery test time)
> 3. Emits log messages for WAL writes/flushes and WAL buffer page
> initializations. These messages don't have to be hiding under a macro,
> but a DEBUGX level is sufficient.
> 
> > > > > I have no idea if anyone uses WAL_DEBUG macro and wal_debug GUCs in
> > > > > production, if we have somebody using it, I think we need to fix the
> >
> > I don't use it in production, but I use it more than any other of our many
> > DEBUG macros.
> 
> I'm just curious to know what sorts of problems WAL_DEBUG code helps
> debug with. Is the WAL_DEBUG code (1) or (2) or (3) that helped you
> the most?

For me, (1) and (2) came up several times, and (3) came up once.  I don't
remember which of (1) or (2) helped more.

> Is converting the LOG messages (3) to DEBUGX level going to
> help in your case?

Not in my case.




Re: reindexing an invalid index should not use ERRCODE_INDEX_CORRUPTED

2023-12-06 Thread Noah Misch
On Wed, Dec 06, 2023 at 03:17:12PM +0900, Michael Paquier wrote:
> On Sat, Nov 18, 2023 at 04:32:36PM -0800, Noah Misch wrote:
> > On Sat, Nov 18, 2023 at 03:09:58PM -0800, Andres Freund wrote:
> >> Unfortunately, there is a case of such an sqlstate that's not at all 
> >> indicating
> >> corruption, namely REINDEX CONCURRENTLY when the index is invalid:
> >> 
> >> if (!indexRelation->rd_index->indisvalid)
> >> ereport(WARNING,
> >> (errcode(ERRCODE_INDEX_CORRUPTED),
> >>  errmsg("cannot reindex invalid index 
> >> \"%s.%s\" concurrently, skipping",
> >> 
> >> get_namespace_name(get_rel_namespace(cellOid)),
> >> get_rel_name(cellOid;
> >> 
> >> The only thing required to get to this is an interrupted CREATE INDEX
> >> CONCURRENTLY, which I don't think can be fairly characterized as 
> >> "corruption".
> >> 
> >> ISTM something like ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE would be more
> >> appropriate?
> > 
> > +1, that's a clear improvement.
> 
> The same thing can be said a couple of lines above where the code uses
> ERRCODE_FEATURE_NOT_SUPPORTED but your suggestion of
> ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE would be better.
> 
> Would the attached be OK for you?

Okay.

> > The "cannot" part of the message is also inaccurate, and it's not clear to 
> > me
> > why we have this specific restriction at all.  REINDEX INDEX CONCURRENTLY
> > accepts such indexes, so I doubt it's an implementation gap.
> 
> If you would reword that, what would you change?

I'd do "skipping reindex of invalid index \"%s.%s\"".  If one wanted more,
errhint("Use DROP INDEX or REINDEX INDEX.") would fit.




Re: Is WAL_DEBUG related code still relevant today?

2023-12-03 Thread Noah Misch
On Sun, Dec 03, 2023 at 08:30:24PM +0530, Bharath Rupireddy wrote:
> On Sun, Dec 3, 2023 at 4:00 AM Nathan Bossart  
> wrote:
> > On Sat, Dec 02, 2023 at 07:36:29PM +0530, Bharath Rupireddy wrote:
> > > b) Remove both the WAL_DEBUG macro and the wal_debug GUC. I don't
> > > think (2) is needed to be in core especially when tools like
> > > pg_walinspect and pg_waldump can do the same job. And, the messages in
> > > (3) and (4) can be turned to some DEBUGX level without being under the
> > > WAL_DEBUG macro.
> >
> > Is there anything provided by wal_debug that can't be found via
> > pg_walinspect/pg_waldump?
> 
> I don't think so. The WAL record decoding can be achieved with
> pg_walinspect or pg_waldump.

Can be, but the WAL_DEBUG model is mighty convenient:
- Cooperates with backtrace_functions
- Change log_line_prefix to correlate any log_line_prefix fact with WAL records
- See WAL records interleaved with non-WAL log messages

> > > I have no idea if anyone uses WAL_DEBUG macro and wal_debug GUCs in
> > > production, if we have somebody using it, I think we need to fix the

I don't use it in production, but I use it more than any other of our many
DEBUG macros.

> > > compilation and test failure issues, and start testing this code
> > > (perhaps I can think of setting up a buildfarm member to help here).

WAL_DEBUG compiles and works just fine on GNU/Linux.  I'm not surprised the
failure to compile on Windows has escaped notice, because Windows-specific WAL
behaviors are so rare.  We consistently do our WAL-related development on
non-Windows.  Needless to say, I wouldn't object to fixing WAL_DEBUG for
Windows.

Fixing tests is less valuable, especially since it's clear when a test fails
through extra messages the test didn't expect.  I bet other DEBUG macros make
some tests fail that way, which doesn't devalue those macros.  A test patch
might be okay nonetheless, but a buildfarm member is more likely to have
negative value.  It would create urgent work.  In the hypothetical buildfarm
member's absence, the project would be just fine if that work never happens.
A buildfarm member that compiles but doesn't test could be okay.




dblink query interruptibility

2023-11-21 Thread Noah Misch
=== Background

Something as simple as the following doesn't respond to cancellation.  In
v15+, any DROP DATABASE will hang as long as it's running:

  SELECT dblink_exec(
$$dbname='$$||current_database()||$$' port=$$||current_setting('port'),
'SELECT pg_sleep(15)');

https://postgr.es/m/4b584c99.8090...@enterprisedb.com proposed a fix back in
2010.  Latches and the libpqsrv facility have changed the server programming
environment since that patch.  The problem statement also came up here:

On Thu, Dec 08, 2022 at 06:08:15PM -0800, Andres Freund wrote:
> dblink.c uses a lot of other blocking libpq functions, which obviously also
> isn't ok.


=== Key decisions

This patch adds to libpqsrv facility.  It dutifully follows the existing
naming scheme.  For greppability, I am favoring renaming new and old functions
such that the libpq name is a substring of this facility's name.  That is,
rename libpqsrv_disconnect to srvPQfinish or maybe libpqsrv_PQfinish().  Now
is better than later, while pgxn contains no references to libpqsrv.  Does
anyone else have a preference between naming schemes?  If nobody does, I'll
keep today's libpqsrv_disconnect() style.

I was tempted to add a timeout argument to each libpqsrv function, which would
allow libpqsrv_get_result_last() to replace pgfdw_get_cleanup_result().  We
can always add a timeout-accepting function later and make this thread's
function name a thin wrapper around it.  Does anyone feel a mandatory timeout
argument, accepting -1 for no timeout, would be the right thing?


=== Minor topics

It would be nice to replace libpqrcv_PQgetResult() and friends with the new
functions.  I refrained since they use ProcessWalRcvInterrupts(), not
CHECK_FOR_INTERRUPTS().  Since walreceiver already reaches
CHECK_FOR_INTERRUPTS() via libpqsrv_connect_params(), things might just work.

This patch contains a PQexecParams() wrapper, called nowhere in
postgresql.git.  It's inessential, but twelve pgxn modules do mention
PQexecParams.  Just one mentions PQexecPrepared, and none mention PQdescribe*.

The patch makes postgres_fdw adopt its functions, as part of confirming the
functions are general enough.  postgres_fdw create_cursor() has been passing
the "DECLARE CURSOR FOR inner_query" string for some error messages and just
inner_query for others.  I almost standardized on the longer one, but the test
suite checks that.  Hence, I standardized on just inner_query.

I wrote this because pglogical needs these functions to cooperate with v15+
DROP DATABASE (https://github.com/2ndQuadrant/pglogical/issues/418).

Thanks,
nm
Author: Noah Misch 
Commit: Noah Misch 

Make dblink interruptible, via new libpqsrv APIs.

This replaces dblink's blocking libpq calls, allowing cancellation and
allowing DROP DATABASE (of a database not involved in the query).  Apart
from explicit dblink_cancel_query() calls, dblink still doesn't cancel
the remote side.  The replacement for the blocking calls consists of
new, general-purpose query execution wrappers in the libpqsrv facility.
Out-of-tree extensions should adopt these.  Use them in postgres_fdw,
replacing a local implementation from which the libpqsrv implementation
derives.  This is a bug fix for dblink.  Code inspection identified the
bug at least thirteen years ago, but user complaints have not appeared.
Hence, no back-patch for now.

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 195b278..4624e53 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -133,6 +133,7 @@ static HTAB *remoteConnHash = NULL;
 /* custom wait event values, retrieved from shared memory */
 static uint32 dblink_we_connect = 0;
 static uint32 dblink_we_get_conn = 0;
+static uint32 dblink_we_get_result = 0;
 
 /*
  * Following is list that holds multiple remote connections.
@@ -252,6 +253,9 @@ dblink_init(void)
 {
if (!pconn)
{
+   if (dblink_we_get_result == 0)
+   dblink_we_get_result = 
WaitEventExtensionNew("DblinkGetResult");
+
pconn = (remoteConn *) MemoryContextAlloc(TopMemoryContext, 
sizeof(remoteConn));
pconn->conn = NULL;
pconn->openCursorCount = 0;
@@ -442,7 +446,7 @@ dblink_open(PG_FUNCTION_ARGS)
/* If we are not in a transaction, start one */
if (PQtransactionStatus(conn) == PQTRANS_IDLE)
{
-   res = PQexec(conn, "BEGIN");
+   res = libpqsrv_exec(conn, "BEGIN", dblink_we_get_result);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
dblink_res_internalerror(conn, res, "begin error");
PQclear(res);
@@ -461,7 +465,7 @@ dblink_open(PG_FUNCTION_ARGS)
(rconn->openCursorCount)++;
 
appendStringInfo(, "DECLARE

Re: reindexing an invalid index should not use ERRCODE_INDEX_CORRUPTED

2023-11-18 Thread Noah Misch
On Sat, Nov 18, 2023 at 03:09:58PM -0800, Andres Freund wrote:
> We currently provide no way to learn about a postgres instance having
> corruption than searching the logs for corruption events than matching by
> sqlstate, for ERRCODE_DATA_CORRUPTED and ERRCODE_INDEX_CORRUPTED.
> 
> Unfortunately, there is a case of such an sqlstate that's not at all 
> indicating
> corruption, namely REINDEX CONCURRENTLY when the index is invalid:
> 
> if (!indexRelation->rd_index->indisvalid)
> ereport(WARNING,
> (errcode(ERRCODE_INDEX_CORRUPTED),
>  errmsg("cannot reindex invalid index 
> \"%s.%s\" concurrently, skipping",
> 
> get_namespace_name(get_rel_namespace(cellOid)),
> get_rel_name(cellOid;
> 
> The only thing required to get to this is an interrupted CREATE INDEX
> CONCURRENTLY, which I don't think can be fairly characterized as "corruption".
> 
> ISTM something like ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE would be more
> appropriate?

+1, that's a clear improvement.

The "cannot" part of the message is also inaccurate, and it's not clear to me
why we have this specific restriction at all.  REINDEX INDEX CONCURRENTLY
accepts such indexes, so I doubt it's an implementation gap.  Since an INVALID
index often duplicates some valid index, I could see an argument that
reindexing INVALID indexes as part of a table-level REINDEX is wanted less
often than not.  But that argument would be just as pertinent to REINDEX TABLE
(w/o CONCURRENTLY), which doesn't impose this restriction.  Hmmm.




Re: 2023-11-09 release announcement draft

2023-11-07 Thread Noah Misch
On Tue, Nov 07, 2023 at 09:02:03AM -0500, Jonathan S. Katz wrote:
> On 11/6/23 9:52 PM, Noah Misch wrote:
> > On Mon, Nov 06, 2023 at 05:04:25PM -0500, Jonathan S. Katz wrote:

> > Delete lines starting here ...
> > 
> > > This is the **final release of PostgreSQL 11**. PostgreSQL 10 will no 
> > > longer
> > > receive
> > > [security and bug fixes](https://www.postgresql.org/support/versioning/).
> > > If you are running PostgreSQL 10 in a production environment, we suggest 
> > > that
> > > you make plans to upgrade.
> > 
> > ... to here.  They're redundant with "PostgreSQL 11 EOL Notice" below:
> 
> Initially, I strongly disagreed with this recommendation, as I've seen
> enough people say that they were unaware that a community version is EOL. We
> can't say this enough.
> 
> However, I did decide to clip it out because the notice is just below.

I just figured it was a copy-paste error, given the similarity of nearby
sentences.  I have no concern with a general goal of saying more about the end
of v11.




Re: Popcount optimization using AVX512

2023-11-06 Thread Noah Misch
On Mon, Nov 06, 2023 at 09:59:26PM -0600, Nathan Bossart wrote:
> On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote:
> > On Mon, Nov 06, 2023 at 09:52:58PM -0500, Tom Lane wrote:
> >> Nathan Bossart  writes:
> >> > Like I said, I don't have any proposals yet, but assuming we do want to
> >> > support newer intrinsics, either open-coded or via auto-vectorization, I
> >> > suspect we'll need to gather consensus for a new policy/strategy.
> >> 
> >> Yeah.  The function-pointer solution kind of sucks, because for the
> >> sort of operation we're considering here, adding a call and return
> >> is probably order-of-100% overhead.  Worse, it adds similar overhead
> >> for everyone who doesn't get the benefit of the optimization.
> > 
> > The glibc/gcc "ifunc" mechanism was designed to solve this problem of 
> > choosing
> > a function implementation based on the runtime CPU, without incurring 
> > function
> > pointer overhead.  I would not attempt to use AVX512 on non-glibc systems, 
> > and
> > I would use ifunc to select the desired popcount implementation on glibc:
> > https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.html
> 
> Thanks, that seems promising for the function pointer cases.  I'll plan on
> trying to convert one of the existing ones to use it.  BTW it looks like
> LLVM has something similar [0].
> 
> IIUC this unfortunately wouldn't help for cases where we wanted to keep
> stuff inlined, such as is_valid_ascii() and the functions in pg_lfind.h,
> unless we applied it to the calling functions, but that doesn't ѕound
> particularly maintainable.

Agreed, it doesn't solve inline cases.  If the gains are big enough, we should
move toward packages containing N CPU-specialized copies of the postgres
binary, with bin/postgres just exec'ing the right one.




Re: Popcount optimization using AVX512

2023-11-06 Thread Noah Misch
On Mon, Nov 06, 2023 at 09:52:58PM -0500, Tom Lane wrote:
> Nathan Bossart  writes:
> > Like I said, I don't have any proposals yet, but assuming we do want to
> > support newer intrinsics, either open-coded or via auto-vectorization, I
> > suspect we'll need to gather consensus for a new policy/strategy.
> 
> Yeah.  The function-pointer solution kind of sucks, because for the
> sort of operation we're considering here, adding a call and return
> is probably order-of-100% overhead.  Worse, it adds similar overhead
> for everyone who doesn't get the benefit of the optimization.

The glibc/gcc "ifunc" mechanism was designed to solve this problem of choosing
a function implementation based on the runtime CPU, without incurring function
pointer overhead.  I would not attempt to use AVX512 on non-glibc systems, and
I would use ifunc to select the desired popcount implementation on glibc:
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.html




Re: 2023-11-09 release announcement draft

2023-11-06 Thread Noah Misch
On Mon, Nov 06, 2023 at 05:04:25PM -0500, Jonathan S. Katz wrote:
> The PostgreSQL Global Development Group has released an update to all 
> supported
> versions of PostgreSQL, including 16.1, 15.5, 14.10, 13.13, 12.17, and 11.22
> This release fixes over 55 bugs reported over the last several months.
> 
> This release includes fixes for indexes where in certain cases, we advise
> reindexing. Please see the "Update" section for more details.

s/"Update" section/"Updating" section/ or change section title below.

Delete lines starting here ...

> This is the **final release of PostgreSQL 11**. PostgreSQL 10 will no longer
> receive
> [security and bug fixes](https://www.postgresql.org/support/versioning/).
> If you are running PostgreSQL 10 in a production environment, we suggest that
> you make plans to upgrade.

... to here.  They're redundant with "PostgreSQL 11 EOL Notice" below:

> For the full list of changes, please review the
> [release notes](https://www.postgresql.org/docs/release/).
> 
> PostgreSQL 11 EOL Notice
> 
> 
> **This is the final release of PostgreSQL 11**. PostgreSQL 11 is now 
> end-of-life
> and will no longer receive security and bug fixes. If you are
> running PostgreSQL 11 in a production environment, we suggest that you make
> plans to upgrade to a newer, supported version of PostgreSQL. Please see our
> [versioning policy](https://www.postgresql.org/support/versioning/) for more
> information.




Re: Tab completion regression test failed on illumos

2023-11-01 Thread Noah Misch
On Wed, Nov 01, 2023 at 03:19:39PM +0800, Japin Li wrote:
> I try to run regression test on illumos, the 010_tab_completion will
> failed because of timeout.

> Any suggestions?  Thanks in advance!

This test failed for me, in a different way, when I briefly installed IO::Pty
on a Solaris buildfarm member:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse=2023-01-03%2022%3A39%3A26

The IPC::Run pty tests also fail on Solaris.  If I were debugging this, I'd
start by fixing IO::Pty and IPC::Run to pass their own test suites on Solaris
or illumos.  Then I'd see if problems continue for this postgresql test.




Re: race condition in pg_class

2023-11-01 Thread Noah Misch
I prototyped two ways, one with a special t_ctid and one with LockTuple().

On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote:
> On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> > Noah Misch  writes:
> > > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:

> > > I anticipate a new locktag per catalog that can receive inplace updates,
> > > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.
> > 
> > We could perhaps make this work by using the existing tuple-lock
> > infrastructure, rather than inventing new locktags (a choice that
> > spills to a lot of places including clients that examine pg_locks).
> 
> That could be okay.  It would be weird to reuse a short-term lock like that
> one as something held till end of transaction.  But the alternative of new
> locktags ain't perfect, as you say.

That worked.

> > I would prefer though to find a solution that only depends on making
> > heap_inplace_update protect itself, without high-level cooperation
> > from the possibly-interfering updater.  This is basically because
> > I'm still afraid that we're defining the problem too narrowly.
> > For one thing, I have nearly zero confidence that GRANT et al are
> > the only problematic source of conflicting transactional updates.
> 
> Likewise here, but I have fair confidence that an assertion would flush out
> the rest.  heap_inplace_update() would assert that the backend holds one of
> the acceptable locks.  It could even be an elog; heap_inplace_update() can
> tolerate that cost.

That check would fall in both heap_inplace_update() and heap_update().  After
all, a heap_inplace_update() check won't detect an omission in GRANT.

> > For another, I'm worried that some extension may be using
> > heap_inplace_update against a catalog we're not considering here.
> 
> A pgxn search finds "citus" using heap_inplace_update().
> 
> > I'd also like to find a solution that fixes the case of a conflicting
> > manual UPDATE (although certainly that's a stretch goal we may not be
> > able to reach).
> 
> It would be nice.

I expect most approaches could get there by having ExecModifyTable() arrange
for the expected locking or other actions.  That's analogous to how
heap_update() takes care of sinval even for a manual UPDATE.

> > I wonder if there's a way for heap_inplace_update to mark the tuple
> > header as just-updated in a way that regular heap_update could
> > recognize.  (For standard catalog updates, we'd then end up erroring
> > in simple_heap_update, which I think is fine.)  We can't update xmin,
> > because the VACUUM callers don't have an XID; but maybe there's some
> > other way?  I'm speculating about putting a funny value into xmax,
> > or something like that, and having heap_update check that what it
> > sees in xmax matches what was in the tuple the update started with.
> 
> Hmmm.  Achieving it without an XID would be the real trick.  (With an XID, we
> could use xl_heap_lock like heap_update() does.)  Thinking out loud, what if
> heap_inplace_update() sets HEAP_XMAX_INVALID and xmax =
> TransactionIdAdvance(xmax)?  Or change t_ctid in a similar way.  Then regular
> heap_update() could complain if the field changed vs. last seen value.  This
> feels like something to regret later in terms of limiting our ability to
> harness those fields for more-valuable ends or compact them away in a future
> page format.  I can't pinpoint a specific loss, so the idea might have legs.
> Nontransactional data in separate tables or in new metapages smells like the
> right long-term state.  A project wanting to reuse the tuple header bits could
> introduce such storage to unblock its own bit reuse.

heap_update() does not have the pre-modification xmax today, so I used t_ctid.
heap_modify_tuple() preserves t_ctid, so heap_update() already has the
pre-modification t_ctid in key cases.  For details of how the prototype uses
t_ctid, see comment at "#define InplaceCanaryOffsetNumber".  The prototype
doesn't prevent corruption in the following scenario, because the aborted
ALTER TABLE RENAME overwrites the special t_ctid:

  == session 1
  drop table t;
  create table t (c int);
  begin;
  -- in gdb, set breakpoint on heap_modify_tuple
  grant select on t to public;

  == session 2
  alter table t add primary key (c);
  begin; alter table t rename to t2; rollback;

  == back in session 1
  -- release breakpoint
  -- want error (would get it w/o the begin;alter;rollback)
  commit;

I'm missing how to mark the tuple in a fashion accessible to a second
heap_update() after a rolled-back heap_update().  The mark needs enough bits
"N" so it's implausible for 2^N inplace updates to happen between GRANT
fetching the old 

Re: race condition in pg_class

2023-10-27 Thread Noah Misch
On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
> >> I'm inclined to propose that heap_inplace_update should check to
> >> make sure that it's operating on the latest version of the tuple
> >> (including, I guess, waiting for an uncommitted update?) and throw
> >> error if not.  I think this is what your B3 option is, but maybe
> >> I misinterpreted.  It might be better to throw error immediately
> >> instead of waiting to see if the other updater commits.
> 
> > That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
> > waiting for GRANT to commit but instead inplace-updating every member of the
> > update chain.  For B2, I was thinking we don't need to error.  There are two
> > problematic orders of events.  The easy one is heap_inplace_update() 
> > mutating
> > a tuple that already has an xmax.  That's the one in the test case upthread,
> > and detecting it is trivial.  The harder one is heap_inplace_update() 
> > mutating
> > a tuple after GRANT fetches the old tuple, before GRANT enters 
> > heap_update().
> 
> Ugh ... you're right, what I was imagining would not catch that last case.
> 
> > I anticipate a new locktag per catalog that can receive inplace updates,
> > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.
> 
> We could perhaps make this work by using the existing tuple-lock
> infrastructure, rather than inventing new locktags (a choice that
> spills to a lot of places including clients that examine pg_locks).

That could be okay.  It would be weird to reuse a short-term lock like that
one as something held till end of transaction.  But the alternative of new
locktags ain't perfect, as you say.

> I would prefer though to find a solution that only depends on making
> heap_inplace_update protect itself, without high-level cooperation
> from the possibly-interfering updater.  This is basically because
> I'm still afraid that we're defining the problem too narrowly.
> For one thing, I have nearly zero confidence that GRANT et al are
> the only problematic source of conflicting transactional updates.

Likewise here, but I have fair confidence that an assertion would flush out
the rest.  heap_inplace_update() would assert that the backend holds one of
the acceptable locks.  It could even be an elog; heap_inplace_update() can
tolerate that cost.

> For another, I'm worried that some extension may be using
> heap_inplace_update against a catalog we're not considering here.

A pgxn search finds "citus" using heap_inplace_update().

> I'd also like to find a solution that fixes the case of a conflicting
> manual UPDATE (although certainly that's a stretch goal we may not be
> able to reach).

It would be nice.

> I wonder if there's a way for heap_inplace_update to mark the tuple
> header as just-updated in a way that regular heap_update could
> recognize.  (For standard catalog updates, we'd then end up erroring
> in simple_heap_update, which I think is fine.)  We can't update xmin,
> because the VACUUM callers don't have an XID; but maybe there's some
> other way?  I'm speculating about putting a funny value into xmax,
> or something like that, and having heap_update check that what it
> sees in xmax matches what was in the tuple the update started with.

Hmmm.  Achieving it without an XID would be the real trick.  (With an XID, we
could use xl_heap_lock like heap_update() does.)  Thinking out loud, what if
heap_inplace_update() sets HEAP_XMAX_INVALID and xmax =
TransactionIdAdvance(xmax)?  Or change t_ctid in a similar way.  Then regular
heap_update() could complain if the field changed vs. last seen value.  This
feels like something to regret later in terms of limiting our ability to
harness those fields for more-valuable ends or compact them away in a future
page format.  I can't pinpoint a specific loss, so the idea might have legs.
Nontransactional data in separate tables or in new metapages smells like the
right long-term state.  A project wanting to reuse the tuple header bits could
introduce such storage to unblock its own bit reuse.

> Or we could try to get rid of in-place updates, but that seems like
> a mighty big lift.  All of the existing callers, except maybe
> the johnny-come-lately dropdb usage, have solid documented reasons
> to do it that way.

Yes, removing that smells problematic.




Re: race condition in pg_class

2023-10-27 Thread Noah Misch
On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
> >> We'll likely need to change how we maintain relhasindex or perhaps take a 
> >> lock
> >> in GRANT.
> 
> > The main choice is accepting more DDL blocking vs. accepting inefficient
> > relcache builds.  Options I'm seeing:
> 
> It looks to me like you're only thinking about relhasindex, but it
> seems to me that any call of heap_inplace_update brings some
> risk of this kind.  Excluding the bootstrap-mode-only usage in
> create_toast_table, I see four callers:
> 
> * index_update_stats updating a pg_class tuple's
>   relhasindex, relpages, reltuples, relallvisible
> 
> * vac_update_relstats updating a pg_class tuple's
>   relpages, reltuples, relallvisible, relhasindex, relhasrules,
>   relhastriggers, relfrozenxid, relminmxid
> 
> * vac_update_datfrozenxid updating a pg_database tuple's
>   datfrozenxid, datminmxid
> 
> * dropdb updating a pg_database tuple's datconnlimit
> 
> So we have just as much of a problem with GRANTs on databases
> as GRANTs on relations.  Also, it looks like we can lose
> knowledge of the presence of rules and triggers, which seems
> nearly as bad as forgetting about indexes.  The rest of these
> updates might not be correctness-critical, although I wonder
> how bollixed things could get if we forget an advancement of
> relfrozenxid or datfrozenxid (especially if the calling
> transaction goes on to make other changes that assume that
> the update happened).

Thanks for researching that.  Let's treat frozenxid stuff as critical; I
wouldn't want to advance XID limits based on a datfrozenxid that later gets
rolled back.  I agree relhasrules and relhastriggers are also critical.  The
"inefficient relcache builds" option family can't solve cases like
relfrozenxid and datconnlimit, so that leaves us with the "more DDL blocking"
option family.

> BTW, vac_update_datfrozenxid believes (correctly I think) that
> it cannot use the syscache copy of a tuple as the basis for in-place
> update, because syscache will have detoasted any toastable fields.
> These other callers are ignoring that, which seems like it should
> result in heap_inplace_update failing with "wrong tuple length".
> I wonder how come we're not seeing reports of that from the field.

Good question.  Perhaps we'll need some test cases that exercise each inplace
update against a row having a toast pointer.  It's too easy to go a long time
without encountering those in the field.

> I'm inclined to propose that heap_inplace_update should check to
> make sure that it's operating on the latest version of the tuple
> (including, I guess, waiting for an uncommitted update?) and throw
> error if not.  I think this is what your B3 option is, but maybe
> I misinterpreted.  It might be better to throw error immediately
> instead of waiting to see if the other updater commits.

That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
waiting for GRANT to commit but instead inplace-updating every member of the
update chain.  For B2, I was thinking we don't need to error.  There are two
problematic orders of events.  The easy one is heap_inplace_update() mutating
a tuple that already has an xmax.  That's the one in the test case upthread,
and detecting it is trivial.  The harder one is heap_inplace_update() mutating
a tuple after GRANT fetches the old tuple, before GRANT enters heap_update().
I anticipate a new locktag per catalog that can receive inplace updates,
i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.  Here's a
walk-through for the pg_database case.  GRANT will use the following sequence
of events:

- acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode
- fetch latest pg_database tuple
- heap_update()
- COMMIT, releasing LOCKTAG_DATABASE_DEFINITION

vac_update_datfrozenxid() sequence of events:

- acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode
- (now, all GRANTs on the given database have committed or aborted)
- fetch latest pg_database tuple
- heap_inplace_update()
- release LOCKTAG_DATABASE_DEFINITION, even if xact not ending
- continue with other steps, e.g. vac_truncate_clog()

How does that compare to what you envisioned?  vac_update_datfrozenxid() could
further use xmax as a best-efforts thing to catch conflict with manual UPDATE
statements, but it wouldn't solve the case where the UPDATE had fetched the
tuple but not yet heap_update()'d it.




Re: race condition in pg_class

2023-10-27 Thread Noah Misch
On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
> On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote:
> > We are running PG13.10 and recently we have encountered what appears to be
> > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> > some other catalog-writer, possibly ANALYZE.
> > The problem is that after successfully creating index on relation (which
> > previosly didnt have any indexes), its pg_class.relhasindex remains set to
> > "false", which is illegal, I think.

It's damaging.  The table will behave like it has no indexes.  If something
adds an index later, old indexes will reappear, corrupt, having not received
updates during the relhasindex=false era.  ("pg_amcheck --heapallindexed" can
detect this.)

> > Index was built using the following statement:
> > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
> 
> This is going to be a problem with any operation that does a transactional
> pg_class update without taking a lock that conflicts with ShareLock.  GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
> 
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
> 
> == session 2
> alter table t add primary key (c);
> 
> == back in session 1
> commit;
> 
> 
> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> in GRANT.

The main choice is accepting more DDL blocking vs. accepting inefficient
relcache builds.  Options I'm seeing:

=== "more DDL blocking" option family

B1. Take ShareUpdateExclusiveLock in GRANT, REVOKE, and anything that makes
transactional pg_class updates without holding some stronger lock.  New
asserts could catch future commands failing to do this.

B2. Take some shorter-lived lock around pg_class tuple formation, such that
GRANT blocks CREATE INDEX, but two CREATE INDEX don't block each other.
Anything performing a transactional update of a pg_class row would acquire
the lock in exclusive mode before fetching the old tuple and hold it till
end of transaction.  relhasindex=true in-place updates would acquire it
the same way, but they would release it after the inplace update.  I
expect a new heavyweight lock type, e.g. LOCKTAG_RELATION_DEFINITION, with
the same key as LOCKTAG_RELATION.  This has less blocking than the
previous option, but it's more complicated to explain to both users and
developers.

B3. Have CREATE INDEX do an EvalPlanQual()-style thing to update all successor
tuple versions.  Like the previous option, this would require new locking,
but the new lock would not need to persist till end of xact.  It would be
even more complicated to explain to users and developers.  (If this is
promising enough to warrant more detail, let me know.)

B4. Use transactional updates to set relhasindex=true.  Two CREATE INDEX
commands on the same table would block each other.  If we did it the way
most DDL does today, they'd get "tuple concurrently updated" failures
after the blocking ends.

=== "inefficient relcache builds" option family

R1. Ignore relhasindex; possibly remove it in v17.  Relcache builds et
al. will issue more superfluous queries.

R2. As a weird variant of the previous option, keep relhasindex and make all
transactional updates of pg_class set relhasindex=true pessimistically.
(VACUUM will set it back to false.)

=== other

O1. This is another case where the sometimes-discussed "pg_class_nt" for
nontransactional columns would help.  I'm ruling that out as too hard to
back-patch.


Are there other options important to consider?  I currently like (B1) the
most, followed closely by (R1) and (B2).  A key unknown is the prevalence of
index-free tables.  Low prevalence would argue in favor of (R1).  In my
limited experience, they've been rare.  That said, I assume relcache builds
happen a lot more than GRANTs, so it's harder to bound the damage from (R1)
compared to the damage from (B1).  Thoughts on this decision?

Thanks,
nm




Re: race condition in pg_class

2023-10-26 Thread Noah Misch
On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

This is going to be a problem with any operation that does a transactional
pg_class update without taking a lock that conflicts with ShareLock.  GRANT
doesn't lock the table at all, so I can reproduce this in v17 as follows:

== session 1
create table t (c int);
begin;
grant select on t to public;

== session 2
alter table t add primary key (c);

== back in session 1
commit;


We'll likely need to change how we maintain relhasindex or perhaps take a lock
in GRANT.

> Looking into the WAL via waldump given us the following picture (full
> waldump output is attached):

> 1202295045 - create index statement
> 1202298790 and 1202298791 are some other concurrent operations,
> unfortunately I wasnt able to determine what are they

Can you explore that as follows?

- PITR to just before the COMMIT record.
- Save all rows of pg_class.
- PITR to just after the COMMIT record.
- Save all rows of pg_class.
- Diff the two sets of saved rows.

Which columns changed?  The evidence you've shown would be consistent with a
transaction doing GRANT or REVOKE on dozens of tables.  If the changed column
is something other than relacl, that would be great to know.

On the off-chance it's relevant, what extensions do you have (\dx in psql)?




Re: post-recovery amcheck expectations

2023-10-24 Thread Noah Misch
On Tue, Oct 24, 2023 at 07:03:34PM -0700, Peter Geoghegan wrote:
> On Mon, Oct 23, 2023 at 7:28 PM Noah Misch  wrote:
> > > That makes sense to me. I believe that it's not possible to have a
> > > string of consecutive sibling pages that are all half-dead (regardless
> > > of the BlockNumber order of sibling pages, even). But I'd probably
> > > have written the fix in roughly the same way. Although...maybe you
> > > should try to detect a string of half-dead pages? Hard to say if it's
> > > worth the trouble.
> >
> > I imagined a string of half-dead siblings could arise in structure like 
> > this:
> >
> >  *   1
> >  *  /|\
> >  * 4 <-> 2 <-> 3
> >
> > With events like this:
> >
> > - DELETE renders blk 4 deletable.
> > - Crash with concurrent VACUUM, leaving 4 half-dead after having visited 
> > 1-4.
> > - DELETE renders blk 2 deletable.
> > - Crash with concurrent VACUUM, leaving 2 half-dead after having visited 
> > 1-2.
> >
> > I didn't try to reproduce that, and something may well prevent it.
> 
> FWIW a couple of factors prevent it (in the absence of corruption). These are:
> 
> 1. Only VACUUM can delete pages, and in general the only possible
> source of half-dead pages is an unfortunately timed crash/error within
> VACUUM. Each interrupted VACUUM can leave behind at most one half-dead
> page.

Agreed.

> 2. One thing that makes VACUUM back out of deleting an empty page is
> the presence of a half-dead right sibling leaf page left behind by
> some VACUUM that was interrupted at some point in the past -- see
> _bt_rightsib_halfdeadflag() for details.
> 
> Obviously, factors 1 and 2 together make three consecutive half-dead
> sibling pages impossible.

Can't it still happen if the sequence of unfortunately timed crashes causes
deletions from left to right?  Take this example, expanding the one above.
Half-kill 4, crash, half-kill 3, crash, half-kill 2 in:

 *  1
 *   // | \\
 * 4 <-> 3 <-> 2 <-> 1

(That's not to say it has ever happened outside of a test.)




Re: post-recovery amcheck expectations

2023-10-23 Thread Noah Misch
On Mon, Oct 23, 2023 at 04:46:23PM -0700, Peter Geoghegan wrote:
> On Fri, Oct 20, 2023 at 8:55 PM Noah Misch  wrote:
> > > > I lean toward fixing this by
> > > > having amcheck scan left; if left links reach only half-dead or deleted 
> > > > pages,
> > > > that's as good as the present child block being P_LEFTMOST.
> > >
> > > Also my preference.
> >
> > Done mostly that way, except I didn't accept deleted pages.  Making this 
> > work
> > on !readonly would take more than that, and readonly shouldn't need that.
> 
> That makes sense to me. I believe that it's not possible to have a
> string of consecutive sibling pages that are all half-dead (regardless
> of the BlockNumber order of sibling pages, even). But I'd probably
> have written the fix in roughly the same way. Although...maybe you
> should try to detect a string of half-dead pages? Hard to say if it's
> worth the trouble.

I imagined a string of half-dead siblings could arise in structure like this:

 *   1  

 
 *  /|\ 

  
 * 4 <-> 2 <-> 3

With events like this:

- DELETE renders blk 4 deletable.
- Crash with concurrent VACUUM, leaving 4 half-dead after having visited 1-4.
- DELETE renders blk 2 deletable.
- Crash with concurrent VACUUM, leaving 2 half-dead after having visited 1-2.

I didn't try to reproduce that, and something may well prevent it.

> Suggest adding a CHECK_FOR_INTERRUPTS() call to the loop, too, just
> for good luck.

Added.  That gave me the idea to check for circular links, like other parts of
amcheck do.  Net diff:

--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -949,11 +949,16 @@ bt_leftmost_ignoring_half_dead(BtreeCheckState *state,
Pagepage = palloc_btree_page(state, reached);
BTPageOpaque reached_opaque = BTPageGetOpaque(page);
 
+   CHECK_FOR_INTERRUPTS();
+
/*
-* _bt_unlink_halfdead_page() writes that side-links will 
continue to
-* point to the siblings.  We can easily check btpo_next.
+* Try to detect btpo_prev circular links.  
_bt_unlink_halfdead_page()
+* writes that side-links will continue to point to the 
siblings.
+* Check btpo_next for that property.
 */
-   all_half_dead = P_ISHALFDEAD(reached_opaque) &&
+   all_half_dead = P_ISHALFDEAD(reached_opaque);
+   reached != start &&
+   reached != reached_from &&
reached_opaque->btpo_next == reached_from;
if (all_half_dead)
{




Re: interval_ops shall stop using btequalimage (deduplication)

2023-10-23 Thread Noah Misch
On Mon, Oct 16, 2023 at 11:21:20PM -0700, Donghang Lin wrote:
> > I've also caught btree posting lists where one TID refers to a '1d' heap
> > tuple, while another TID refers to a '24h' heap tuple.  amcheck complains.
> Index-only scans can return the '1d' bits where the actual tuple had the
> '24h'
> bits.
> 
> Have a build without the patch. I can't reproduce amcheck complaints in
> release mode
> where all these statements succeed.

The queries I shared don't create the problematic structure, just an assertion
failure.

> >  * Generic "equalimage" support function.
> > *
> > * B-Tree operator classes whose equality function could safely be
> replaced by
> > * datum_image_eq() in all cases can use this as their "equalimage" support
> > * function.
> It seems to me that as long as a data type has a deterministic sort but not
> necessarily be equalimage,
> it should be able to support deduplication.  e.g for interval type, we add
> a byte wise tie breaker
> after '24h' and '1day' are compared equally. In the btree, '24h' and '1day'
> are still adjacent,
> '1day' is always sorted before '24h' in a btree page, can we do dedup for
> each value
> without problem?

Yes.  I'm not aware of correctness obstacles arising if one did that.




Re: pgstatindex vs. !indisready

2023-10-22 Thread Noah Misch
On Wed, Oct 04, 2023 at 09:00:23AM +0900, Michael Paquier wrote:
> On Sun, Oct 01, 2023 at 06:31:26PM -0700, Noah Misch wrote:
> > The !indisvalid index may be missing tuples, yes.  In what way does that 
> > make
> > one of those four operations incorrect?
> 
> Reminding myself of what these four do, it looks that you're right and
> that the state of indisvalid is not going to change what they report.
> 
> Still, I'd like to agree with Tom's point to be more conservative and
> check also for indisvalid which is what the planner does.  These
> functions will be used in SELECTs, and one thing that worries me is
> that checks based on indisready may get copy-pasted somewhere else,
> leading to incorrect results where they get copied.  (indisready &&
> !indisvalid) is a "short"-term combination in a concurrent build
> process, as well (depends on how long one waits for the old snapshots
> before switching indisvalid, but that's way shorter than the period of
> time where the built indexes remain valid).

Neither choice would harm the user experience in an important way, so I've
followed your preference in the attached patch.
Author: Noah Misch 
Commit: Noah Misch 

Diagnose !indisvalid in more SQL functions.

pgstatindex failed with ERRCODE_DATA_CORRUPTED, of the "can't-happen"
class XX.  The other functions succeeded on an empty index; they might
have malfunctioned if the failed index build left torn I/O or other
complex state.  Report an ERROR in statistics functions pgstatindex,
pgstatginindex, pgstathashindex, and pgstattuple.  Report DEBUG1 and
skip all index I/O in maintenance functions brin_desummarize_range,
brin_summarize_new_values, brin_summarize_range, and
gin_clean_pending_list.  Back-patch to v11 (all supported versions).

Reviewed by FIXME.

Discussion: https://postgr.es/m/20231001195309...@google.com

diff --git a/contrib/pgstattuple/pgstatindex.c 
b/contrib/pgstattuple/pgstatindex.c
index d69ac1c..8e5a4d6 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -238,6 +238,18 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 errmsg("cannot access temporary tables of 
other sessions")));
 
/*
+* A !indisready index could lead to ERRCODE_DATA_CORRUPTED later, so 
exit
+* early.  We're capable of assessing an indisready&&!indisvalid index,
+* but the results could be confusing.  For example, the index's size
+* could be too low for a valid index of the table.
+*/
+   if (!rel->rd_index->indisvalid)
+   ereport(ERROR,
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("index \"%s\" is not valid",
+   RelationGetRelationName(rel;
+
+   /*
 * Read metapage
 */
{
@@ -523,6 +535,13 @@ pgstatginindex_internal(Oid relid, FunctionCallInfo fcinfo)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot access temporary indexes of 
other sessions")));
 
+   /* see pgstatindex_impl */
+   if (!rel->rd_index->indisvalid)
+   ereport(ERROR,
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("index \"%s\" is not valid",
+   RelationGetRelationName(rel;
+
/*
 * Read metapage
 */
@@ -600,6 +619,13 @@ pgstathashindex(PG_FUNCTION_ARGS)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot access temporary indexes of 
other sessions")));
 
+   /* see pgstatindex_impl */
+   if (!rel->rd_index->indisvalid)
+   ereport(ERROR,
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("index \"%s\" is not valid",
+   RelationGetRelationName(rel;
+
/* Get the information we need from the metapage. */
memset(, 0, sizeof(stats));
metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
diff --git a/contrib/pgstattuple/pgstattuple.c 
b/contrib/pgstattuple/pgstattuple.c
index 93b7834..3bd8b96 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -259,6 +259,13 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
}
else if (rel->rd_rel->relkind == RELKIND_INDEX)
{
+   /* see pgstatindex_impl */
+   if (!rel->

Re: post-recovery amcheck expectations

2023-10-20 Thread Noah Misch
On Mon, Oct 09, 2023 at 04:46:26PM -0700, Peter Geoghegan wrote:
> On Wed, Oct 4, 2023 at 7:52 PM Noah Misch  wrote:

> Might make sense to test the fix for this issue using a similar
> approach: by adding custom code that randomly throws errors at a point
> that stresses the implementation. I'm referring to the point at which
> VACUUM is between the first and second phase of page deletion: right
> before (or directly after) _bt_unlink_halfdead_page() is called. That
> isn't fundamentally different to the approach from your TAP test, but
> seems like it might add some interesting variation.

My initial manual test was like that, actually.

> > I lean toward fixing this by
> > having amcheck scan left; if left links reach only half-dead or deleted 
> > pages,
> > that's as good as the present child block being P_LEFTMOST.
> 
> Also my preference.

Done mostly that way, except I didn't accept deleted pages.  Making this work
on !readonly would take more than that, and readonly shouldn't need that.

> > There's a
> > different error from bt_index_check(), and I've not yet studied how to fix
> > that:
> >
> >   ERROR:  left link/right link pair in index "not_leftmost_pk" not in 
> > agreement
> >   DETAIL:  Block=0 left block=0 left link from block=4294967295.
> 
> This looks like this might be a straightforward case of confusing
> P_NONE for InvalidBlockNumber (or vice-versa). They're both used to
> indicate "no such block" here.

Roughly so.  It turned out this scenario was passing leftcurrent=P_NONE to
bt_recheck_sibling_links(), causing that function to use BTPageGetOpaque() on
the metapage.

> > For some other amcheck expectations, the comments suggest reliance on the
> > bt_index_parent_check() ShareLock.  I haven't tried to make test cases for
> > them, but perhaps recovery can trick them the same way.  Examples:
> >
> >   errmsg("downlink or sibling link points to deleted block in index \"%s\"",
> >   errmsg("block %u is not leftmost in index \"%s\"",
> >   errmsg("block %u is not true root in index \"%s\"",
> 
> These are all much older. They're certainly all from before the
> relevant checks were first added (by commit d114cc53), and seem much
> less likely to be buggy.

After I fixed the original error, the "block %u is not leftmost" surfaced
next.  The attached patch fixes that, too.  I didn't investigate the others.
The original test was flaky in response to WAL flush timing, but this one
survives thousands of runs.
Author: Noah Misch 
Commit: Noah Misch 

amcheck: Distinguish interrupted page deletion from corruption.

This prevents false-positive reports about "the first child of leftmost
target page is not leftmost of its level", "block %u is not leftmost"
and "left link/right link pair".  They appeared if amcheck ran before
VACUUM cleaned things, after a cluster exited recovery between the
first-stage and second-stage WAL records of a deletion.  Back-patch to
v11 (all supported versions).

Reviewed by FIXME.

Discussion: https://postgr.es/m/20231005025232.c7.nmi...@google.com

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index b82f221..f830bd3 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -12,6 +12,7 @@ PGFILEDESC = "amcheck - function for verifying relation 
integrity"
 
 REGRESS = check check_btree check_heap
 
+EXTRA_INSTALL = contrib/pg_walinspect
 TAP_TESTS = 1
 
 ifdef USE_PGXS
diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index 5b55cf3..51d8107 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -42,6 +42,7 @@ tests += {
   't/001_verify_heapam.pl',
   't/002_cic.pl',
   't/003_cic_2pc.pl',
+  't/004_pitr.pl',
 ],
   },
 }
diff --git a/contrib/amcheck/t/004_pitr.pl b/contrib/amcheck/t/004_pitr.pl
new file mode 100644
index 000..6bcc159
--- /dev/null
+++ b/contrib/amcheck/t/004_pitr.pl
@@ -0,0 +1,82 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+# Test integrity of intermediate states by PITR to those states
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# origin node: generate WAL records of interest.
+my $origin = PostgreSQL::Test::Cluster->new('origin');
+$origin->init(has_archiving => 1, allows_streaming => 1);
+$origin->append_conf('postgresql.conf', 'autovacuum = off');
+$origin->start;
+$origin->backup('my_backup');
+# Create a table with each of 6 PK values spanning 1/4 of a block.  Delete the
+# first four, so one index leaf is eligible for deletion.  Make a replication
+# slot just so pg_walinspect will always

Re: Remove last traces of HPPA support

2023-10-20 Thread Noah Misch
On Fri, Oct 20, 2023 at 12:40:00PM -0700, Andres Freund wrote:
> On 2023-10-19 17:23:04 -0700, Noah Misch wrote:
> > On Thu, Oct 19, 2023 at 11:16:28AM -0400, Tom Lane wrote:
> > > We removed support for the HP-UX OS in v16, but left in support
> > > for the PA-RISC architecture, mainly because I thought that its
> > > spinlock mechanism is weird enough to be a good stress test
> > > for our spinlock infrastructure.  It still is that, but my
> > > one remaining HPPA machine has gone to the great recycle heap
> > > in the sky.  There seems little point in keeping around nominal
> > > support for an architecture that we can't test and no one is
> > > using anymore.
> > > 
> > > Hence, the attached removes the remaining support for HPPA.
> > > Any objections?
> > 
> > I wouldn't do this.  NetBSD/hppa still claims to exist, as does the OpenBSD
> > equivalent.  I presume its pkgsrc compiles this code.  The code is basically
> > zero-maintenance, so there's not much to gain from deleting it preemptively.
> 
> In addition to the point Tom has made, I think it's also not correct that hppa
> doesn't impose a burden: hppa is the only of our architectures that doesn't
> actually support atomic operations, requiring us to have infrastructure to
> backfill atomics using spinlocks. This does preclude some uses of atomics,
> e.g. in signal handlers - I think Thomas wanted to do so for some concurrency
> primitive.

If the next thing is a patch removing half of the fallback atomics, that is a
solid reason to remove hppa.  The code removed in the last proposed patch was
not that and was code that never changes, hence my reaction.




Re: Remove last traces of HPPA support

2023-10-19 Thread Noah Misch
On Thu, Oct 19, 2023 at 11:16:28AM -0400, Tom Lane wrote:
> We removed support for the HP-UX OS in v16, but left in support
> for the PA-RISC architecture, mainly because I thought that its
> spinlock mechanism is weird enough to be a good stress test
> for our spinlock infrastructure.  It still is that, but my
> one remaining HPPA machine has gone to the great recycle heap
> in the sky.  There seems little point in keeping around nominal
> support for an architecture that we can't test and no one is
> using anymore.
> 
> Hence, the attached removes the remaining support for HPPA.
> Any objections?

I wouldn't do this.  NetBSD/hppa still claims to exist, as does the OpenBSD
equivalent.  I presume its pkgsrc compiles this code.  The code is basically
zero-maintenance, so there's not much to gain from deleting it preemptively.




Re: Add support for AT LOCAL

2023-10-18 Thread Noah Misch
On Wed, Oct 18, 2023 at 04:45:46PM -0400, Tom Lane wrote:
> > On Wed, Oct 18, 2023 at 12:02 PM Tom Lane  wrote:
> >> Probably.  Independent of that, it's fair to ask why we're still
> >> testing against xlc 12.1 and not the considerably-more-recent xlclang,
> >> or at least xlc 16.1.  (I also wonder why we're still testing AIX 7.1
> >> rather than an OS version that's not EOL.)

> The machine belongs to OSU (via the gcc compile farm), and I see
> that they have another one that's POWER8 and is running AIX 7.3 [1].
> So in principle the buildfarm animals could just be moved over
> to that one.
> 
> Perhaps Noah has some particular attachment to 7.1, but now that that's
> EOL it seems like we shouldn't be paying so much attention to it.
> My guess is that it's still there in the compile farm because the gcc
> people think it's still useful to have access to POWER7 hardware; but
> I doubt there's enough difference for our purposes to be worth dealing
> with a dead OS and ancient compiler.

No particular attachment.  From 2019 to 2023-08, hoverfly tested xlc16 on AIX
7.2; its run ended when cfarm119's owner replaced cfarm119 with an AIX 7.3,
ibm-clang v17.1.1 machine.  Since 2015, hornet and mandrill have tested xlc12
on AIX 7.1.  That said, given your finding that later xlc versions have the
same code generation bug, the choice of version is a side issue.  A migration
to ibm-clang wouldn't have prevented this week's xlc-prompted commits.

I feel the gravity and longevity of xlc bugs has been out of proportion with
the compiler's contribution to PostgreSQL.  I would find it reasonable to
revoke xlc support in v17+, leaving AIX gcc support in place.  The main
contribution of AIX has been to find the bug behind commit a1b8aa1.  That
benefited from the AIX kernel, not from any particular compiler.  hornet and
mandrill would continue to test v16-.

By the way, I once tried to report an xlc bug.  Their system was tailored to
accept bugs from paid support customers only.  I submitted it via some sales
inquiry form, just in case, but never heard back.




Re: Add support for AT LOCAL

2023-10-16 Thread Noah Misch
On Mon, Oct 16, 2023 at 01:54:23AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Sun, Oct 15, 2023 at 09:58:04PM -0700, Noah Misch wrote:
> >> Works for me.  I've started a test run with the xlc version change.
> 
> > It failed similarly:
> 
> > +  23:59:00-07| 4294966103:4294967295:00+00 | 
> > 4294966103:4294967295:00+00 | 4294966103:4294967295:00+00
> > +  23:59:59.99-07 | 4294966103:00:00.01+00  | 4294966103:00:00.01+00
> >   | 4294966103:00:00.01+00
> 
> Ugh.  So if the failure is robust enough to persist across
> several major xlc versions, why couldn't Thomas reproduce it?

Beats me.  hornet wipes its starting environment down to OBJECT_MODE=32_64
PERL5LIB=/home/nm/sw/cpan64/lib/perl5 SPECIES=xlc64 PATH=/usr/bin, then
applies all the environment settings seen in buildfarm logs.




Re: Add support for AT LOCAL

2023-10-15 Thread Noah Misch
On Sun, Oct 15, 2023 at 09:58:04PM -0700, Noah Misch wrote:
> On Sun, Oct 15, 2023 at 11:30:17PM -0400, Tom Lane wrote:
> > Thomas Munro  writes:
> > > On Mon, Oct 16, 2023 at 4:02 PM Tom Lane  wrote:
> > >> I'm having a hard time not believing that this is a compiler bug.
> > >> Looking back at 8d2a01ae12cd and its speculation that xlc is overly
> > >> liberal about reordering code around sequence points ... I wonder
> > >> if it'd help to do this calculation in a local variable, and only
> > >> assign the final value to result->time ?  But we have to reproduce
> > >> the problem first.
> > 
> > > If that can be shown I would vote for switching to /opt/IBM/xlc/16.1.0
> > > and not changing a single bit of PostgreSQL.
> > 
> > If switching to 16.1 removes the failure, I'd agree.  It's hard
> > to believe that any significant number of users still care about
> > building PG with xlc 12.
> 
> Works for me.  I've started a test run with the xlc version change.

It failed similarly:

+  23:59:00-07| 4294966103:4294967295:00+00 | 4294966103:4294967295:00+00 | 
4294966103:4294967295:00+00
+  23:59:59.99-07 | 4294966103:00:00.01+00  | 4294966103:00:00.01+00  | 
4294966103:00:00.01+00




Re: Add support for AT LOCAL

2023-10-15 Thread Noah Misch
On Sun, Oct 15, 2023 at 11:30:17PM -0400, Tom Lane wrote:
> Thomas Munro  writes:
> > On Mon, Oct 16, 2023 at 4:02 PM Tom Lane  wrote:
> >> I'm having a hard time not believing that this is a compiler bug.
> >> Looking back at 8d2a01ae12cd and its speculation that xlc is overly
> >> liberal about reordering code around sequence points ... I wonder
> >> if it'd help to do this calculation in a local variable, and only
> >> assign the final value to result->time ?  But we have to reproduce
> >> the problem first.
> 
> > If that can be shown I would vote for switching to /opt/IBM/xlc/16.1.0
> > and not changing a single bit of PostgreSQL.
> 
> If switching to 16.1 removes the failure, I'd agree.  It's hard
> to believe that any significant number of users still care about
> building PG with xlc 12.

Works for me.  I've started a test run with the xlc version change.




Re: interval_ops shall stop using btequalimage (deduplication)

2023-10-12 Thread Noah Misch
On Wed, Oct 11, 2023 at 01:00:44PM -0700, Peter Geoghegan wrote:
> On Wed, Oct 11, 2023 at 11:38 AM Noah Misch  wrote:
> > Interesting.  So, >99% of interval-type indexes, even ones WITH
> > (deduplicate_items=off), will get amcheck failures.  The <1% of exceptions
> > might include indexes having allequalimage=off due to an additional column,
> > e.g. a two-column (interval, numeric) index.  If interval indexes are common
> > enough and "pg_amcheck --heapallindexed" failures from $SUBJECT are 
> > relatively
> > rare, that could argue for giving amcheck a special case.  Specifically,
> > downgrade its "metapage incorrectly indicates that deduplication is safe" 
> > from
> > ERROR to WARNING for interval_ops only.
> 
> I am not aware of any user actually running "deduplicate_items = off"
> in production, for any index. It was added purely as a defensive thing
> -- not because I anticipated any real need to disable deduplication.
> Deduplication was optimized for being enabled by default.

Sure.  Low-importance background information: deduplicate_items=off got on my
radar while I was wondering if ALTER INDEX ... SET (deduplicate_items=off)
would clear allequalimage.  If it had, we could have advised people to use
ALTER INDEX, then rebuild only those indexes still failing "pg_amcheck
--heapallindexed".  ALTER INDEX doesn't do that, ruling out that idea.

> > Without that special case (i.e. with
> > the v1 patch), the release notes should probably resemble, "After updating,
> > run REINDEX on all indexes having an interval-type column."
> 
> +1
> 
> > There's little
> > point in recommending pg_amcheck if >99% will fail.  I'm inclined to bet 
> > that
> > interval-type indexes are rare, so I lean against adding the amcheck special
> > case.  It's not a strong preference.  Other opinions?

> exactly one case like that post-fix (interval_ops is at least the only
> affected core code opfamily), so why not point that out directly with
> a HINT? A HINT could go a long way towards putting the problem in
> context, without really adding a special case, and without any real
> question of users being misled.

Works for me.  Added.
Author: Noah Misch 
Commit: Noah Misch 

Dissociate btequalimage() from interval_ops, ending its deduplication.

Under interval_ops, some equal values are distinguishable.  One such
pair is '24:00:00' and '1 day'.  With that being so, btequalimage()
breaches the documented contract for the "equalimage" btree support
function.  This can cause incorrect results from index-only scans.
Users should REINDEX any btree indexes having interval-type columns.
After updating, pg_amcheck will report an error for almost all such
indexes.  This fix makes interval_ops simply omit the support function,
like numeric_ops does.  Back-pack to v13, where btequalimage() first
appeared.  In back branches, for the benefit of old catalog content,
btequalimage() code will return false for type "interval".  Going
forward, back-branch initdb will include the catalog change.

Reviewed by Peter Geoghegan.

Discussion: https://postgr.es/m/20231011013317.22.nmi...@google.com

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index dbb83d8..3e07a3e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -31,6 +31,7 @@
 #include "access/xact.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_opfamily_d.h"
 #include "commands/tablecmds.h"
 #include "common/pg_prng.h"
 #include "lib/bloomfilter.h"
@@ -338,10 +339,20 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, 
bool heapallindexed,
 errmsg("index \"%s\" metapage has 
equalimage field set on unsupported nbtree version",

RelationGetRelationName(indrel;
if (allequalimage && !_bt_allequalimage(indrel, false))
+   {
+   boolhas_interval_ops = false;
+
+   for (int i = 0; i < 
IndexRelationGetNumberOfKeyAttributes(indrel); i++)
+   if (indrel->rd_opfamily[i] == 
INTERVAL_BTREE_FAM_OID)
+   has_interval_ops = true;
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
 errmsg("index \"%s\" metapage 
incorrectly indicates that deduplication is safe",
-   
RelationGetR

Re: interval_ops shall stop using btequalimage (deduplication)

2023-10-11 Thread Noah Misch
On Tue, Oct 10, 2023 at 09:35:45PM -0700, Peter Geoghegan wrote:
> On Tue, Oct 10, 2023 at 8:51 PM Peter Geoghegan  wrote:
> > I don't see any reason to delay committing your fix. The issue that
> > you've highlighted is exactly the kind of issue that I anticipated
> > might happen at some point. This seems straightforward.
> 
> BTW, we don't need to recommend the heapallindexed option in the
> release notes. Calling bt_check_index() will reliably indicate that
> corruption is present when called against existing interval_ops
> indexes once your fix is in. Simply having an index whose metapage's
> allequalimage field is spuriously set to true will be recognized as
> corruption right away. Obviously, this will be no less true with an
> existing interval_ops index that happens to be completely empty.

Interesting.  So, >99% of interval-type indexes, even ones WITH
(deduplicate_items=off), will get amcheck failures.  The <1% of exceptions
might include indexes having allequalimage=off due to an additional column,
e.g. a two-column (interval, numeric) index.  If interval indexes are common
enough and "pg_amcheck --heapallindexed" failures from $SUBJECT are relatively
rare, that could argue for giving amcheck a special case.  Specifically,
downgrade its "metapage incorrectly indicates that deduplication is safe" from
ERROR to WARNING for interval_ops only.  Without that special case (i.e. with
the v1 patch), the release notes should probably resemble, "After updating,
run REINDEX on all indexes having an interval-type column."  There's little
point in recommending pg_amcheck if >99% will fail.  I'm inclined to bet that
interval-type indexes are rare, so I lean against adding the amcheck special
case.  It's not a strong preference.  Other opinions?

If users want to reduce their exposure now, they could do "ALTER INDEX ... SET
(deduplicate_items = off)" and then REINDEX any indexes already failing
"pg_amcheck --heapallindexed".  allequalimage will remain wrong but have no
ill effects beyond making amcheck fail.  Another REINDEX after the update
would let amcheck pass.




Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

2023-10-10 Thread Noah Misch
On Sun, Oct 08, 2023 at 10:00:03PM -0700, David G. Johnston wrote:
> On Sun, Oct 8, 2023 at 9:10 PM Noah Misch  wrote:
> > I didn't think of any phrasing that clearly explained things without the
> > reader consulting the code.  I considered these:

I'm leaning toward:

  "socket file descriptor out of range for select(): %d" [style guide violation]

A true style guide purist might bury it in a detail message:

  ERROR: socket file descriptor out of range: %d
  DETAIL: select() accepts file descriptors from 0 to 1023, inclusive, in this 
build.
  HINT: Try fewer concurrent database clients.

> >   "socket file descriptor out of range: %d" [what range?]
> >
> Quick drive-by...but it seems that < 0 is a distinctly different problem
> than exceeding FD_SETSIZE.  I'm unsure what would cause the former but the
> error for the later seems like:
> 
> error: "Requested socket file descriptor %d exceeds connection limit of
> %d", fd, FD_SETSIZE-1
> hint: Reduce the requested number of concurrent connections
> 
> In short, the concept of range doesn't seem applicable here.  There is an
> error state at the max, and some invalid system state condition where the
> position within a set is somehow negative.  These should be separated -
> with the < 0 check happening first.

I view it as: the range of select()-able file descriptors is [0,FD_SETSIZE).
Negative is out of range.

> And apparently this condition isn't applicable when dealing with jobs in
> connect_slot?

For both pgbench.c and parallel_slot.c, there are sufficient negative-FD
checks elsewhere in the file.  Ideally, either both files would have redundant
checks, or neither file would.  I didn't feel the need to mess with that part
of the status quo.

> Also, I see that for connections we immediately issue FD_SET
> so this check on the boundary of the file descriptor makes sense.  But the
> remaining code in connect_slot doesn't involve FD_SET so the test for the
> file descriptor falling within its maximum seems to be coming out of
> nowhere.  Likely this is all good, and the lack of symmetry is just due to
> the natural progressive development of the code, but it stands out to the
> uninitiated looking at this patch.

True.  The approach in parallel_slot.c is to check the FD number each time it
opens a socket, after which its loop with FD_SET() doesn't recheck.  That's a
bit more efficient than the pgbench.c way, because each file may call FD_SET()
many times per socket.  Again, I didn't mess with that part of the status quo.




Re: interval_ops shall stop using btequalimage (deduplication)

2023-10-10 Thread Noah Misch
On Tue, Oct 10, 2023 at 08:12:36PM -0700, Peter Geoghegan wrote:
> On Tue, Oct 10, 2023 at 6:33 PM Noah Misch  wrote:
> > interval_ops, however, recognizes equal-but-distinguishable values:
> 
> > Fails with:
> >
> >   2498151 2023-10-10 05:06:46.177 GMT DEBUG:  building index "ti" on table 
> > "t" serially
> >   2498151 2023-10-10 05:06:46.178 GMT DEBUG:  index "ti" can safely use 
> > deduplication
> >   TRAP: failed Assert("!itup_key->allequalimage || keepnatts == 
> > _bt_keep_natts_fast(rel, lastleft, firstright)"), File: "nbtutils.c", Line: 
> > 2443, PID: 2498151
> 
> Nice catch.
> 
> Out of curiosity, how did you figure this out? Did it just occur to
> you that interval_ops had a behavior that made deduplication unsafe?
> Or did the problem come to your attention after running amcheck on a
> customer database? Or was it something else?

My friend got an amcheck "lacks matching index tuple" failure, and they asked
me about it.  I ran into the assertion failure while reproducing things.

> I'm a little surprised that it took this long to notice
> the interval_ops issue.

Agreed.  I don't usually store interval values in tables, and I'm not sure
I've ever indexed one.  Who knows.

> Do we really need to change the catalog contents when backpatching?

Not really.  I think we usually do.  On the other hand, unlike some past
cases, there's no functional need for the catalog changes.  The catalog
changes just get a bit of efficiency.  No strong preference here.




interval_ops shall stop using btequalimage (deduplication)

2023-10-10 Thread Noah Misch
The btequalimage() header comment says:

 * Generic "equalimage" support function.
 *
 * B-Tree operator classes whose equality function could safely be replaced by
 * datum_image_eq() in all cases can use this as their "equalimage" support
 * function.

interval_ops, however, recognizes equal-but-distinguishable values:

  create temp table t (c interval);
  insert into t values ('1d'::interval), ('24h');
  table t;
  select distinct c from t;

The CREATE INDEX of the following test:

  begin;
  create table t (c interval);
  insert into t select x from generate_series(1,500), (values ('1 year 1 
month'::interval), ('1 year 30 days')) t(x);
  select distinct c from t;
  create index ti on t (c);
  rollback;

Fails with:

  2498151 2023-10-10 05:06:46.177 GMT DEBUG:  building index "ti" on table "t" 
serially
  2498151 2023-10-10 05:06:46.178 GMT DEBUG:  index "ti" can safely use 
deduplication
  TRAP: failed Assert("!itup_key->allequalimage || keepnatts == 
_bt_keep_natts_fast(rel, lastleft, firstright)"), File: "nbtutils.c", Line: 
2443, PID: 2498151

I've also caught btree posting lists where one TID refers to a '1d' heap
tuple, while another TID refers to a '24h' heap tuple.  amcheck complains.
Index-only scans can return the '1d' bits where the actual tuple had the '24h'
bits.  Are there other consequences to highlight in the release notes?  The
back-branch patch is larger, to fix things without initdb.  Hence, I'm
attaching patches for HEAD and for v16 (trivial to merge back from there).  I
glanced at the other opfamilies permitting deduplication, and they look okay:

[local] test=*# select amproc, amproclefttype = amprocrighttype as l_eq_r, 
array_agg(array[opfname, amproclefttype::regtype::text]) from pg_amproc join 
pg_opfamily f on amprocfamily = f.oid where amprocnum = 4 and opfmethod = 403 
group by 1,2;
─[ RECORD 1 
]───
amproc│ btequalimage
l_eq_r│ t
array_agg │ 
{{bit_ops,bit},{bool_ops,boolean},{bytea_ops,bytea},{char_ops,"\"char\""},{datetime_ops,date},{datetime_ops,"timestamp
 without time zone"},{datetime_ops,"timestamp with time 
zone"},{network_ops,inet},{integer_ops,smallint},{integer_ops,integer},{integer_ops,bigint},{interval_ops,interval},{macaddr_ops,macaddr},{oid_ops,oid},{oidvector_ops,oidvector},{time_ops,"time
 without time zone"},{timetz_ops,"time with time zone"},{varbit_ops,"bit 
varying"},{text_pattern_ops,text},{bpchar_pattern_ops,character},{money_ops,money},{tid_ops,tid},{uuid_ops,uuid},{pg_lsn_ops,pg_lsn},{macaddr8_ops,macaddr8},{enum_ops,anyenum},{xid8_ops,xid8}}
─[ RECORD 2 
]───
amproc│ btvarstrequalimage
l_eq_r│ t
array_agg │ {{bpchar_ops,character},{text_ops,text},{text_ops,name}}

Thanks,
nm
Author: Noah Misch 
Commit: Noah Misch 

Dissociate btequalimage() from interval_ops, ending its deduplication.

Under interval_ops, some equal values are distinguishable.  One such
pair is '24:00:00' and '1 day'.  With that being so, btequalimage()
breaches the documented contract for the "equalimage" btree support
function.  This can cause incorrect results from index-only scans.
Users should REINDEX any indexes having interval-type columns and for
which "pg_amcheck --heapallindexed" reports an error.  This fix makes
interval_ops simply omit the support function, like numeric_ops does.
Back-pack to v13, where btequalimage() first appeared.  In back
branches, for the benefit of old catalog content, btequalimage() code
will return false for type "interval".  Going forward, back-branch
initdb will 

Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

2023-10-08 Thread Noah Misch
On Mon, Oct 09, 2023 at 04:22:52PM +1300, Thomas Munro wrote:
> On Mon, Oct 9, 2023 at 3:25 PM Noah Misch  wrote:
> > The "fd >= FD_SETSIZE" check is irrelevant on Windows.  See comments in the
> > attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE 
> > differently.
> > The first associated failure was commit dea12a1 (2023-08-03); as a doc 
> > commit,
> > it's an innocent victim.  Bisect blamed 8488bab "ci: Use windows VMs instead
> > of windows containers" (2023-02), long before the failures began.  I'll 
> > guess
> > some 2023-08 Windows update or reconfiguration altered file descriptor
> > assignment, hence the onset of failures.  In my tests of v16, the highest 
> > file
> > descriptor was 948.  I could make v16 fail by changing --client=5 to
> > --client=90 in the test.  With the attached patch and --client=90, v16 
> > peaked
> > at file descriptor 2040.
> 
> Ohhh.  Thanks for figuring that out.  I'd never noticed that quirk.  I
> didn't/can't test it but the patch looks reasonable after reading the
> referenced docs.

For what it's worth, I did all that testing through CI, using hacks like
https://cirrus-ci.com/task/5352974499708928 to get the relevant information.
I didn't bother trying non-CI, since buildfarm animals aren't failing.

> Maybe instead of just "out of range" I'd say "out of
> range for select()" since otherwise it might seem a little baffling --
> what range?

I was trying to follow this from the style guide:

  Avoid mentioning called function names, either; instead say what the code was 
trying to do:
  BAD:open() failed: %m
  BETTER: could not open file %s: %m

I didn't think of any phrasing that clearly explained things without the
reader consulting the code.  I considered these:

  "socket file descriptor out of range: %d" [what range?]
  "socket file descriptor out of range for select(): %d" [style guide violation]
  "socket file descriptor out of range for checking whether ready for reading: 
%d" [?]
  "socket file descriptor out of range for synchronous I/O multiplexing: %d" 
[term from POSIX]

> Random water cooler speculation about future ideas:  I wonder
> whether/when we can eventually ditch select() and use WSAPoll() for
> this on Windows, which is supposed to be like poll().  There's a
> comment explaining that we prefer select() because it has a higher
> resolution sleep than poll() (us vs ms), so we wouldn't want to use
> poll() on Unixen, but we know that Windows doesn't even use high
> resolution timers for any user space APIs anyway so that's just not a
> concern on that platform.  The usual reason nobody ever uses WSAPoll()
> is because the Curl guys told[1] everyone that it's terrible in a way
> that would quite specifically break our usage.  But I wonder, because
> the documentation now says "As of Windows 10 version 2004, when a TCP
> socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is
> indicated", it *sounds* like it might have been fixed ~3 years ago?
> One idea would be to hide it inside WaitEventSet, and let WaitEventSet
> be used in front end code (we couldn't use the
> WaitForMultipleObjects() version because it's hard-limited to 64
> events, but in front end code we don't need latches and other stuff,
> so we could have a sockets-only version with WSAPoll()).  The idea of
> using WaitEventSet for pgbench on Unix was already mentioned a couple
> of times by others for general scalability reasons -- that way we
> could finish up using a better kernel interface on all supported
> platforms.  We'd probably want to add high resolution time-out support
> (I already posted a patch for that somewhere), or we'd be back to 1ms
> timeouts.
> 
> [1] https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/

Interesting.  I have no current knowledge to add there.




Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

2023-10-08 Thread Noah Misch
On Mon, Sep 04, 2023 at 03:18:40PM +1200, Thomas Munro wrote:
> Somehow these tests have recently become unstable and have failed a few times:
> 
> https://github.com/postgres/postgres/commits/REL_15_STABLE
> 
> The failures are like:
> 
> [22:32:26.722] # Failed test 'pgbench simple update stdout
> /(?^:builtin: simple update)/'
> [22:32:26.722] # at t/001_pgbench_with_server.pl line 119.
> [22:32:26.722] # 'pgbench (15.4)
> [22:32:26.722] # '
> [22:32:26.722] # doesn't match '(?^:builtin: simple update)'

Fun.  That's a test of "pgbench -C".  The test harness isn't reporting
pgbench's stderr, so I hacked things to get that and the actual file
descriptor values being assigned.  The failure case gets "pgbench: error: too
many client connections for select()" in stderr, from this pgbench.c function:

static void
add_socket_to_set(socket_set *sa, int fd, int idx)
{
if (fd < 0 || fd >= FD_SETSIZE)
{
/*
 * Doing a hard exit here is a bit grotty, but it doesn't seem 
worth
 * complicating the API to make it less grotty.
 */
pg_fatal("too many client connections for select()");
}
FD_SET(fd, >fds);
if (fd > sa->maxfd)
sa->maxfd = fd;
}

The "fd >= FD_SETSIZE" check is irrelevant on Windows.  See comments in the
attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently.
The first associated failure was commit dea12a1 (2023-08-03); as a doc commit,
it's an innocent victim.  Bisect blamed 8488bab "ci: Use windows VMs instead
of windows containers" (2023-02), long before the failures began.  I'll guess
some 2023-08 Windows update or reconfiguration altered file descriptor
assignment, hence the onset of failures.  In my tests of v16, the highest file
descriptor was 948.  I could make v16 fail by changing --client=5 to
--client=90 in the test.  With the attached patch and --client=90, v16 peaked
at file descriptor 2040.

Thanks,
nm

P.S. Later, we should change test code so the pgbench stderr can't grow an
extra line without that line appearing in test logs.
Author: Noah Misch 
Commit: Noah Misch 

Don't spuriously report FD_SETSIZE exhaustion on Windows.

Starting on 2023-08-03, this intermittently terminated a "pgbench -C"
test in CI.  It could affect a high-client-count "pgbench" without "-C".
While parallel reindexdb and vacuumdb reach the same problematic check,
sufficient client count and/or connection turnover is less plausible for
them.  Given the lack of examples from the buildfarm or from manual
builds, reproducing this must entail rare operating system
configurations.  Also correct the associated error message, which was
wrong for non-Windows and did not comply with the style guide.
Back-patch to v12, where the pgbench check first appeared.  While v11
vacuumdb has the problematic check, reaching it with typical vacuumdb
usage is implausible.

Reviewed by FIXME.

Discussion: 
https://postgr.es/m/CA+hUKG+JwvTNdcyJTriy9BbtzF1veSRQ=9m_zkfn9_lqe7k...@mail.gmail.com

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e3919395..a1f566c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7837,14 +7837,22 @@ clear_socket_set(socket_set *sa)
 static void
 add_socket_to_set(socket_set *sa, int fd, int idx)
 {
+   /* See connect_slot() for background on this code. */
+#ifdef WIN32
+   if (sa->fds.fd_count + 1 >= FD_SETSIZE)
+   {
+   pg_log_error("too many concurrent database clients for this 
platform: %d",
+sa->fds.fd_count + 1);
+   exit(1);
+   }
+#else
if (fd < 0 || fd >= FD_SETSIZE)
{
-   /*
-* Doing a hard exit here is a bit grotty, but it doesn't seem 
worth
-* complicating the API to make it less grotty.
-*/
-   pg_fatal("too many client connections for select()");
+   pg_log_error("socket file descriptor out of range: %d", fd);
+   pg_log_error_hint("Try fewer concurrent database clients.");
+   exit(1);
}
+#endif
FD_SET(fd, >fds);
if (fd > sa->maxfd)
sa->maxfd = fd;
diff --git a/src/fe_utils/parallel_slot.c b/src/fe_utils/parallel_slot.c
index aca238b..2ee4b6d 100644
--- a/src/fe_utils/parallel_slot.c
+++ b/src/fe_utils/parallel_slot.c
@@ -295,8 +295,40 @@ connect_slot(ParallelSlotArray *sa, int slotno, const char 
*dbname)
slot->connection = connectDatabase(sa->cparams, sa->progname, sa->echo, 
false, true);
sa->cparams->override_dbname = old_override;
 
- 

Re: Trigger violates foreign key constraint

2023-10-08 Thread Noah Misch
On Mon, Oct 02, 2023 at 09:49:53AM -0400, Tom Lane wrote:
> Laurenz Albe  writes:
> > CREATE FUNCTION silly() RETURNS trigger LANGUAGE plpgsql AS 'BEGIN RETURN 
> > NULL; END;';
> > CREATE TRIGGER silly BEFORE DELETE ON child FOR EACH ROW EXECUTE FUNCTION 
> > silly();
> 
> > The trigger function cancels the cascaded delete on "child", and we are 
> > left with
> > a row in "child" that references no row in "parent".
> 
> Yes.  This is by design: triggers operate at a lower level than
> foreign keys, so an ill-conceived trigger can break an FK constraint.
> That's documented somewhere, though maybe not visibly enough.
> 
> There are good reasons to want triggers to be able to see and
> react to FK-driven updates,

I agree with that, but I also think it's a bug that other triggers can
invalidate the constraint, without even going out of their way to do so.
Ideally, triggers would be able to react, yet when all non-superuser-defined
code settles, the constraint would still hold.  While UNIQUE indexes over
expressions aren't that strict, at least for those you need to commit the
clear malfeasance of redefining an IMMUTABLE function.

On Mon, Oct 02, 2023 at 12:02:17PM +0200, Laurenz Albe wrote:
> Perhaps it would be enough to run "RI_FKey_noaction_del" after
> "RI_FKey_cascade_del", although that would impact the performance.

Yes.  A cure that doubles the number of heap fetches would be worse than the
disease, but a more-optimized version of this idea could work.  The FK system
could use a broader optimization-oriented rewrite, to deal with the unbounded
memory usage and redundant I/O.  If that happens, it could be a good time to
plan for closing the trigger hole.




Re: Unlogged relation copy is not fsync'd

2023-10-07 Thread Noah Misch
On Wed, Sep 20, 2023 at 11:22:10PM -0700, Noah Misch wrote:
> On Fri, Sep 15, 2023 at 02:47:45PM +0300, Heikki Linnakangas wrote:
> > On 05/09/2023 21:20, Robert Haas wrote:
> 
> > Thinking about this some more, I think this is still not 100% correct, even
> > with the patch I posted earlier:
> > 
> > >   /*
> > >* When we WAL-logged rel pages, we must nonetheless fsync them.  The
> > >* reason is that since we're copying outside shared buffers, a 
> > > CHECKPOINT
> > >* occurring during the copy has no way to flush the previously written
> > >* data to disk (indeed it won't know the new rel even exists).  A crash
> > >* later on would replay WAL from the checkpoint, therefore it wouldn't
> > >* replay our earlier WAL entries. If we do not fsync those pages here,
> > >* they might still not be on disk when the crash occurs.
> > >*/
> > >   if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
> > >   smgrimmedsync(dst, forkNum);
> > 
> > Let's walk through each case one by one:
> > 
> > 1. Temporary table. No fsync() needed. This is correct.
> > 
> > 2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and
> > also because we bypassed the buffer manager. Correct.
> 
> Agreed.
> 
> > 3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged
> > it, because we bypassed the buffer manager. Like the comment explains. This
> > is now correct with the patch.
> 
> This has two subtypes:
> 
> 3a. Unlogged rel, init fork, use_wal (wal_level!=minimal). Matches what
> you wrote.
> 
> 3b. Unlogged rel, init fork, !use_wal (wal_level==minimal). Needs to be
> fsync'd because we didn't WAL-log it and RelationCreateStorage() won't call
> AddPendingSync().  (RelationCreateStorage() could start calling
> AddPendingSync() for this case.  I think we chose not to do that because there
> will never be additional writes to the init fork, and smgrDoPendingSyncs()
> would send the fork to FlushRelationsAllBuffers() even though the fork will
> never appear in shared buffers.  On the other hand, grouping the sync with the
> other end-of-xact syncs could improve efficiency for some filesystems.  Also,
> the number of distinguishable cases is unpleasantly high.)
> 
> > 4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
> > WAL-logged it, because we bypassed the buffer manager. Like the comment
> > explains. Correct.
> > 
> > 5. Permanent rel, use_wal == false. We skip fsync, because it means that
> > it's a new relation, so we have a sync request pending for it. (An assertion
> > for that would be nice). At the end of transaction, in smgrDoPendingSyncs(),
> > we will either fsync it or we will WAL-log all the pages if it was a small
> > relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to
> > WAL-log the pages, we have the same race condition that's explained in the
> > comment, because we bypassed the buffer manager:
> > 
> > 1. RelationCopyStorage() copies some of the pages.
> > 2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a
> > dirty segment when the relation was created)
> > 3. RelationCopyStorage() copies the rest of the pages.
> > 4. smgrDoPendingSyncs() WAL-logs all the pages.
> 
> smgrDoPendingSyncs() delegates to log_newpage_range().  log_newpage_range()
> loads each page into the buffer manager and calls MarkBufferDirty() on each.
> Hence, ...
> 
> > 5. Another checkpoint happens. It does *not* fsync the relation.
> 
> ... I think this will fsync the relation.  No?
> 
> > 6. Crash.

While we're cataloging gaps, I think the middle sentence is incorrect in the
following heapam_relation_set_new_filelocator() comment:

/*
 * If required, set up an init fork for an unlogged table so that it can
 * be correctly reinitialized on restart.  Recovery may remove it while
 * replaying, for example, an XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
 * record.  Therefore, logging is necessary even if wal_level=minimal.
 */
if (persistence == RELPERSISTENCE_UNLOGGED)
{
Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
   rel->rd_rel->relkind == RELKIND_MATVIEW ||
   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
smgrcreate(srel, INIT_FORKNUM, false);
log_smgrcreate(newrlocator, INIT_FORKNUM);
}

XLOG_DBASE_CREATE_FILE_COPY last had that problem before fbcbc5d (2005-06)
made it issue a checkpoint.  XLOG_DBASE_CREATE_WAL_LOG never had that problem.
XLOG_TBLSPC_CREATE last had that problem before 97ddda8a82 (2021-08).  In
general, if we reintroduced such a bug, it would affect all new rels under
wal_level=minimal, not just init forks.  Having said all that,
log_smgrcreate() calls are never conditional on wal_level=minimal; the above
code is correct.




post-recovery amcheck expectations

2023-10-04 Thread Noah Misch
Suppose we start with this nbtree (subset of a diagram from verify_nbtree.c):

 *   1
 *   /   \
 *2 <-> 3

We're deleting 2, the leftmost leaf under a leftmost internal page.  After the
MARK_PAGE_HALFDEAD record, the first downlink from 1 will lead to 3, which
still has a btpo_prev pointing to 2.  bt_index_parent_check() complains here:

/* The first page we visit at the level should be leftmost */
if (first && !BlockNumberIsValid(state->prevrightlink) && 
!P_LEFTMOST(opaque))
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
 errmsg("the first child of leftmost 
target page is not leftmost of its level in index \"%s\"",

RelationGetRelationName(state->rel)),
 errdetail_internal("Target block=%u 
child block=%u target page lsn=%X/%X.",

state->targetblock, blkno,

LSN_FORMAT_ARGS(state->targetlsn;

One can encounter this if recovery ends between a MARK_PAGE_HALFDEAD record
and its corresponding UNLINK_PAGE record.  See the attached test case.  The
index is actually fine in such a state, right?  I lean toward fixing this by
having amcheck scan left; if left links reach only half-dead or deleted pages,
that's as good as the present child block being P_LEFTMOST.  There's a
different error from bt_index_check(), and I've not yet studied how to fix
that:

  ERROR:  left link/right link pair in index "not_leftmost_pk" not in agreement
  DETAIL:  Block=0 left block=0 left link from block=4294967295.

Alternatively, one could view this as a need for the user to VACUUM between
recovery and amcheck.  The documentation could direct users to "VACUUM
(DISABLE_PAGE_SKIPPING off, INDEX_CLEANUP on, TRUNCATE off)" if not done since
last recovery.  Does anyone prefer that or some other alternative?

For some other amcheck expectations, the comments suggest reliance on the
bt_index_parent_check() ShareLock.  I haven't tried to make test cases for
them, but perhaps recovery can trick them the same way.  Examples:

  errmsg("downlink or sibling link points to deleted block in index \"%s\"",
  errmsg("block %u is not leftmost in index \"%s\"",
  errmsg("block %u is not true root in index \"%s\"",

Thanks,
nm
Author: Noah Misch 
Commit: Noah Misch 



diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index b82f221..9a7f4f7 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -13,6 +13,7 @@ PGFILEDESC = "amcheck - function for verifying relation 
integrity"
 REGRESS = check check_btree check_heap
 
 TAP_TESTS = 1
+EXTRA_INSTALL=contrib/pg_walinspect
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/amcheck/t/004_pitr.pl b/contrib/amcheck/t/004_pitr.pl
new file mode 100644
index 000..ec6d87e
--- /dev/null
+++ b/contrib/amcheck/t/004_pitr.pl
@@ -0,0 +1,65 @@
+
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+# Test integrity of intermediate states by PITR to those states
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# origin node: generate WAL records of interest.
+my $origin = PostgreSQL::Test::Cluster->new('origin');
+$origin->init(has_archiving => 1, allows_streaming => 1);
+$origin->append_conf('postgresql.conf', 'autovacuum = off');
+$origin->start;
+$origin->backup('my_backup');
+# Create a table with each of 6 PK values spanning 1/4 of a block.  Delete the
+# first four, so one index leaf is eligible for deletion.  Make a replication
+# slot just so pg_walinspect will always have access to later WAL.
+my $setup = <safe_psql('postgres', $setup);
+my $before_vacuum_lsn =
+  $origin->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+# VACUUM to delete the aforementioned leaf page.  Find the LSN of that
+# UNLINK_PAGE record.
+my $exec = <safe_psql('postgres', $exec);
+$origin->stop;
+
+# replica node: amcheck at notable points in the WAL stream
+my $replica = PostgreSQL::Test::Cluster->new('replica');
+$replica->init_from_backup($origin, 'my_backup', has_restoring => 1);
+$replica->append_conf('postgresql.conf',
+   "recovery_target_lsn = '$unlink_lsn'");
+$replica->append_conf('postgresql.conf', 'recovery_target_inclusive = off');
+$replica->append_conf('postgresql.conf', 'recovery_target_action = promote');
+$replica->start;
+$replica->poll_query_until('postgres', "SELECT pg_is_in_recovery(

Re: pgstatindex vs. !indisready

2023-10-01 Thread Noah Misch
On Mon, Oct 02, 2023 at 09:24:33AM +0900, Michael Paquier wrote:
> On Sun, Oct 01, 2023 at 04:20:42PM -0700, Peter Geoghegan wrote:
> > On Sun, Oct 1, 2023 at 2:00 PM Noah Misch  wrote:
> >> [brin_desummarize_range brin_summarize_new_values brin_summarize_range
> >> gin_clean_pending_list] currently succeed.  I propose to make them emit a
> >> DEBUG1 message and return early, like amcheck does, except on !indisready.
> >> This would allow users to continue running them on all indexes of the
> >> applicable access method.  Doing these operations on an
> >> indisready&&!indisvalid index is entirely reasonable, since they relate to
> >> INSERT/UPDATE/DELETE operations.
> 
> Hmm.  Still slightly incorrect in some cases?  Before being switched
> to indisvalid, an index built concurrently may miss some tuples
> deleted before the reference snapshot used to build the index was
> taken.

The !indisvalid index may be missing tuples, yes.  In what way does that make
one of those four operations incorrect?




Re: pgstatindex vs. !indisready

2023-10-01 Thread Noah Misch
On Sun, Oct 01, 2023 at 04:37:25PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > Running pgstatindex on some !indisready indexes fails with "ERROR:  XX001:
> > could not read block 0 in file".  This reproduces it:
> > ...
> > Since XX001 = ERRCODE_DATA_CORRUPTED appears in the "can't-happen" class, 
> > it's
> > not a good fit for this scenario.  I propose to have pgstatindex fail early 
> > on
> > !indisready indexes.
> 
> +1
> 
> > We could go a step further and also fail on
> > indisready&&!indisvalid indexes, which are complete enough to accept inserts
> > but not complete enough to answer queries.  I don't see a reason to do that,
> > but maybe someone else will.
> 
> Hmm.  Seems like the numbers pgstatindex would produce for a
> not-yet-complete index would be rather irrelevant, even if the case
> doesn't risk any outright problems.  I'd be inclined to be
> conservative and insist on indisvalid being true too.

Okay.  If no other preferences appear, I'll do pgstatindex that way.

> > This made me wonder about functions handling unlogged rels during recovery. 
> >  I
> > used the attached hack to test most regclass-arg functions.

I forgot to test the same battery of functions on !indisready indexes.  I've
now done that, using the attached script.  While I didn't get additional
class-XX errors, more should change:

[pgstatginindex pgstathashindex pgstattuple] currently succeed.  Like
pgstatindex, they should ERROR, including in the back branches.

[brin_desummarize_range brin_summarize_new_values brin_summarize_range
gin_clean_pending_list] currently succeed.  I propose to make them emit a
DEBUG1 message and return early, like amcheck does, except on !indisready.
This would allow users to continue running them on all indexes of the
applicable access method.  Doing these operations on an
indisready&&!indisvalid index is entirely reasonable, since they relate to
INSERT/UPDATE/DELETE operations.

[pg_freespace pg_indexes_size pg_prewarm] currently succeed, and I propose to
leave them that way.  No undefined behavior arises.  pg_freespace needs to be
resilient to garbage data anyway, given the lack of WAL for the FSM.  One
could argue for a relkind check in pg_indexes_size.  One could argue for
treating pg_prewarm like amcheck (DEBUG1 and return).
rollback;
begin;
create extension if not exists btree_gin;
create or replace function error(text) returns text language plpgsql as $$
begin
raise exception 'break index build';
return $1;
end;
$$ immutable;
drop table if exists u;
create table u (c text);
insert into u values ('foo');
commit;
create index concurrently ubtree on u (error(c));
create index concurrently ugin on u using gin (error(c));
create index concurrently uhash on u using hash (error(c));
create index concurrently ubrin on u using brin (error(c));

\set utable '''ubtree''::regclass'
\set useq '''ubtree''::regclass'
\set VERBOSITY verbose
begin;
create or replace function error(text) returns text language sql
  as 'select $1' immutable;

SET log_error_verbosity = verbose;
SET log_min_messages = debug1; -- for amcheck
--SET client_min_messages = debug1;

create extension pg_visibility;
create extension amcheck;
create extension pgstattuple;
create extension pg_surgery;
create extension pg_prewarm;
create extension pg_freespacemap;
create extension pgrowlocks;

SELECT * FROM pgrowlocks(:utable::text);
SELECT brin_desummarize_range('ubrin',1);
SELECT brin_summarize_new_values('ubrin');
SELECT brin_summarize_range('ubrin',1);
SELECT bt_index_check('ubtree');
SELECT bt_index_check('ubtree',true);
SELECT bt_index_parent_check('ubtree');
SELECT bt_index_parent_check('ubtree',true);
SELECT bt_index_parent_check('ubtree',true,true);
SELECT gin_clean_pending_list('ugin');
SELECT heap_force_freeze(:utable,array['(0,1)'::tid]);
SELECT heap_force_kill(:utable,array['(0,1)'::tid]);
SELECT nextval(:useq);
SELECT currval(:useq);
SELECT pg_check_frozen(:utable);
SELECT pg_check_visible(:utable);
SELECT pg_column_is_updatable(:utable,'1',true);
--SELECT pg_extension_config_dump(:utable,'select 1');
SELECT pg_freespace(:utable);
SELECT pg_freespace(:utable,1);
SELECT pg_get_replica_identity_index(:utable);
SELECT pg_index_column_has_property(:utable,1,'asc');
SELECT pg_indexes_size(:utable);
SELECT pg_index_has_property('ubrin','asc');
--SELECT pg_nextoid(:utable,name,:utable);
SELECT pg_partition_ancestors(:utable);
SELECT pg_partition_root(:utable);
SELECT pg_partition_tree(:utable);
SELECT pg_prewarm(:utable);
SELECT pg_relation_filenode(:utable);
SELECT pg_relation_filepath(:utable);
SELECT pg_relation_is_publishable(:utable);
SELECT pg_relation_is_updatable(:utable,true);
SELECT pg_relation_size(:utable);
SELECT pg_relation_size(:utable,'main');
SELECT pg_relpages(:utable);
SELECT pg_sequence_last_value(:useq);
SELECT pgstatginindex('ugi

pgstatindex vs. !indisready

2023-10-01 Thread Noah Misch
Running pgstatindex on some !indisready indexes fails with "ERROR:  XX001:
could not read block 0 in file".  This reproduces it:

begin;
drop table if exists not_indisready_t;
create table not_indisready_t (c int);
insert into not_indisready_t values (1),(1);
commit;
create unique index concurrently not_indisready_i on not_indisready_t(c);
begin;
create extension pgstattuple;
\set VERBOSITY verbose
select * from pgstatindex('not_indisready_i');
\set VERBOSITY default
rollback;

Since XX001 = ERRCODE_DATA_CORRUPTED appears in the "can't-happen" class, it's
not a good fit for this scenario.  I propose to have pgstatindex fail early on
!indisready indexes.  We could go a step further and also fail on
indisready&&!indisvalid indexes, which are complete enough to accept inserts
but not complete enough to answer queries.  I don't see a reason to do that,
but maybe someone else will.

This made me wonder about functions handling unlogged rels during recovery.  I
used the attached hack to test most regclass-arg functions.  While some of
these errors from src/test/recovery/tmp_check/log/001_stream_rep_standby_2.log
may deserve improvement, there were no class-XX messages:

2023-10-01 12:24:05.992 PDT [646914:11] 001_stream_rep.pl ERROR:  58P01: could 
not open file "base/5/16862": No such file or directory
2023-10-01 12:24:05.996 PDT [646914:118] 001_stream_rep.pl ERROR:  22023: fork 
"main" does not exist for this relation

Thanks,
nm
Author: Noah Misch 
Commit: Noah Misch 



diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index c60314d..ad54859 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -10,6 +10,14 @@
 #-
 
 EXTRA_INSTALL=contrib/pg_prewarm contrib/pg_stat_statements 
contrib/test_decoding
+EXTRA_INSTALL += \
+  contrib/pg_visibility \
+  contrib/amcheck \
+  contrib/pgstattuple \
+  contrib/btree_gin \
+  contrib/pg_surgery \
+  contrib/pg_freespacemap \
+  contrib/pgrowlocks
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/001_stream_rep.pl 
b/src/test/recovery/t/001_stream_rep.pl
index 0c72ba0..e441cb2 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -45,6 +45,24 @@ $node_standby_2->start;
 # Create some content on primary and check its presence in standby nodes
 $node_primary->safe_psql('postgres',
"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
+$node_primary->safe_psql('postgres', "
+create extension pg_visibility;
+create extension amcheck;
+create extension pgstattuple;
+create extension btree_gin;
+create extension pg_surgery;
+create extension pg_prewarm;
+create extension pg_freespacemap;
+create extension pgrowlocks;
+
+create unlogged sequence useq;
+create unlogged table u (c text);
+insert into u values ('foo');
+create index ubtree on u (c);
+create index ugin on u using gin (c);
+create index uhash on u using hash (c);
+create index ubrin on u using brin (c);
+");
 
 # Wait for standbys to catch up
 $node_primary->wait_for_replay_catchup($node_standby_1);
@@ -60,6 +78,88 @@ $result =
 print "standby 2: $result\n";
 is($result, qq(1002), 'check streamed content on standby 2');
 
+# Call regclass-arg functions on unlogged objects.  Function list from:
+#
+# DO $$
+# DECLARE
+#  extname text;
+# BEGIN
+#  FOR extname in SELECT name from pg_available_extensions LOOP
+#  EXECUTE format('CREATE EXTENSION IF NOT EXISTS %I CASCADE', 
extname);
+#  END LOOP;
+# END
+# $$;
+# select oid::regprocedure from pg_proc
+# where proargtypes::oid[] @> array['regclass'::regtype::oid]
+# order by oid::regprocedure::text;
+#
+# Removed pageinspect functions, since the superuser might choose to run them
+# in cases we wouldn't expect.  Added pgrowlocks(text).
+$node_standby_2->psql('postgres', <<'EOSQL', on_error_stop => 0);
+\set utable '''u''::regclass'
+\set VERBOSITY verbose
+SET log_error_verbosity = verbose;
+SET log_min_messages = debug1; -- for amcheck
+SET client_min_messages = debug1;
+
+SELECT * FROM pgrowlocks(:utable::text);
+SELECT brin_desummarize_range('ubrin',1);
+SELECT brin_summarize_new_values('ubrin');
+SELECT brin_summarize_range('ubrin',1);
+SELECT bt_index_check('ubtree');
+SELECT bt_index_check('ubtree',true);
+SELECT bt_index_parent_check('ubtree');
+SELECT bt_index_parent_check('ubtree',true);
+SELECT bt_index_parent_check('ubtree',true,true);
+SELECT gin_clean_pending_list('ugin');
+SELECT heap_force_freeze(:utable,array['(0,1)'::tid]);
+SELECT heap_force_kill(:utable,array['(0,1)'::tid]);
+SELECT nextval('useq');
+SELECT currval('useq');
+SELECT pg_check_frozen(:utable);
+SELECT pg_check_visible(:utable);
+SELECT pg_column_is_updatable(:utable,'1',true);
+--SELECT pg_extension_config_dump(:utable,'select 1');
+SEL

Re: Testing autovacuum wraparound (including failsafe)

2023-09-29 Thread Noah Misch
On Fri, Sep 29, 2023 at 12:17:04PM +0200, Daniel Gustafsson wrote:
> +# Bump the query timeout to avoid false negatives on slow test systems.
> +my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
> 
> Should we bump the timeout like this for all systems?  I interpreted Noah's
> comment such that it should be possible for slower systems to override, not
> that it should be extended everywhere, but I might have missed something.

This is the conventional way to do it.  For an operation far slower than a
typical timeout_default situation, the patch can and should dilate the default
timeout like this.  The patch version as of my last comment was extending the
timeout but also blocking users from extending it further via
PG_TEST_TIMEOUT_DEFAULT.  The latest version restores PG_TEST_TIMEOUT_DEFAULT
reactivity, resolving my comment.




Re: Fail hard if xlogreader.c fails on out-of-memory

2023-09-26 Thread Noah Misch
On Wed, Sep 27, 2023 at 11:06:37AM +1300, Thomas Munro wrote:
> On Tue, Sep 26, 2023 at 8:38 PM Michael Paquier  wrote:
> > Thoughts and/or comments are welcome.
> 
> I don't have an opinion yet on your other thread about making this
> stuff configurable for replicas, but for the simple crash recovery
> case shown here, hard failure makes sense to me.

> Recycled pages can't fool us into making a huge allocation any more.
> If xl_tot_len implies more than one page but the next page's
> xlp_pageaddr is too low, then either the xl_tot_len you read was
> recycled garbage bits, or it was legitimate but the overwrite of the
> following page didn't make it to disk; either way, we don't have a
> record, so we have an end-of-wal condition.  The xlp_rem_len check
> defends against the second page making it to disk while the first one
> still contains recycled garbage where the xl_tot_len should be*.
> 
> What Michael wants to do now is remove the 2004-era assumption that
> malloc failure implies bogus data.  It must be pretty unlikely in a 64
> bit world with overcommitted virtual memory, but a legitimate
> xl_tot_len can falsely end recovery and lose data, as reported from a
> production case analysed by his colleagues.  In other words, we can
> actually distinguish between lack of resources and recycled bogus
> data, so why treat them the same?

Indeed.  Hard failure is fine, and ENOMEM=end-of-WAL definitely isn't.

> *A more detailed analysis would talk about sectors (page header is
> atomic)

I think the page header is atomic on POSIX-compliant filesystems but not
atomic on ext4.  That doesn't change the conclusion on $SUBJECT.




Re: Unlogged relation copy is not fsync'd

2023-09-21 Thread Noah Misch
On Fri, Sep 15, 2023 at 02:47:45PM +0300, Heikki Linnakangas wrote:
> On 05/09/2023 21:20, Robert Haas wrote:

> Thinking about this some more, I think this is still not 100% correct, even
> with the patch I posted earlier:
> 
> > /*
> >  * When we WAL-logged rel pages, we must nonetheless fsync them.  The
> >  * reason is that since we're copying outside shared buffers, a 
> > CHECKPOINT
> >  * occurring during the copy has no way to flush the previously written
> >  * data to disk (indeed it won't know the new rel even exists).  A crash
> >  * later on would replay WAL from the checkpoint, therefore it wouldn't
> >  * replay our earlier WAL entries. If we do not fsync those pages here,
> >  * they might still not be on disk when the crash occurs.
> >  */
> > if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
> > smgrimmedsync(dst, forkNum);
> 
> Let's walk through each case one by one:
> 
> 1. Temporary table. No fsync() needed. This is correct.
> 
> 2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and
> also because we bypassed the buffer manager. Correct.

Agreed.

> 3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged
> it, because we bypassed the buffer manager. Like the comment explains. This
> is now correct with the patch.

This has two subtypes:

3a. Unlogged rel, init fork, use_wal (wal_level!=minimal). Matches what
you wrote.

3b. Unlogged rel, init fork, !use_wal (wal_level==minimal). Needs to be
fsync'd because we didn't WAL-log it and RelationCreateStorage() won't call
AddPendingSync().  (RelationCreateStorage() could start calling
AddPendingSync() for this case.  I think we chose not to do that because there
will never be additional writes to the init fork, and smgrDoPendingSyncs()
would send the fork to FlushRelationsAllBuffers() even though the fork will
never appear in shared buffers.  On the other hand, grouping the sync with the
other end-of-xact syncs could improve efficiency for some filesystems.  Also,
the number of distinguishable cases is unpleasantly high.)

> 4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
> WAL-logged it, because we bypassed the buffer manager. Like the comment
> explains. Correct.
> 
> 5. Permanent rel, use_wal == false. We skip fsync, because it means that
> it's a new relation, so we have a sync request pending for it. (An assertion
> for that would be nice). At the end of transaction, in smgrDoPendingSyncs(),
> we will either fsync it or we will WAL-log all the pages if it was a small
> relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to
> WAL-log the pages, we have the same race condition that's explained in the
> comment, because we bypassed the buffer manager:
> 
> 1. RelationCopyStorage() copies some of the pages.
> 2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a
> dirty segment when the relation was created)
> 3. RelationCopyStorage() copies the rest of the pages.
> 4. smgrDoPendingSyncs() WAL-logs all the pages.

smgrDoPendingSyncs() delegates to log_newpage_range().  log_newpage_range()
loads each page into the buffer manager and calls MarkBufferDirty() on each.
Hence, ...

> 5. Another checkpoint happens. It does *not* fsync the relation.

... I think this will fsync the relation.  No?

> 6. Crash.




Re: Optionally using a better backtrace library?

2023-09-04 Thread Noah Misch
On Mon, Jul 03, 2023 at 11:58:25AM +0200, Alvaro Herrera wrote:
> On 2023-Jul-02, Andres Freund wrote:
> > I like that we now have a builtin backtrace ability. Unfortunately I think 
> > the
> > backtraces are often not very useful, because only externally visible
> > functions are symbolized.
> 
> Agreed, these backtraces are pretty close to useless.  Not completely,
> but I haven't found a practical way to use them for actual debugging
> of production problems.

For what it's worth, I use the attached script to convert the current
errbacktrace output to a fully-symbolized backtrace.  Nonetheless, ...

> > I hacked it up for ereport() to debug something, and the backtraces are
> > considerably better:
> > 
> > 2023-07-02 10:52:54.863 PDT [1398207][client backend][:0][[unknown]] LOG:  
> > will crash
> > 2023-07-02 10:52:54.863 PDT [1398207][client backend][:0][[unknown]] 
> > BACKTRACE:
> > [0x55fcd03e6143] PostgresMain: 
> > ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4126
> > [0x55fcd031154c] BackendRun: 
> > ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4461
> > [0x55fcd0310dd8] BackendStartup: 
> > ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4189
> > [0x55fcd030ce75] ServerLoop: 
> > ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1779
> 
> Yeah, this looks much more usable.

... +1 for offering this.
#! /usr/bin/perl

# Usage: errbacktrace2line EXE_FILE ) {
if (/^[[:space:]]*(\/.*\.so)$last2_fields/) {
my($obj, $offset, $return_addr) = ($1, $2, $3);
conditional_flush_addr $obj;
push @addr, ($offset || $return_addr);
} elsif (/$last2_fields/) {
my($offset, $return_addr) = ($1, $2);
conditional_flush_addr $bin;
push @addr, ($offset || $return_addr);
}
}

flush_addr;


Re: backtrace_functions emits trace for any elog

2023-09-04 Thread Noah Misch
On Mon, Sep 04, 2023 at 09:30:32PM +0100, Ilya Gladyshev wrote:
> I used backtrace_functions to debug one of my ideas and found its behavior 
> counter-intuitive and contradictory to it own docs. I think the GUC is 
> supposed to be used to dump backtrace only on elog(ERROR) (should it also be 
> done for higher levels? not sure about this), but, in fact, it does that for 
> any log-level. I have attached a patch that checks log-level before attaching 
> backtrace.

This would make the feature much less useful.  Better to change the docs.




Re: Create shorthand for including all extra tests

2023-09-04 Thread Noah Misch
On Mon, Sep 04, 2023 at 04:30:31PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Mon, Sep 04, 2023 at 08:16:44PM +0200, Daniel Gustafsson wrote:
> >> On 4 Sep 2023, at 17:01, Tom Lane  wrote:
> >>> I think this is a seriously bad idea.  The entire point of not including
> >>> certain tests in check-world by default is that the omitted tests are
> >>> security hazards, so a developer or buildfarm owner should review each
> >>> one before deciding whether to activate it on their machine.
> 
> > Other than PG_TEST_EXTRA=wal_consistency_checking, they have the same 
> > hazard:
> > they treat the loopback interface as private, so anyone with access to
> > loopback interface ports can hijack the test.  I'd be fine with e.g.
> > PG_TEST_EXTRA=private-lo activating all of those.  We don't gain by inviting
> > the tester to review the tests to rediscover this common factor.
> 
> Yeah, I could live with something like that from the security standpoint.
> Not sure if it helps Nazir's use-case though.  Maybe we could invent
> categories that can be used in place of individual test names?
> For now,
> 
>   PG_TEST_EXTRA="needs-private-lo slow"
> 
> would cover the territory of "all", and I think it'd be very seldom
> that we'd have to invent new categories here (though maybe I lack
> imagination today).

I could imagine categories for filesystem bytes and RAM bytes.  Also, while
needs-private-lo has a bounded definition, "slow" doesn't.  If today's one
"slow" test increases check-world duration by 1.1x, we may not let a
100x-increase test use the same keyword.

If one introduced needs-private-lo, the present spelling of "all" would be
"needs-private-lo wal_consistency_checking".  Looks okay to me.  Doing nothing
here wouldn't be ruinous, of course.




Re: Create shorthand for including all extra tests

2023-09-04 Thread Noah Misch
On Mon, Sep 04, 2023 at 08:16:44PM +0200, Daniel Gustafsson wrote:
> > On 4 Sep 2023, at 17:01, Tom Lane  wrote:
> > Nazir Bilal Yavuz  writes:
> >> I created an 'all' option for PG_TEST_EXTRA to enable all test suites
> >> defined under PG_TEST_EXTRA.
> > 
> > I think this is a seriously bad idea.  The entire point of not including
> > certain tests in check-world by default is that the omitted tests are
> > security hazards, so a developer or buildfarm owner should review each
> > one before deciding whether to activate it on their machine.
> 
> I dunno, I've certainly managed to not run the tests I hoped to multiple 
> times.
> I think it could be useful for sandboxed testrunners which are destroyed after
> each run. There is for sure a foot-gun angle to it, no question about that.

Other than PG_TEST_EXTRA=wal_consistency_checking, they have the same hazard:
they treat the loopback interface as private, so anyone with access to
loopback interface ports can hijack the test.  I'd be fine with e.g.
PG_TEST_EXTRA=private-lo activating all of those.  We don't gain by inviting
the tester to review the tests to rediscover this common factor.




Re: Testing autovacuum wraparound (including failsafe)

2023-09-02 Thread Noah Misch
On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote:
> > On 12 Jul 2023, at 09:52, Masahiko Sawada  wrote:
> > Agreed. The timeout can be set by manually setting
> > PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it
> > now require setting PG_TET_EXTRA to run it.
> 
> +# bump the query timeout to avoid false negatives on slow test syetems.
> typo: s/syetems/systems/
> 
> 
> +# bump the query timeout to avoid false negatives on slow test syetems.
> +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600;
> Does this actually work?  Utils.pm read the environment variable at compile
> time in the BEGIN block so this setting won't be seen?  A quick testprogram
> seems to confirm this but I might be missing something.

The correct way to get a longer timeout is "IPC::Run::timer(4 *
$PostgreSQL::Test::Utils::timeout_default);".  Even if changing env worked,
that would be removing the ability for even-slower systems to set timeouts
greater than 10min.




Re: Query execution in Perl TAP tests needs work

2023-08-29 Thread Noah Misch
On Tue, Aug 29, 2023 at 04:25:24PM +1200, Thomas Munro wrote:
> On Tue, Aug 29, 2023 at 1:48 PM Noah Misch  wrote:
> > https://github.com/cpan-authors/IPC-Run/issues/166#issuecomment-1288190929
> 
> Interesting.  But that shows a case with no pipes connected, using
> select() as a dumb sleep and ignoring SIGCHLD.  In our usage we have
> pipes connected, and I think select() should return when the child's
> output pipes become readable due to EOF.  I guess something about that
> might be b0rked on Windows?  I see there is an extra helper process
> doing socket<->pipe conversion (hah, that explains an extra ~10ms at
> the start in my timestamps)...

In that case, let's assume it's not the same issue.




Re: Query execution in Perl TAP tests needs work

2023-08-28 Thread Noah Misch
On Mon, Aug 28, 2023 at 05:29:56PM +1200, Thomas Munro wrote:
> CI shows Windows
> consuming 4 CPUs at 100% for a full 10 minutes to run a test suite
> that finishes in 2-3 minutes everywhere else with the same number of
> CPUs.  Could there be an event handling snafu in IPC::Run or elsewhere
> nearby?  It seems like there must be either a busy loop or a busted
> sleep/wakeup... somewhere?

That smells like this one:
https://github.com/cpan-authors/IPC-Run/issues/166#issuecomment-1288190929

> As an experiment, I hacked up a not-good-enough-to-share experiment
> where $node->safe_psql() would automatically cache a BackgroundPsql
> object and reuse it, and the times for that test dropped ~51 -> ~9s on
> Windows, and ~7 -> ~2s on the Unixen.

Nice!

> suppose there are quite a few ways we could do better:
> 
> 1.  Don't fork anything at all: open (and cache) a connection directly
> from Perl.
> 1a.  Write xsub or ffi bindings for libpq.  Or vendor (parts) of the
> popular Perl xsub library?
> 1b.  Write our own mini pure-perl pq client module.  Or vendor (parts)
> of some existing one.
> 2.  Use long-lived psql sessions.
> 2a.  Something building on BackgroundPsql.
> 2b.  Maybe give psql or a new libpq-wrapper a new low level stdio/pipe
> protocol that is more fun to talk to from Perl/machines?

(2a) seems adequate and easiest to achieve.  If DBD::Pg were under a
compatible license, I'd say use it as the vendor for (1a).  Maybe we can get
it relicensed?  Building a separate way of connecting from Perl would be sad.




pg_waldump vs. all-zeros WAL files; server creation of such files

2023-08-12 Thread Noah Misch
The attached 010_zero.pl, when run as part of the pg_waldump test suite, fails
at today's master (c36b636) and v15 (1bc19df).  It passes at v14 (5a32af3).
Command "pg_waldump --start 0/0100 --end 0/01000100" fails as follows:

  pg_waldump: error: WAL segment size must be a power of two between 1 MB and 1 
GB, but the WAL file "00010002" header specifies 0 bytes

Where it fails, the server has created an all-zeros WAL file under that name.
Where it succeeds, that file doesn't exist at all.  Two decisions to make:

- Should a clean server shutdown ever leave an all-zeros WAL file?  I think
  yes, it's okay to let that happen.
- Should "pg_waldump --start $X --end $Y" open files not needed for the
  requested range?  I think no.

Bisect of master got:
30a53b7 Wed Mar 8 16:56:37 2023 +0100 Allow tailoring of ICU locales with 
custom rules
Doesn't fail at $(git merge-base REL_15_STABLE master).  Bisect of v15 got:
811203d Sat Aug 6 11:50:23 2022 -0400 Fix data-corruption hazard in WAL-logged 
CREATE DATABASE.

I suspect those are innocent.  They changed the exact WAL content, which I
expect somehow caused creation of segment 2.

Oddly, I find only one other report of this:
https://www.postgresql.org/message-id/CAJ6DU3HiJ5FHbqPua19jAD%3DwLgiXBTjuHfbmv1jCOaNOpB3cCQ%40mail.gmail.com

Thanks,
nm


010_zero.pl
Description: Perl program


Re: Inconsistent results with libc sorting on Windows

2023-08-11 Thread Noah Misch
On Fri, Aug 11, 2023 at 11:48:18AM +0200, Juan José Santamaría Flecha wrote:
> On Fri, Aug 11, 2023 at 7:29 AM Noah Misch  wrote:
> 
> >
> > The LCMapStringEx() solution is elegant.  I do see
> >
> > https://learn.microsoft.com/en-us/windows/win32/intl/handling-sorting-in-your-applications
> > says, "If an application calls the function to create a sort key for a
> > string
> > containing an Arabic kashida, the function creates no sort key value."
> > That's
> > aggravating.
> >
> 
> I think the problem there is that it is just poorly explained.
> Take for example the attached program that compares "normal" and "kashida"
> 'Raħīm' taken from [1], you'll get:
> 
> c1 = c2
> c2 = c1
> 
> meaning that "normal" and "kashida" are the same string. So, probably that
> phrase should read something like: "If an application calls the function to
> create a sort key for a string containing an Arabic kashida character, the
> function will ignore that character and no sort key value will be generated
> for it."

Good.  That sounds fine.  Thanks for clarifying.




Re: Inconsistent results with libc sorting on Windows

2023-08-10 Thread Noah Misch
On Wed, Jun 14, 2023 at 12:50:28PM +0200, Juan José Santamaría Flecha wrote:
> On Wed, Jun 14, 2023 at 4:13 AM Peter Geoghegan  wrote:
> 
> > On Tue, Jun 13, 2023 at 5:59 PM Thomas Munro 
> > wrote:
> > > Trying to follow along here... you're doing the moral equivalent of
> > > strxfrm(), so sort keys have the transitive property but direct string
> > > comparisons don't?  Or is this because LCIDs reach a different
> > > algorithm somehow (or otherwise why do you need to use LCIDs for this,
> > > when there is a non-LCID version of that function, with a warning not
> > > to use the older LCID version[1]?)
> >
> > I'm reminded of the fact that the abbreviated keys strxfrm() debacle
> > (back when 9.5 was released) was caused by a bug in strcoll() -- not a
> > bug in strxfrm() itself. From our point of view the problem was that
> > strxfrm() failed to be bug compatible with strcoll() due to a buggy
> > strcoll() optimization.
> >
> > I believe that strxfrm() is generally less likely to have bugs than
> > strcoll(). There are far fewer opportunities to dodge unnecessary work
> > in the case of strxfrm()-like algorithms (offering something like
> > ICU's pg_strnxfrm_prefix_icu() prefix optimization is the only one).
> > On the other hand, collation library implementers are likely to
> > heavily optimize strcoll() for typical use-cases such as sorting and
> > binary search. Using strxfrm() for everything is discouraged [1].
> >
> 
> Yes, I think the situation is quite similar to what you describe, with its
> WIN32 peculiarities. Take for example the attached program, it'll output:
> 
> s1 = s2
> s2 = s3
> s1 > s3
> c1 > c2
> c2 > c3
> c1 > c3
> 
> As you can see the test for CompareStringEx() is broken, but we get a sane
> answer with LCMapStringEx().

The LCMapStringEx() solution is elegant.  I do see
https://learn.microsoft.com/en-us/windows/win32/intl/handling-sorting-in-your-applications
says, "If an application calls the function to create a sort key for a string
containing an Arabic kashida, the function creates no sort key value."  That's
aggravating.




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-10 Thread Noah Misch
On Tue, Aug 08, 2023 at 07:59:55PM -0700, Andres Freund wrote:
> On 2023-08-08 22:29:50 -0400, Robert Treat wrote:

> 3) using ~350 USD / mo in GCP costs for windows, linux, freebsd (*)

> > The other likely option would be to seek out cloud credits

> I tried to start that progress within microsoft, fwiw.  Perhaps Joe and
> Jonathan know how to start within AWS?  And perhaps Noah inside GCP?
> 
> It'd be the least work to get it up and running in GCP, as it's already
> running there

I'm looking at this.  Thanks for bringing it to my attention.




Re: 2023-08-10 release announcement draft

2023-08-08 Thread Noah Misch
On Mon, Aug 07, 2023 at 10:03:44PM -0400, Jonathan S. Katz wrote:
> Fixes in PostgreSQL 16 Beta 3
> -
> 
> The following includes fixes included in PostgreSQL 16 Beta 3:

With both "includes" and "included" there, this line reads awkwardly to me.
I'd just delete the line, since the heading has the same information.

> * Add the `\drg` command to `psql` to display information about role grants.
> * Add timeline ID to filenames generated with `pg_waldump --save-fullpage`.
> * Fix crash after a deadlock occurs in a parallel `VACUUM` worker.
> 
> Please see the [release 
> notes](https://www.postgresql.org/docs/16/release-16.html)
> for a complete list of new and changed features:
> 
>   
> [https://www.postgresql.org/docs/16/release-16.html](https://www.postgresql.org/docs/16/release-16.html)




Re: PG 16 draft release notes ready

2023-08-05 Thread Noah Misch
On Thu, May 18, 2023 at 04:49:47PM -0400, Bruce Momjian wrote:
>   https://momjian.us/pgsql_docs/release-16.html

> 
> 
> 
> 
> Restrict the privileges of CREATEROLE roles (Robert Haas)
> 
> 
> 
> Previously roles with CREATEROLE privileges could change many aspects of any 
> non-superuser role.  Such changes, including adding members, now require the 
> role requesting the change to have ADMIN OPTION
> permission.
> 
> 
> 
> 
> 
> 
> 
> Improve logic of CREATEROLE roles ability to control other roles (Robert Haas)
> 
> 
> 
> For example, they can change the CREATEDB, REPLICATION, and BYPASSRLS 
> properties only if they also have those permissions.
> 
> 

CREATEROLE is a radically different feature in v16.  In v15-, it was an
almost-superuser.  In v16, informally speaking, it can create and administer
its own collection of roles, but it can't administer roles outside its
collection or grant memberships or permissions not offered to itself.  Hence,
let's move these two into the incompatibilities section.  Let's also merge
them, since f1358ca52 is just doing to clauses like CREATEDB what cf5eb37c5
did to role memberships.

> 
> 
> 
> 
> Allow GRANT to control role inheritance behavior (Robert Haas)
> 
> 
> 
> By default, role inheritance is controlled by the inheritance status of the 
> member role.  The new GRANT clauses WITH INHERIT and WITH ADMIN can now 
> override this.
> 
> 
> 
> 
> 
> 
> 
> Allow roles that create other roles to automatically inherit the new role's 
> rights or SET ROLE to the new role (Robert Haas, Shi Yu)
> 
> 
> 
> This is controlled by server variable createrole_self_grant.
> 
> 

Similarly, v16 radically changes the CREATE ROLE ... WITH INHERIT clause.  The
clause used to "change the behavior of already-existing grants."  Let's merge
these two and move the combination to the incompatibilities section.

> Remove libpq support for SCM credential authentication (Michael Paquier)

Since the point of removing it is the deep unlikelihood of anyone using it, I
wouldn't list this in "incompatibilities".

> Deprecate createuser option --role (Nathan Bossart)

This is indeed a deprecation, not a removal.  By the definition of
"deprecate", it's not an incompatibility.




Re: Getting rid of OverrideSearhPath in namespace.c

2023-07-31 Thread Noah Misch
On Mon, Jul 17, 2023 at 05:11:46PM +0300, Aleksander Alekseev wrote:
> > As a follow-up for the CVE-2023-2454 fix, I think that it makes sense to
> > completely remove unsafe functions
> > PushOverrideSearchPath()/PopOverrideSearchPath(), which are not used in the
> > core now.
> > Please look at the patch attached.
> >
> > [...]
> >
> > What do you think?
> 
> +1 to remove dead code.
> 
> The proposed patch however removes get_collation_oid(), apparently by
> mistake. Other than that the patch looks fine and passes `make
> installcheck-world`.
> 
> I added an entry to the nearest CF [1].
> 
> > Beside that, maybe it's worth to rename three functions in "Override" in
> > their names: GetOverrideSearchPath(), CopyOverrideSearchPath(),
> > OverrideSearchPathMatchesCurrent(), and then maybe struct 
> > OverrideSearchPath.
> > Noah Misch proposed name GetSearchPathMatcher() for the former.
> 
> +1 as well. I added the corresponding 0002 patch.

Pushed both.  Thanks.




Re: Postgres v15 windows bincheck regression test failures

2023-07-29 Thread Noah Misch
On Fri, Jul 28, 2023 at 04:00:00PM +0300, Alexander Lakhin wrote:
> 28.07.2023 14:42, Noah Misch wrpte:
> >That was about a bug that appears when using TCP sockets. ...
> 
> Yes, and according to the failed test output, TCP sockets were used:
> 
> #   'pg_amcheck: error: connection to server at
> "127.0.0.1", port 49393 failed: server closed the connection
> unexpectedly
> #   This probably means the server terminated abnormally
> #   before or while processing the request.

I think we were talking about different details.  Agreed, bug #16678 probably
did cause the failure in the original post.  I was saying that bug has no
connection to the "scary comment", though.




Re: Postgres v15 windows bincheck regression test failures

2023-07-28 Thread Noah Misch
On Fri, Jul 28, 2023 at 07:00:01AM +0300, Alexander Lakhin wrote:
> 28.07.2023 05:17, Noah Misch wrote:
> >On Tue, Jun 20, 2023 at 07:49:52AM -0400, Russell Foster wrote:
> >>/*
> >>* We don't use Unix-domain sockets on Windows by default, even if the
> >>* build supports them.  (See comment at remove_temp() for a reason.)
> >>* Override at your own risk.
> >>*/
> >>
> >>Is there some sort of race condition in the SSPI code that sometimes
> >>doesn't gracefully finish/close the connection when the backend
> >>decides to exit due to error?
> >No.  remove_temp() is part of test driver "pg_regress".  Non-test usage is
> >unaffected.  Even for test usage, folks have reported no failures from the
> >cause mentioned in the remove_temp() comment.
> 
> It seems to me that it's just another manifestation of bug #16678 ([1]).
> See also commits 6051857fc and 29992a6a5.
> 
> [1] 
> https://www.postgresql.org/message-id/flat/16678-253e48d34dc0c376%40postgresql.org

That was about a bug that appears when using TCP sockets.  The remove_temp()
comment involves code that doesn't run when using TCP sockets.  I don't think
they can be manifestations of the same phenomenon.




Re: Postgres v15 windows bincheck regression test failures

2023-07-27 Thread Noah Misch
On Tue, Jun 20, 2023 at 07:49:52AM -0400, Russell Foster wrote:
> /*
> * We don't use Unix-domain sockets on Windows by default, even if the
> * build supports them.  (See comment at remove_temp() for a reason.)
> * Override at your own risk.
> */
> 
> Is there some sort of race condition in the SSPI code that sometimes
> doesn't gracefully finish/close the connection when the backend
> decides to exit due to error?

No.  remove_temp() is part of test driver "pg_regress".  Non-test usage is
unaffected.  Even for test usage, folks have reported no failures from the
cause mentioned in the remove_temp() comment.




  1   2   3   4   5   6   7   8   9   >