date:20240329

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Thu, Mar 28, 2024 at 10:03:04PM +, Amonson, Paul D wrote:
>> * I think we need to verify there isn't a huge performance regression for
>>   smaller arrays.  IIUC those will still require an AVX512 instruction or
>>   two as well as a function call, which might add some noticeable overhead.
> 
> Not considering your changes, I had already tested small buffers. At less
> than 512 bytes there was no measurable regression (there was one extra
> condition check) and for 512+ bytes it moved from no regression to some
> gains between 512 and 4096 bytes. Assuming you introduced no extra
> function calls, it should be the same.

Cool.  I think we should run the benchmarks again to be safe, though.

>> I forgot to mention that I also want to understand whether we can
>> actually assume availability of XGETBV when CPUID says we support
>> AVX512:
> 
> You cannot assume as there are edge cases where AVX-512 was found on
> system one during compile but it's not actually available in a kernel on
> a second system at runtime despite the CPU actually having the hardware
> feature.

Yeah, I understand that much, but I want to know how portable the XGETBV
instruction is.  Unless I can assume that all x86_64 systems and compilers
support that instruction, we might need an additional configure check
and/or CPUID check.  It looks like MSVC has had support for the _xgetbv
intrinsic for quite a while, but I'm still researching the other cases.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D

> -Original Message-
>
> Cool.  I think we should run the benchmarks again to be safe, though.

Ok, sure go ahead. :)

> >> I forgot to mention that I also want to understand whether we can
> >> actually assume availability of XGETBV when CPUID says we support
> >> AVX512:
> >
> > You cannot assume as there are edge cases where AVX-512 was found on
> > system one during compile but it's not actually available in a kernel
> > on a second system at runtime despite the CPU actually having the
> > hardware feature.
> 
> Yeah, I understand that much, but I want to know how portable the XGETBV
> instruction is.  Unless I can assume that all x86_64 systems and compilers
> support that instruction, we might need an additional configure check and/or
> CPUID check.  It looks like MSVC has had support for the _xgetbv intrinsic for
> quite a while, but I'm still researching the other cases.

I see google web references to the xgetbv instruction as far back as 2009 for 
Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for _xgetbv() 
MSVC built-in. How far back do you need to go?

Thanks,
Paul

Re: Possibility to disable `ALTER SYSTEM`

2024-03-29 Thread Robert Haas

On Fri, Mar 29, 2024 at 10:48 AM Bruce Momjian  wrote:
> On Fri, Mar 29, 2024 at 08:46:33AM -0400, Robert Haas wrote:
> > On Thu, Mar 28, 2024 at 3:33 PM Bruce Momjian  wrote:
> > > I am fine with moving ahead.  I thought my later emails explaining we
> > > have to be careful communicated that.
> >
> > OK. Thanks for clarifying. I've committed the patch with the two
> > wording changes that you suggested in your subsequent email.
>
> Great, I know this has been frustrating, and you are right that this
> wouldn't have been any simpler next year.

Thanks, Bruce.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Fri, Mar 29, 2024 at 10:59:40AM -0500, Nathan Bossart wrote:
> It might be nice if we conditionally built pg_popcount_avx512.o in autoconf
> builds, too, but AFAICT we still need to wrap most of that code with
> macros, so I'm not sure it's worth the trouble.  I'll take another look at
> this...

If we assumed that TRY_POPCNT_FAST would be set and either
HAVE__GET_CPUID_COUNT or HAVE__CPUIDEX would be set whenever
USE_AVX512_POPCNT_WITH_RUNTIME_CHECK is set, we could probably remove the
surrounding macros and just compile pg_popcount_avx512.c conditionally
based on USE_AVX512_POPCNT_WITH_RUNTIME_CHECK.  However, the surrounding
code seems to be pretty cautious about these assumptions (e.g., the CPUID
macros are checked before setting TRY_POPCNT_FAST), so this would stray
from the nearby precedent a bit.

A counterexample is the CRC32C code.  AFAICT we assume the presence of
CPUID in that code (and #error otherwise).  I imagine its probably safe to
assume the compiler understands CPUID if it understands AVX512 intrinsics,
but that is still mostly a guess.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D

> On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote:
> > We don't do MSVC via autoconf/Make.  We used to have a special build
> > framework for MSVC which parsed Makefiles to produce "solution" files,
> > but it was removed as soon as Meson was mature enough to build.  See
> > commit 1301c80b2167.  If it builds with Meson, you're good.
> 
> The latest cfbot build for this seems to indicate that at least newer MSVC
> knows AVX512 intrinsics without any special compiler flags [0], so maybe
> what I had in v14 is good enough.  A previous version of the patch set [1] had
> the following lines:
> 
> +  if host_system == 'windows'
> +test_flags = ['/arch:AVX512']
> +  endif
> 
> I'm not sure if this is needed for older MSVC or something else.  IIRC I 
> couldn't
> find any other examples of this sort of thing in the meson scripts, either.  
> Paul,
> do you recall why you added this?

I asked internal folks here in-the-know and they suggested I add it. I 
personally am not a Windows guy. If it works without it and you are comfortable 
not including the lines, I am fine with it.

Thanks,
Paul

Re: To what extent should tests rely on VACUUM ANALYZE?

2024-03-29 Thread Tom Lane

I wrote:
> I experimented with the attached modified version of the patch,
> which probes just after the relevant VACUUMs and reduces the
> crankiness of ConditionalLockBufferForCleanup a bit to more nearly
> approximate what we're likely to see in the buildfarm.

Sigh, forgot to attach the patch, not that you couldn't have
guessed what's in it.

regards, tom lane

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e63c86cae4..6accaec26d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1351,6 +1351,11 @@ vac_estimate_reltuples(Relation relation,
 	old_density = old_rel_tuples / old_rel_pages;
 	unscanned_pages = (double) total_pages - (double) scanned_pages;
 	total_tuples = old_density * unscanned_pages + scanned_tuples;
+	ereport(LOG,
+			(errmsg("vac_estimate_reltuples(%s): od %g, sp %u tp %u, st %g orl %g tt %g",
+	RelationGetRelationName(relation),
+	old_density, scanned_pages, total_pages,
+	scanned_tuples, old_rel_tuples, total_tuples)));
 	return floor(total_tuples + 0.5);
 }
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..53cc74d88f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5041,6 +5041,8 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
+	if (rand() % 100 == 0) return false;
+
 	if (BufferIsLocal(buffer))
 	{
 		refcount = LocalRefCount[-buffer - 1];
diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out
index 8370c1561c..8841581d0a 100644
--- a/src/test/regress/expected/sanity_check.out
+++ b/src/test/regress/expected/sanity_check.out
@@ -1,4 +1,24 @@
+select c.relname,c.relpages,c.reltuples,c.relallvisible,s.autovacuum_count,s.autoanalyze_count
+from pg_class c
+left join pg_stat_all_tables s on c.oid = s.relid
+where c.relname like 'tenk_' order by c.relname;
+ relname | relpages | reltuples | relallvisible | autovacuum_count | autoanalyze_count 
+-+--+---+---+--+---
+ tenk1   |  345 | 1 |   345 |0 | 0
+ tenk2   |  345 | 1 |   345 |0 | 0
+(2 rows)
+
 VACUUM;
+select c.relname,c.relpages,c.reltuples,c.relallvisible,s.autovacuum_count,s.autoanalyze_count
+from pg_class c
+left join pg_stat_all_tables s on c.oid = s.relid
+where c.relname like 'tenk_' order by c.relname;
+ relname | relpages | reltuples | relallvisible | autovacuum_count | autoanalyze_count 
+-+--+---+---+--+---
+ tenk1   |  345 | 1 |   345 |0 | 0
+ tenk2   |  345 | 1 |   345 |0 | 0
+(2 rows)
+
 --
 -- Sanity check: every system catalog that has OIDs should have
 -- a unique index on OID.  This ensures that the OIDs will be unique,
diff --git a/src/test/regress/expected/test_setup.out b/src/test/regress/expected/test_setup.out
index 3d0eeec996..84943ecbae 100644
--- a/src/test/regress/expected/test_setup.out
+++ b/src/test/regress/expected/test_setup.out
@@ -138,6 +138,16 @@ COPY tenk1 FROM :'filename';
 VACUUM ANALYZE tenk1;
 CREATE TABLE tenk2 AS SELECT * FROM tenk1;
 VACUUM ANALYZE tenk2;
+select c.relname,c.relpages,c.reltuples,c.relallvisible,s.autovacuum_count,s.autoanalyze_count
+from pg_class c
+left join pg_stat_all_tables s on c.oid = s.relid
+where c.relname like 'tenk_' order by c.relname;
+ relname | relpages | reltuples | relallvisible | autovacuum_count | autoanalyze_count 
+-+--+---+---+--+---
+ tenk1   |  345 | 1 |   345 |0 | 0
+ tenk2   |  345 | 1 |   345 |0 | 0
+(2 rows)
+
 CREATE TABLE person (
 	name 		text,
 	age			int4,
diff --git a/src/test/regress/sql/sanity_check.sql b/src/test/regress/sql/sanity_check.sql
index 162e5324b5..28635a1287 100644
--- a/src/test/regress/sql/sanity_check.sql
+++ b/src/test/regress/sql/sanity_check.sql
@@ -1,5 +1,15 @@
+select c.relname,c.relpages,c.reltuples,c.relallvisible,s.autovacuum_count,s.autoanalyze_count
+from pg_class c
+left join pg_stat_all_tables s on c.oid = s.relid
+where c.relname like 'tenk_' order by c.relname;
+
 VACUUM;
 
+select c.relname,c.relpages,c.reltuples,c.relallvisible,s.autovacuum_count,s.autoanalyze_count
+from pg_class c
+left join pg_stat_all_tables s on c.oid = s.relid
+where c.relname like 'tenk_' order by c.relname;
+
 --
 -- Sanity check: every system catalog that has OIDs should have
 -- a unique index on OID.  This ensures that the OIDs will be unique,
diff --git a/src/test/regress/sql/test_setup.sql b/src/test/regress/sql/test_setup.sql
index

Re: Allowing DESC for a PRIMARY KEY column

2024-03-29 Thread Mitar

Hi!

On Fri, Mar 29, 2024 at 9:41 PM Tom Lane  wrote:
> You would need a lot stronger case than "I didn't bother checking
> whether I really need this".

Thanks! I have tested it this way (based on your example):

create table t (id int not null, revision int not null);
create unique index on t (id, revision desc);
explain select * from t where id=123 order by revision desc limit 1;
   QUERY PLAN

 Limit  (cost=0.15..3.45 rows=1 width=8)
   ->  Index Only Scan using t_id_revision_idx on t  (cost=0.15..36.35
rows=11 width=8)
 Index Cond: (id = 123)
(3 rows)

It is very similar, with only the direction difference. Based on [1] I
was under the impression that "Index Only Scan Backward" is much
slower than "Index Only Scan", but based on your answer it seems I
misunderstood and backwards scanning is comparable with forward
scanning? Especially this section:

"Consider a two-column index on (x, y): this can satisfy ORDER BY x, y
if we scan forward, or ORDER BY x DESC, y DESC if we scan backward.
But it might be that the application frequently needs to use ORDER BY
x ASC, y DESC. There is no way to get that ordering from a plain
index, but it is possible if the index is defined as (x ASC, y DESC)
or (x DESC, y ASC)."

I am curious, what is then an example where the quote from [1]
applies? Really just if I would be doing ORDER BY id, revision DESC on
the whole table? Because one future query I am working on is where I
select all rows but for only the latest (highest) revision. Curious if
that will have an effect there.

Mitar

[1] https://www.postgresql.org/docs/16/indexes-ordering.html

-- 
https://mitar.tnode.com/
https://twitter.com/mitar_m
https://noc.social/@mitar

Re: LLVM 18

2024-03-29 Thread Thomas Munro

On Sat, Mar 30, 2024 at 7:07 AM Christoph Berg  wrote:
> Ubuntu in their infinite wisdom have switched to LLVM 18 as default
> for their upcoming 24.04 "noble" LTS release while Debian is still
> defaulting to 16. I'm now seeing LLVM crashes on the 4 architectures
> we support on noble.
>
> Should LLVM 18 be supported by now?

Hi Christoph,

Seems there is a bug somewhere, probably (?) not in our code, but
perhaps we should be looking for a workaround...  here's the thread:

https://www.postgresql.org/message-id/flat/CAFj8pRACpVFr7LMdVYENUkScG5FCYMZDDdSGNU-tch%2Bw98OxYg%40mail.gmail.com

Re: Statistics Import and Export

2024-03-29 Thread Corey Huinker

>
>
> There's still a failure in the pg_upgrade TAP test. One seems to be
> ordering, so perhaps we need to ORDER BY the attribute number. Others
> seem to be missing relstats and I'm not sure why yet. I suggest doing
> some manual pg_upgrade tests and comparing the before/after dumps to
> see if you can reproduce a smaller version of the problem.
>

That's fixed in my current working version, as is a tsvector-specific
issue. Working on the TAP issue.

Re: Statistics Import and Export

2024-03-29 Thread Corey Huinker

>
> (I've not read the patch yet, but I assume the switch works like
> other pg_dump filters in that you can apply it on the restore
> side?)
>

Correct. It follows the existing --no-something pattern.

Re: LLVM 18

2024-03-29 Thread Christoph Berg

Re: Thomas Munro
> By the way, while testing on my Debian system with apt.llvm.org
> packages, I discovered that we crash with its latest llvm-18 package,
> namely:

Ubuntu in their infinite wisdom have switched to LLVM 18 as default
for their upcoming 24.04 "noble" LTS release while Debian is still
defaulting to 16. I'm now seeing LLVM crashes on the 4 architectures
we support on noble.

Should LLVM 18 be supported by now?

Christoph

Re: Why is parula failing?

2024-03-29 Thread Tom Lane

David Rowley  writes:
> On Wed, 27 Mar 2024 at 18:28, Tom Lane  wrote:
>> Let's wait a bit to see if it fails in HEAD ... but if not, would
>> it be reasonable to back-patch the additional debugging output?

> I think REL_16_STABLE has told us that it's not an auto-vacuum issue.
> I'm uncertain what a few more failures in master will tell us aside
> from if reltuples == 48 is consistent or if that value is going to
> fluctuate.

> Let's give it a week and see if it fails a few more times.

We have another failure today [1] with the same symptom:

  ab_a2  |0 |-1 ||0 |   
  0
- ab_a2_b1   |0 |-1 ||0 |   
  0
+ ab_a2_b1   |0 |48 ||0 |   
  0
  ab_a2_b1_a_idx |1 | 0 | t  |  |   

Different table, same "48" reltuples.  But I have to confess that
I'd not looked closely enough at the previous failure, because
now that I have, this is well out in WTFF territory: how can
reltuples be greater than zero when relpages is zero?  This can't
be a state that autovacuum would have left behind, unless it's
really seriously broken.  I think we need to be looking for
explanations like "memory stomp" or "compiler bug".

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-03-29%2012%3A46%3A02

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Fri, Mar 29, 2024 at 03:08:28PM -0500, Nathan Bossart wrote:
>> +#if defined(HAVE__GET_CPUID)
>> +__get_cpuid_count(7, 0, [0], [1], [2], [3]);
>> +#elif defined(HAVE__CPUID)
>> +__cpuidex(exx, 7, 0);
> 
> Is there any reason we can't use __get_cpuid() and __cpuid() here, given
> the sub-leaf is 0?

The answer to this seems to be "no."  After additional research,
__get_cpuid_count/__cpuidex seem new enough that we probably want configure
checks for them, so I'll add those back in the next version of the patch.

Apologies for the stream of consciousness today...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Statistics Import and Export

2024-03-29 Thread Stephen Frost

Greetings,

On Fri, Mar 29, 2024 at 11:05 Jeff Davis  wrote:

> On Fri, 2024-03-29 at 05:32 -0400, Corey Huinker wrote:
> > 0002:
> > - All relstats and attrstats calls are now their own statement
> > instead of a compound statement
> > - moved the archive TOC entry from post-data back to SECTION_NONE (as
> > it was modeled on object COMMENTs), which seems to work better.
> > - remove meta-query in favor of more conventional query building
> > - removed all changes to fe_utils/
>
> Can we get a consensus on whether the default should be with stats or
> without? That seems like the most important thing remaining in the
> pg_dump changes.


I’d certainly think “with stats” would be the preferred default of our
users.

Thanks!

Stephen

Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2024-03-29 Thread Noah Misch

On Fri, Mar 29, 2024 at 02:17:08PM -0400, Peter Geoghegan wrote:
> On Mon, Mar 25, 2024 at 2:24 PM Noah Misch  wrote:
> > I wasn't thinking about changing the pre-v17 bt_right_page_check_scankey()
> > code.  I got interested in this area when I saw the interaction of the new
> > "first key on the next page" logic with bt_right_page_check_scankey().  The
> > patch made bt_right_page_check_scankey() pass back rightfirstoffset.  The 
> > new
> > code then does palloc_btree_page() and PageGetItem() with that offset, which
> > bt_right_page_check_scankey() had already done.  That smelled like a 
> > misplaced
> > distribution of responsibility.  For a time, I suspected the new code should
> > move down into bt_right_page_check_scankey().  Then I transitioned to 
> > thinking
> > checkunique didn't need new code for the page boundary.

>  I did notice (I meant to point out) that I have concerns about this
> part of the new uniqueness check code:
> 
> "
> if (P_IGNORE(topaque) || !P_ISLEAF(topaque))
> break;
> "
> 
> My concern here is with the !P_ISLEAF(topaque) test -- it shouldn't be
> required. If the page in question isn't a leaf page, then the index
> must be corrupt (or the page deletion recycle safety/drain technique
> thing is buggy). The " !P_ISLEAF(topaque)" part of the check is either
> superfluous or something that ought to be reported as corruption --
> it's not a legal/expected state.

Good point.

> Separately, I dislike the way the target block changes within
> bt_target_page_check(). The general idea behind verify_nbtree.c's
> target block is that every block becomes the target exactly once, in a
> clearly defined place.

Agreed.

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-29 Thread Thomas Munro

On Sat, Mar 30, 2024 at 12:40 PM Tomas Vondra
 wrote:
> Sorry, I meant the prefetch (readahead) built into ZFS. I may be wrong
> but I don't think the regular RA (in linux kernel) works for ZFS, right?

Right, it separate page cache ("ARC") and prefetch settings:

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html

That's probably why Linux posix_fadvise didn't affect it, well that
and the fact, at a wild guess, that Solaris didn't have that system
call...

> I was wondering if we could use this (posix_fadvise) to improve that,
> essentially by issuing fadvise even for sequential patterns. But now
> that I think about that, if posix_fadvise works since 2.2, maybe RA
> works too now?)

It should work fine.  I am planning to look into this a bit some day
soon -- I think there may be some interesting interactions between
systems with big pages/records like ZFS/BTRFS/... and io_combine_limit
that might offer interesting optimisation tweak opportunities, but
first things first...

Re: Statistics Import and Export

2024-03-29 Thread Stephen Frost

Greetings,

On Fri, Mar 29, 2024 at 19:35 Jeff Davis  wrote:

> On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> > I’d certainly think “with stats” would be the preferred default of
> > our users.
>
> I'm concerned there could still be paths that lead to an error. For
> pg_restore, or when loading a SQL file, a single error isn't fatal
> (unless -e is specified), but it still could be somewhat scary to see
> errors during a reload.

I understand that point.

Also, it's new behavior, so it may cause some minor surprises, or there
> might be minor interactions to work out. For instance, dumping stats
> doesn't make a lot of sense if pg_upgrade (or something else) is just
> going to run analyze anyway.

But we don’t expect anything to run analyze … do we?  So I’m not sure why
it makes sense to raise this as a concern.

What do you think about starting off with it as non-default, and then
> switching it to default in 18?

What’s different, given the above arguments, in making the change with 18
instead of now?  I also suspect that if we say “we will change the default
later” … that later won’t ever come and we will end up making our users
always have to remember to say “with-stats” instead.

The stats are important which is why the effort is being made in the first
place. If just doing an analyze after loading the data was good enough then
this wouldn’t be getting worked on.

Independently, I had a thought around doing an analyze as the data is being
loaded .. but we can’t do that for indexes (but we could perhaps analyze
the indexed values as we build the index..).  This works when we do a
truncate or create the table in the same transaction, so we would tie into
some of the existing logic that we have around that.  Would also adjust
COPY to accept an option that specifies the anticipated number of rows
being loaded (which we can figure out during the dump phase reasonably..).
Perhaps this would lead to a pg_dump option to do the data load as a
transaction with a truncate before the copy (point here being to be able to
still do parallel load while getting the benefits from knowing that we are
completely reloading the table). Just some other thoughts- which I don’t
intend to take away from the current effort at all, which I see as valuable
and should be enabled by default.

Thanks!

Stephen

>

Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2024-03-29 Thread Peter Geoghegan

On Mon, Mar 25, 2024 at 2:24 PM Noah Misch  wrote:
> I wasn't thinking about changing the pre-v17 bt_right_page_check_scankey()
> code.  I got interested in this area when I saw the interaction of the new
> "first key on the next page" logic with bt_right_page_check_scankey().  The
> patch made bt_right_page_check_scankey() pass back rightfirstoffset.  The new
> code then does palloc_btree_page() and PageGetItem() with that offset, which
> bt_right_page_check_scankey() had already done.  That smelled like a misplaced
> distribution of responsibility.  For a time, I suspected the new code should
> move down into bt_right_page_check_scankey().  Then I transitioned to thinking
> checkunique didn't need new code for the page boundary.

Ah, I see. Somehow I missed this point when I recently took a fresh
look at the committed patch.

 I did notice (I meant to point out) that I have concerns about this
part of the new uniqueness check code:

"
if (P_IGNORE(topaque) || !P_ISLEAF(topaque))
break;
"

My concern here is with the !P_ISLEAF(topaque) test -- it shouldn't be
required. If the page in question isn't a leaf page, then the index
must be corrupt (or the page deletion recycle safety/drain technique
thing is buggy). The " !P_ISLEAF(topaque)" part of the check is either
superfluous or something that ought to be reported as corruption --
it's not a legal/expected state.

Separately, I dislike the way the target block changes within
bt_target_page_check(). The general idea behind verify_nbtree.c's
target block is that every block becomes the target exactly once, in a
clearly defined place. All corruption (in the index structure itself)
is formally considered to be a problem with that particular target
block. I want to be able to clearly distinguish between the target and
target's right sibling here, to explain my concerns, but they're kinda
both the target, so that's a lot harder than it should be. (Admittedly
directly blaming the target block has always been a little bit
arbitrary, at least in certain cases, but even there it provides
structure that makes things much easier to describe unambiguously.)

-- 
Peter Geoghegan

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

Okay, here is a slightly different approach that I've dubbed the "maximum
assumption" approach.  In short, I wanted to see how much we could simplify
the patch by making all possibly-reasonable assumptions about the compiler
and CPU.  These include:

* If the compiler understands AVX512 intrinsics, we assume that it also
  knows about the required CPUID and XGETBV intrinsics, and we assume that
  the conditions for TRY_POPCNT_FAST are true.
* If this is x86_64, CPUID will be supported by the CPU.
* If CPUID indicates AVX512 POPCNT support, the CPU also supports XGETBV.

Do any of these assumptions seem unreasonable or unlikely to be true for
all practical purposes?  I don't mind adding back some or all of the
configure/runtime checks if they seem necessary.  I guess the real test
will be the buildfarm...

Another big change in this version is that I've moved
pg_popcount_avx512_available() to its own file so that we only compile
pg_popcount_avx512() with the special compiler flags.  This is just an
oversight in previous versions.

Finally, I've modified the build scripts so that the AVX512 popcount stuff
is conditionally built based on the configure checks for both
autoconf/meson.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d7864391c455ea77b8e555e40a358c59de1bd702 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v16 1/1] AVX512 popcount support

---
 config/c-compiler.m4 |  34 +
 configure| 100 +++
 configure.ac |  14 
 meson.build  |  35 ++
 src/Makefile.global.in   |   4 ++
 src/include/pg_config.h.in   |   3 +
 src/include/port/pg_bitutils.h   |  17 +
 src/makefiles/meson.build|   3 +-
 src/port/Makefile|   6 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  56 ++-
 src/port/pg_popcount_avx512.c|  40 +++
 src/port/pg_popcount_avx512_choose.c |  61 
 13 files changed, 340 insertions(+), 39 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..7d13368b23 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..86c471f4ec 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,8 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+CFLAGS_POPCNT
+PG_POPCNT_OBJS
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17438,6 +17440,104 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
 
 fi
 
+# Check for AVX512 popcount intrinsics
+#
+PG_POPCNT_OBJS=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt =

Re: Security lessons from liblzma

2024-03-29 Thread Thomas Munro

On Sat, Mar 30, 2024 at 11:37 AM Bruce Momjian  wrote:
> You might have seen reports today about a very complex exploit added to
> recent versions of liblzma.  Fortunately, it was only enabled two months
> ago and has not been pushed to most stable operating systems like Debian
> and Ubuntu.  The original detection report is:
>
> https://www.openwall.com/lists/oss-security/2024/03/29/4

Incredible work from Andres.  The attackers made a serious strategic
mistake: they made PostgreSQL slightly slower.

Re: Security lessons from liblzma

2024-03-29 Thread Daniel Gustafsson

> On 29 Mar 2024, at 23:59, Andres Freund  wrote:
> On 2024-03-29 18:37:24 -0400, Bruce Momjian wrote:

>> Now, we don't take pull requests, and all our committers are known
>> individuals, but this might have cautionary lessons for us.
> 
> I am doubtful that every committer would find something sneaky hidden in
> e.g. one of the test changes in a large commit. It's not too hard to hide
> something sneaky.

One take-away for me is how important it is to ship recipes for regenerating
any testdata which is included in generated/compiled/binary format.  Kind of
how we in our tree ship the config for test TLS certificates and keys which can
be manually inspected, and used to rebuild the testdata (although the risk for
injections in this particular case seems low).  Bad things can still be
injected, but formats which allow manual review at least goes some way towards
lowering risk.

--
Daniel Gustafsson

Re: Remove excessive trailing semicolons

2024-03-29 Thread Daniel Gustafsson

> On 29 Mar 2024, at 10:14, Richard Guo  wrote:

> Noticed some newly introduced excessive trailing semicolons:
> 
> $ git grep -E ";;$" -- *.c *.h
> src/include/lib/radixtree.h:int deletepos = 
> slot - n4->children;;
> src/test/modules/test_tidstore/test_tidstore.c: BlockNumber prevblkno = 0;;
> 
> Here is a trivial patch to remove them.

Thanks, applied!

--
Daniel Gustafsson

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-29 Thread Thomas Munro

On Sat, Mar 30, 2024 at 12:34 PM Tomas Vondra
 wrote:
> Hmmm. I admit I didn't think about the "always prefetch" flag too much,
> but I did imagine it'd only affect some places (e.g. BHS, but not for
> sequential scans). If it could be done by lowering the combine limit,
> that could work too - in fact, I was wondering if we should have combine
> limit as a tablespace parameter too.

Good idea!  Will add.  Planning to commit the basic patch very soon,
I'm just thinking about what to do with Heikki's recent optimisation
feedback (ie can it be done as follow-up, he thinks so, I'm thinking
about that today as time is running short).

> But I think adding such knobs should be only the last resort - I myself
> don't know how to set these parameters, how could we expect users to
> pick good values? Better to have something that "just works".

Agreed.

> I admit I never 100% understood when exactly the kernel RA kicks in, but
> I always thought it's enough for the patterns to be only "close enough"
> to sequential. Isn't the problem that this only skips fadvise for 100%
> sequential patterns, but keeps prefetching for cases the RA would deal
> on it's own? So maybe we should either relax the conditions when to skip
> fadvise, or combine even pages that are not perfectly sequential (I'm
> not sure if that's possible only for fadvise), though.

Yes that might be worth considering, if we know/guess what the OS RA
window size is for a tablespace.  I will post a patch for that for
consideration/testing as a potential follow-up as it's super easy,
just for experimentation.  I just fear that it's getting into the
realms of "hard to explain/understand" but on the other hand I guess
we already have the mechanism and have to explain it.

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Thu, Mar 28, 2024 at 10:29:47PM +, Amonson, Paul D wrote:
> I see in the meson.build you added the new file twice?
> 
> @@ -7,6 +7,7 @@ pgport_sources = [
>'noblock.c',
>'path.c',
>'pg_bitutils.c',
> +  'pg_popcount_avx512.c',
>'pg_strong_random.c',
>'pgcheckdir.c',
>'pgmkdirp.c',
> @@ -84,6 +85,7 @@ replace_funcs_pos = [
>['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
>['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
>['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
> +  ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 
> 'avx512_popcnt'],
> 
> I was putting the file with special flags ONLY in the second section and all 
> seemed to work. :)

Ah, yes, I think that's a mistake, and without looking closely, might
explain the MSVC warnings [0]:

[22:05:47.444] pg_popcount_avx512.c.obj : warning LNK4006: 
pg_popcount_avx512_available already defined in pg_popcount_a...

It might be nice if we conditionally built pg_popcount_avx512.o in autoconf
builds, too, but AFAICT we still need to wrap most of that code with
macros, so I'm not sure it's worth the trouble.  I'll take another look at
this...

[0] http://commitfest.cputube.org/highlights/all.html#4883

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c924b57f8479e51aa30c8e3cfe194a2ab85497ff Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v15 1/1] AVX512 popcount support

---
 config/c-compiler.m4   |  34 +++
 configure  | 165 +
 configure.ac   |  34 +++
 meson.build|  59 
 src/Makefile.global.in |   1 +
 src/include/pg_config.h.in |   9 ++
 src/include/port/pg_bitutils.h |  20 
 src/makefiles/meson.build  |   1 +
 src/port/Makefile  |   6 ++
 src/port/meson.build   |   5 +-
 src/port/pg_bitutils.c |  56 ---
 src/port/pg_popcount_avx512.c  |  98 
 12 files changed, 450 insertions(+), 38 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..f881e7ec28 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_AVX512_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_AVX512_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..189264b86e 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, [0], [1], [2], [3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo

Re: Popcount optimization using AVX512

2024-03-29 Thread Tom Lane

Nathan Bossart  writes:
>> I see google web references to the xgetbv instruction as far back as 2009
>> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for
>> _xgetbv() MSVC built-in. How far back do you need to go?

> Hm.  It seems unlikely that a compiler would understand AVX512 intrinsics
> and not XGETBV then.  I guess the other question is whether CPUID
> indicating AVX512 is enabled implies the availability of XGETBV on the CPU.
> If that's not safe, we might need to add another CPUID test.

Some quick googling says that (1) XGETBV predates AVX and (2) if you
are worried about old CPUs, you should check CPUID to verify whether
XGETBV exists before trying to use it.  I did not look for the
bit-level details on how to do that.

regards, tom lane

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Fri, Mar 29, 2024 at 12:30:14PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>>> I see google web references to the xgetbv instruction as far back as 2009
>>> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for
>>> _xgetbv() MSVC built-in. How far back do you need to go?
> 
>> Hm.  It seems unlikely that a compiler would understand AVX512 intrinsics
>> and not XGETBV then.  I guess the other question is whether CPUID
>> indicating AVX512 is enabled implies the availability of XGETBV on the CPU.
>> If that's not safe, we might need to add another CPUID test.
> 
> Some quick googling says that (1) XGETBV predates AVX and (2) if you
> are worried about old CPUs, you should check CPUID to verify whether
> XGETBV exists before trying to use it.  I did not look for the
> bit-level details on how to do that.

That extra CPUID check should translate to exactly one additional line of
code, so I think I'm inclined to just add it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-29 Thread Thomas Munro

On Sat, Mar 30, 2024 at 10:39 AM Thomas Munro  wrote:
> On Sat, Mar 30, 2024 at 4:53 AM Tomas Vondra
>  wrote:
> > ... Maybe there should be some flag to force
> > issuing fadvise even for sequential patterns, perhaps at the tablespace
> > level? ...
>
> Yeah, I've wondered about trying harder to "second guess" the Linux
> RA.  At the moment, read_stream.c detects *exactly* sequential reads
> (see seq_blocknum) to suppress advice, but if we knew/guessed the RA
> window size, we could (1) detect it with the same window that Linux
> will use to detect it, and (2) [new realisation from yesterday's
> testing] we could even "tickle" it to wake it up in certain cases
> where it otherwise wouldn't, by temporarily using a smaller
> io_combine_limit if certain patterns come along.  I think that sounds
> like madness (I suspect that any place where the latter would help is
> a place where you could turn RA up a bit higher for the same effect
> without weird kludges), or another way to put it would be to call it
> "overfitting" to the pre-existing quirks; but maybe it's a future
> research idea...

I guess I missed a step when responding that suggestion: I don't think
we could have an "issue advice always" flag, because it doesn't seem
to work out as well as letting the kernel do it, and a global flag
like that would affect everything else including sequential scans
(once the streaming seq scan patch goes in).  But suppose we could do
that, maybe even just for BHS.  In my little test yesterday had to
issue a lot of them, patched eic=4, to beat the kernel's RA with
unpatched eic=0:

eic unpatched patched
041729572
1   30846   10376
2   184355562
4   189803503

So if we forced fadvise to be issued with a GUC, it still wouldn't be
good enough in this case.  So we might need to try to understand what
exactly is waking the RA up for unpatched but not patched, and try to
tickle it by doing a little less I/O combining (for example just
setting io_combine_limit=1 gives the same number for eic=0, a major
clue), but that seems to be going down a weird path, and tuning such a
copying algorithm seems too hard.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-29 Thread Noah Misch

On Fri, Mar 29, 2024 at 09:17:55AM +0100, Jelte Fennema-Nio wrote:
> On Thu, 28 Mar 2024 at 19:03, Tom Lane  wrote:
> > Could we make this test bulletproof by using an injection point?
> > If not, I remain of the opinion that we're better off without it.
> 
> Possibly, and if so, I agree that would be better than the currently
> added test. But I honestly don't feel like spending the time on
> creating such a test.

The SQL test is more representative of real applications, and it's way simpler
to understand.  In general, I prefer 6-line SQL tests that catch a problem 10%
of the time over injection point tests that catch it 100% of the time.  For
low detection rate to be exciting, it needs to be low enough to have a serious
chance of all buildfarm members reporting green for the bad commit.  With ~115
buildfarm members running in the last day, 0.1% detection rate would have been
low enough to bother improving, but 4% would be high enough to call it good.

Re: Security lessons from liblzma

2024-03-29 Thread Andres Freund

Hi,

On 2024-03-29 18:37:24 -0400, Bruce Momjian wrote:
> You might have seen reports today about a very complex exploit added to
> recent versions of liblzma.  Fortunately, it was only enabled two months
> ago and has not been pushed to most stable operating systems like Debian
> and Ubuntu.  The original detection report is:
> 
> https://www.openwall.com/lists/oss-security/2024/03/29/4
> 
> And this ycombinator discussion has details:
> 
> https://news.ycombinator.com/item?id=39865810
> 
> It looks like an earlier commit with a binary blob "test data"
> contained the bulk of the backdoor, then the configure script
> enabled it, and then later commits patched up valgrind errors
> caused by the backdoor. See the commit links in the "Compromised
> Repository" section.
> 
> and I think the configure came in through the autoconf output file
> 'configure', not configure.ac:
>
> This is my main take-away from this. We must stop using upstream
> configure and other "binary" scripts. Delete them all and run
> "autoreconf -fi" to recreate them. (Debian already does something
> like this I think.)

I don't think that's a useful lesson here, actually. In this case, if you ran
autoreconf -fi in a released tarball, it'd just recreate precisely what
the tarball already contained, including the exploit.

I think the issue in this case rather was that the tarball contains files that
are not in the release - a lot of them.  The attackers injected the
'activating' part of the backdoor into the release tarball, while it was not
present in the git tree. They did that because they knew that distributions
often build from release tarballs.

If the release pre-backdoor release tarball had been identical to the git
repository, this would likely have been noticed by packagers - but it was
normal for there to be a lot of new files.

We traditionally had also a lot of generated files in the tarball that weren't
in our git tree - but we removed a lot of that a few months ago, when we
stopped including bison/flex/other generated code.

We still include generated docs, but that's much harder to exploit at
scale.  However, they still make it harder to verify that the release is
exactly the same as the git tree.

> Now, we don't take pull requests, and all our committers are known
> individuals, but this might have cautionary lessons for us.

I am doubtful that every committer would find something sneaky hidden in
e.g. one of the test changes in a large commit. It's not too hard to hide
something sneaky. I comparison to that hiding something in configure.ac seems
less likely to succeed IMO, that imo tends to be more scrutinized. And hiding
just in configure directly wouldn't get you far, it'd just get removed when
the committer or some other committer at a later time, regenerates configure.

Greetings,

Andres Freund

Re: Statistics Import and Export

2024-03-29 Thread Corey Huinker

On Fri, Mar 29, 2024 at 7:34 PM Jeff Davis  wrote:

> On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> > I’d certainly think “with stats” would be the preferred default of
> > our users.
>
> I'm concerned there could still be paths that lead to an error. For
> pg_restore, or when loading a SQL file, a single error isn't fatal
> (unless -e is specified), but it still could be somewhat scary to see
> errors during a reload.
>

To that end, I'm going to be modifying the "Optimizer statistics are not
transferred by pg_upgrade..." message when stats _were_ transferred,
width additional instructions that the user should treat any stats-ish
error messages encountered as a reason to manually analyze that table. We
should probably say something about extended stats as well.

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-29 Thread Heikki Linnakangas


On 29/03/2024 07:04, Melanie Plageman wrote:

On Thu, Mar 28, 2024 at 11:07:10AM -0400, Melanie Plageman wrote:

These comments could use another pass. I had added some extra
(probably redundant) content when I thought I was refactoring it a
certain way and then changed my mind.

Attached is a diff with some ideas I had for a bit of code simplification.

Are you working on cleaning this patch up or should I pick it up?


Attached v9 is rebased over master. But, more importantly, I took
another pass at heap_prune_chain() and am pretty happy with what I came
up with. See 0021. I simplified the traversal logic and then grouped the
chain processing into three branches at the end. I find it much easier
to understand what we are doing for different types of HOT chains.


Ah yes, agreed, that's nicer.

The 'survivor' variable is a little confusing, especially here:

if (!survivor)
{
int i;

/*
 * If no DEAD tuple was found, and the root is redirected, mark 
it as
 * such.
 */
if ((i = ItemIdIsRedirected(rootlp)))
heap_prune_record_unchanged_lp_redirect(prstate, 
rootoffnum);

/* the rest of tuples in the chain are normal, unchanged tuples 
*/
for (; i < nchain; i++)
heap_prune_record_unchanged(dp, prstate, chainitems[i]);
}

You would think that "if(!survivor)", it means that there is no live 
tuples on the chain, i.e. no survivors. But in fact it's the opposite; 
it means that the are all live. Maybe call it 'ndeadchain' instead, 
meaning the number of dead items in the chain.



I got rid of revisited. We can put it back, but I was thinking: we stash
all HOT tuples and then loop over them later, calling record_unchanged()
on the ones that aren't marked. But, if we have a lot of HOT tuples, is
this really that much better than just looping through all the offsets
and calling record_unchanged() on just the ones that aren't marked?


Well, it requires looping through all the offsets one more time, and 
unless you have a lot of HOT tuples, most items would be marked already. 
But maybe the overhead is negligible anyway.



I've done that in my version. While testing this, I found that only
on-access pruning needed this final loop calling record_unchanged() on
items not yet marked. I know we can't skip this final loop entirely in
the ON ACCESS case because it calls record_prunable(), but we could
consider moving that back out into heap_prune_chain()? Or what do you
think?


Hmm, why is that different with on-access pruning?

Here's another idea: In the first loop through the offsets, where we 
gather the HTSV status of each item, also collect the offsets of all HOT 
and non-HOT items to two separate arrays. Call heap_prune_chain() for 
all the non-HOT items first, and then process any remaining HOT tuples 
that haven't been marked yet.



I haven't finished updating all the comments, but I am really interested
to know what you think about heap_prune_chain() now.


Looks much better now, thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Streaming I/O, vectored I/O (WIP)

2024-03-29 Thread Heikki Linnakangas


On 29/03/2024 09:01, Thomas Munro wrote:

On Fri, Mar 29, 2024 at 9:45 AM Heikki Linnakangas  wrote:

master (213c959a29):8.0 s
streaming-api v13:  9.5 s


Hmm, that's not great, and I think I know one factor that has
confounded my investigation and the conflicting reports I have
received from a couple of people: some are using meson, which is
defaulting to -O3 by default, and others are using make which gives
you -O2 by default, but at -O2, GCC doesn't inline that
StartReadBuffer specialisation that is used in the "fast path", and
possibly more.  Some of that gap is closed by using
pg_attribute_inline_always.  Clang fails to inline at any level.  So I
should probably use the "always" macro there because that is the
intention.  Still processing the rest of your email...


Ah yeah, I also noticed that the inlining didn't happen with some 
compilers and flags. I use a mix of gcc and clang and meson and autoconf 
in my local environment.


The above micro-benchmarks were with meson and gcc -O3. GCC version:

$ gcc --version
gcc (Debian 12.2.0-14) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-29 Thread Euler Taveira

On Fri, Mar 29, 2024, at 9:44 AM, Alexander Korotkov wrote:
> This generally makes sense, but I'm not sure about this.  The
> milliseconds timeout was used initially but received critics in [1].

Alexander, I see why you changed the patch.

Peter suggested to use an interval but you proposed another data type:
float. The advantage of the interval data type is that you don't need to
carefully think about the unit, however, if you use the integer data
type you have to propose one. (If that's the case, milliseconds is a
good granularity for this feature.) I don't have a strong preference
between integer and interval data types but I don't like the float for
this case. The 2 main reasons are (a) that we treat time units (hours,
minutes, seconds, ...) as integers so it seems natural for a human being
to use a unit time as integer and (b) depending on the number of digits
after the decimal separator you still don't have an integer in the
internal unit, hence, you have to round it to integer.

We already have functions that use integer (such as pg_terminate_backend)
and interval (such as pg_sleep_for) and if i searched correctly it will
be the first timeout argument as float.

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Fri, Mar 29, 2024 at 02:13:12PM -0500, Nathan Bossart wrote:
> * If the compiler understands AVX512 intrinsics, we assume that it also
>   knows about the required CPUID and XGETBV intrinsics, and we assume that
>   the conditions for TRY_POPCNT_FAST are true.

Bleh, cfbot's 32-bit build is unhappy with this [0].  It looks like it's
trying to build the AVX512 stuff, but TRY_POPCNT_FAST isn't set.

[19:39:11.306] ../src/port/pg_popcount_avx512.c:39:18: warning: implicit 
declaration of function ‘pg_popcount_fast’; did you mean ‘pg_popcount’? 
[-Wimplicit-function-declaration]
[19:39:11.306]39 |  return popcnt + pg_popcount_fast(buf, bytes);
[19:39:11.306]   |  ^~~~
[19:39:11.306]   |  pg_popcount

There's also a complaint about the inline assembly:

[19:39:11.443] ../src/port/pg_popcount_avx512_choose.c:55:1: error: 
inconsistent operand constraints in an ‘asm’
[19:39:11.443]55 | __asm__ __volatile__(" xgetbv\n":"=a"(low), 
"=d"(high):"c"(xcr));
[19:39:11.443]   | ^~~

I'm looking into this...

> +#if defined(HAVE__GET_CPUID)
> + __get_cpuid_count(7, 0, [0], [1], [2], [3]);
> +#elif defined(HAVE__CPUID)
> + __cpuidex(exx, 7, 0);

Is there any reason we can't use __get_cpuid() and __cpuid() here, given
the sub-leaf is 0?

[0] https://cirrus-ci.com/task/5475113447981056

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-29 Thread Tomas Vondra

On 3/29/24 23:03, Thomas Munro wrote:
> On Sat, Mar 30, 2024 at 10:39 AM Thomas Munro  wrote:
>> On Sat, Mar 30, 2024 at 4:53 AM Tomas Vondra
>>  wrote:
>>> ... Maybe there should be some flag to force
>>> issuing fadvise even for sequential patterns, perhaps at the tablespace
>>> level? ...
>>
>> Yeah, I've wondered about trying harder to "second guess" the Linux
>> RA.  At the moment, read_stream.c detects *exactly* sequential reads
>> (see seq_blocknum) to suppress advice, but if we knew/guessed the RA
>> window size, we could (1) detect it with the same window that Linux
>> will use to detect it, and (2) [new realisation from yesterday's
>> testing] we could even "tickle" it to wake it up in certain cases
>> where it otherwise wouldn't, by temporarily using a smaller
>> io_combine_limit if certain patterns come along.  I think that sounds
>> like madness (I suspect that any place where the latter would help is
>> a place where you could turn RA up a bit higher for the same effect
>> without weird kludges), or another way to put it would be to call it
>> "overfitting" to the pre-existing quirks; but maybe it's a future
>> research idea...
> 

I don't know if I'd call this overfitting - yes, we certainly don't want
to tailor this code to only work with the linux RA, but OTOH it's the RA
is what most systems do. And if we plan to rely on that, we probably
have to "respect" how it works ...

Moving to a "clean" approach that however triggers regressions does not
seem like a great thing for users. I'm not saying the goal has to be "no
regressions", that would be rather impossible. At this point I still try
to understand what's causing this.

BTW are you suggesting that increasing the RA distance could maybe fix
the regressions? I can give it a try, but I was assuming that 128kB
readahead would be enough for combine_limit=8kB.

> I guess I missed a step when responding that suggestion: I don't think
> we could have an "issue advice always" flag, because it doesn't seem
> to work out as well as letting the kernel do it, and a global flag
> like that would affect everything else including sequential scans
> (once the streaming seq scan patch goes in).  But suppose we could do
> that, maybe even just for BHS.  In my little test yesterday had to
> issue a lot of them, patched eic=4, to beat the kernel's RA with
> unpatched eic=0:
> 
> eic unpatched patched
> 041729572
> 1   30846   10376
> 2   184355562
> 4   189803503
> 
> So if we forced fadvise to be issued with a GUC, it still wouldn't be
> good enough in this case.  So we might need to try to understand what
> exactly is waking the RA up for unpatched but not patched, and try to
> tickle it by doing a little less I/O combining (for example just
> setting io_combine_limit=1 gives the same number for eic=0, a major
> clue), but that seems to be going down a weird path, and tuning such a
> copying algorithm seems too hard.

Hmmm. I admit I didn't think about the "always prefetch" flag too much,
but I did imagine it'd only affect some places (e.g. BHS, but not for
sequential scans). If it could be done by lowering the combine limit,
that could work too - in fact, I was wondering if we should have combine
limit as a tablespace parameter too.

But I think adding such knobs should be only the last resort - I myself
don't know how to set these parameters, how could we expect users to
pick good values? Better to have something that "just works".

I admit I never 100% understood when exactly the kernel RA kicks in, but
I always thought it's enough for the patterns to be only "close enough"
to sequential. Isn't the problem that this only skips fadvise for 100%
sequential patterns, but keeps prefetching for cases the RA would deal
on it's own? So maybe we should either relax the conditions when to skip
fadvise, or combine even pages that are not perfectly sequential (I'm
not sure if that's possible only for fadvise), though.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Statistics Import and Export

2024-03-29 Thread Jeff Davis

On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

Also, it's new behavior, so it may cause some minor surprises, or there
might be minor interactions to work out. For instance, dumping stats
doesn't make a lot of sense if pg_upgrade (or something else) is just
going to run analyze anyway.

What do you think about starting off with it as non-default, and then
switching it to default in 18?

Regards,
Jeff Davis

RE: Popcount optimization using AVX512

2024-03-29 Thread Shankaran, Akash

> From: Nathan Bossart  
> Sent: Friday, March 29, 2024 9:17 AM
> To: Amonson, Paul D 

> On Fri, Mar 29, 2024 at 04:06:17PM +, Amonson, Paul D wrote:
>> Yeah, I understand that much, but I want to know how portable the 
>> XGETBV instruction is.  Unless I can assume that all x86_64 systems 
>> and compilers support that instruction, we might need an additional 
>> configure check and/or CPUID check.  It looks like MSVC has had 
>> support for the _xgetbv intrinsic for quite a while, but I'm still 
>> researching the other cases.
> 
> I see google web references to the xgetbv instruction as far back as 
> 2009 for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could 
> test for
> _xgetbv() MSVC built-in. How far back do you need to go?

> Hm.  It seems unlikely that a compiler would understand AVX512 intrinsics and 
> not XGETBV then.  I guess the other question is whether CPUID indicating 
> AVX512 is enabled implies the availability of XGETBV on the CPU.
> If that's not safe, we might need to add another CPUID test.

> It would probably be easy enough to add a couple of tests for this, but if we 
> don't have reason to believe there's any practical case to do so, I don't 
> know why we would.  I'm curious what others think about this.

This seems unlikely. Machines supporting XGETBV would support AVX512 
intrinsics. Xgetbv instruction seems to be part of xsave feature set as per 
intel developer manual [2]. XGETBV/XSAVE came first, and seems to be available 
in all x86 systems available since 2011, since Intel SandyBridge architecture 
and AMD the Opteron Gen4 [0].
AVX512 first came into a product in 2016 [1]
[0]: https://kb.vmware.com/s/article/1005764
[1]: https://en.wikipedia.org/wiki/AVX-512
[2]: https://cdrdv2-public.intel.com/774475/252046-sdm-change-document.pdf

- Akash Shankaran

Re: To what extent should tests rely on VACUUM ANALYZE?

2024-03-29 Thread Tom Lane

Alexander Lakhin  writes:
> 29.03.2024 16:51, Tom Lane wrote:
>> Ouch.  So what's triggering that?  The intention of test_setup
>> surely is to provide a uniform starting point.

> Thanks for your attention to the issue!
> Please try the attached...

I experimented with the attached modified version of the patch,
which probes just after the relevant VACUUMs and reduces the
crankiness of ConditionalLockBufferForCleanup a bit to more nearly
approximate what we're likely to see in the buildfarm.  There are
two clearly-visible effects:

1. The initial VACUUM fails to count some pages as all-visible,
presumably exactly the same ones that ConditionalLockBufferForCleanup
fails on.  This feels like a bug.  We still scan the pages (else
reltuples would be wrong); why would we not recognize that they are
all-visible?

2. The re-VACUUM in sanity_check.sql corrects most of the
relallvisible discrepancy, presumably because it's preferentially
going after the pages that didn't get marked the first time.
However, it's distorting reltuples.  Interestingly, the distortion
is worse with less-cranky ConditionalLockBufferForCleanup.

I believe the cause of the reltuples distortion is that the
re-VACUUM will nearly always scan the last page of the relation,
which would usually contain fewer tuples than the rest, but
then it counts that page equally with the rest to compute the
new tuple density.  The fewer other pages are included in that
computation, the worse the new density estimate is, accounting
for the effect that when ConditionalLockBufferForCleanup is more
prone to failure the error gets smaller.

The comments in vac_estimate_reltuples already point out that
vacuum tends to always hit the last page and claim that we
"handle that here", but it's doing nothing about the likelihood
that the last page has fewer than the normal number of tuples.
I wonder if we could adjust the density calculation to account
for that.  I don't think it'd be unreasonable to just assume
that the last page is only half full.  Or we could try to get
the vacuum logic to report the last-page count separately ...

I tried the patch in v16 too and got similar results, so these
are not new problems.

regards, tom lane

Re: Why is parula failing?

2024-03-29 Thread Tom Lane

I wrote:
> I'd not looked closely enough at the previous failure, because
> now that I have, this is well out in WTFF territory: how can
> reltuples be greater than zero when relpages is zero?  This can't
> be a state that autovacuum would have left behind, unless it's
> really seriously broken.  I think we need to be looking for
> explanations like "memory stomp" or "compiler bug".

... in connection with which, I can't help noticing that parula
is using a very old compiler:

configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)

>From some quick checking around, that would have to be near the
beginning of aarch64 support in RHEL (Fedora hadn't promoted aarch64
to a primary architecture until earlier that same year).  It's not
exactly hard to believe that there were some lingering compiler bugs.
I wonder why parula is using that when its underlying system seems
markedly newer (the kernel at least has a recent build date).

regards, tom lane

Allowing DESC for a PRIMARY KEY column

2024-03-29 Thread Mitar

Hi!

I have the same problem as [1]. I have table something like:

CREATE TABLE values (
  id int NOT NULL,
  revision int NOT NULL,
  data jsonb NOT NULL,
  PRIMARY KEY (id, revision)
)

And I would like to be able to specify PRIMARY KEY (id, revision DESC)
because the most common query I am making is:

SELECT data FROM values WHERE id=123 ORDER BY revision DESC LIMIT 1

My understanding, based on [2], is that the primary key index cannot
help here, unless it is defined with DESC on revision. But this does
not seem to be possible. Would you entertain a patch adding this
feature? It seems pretty straightforward?


Mitar

[1] 
https://stackoverflow.com/questions/45597101/primary-key-with-asc-or-desc-ordering
[2] https://www.postgresql.org/docs/16/indexes-ordering.html

-- 
https://mitar.tnode.com/
https://twitter.com/mitar_m
https://noc.social/@mitar

Re: WIP Incremental JSON Parser

2024-03-29 Thread Jacob Champion

On Fri, Mar 29, 2024 at 9:42 AM Andrew Dunstan  wrote:
> Here's a new set of patches, with I think everything except the error case 
> Asserts attended to. There's a change to add semantic processing to the test 
> suite in patch 4, but I'd fold that into patch 1 when committing.

Thanks! 0004 did highlight one last bug for me -- the return value
from semantic callbacks is ignored, so applications can't interrupt
the parsing early:

> +   if (ostart != NULL)
> +   (*ostart) (sem->semstate);
> +   inc_lex_level(lex);

Anything not JSON_SUCCESS should be returned immediately, I think, for
this and the other calls.

> +   /* make sure we've used all the input */
> +   if (lex->token_terminator - lex->token_start != ptok->len)
> +   return JSON_INVALID_TOKEN;

nit: Debugging this, if anyone ever hits it, is going to be confusing.
The error message is going to claim that a valid token is invalid, as
opposed to saying that the parser has a bug. Short of introducing
something like JSON_INTERNAL_ERROR, maybe an Assert() should be added
at least?

> +   case JSON_NESTING_TOO_DEEP:
> +   return(_("Json nested too deep, maximum permitted depth is 
> 6400"));

nit: other error messages use all-caps, "JSON"

> +   charchunk_start[100],
> +   chunk_end[100];
> +
> +   snprintf(chunk_start, 100, "%s", ib->buf.data);
> +   snprintf(chunk_end, 100, "%s", ib->buf.data + (ib->buf.len - 
> (MIN_CHUNK + 99)));
> +   elog(NOTICE, "incremental 
> manifest:\nchunk_start='%s',\nchunk_end='%s'", chunk_start, chunk_end);

Is this NOTICE intended to be part of the final commit?

> +   inc_state = json_parse_manifest_incremental_init();
> +
> +   buffer = pg_malloc(chunk_size + 64);

These `+ 64`s (there are two of them) seem to be left over from
debugging. In v7 they were just `+ 1`.

--

>From this point onward, I think all of my feedback is around
maintenance characteristics, which is firmly in YMMV territory, and I
don't intend to hold up a v1 of the feature for it since I'm not the
maintainer. :D

The complexity around the checksum handling and "final chunk"-ing is
unfortunate, but not a fault of this patch -- just a weird consequence
of the layering violation in the format, where the checksum is inside
the object being checksummed. I don't like it, but I don't think
there's much to be done about it.

By far the trickiest part of the implementation is the partial token
logic, because of all the new ways there are to handle different types
of failures. I think any changes to the incremental handling in
json_lex() are going to need intense scrutiny from someone who already
has the mental model loaded up. I'm going snowblind on the patch and
I'm no longer the right person to review how hard it is to get up to
speed with the architecture, but I'd say it's not easy.

For something as security-critical as a JSON parser, I'd usually want
to counteract that by sinking the observable behaviors in concrete and
getting both the effective test coverage *and* the code coverage as
close to 100% as we can stand. (By that, I mean that purposely
introducing observable bugs into the parser should result in test
failures. We're not there yet when it comes to the semantic callbacks,
at least.) But I don't think the current JSON parser is held to that
standard currently, so it seems strange for me to ask for that here.

I think it'd be good for a v1.x of this feature to focus on
simplification of the code, and hopefully consolidate and/or eliminate
some of the duplicated parsing work so that the mental model isn't
quite so big.

Thanks!
--Jacob

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-29 Thread Thomas Munro

On Sat, Mar 30, 2024 at 4:53 AM Tomas Vondra
 wrote:
> Two observations:
>
> * The combine limit seems to have negligible impact. There's no visible
> difference between combine_limit=8kB and 128kB.
>
> * Parallel queries seem to work about the same as master (especially for
> optimal cases, but even for not optimal ones).
>
>
> The optimal plans with kernel readahead (two charts in the first row)
> look fairly good. There are a couple regressed cases, but a bunch of
> faster ones too.

Thanks for doing this!

> The optimal plans without kernel read ahead (two charts in the second
> row) perform pretty poorly - there are massive regressions. But I think
> the obvious reason is that the streaming read API skips prefetches for
> sequential access patterns, relying on kernel to do the readahead. But
> if the kernel readahead is disabled for the device, that obviously can't
> happen ...

Right, it does seem that this whole concept is sensitive on the
'borderline' between sequential and random, and this patch changes
that a bit and we lose some.  It's becoming much clearer to me that
master is already exposing weird kinks, and the streaming version is
mostly better, certainly on low IOPS systems.  I suspect that there
must be queries in the wild that would run much faster with eic=0 than
eic=1 today due to that, and while the streaming version also loses in
some cases, it seems that it mostly loses because of not triggering
RA, which can at least be improved by increasing the RA window.  On
the flip side, master is more prone to running out of IOPS and there
is no way to tune your way out of that.

> I think the question is how much we can (want to) rely on the readahead
> to be done by the kernel. ...

We already rely on it everywhere, for basic things like sequential scan.

> ... Maybe there should be some flag to force
> issuing fadvise even for sequential patterns, perhaps at the tablespace
> level? ...

Yeah, I've wondered about trying harder to "second guess" the Linux
RA.  At the moment, read_stream.c detects *exactly* sequential reads
(see seq_blocknum) to suppress advice, but if we knew/guessed the RA
window size, we could (1) detect it with the same window that Linux
will use to detect it, and (2) [new realisation from yesterday's
testing] we could even "tickle" it to wake it up in certain cases
where it otherwise wouldn't, by temporarily using a smaller
io_combine_limit if certain patterns come along.  I think that sounds
like madness (I suspect that any place where the latter would help is
a place where you could turn RA up a bit higher for the same effect
without weird kludges), or another way to put it would be to call it
"overfitting" to the pre-existing quirks; but maybe it's a future
research idea...

> I don't recall seeing a system with disabled readahead, but I'm
> sure there are cases where it may not really work - it clearly can't
> work with direct I/O, ...

Right, for direct I/O everything is slow right now including seq scan.
We need to start asynchronous reads in the background (imagine
literally just a bunch of background "I/O workers" running preadv() on
your behalf to get your future buffers ready for you, or equivalently
Linux io_uring).  That's the real goal of this project: restructuring
so we have the information we need to do that, ie teach every part of
PostgreSQL to predict the future in a standard and centralised way.
Should work out better than RA heuristics, because we're not just
driving in a straight line, we can turn corners too.

> ... but I've also not been very successful with
> prefetching on ZFS.

posix_favise() did not do anything in OpenZFS before 2.2, maybe you
have an older version?

> I certainly admit the data sets are synthetic and perhaps adversarial.
> My intent was to cover a wide range of data sets, to trigger even less
> common cases. It's certainly up to debate how serious the regressions on
> those data sets are in practice, I'm not suggesting "this strange data
> set makes it slower than master, so we can't commit this".

Right, yeah.  Thanks!  Your initial results seemed discouraging, but
looking closer I'm starting to feel a lot more positive about
streaming BHS.

Re: Table AM Interface Enhancements

2024-03-29 Thread Pavel Borisov

>
> I think for better code look this could be removed:
> >vlock:
>  >   CHECK_FOR_INTERRUPTS();
> together with CHECK_FOR_INTERRUPTS(); in heapam_tuple_insert_with_arbiter
> placed in the beginning of while loop.
>
To clarify things, this I wrote only about CHECK_FOR_INTERRUPTS();
rearrangement.

Regards,
Pavel

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-29 Thread Melanie Plageman

On Fri, Mar 29, 2024 at 12:00 PM Heikki Linnakangas  wrote:
>
> On 29/03/2024 07:04, Melanie Plageman wrote:
> > On Thu, Mar 28, 2024 at 11:07:10AM -0400, Melanie Plageman wrote:
> >> These comments could use another pass. I had added some extra
> >> (probably redundant) content when I thought I was refactoring it a
> >> certain way and then changed my mind.
> >>
> >> Attached is a diff with some ideas I had for a bit of code simplification.
> >>
> >> Are you working on cleaning this patch up or should I pick it up?
> >
> > Attached v9 is rebased over master. But, more importantly, I took
> > another pass at heap_prune_chain() and am pretty happy with what I came
> > up with. See 0021. I simplified the traversal logic and then grouped the
> > chain processing into three branches at the end. I find it much easier
> > to understand what we are doing for different types of HOT chains.
>
> Ah yes, agreed, that's nicer.
>
> The 'survivor' variable is a little confusing, especially here:
>
> if (!survivor)
> {
> int i;
>
> /*
>  * If no DEAD tuple was found, and the root is redirected, 
> mark it as
>  * such.
>  */
> if ((i = ItemIdIsRedirected(rootlp)))
> heap_prune_record_unchanged_lp_redirect(prstate, 
> rootoffnum);
>
> /* the rest of tuples in the chain are normal, unchanged 
> tuples */
> for (; i < nchain; i++)
> heap_prune_record_unchanged(dp, prstate, 
> chainitems[i]);
> }
>
> You would think that "if(!survivor)", it means that there is no live
> tuples on the chain, i.e. no survivors. But in fact it's the opposite;
> it means that the are all live. Maybe call it 'ndeadchain' instead,
> meaning the number of dead items in the chain.

Makes sense.

> > I've done that in my version. While testing this, I found that only
> > on-access pruning needed this final loop calling record_unchanged() on
> > items not yet marked. I know we can't skip this final loop entirely in
> > the ON ACCESS case because it calls record_prunable(), but we could
> > consider moving that back out into heap_prune_chain()? Or what do you
> > think?
>
> Hmm, why is that different with on-access pruning?

Well, it is highly possible we just don't hit cases like this with
vacuum in our test suite (not that it is unreachable by vacuum).

It's just very easy to get in this situation with on-access pruning.
Imagine an UPDATE which caused the following chain:

RECENTLY_DEAD -> DELETE_IN_PROGRESS -> INSERT_IN_PROGRESS

It invokes heap_page_prune_and_freeze() (assume the page meets the
criteria for on-access pruning) and eventually enters
heap_prune_chain() with the first offset in this chain.

The first item is LP_NORMAL and the tuple is RECENTLY_DEAD, so the
survivor variable stays 0 and we record_unchanged() for that tuple and
return. The next two items are LP_NORMAL and the tuples are HOT
tuples, so we just return from the "fast path" at the top of
heap_prune_chain(). After invoking heap_prune_chain() for all of them,
the first offset is marked but the other two are not. Thus, we end up
having to record_unchanged() later. This kind of thing is a super
common case that we see all the time in queries in the regression test
suite.

> Here's another idea: In the first loop through the offsets, where we
> gather the HTSV status of each item, also collect the offsets of all HOT
> and non-HOT items to two separate arrays. Call heap_prune_chain() for
> all the non-HOT items first, and then process any remaining HOT tuples
> that haven't been marked yet.

That's an interesting idea. I'll try it out and see how it works.

> > I haven't finished updating all the comments, but I am really interested
> > to know what you think about heap_prune_chain() now.
>
> Looks much better now, thanks!

I am currently doing chain traversal refactoring in heap_prune_chain()
on top of master as the first patch in the set.

- Melanie

RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D

> A counterexample is the CRC32C code.  AFAICT we assume the presence of
> CPUID in that code (and #error otherwise).  I imagine its probably safe to
> assume the compiler understands CPUID if it understands AVX512 intrinsics,
> but that is still mostly a guess.

If AVX-512 intrinsics are available, then yes you will have CPUID. CPUID is 
much older in the hardware/software timeline than AVX-512.

Thanks,
Paul

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-03-29 Thread Laurenz Albe

On Fri, 2024-03-29 at 14:07 +0100, Laurenz Albe wrote:
> I had a look at patch 0001 (0002 will follow).

Here is the code review for patch number 2:

> diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
[...]
+static bool
+SetupGOutput(PGresult *result, FILE **gfile_fout, bool *is_pipe)
[...]
+static void
+CloseGOutput(FILE *gfile_fout, bool is_pipe)

It makes sense to factor out this code.
But shouldn't these functions have a prototype at the beginning of the file?

> +   /*
> +* If FETCH_COUNT is set and the context allows it, use the single row
> +* mode to fetch results and have no more than FETCH_COUNT rows in
> +* memory.
> +*/

That comment talks about single-row mode, whey you are using chunked mode.
You probably forgot to modify the comment from a previous version of the patch.

> +   if (fetch_count > 0 && !pset.crosstab_flag && !pset.gexec_flag && 
> !is_watch
> +   && !pset.gset_prefix && pset.show_all_results)
> +   {
> +   /*
> +* The row-by-chunks fetch is not enabled when SHOW_ALL_RESULTS is 
> false,
> +* since we would need to accumulate all rows before knowing
> +* whether they need to be discarded or displayed, which contradicts
> +* FETCH_COUNT.
> +*/
> +   if (!PQsetChunkedRowsMode(pset.db, fetch_count))
> +   {

I think that comment should be before the "if" statement, not inside it.

Here is a suggestion for a consolidated comment:

  Fetch the result in chunks if FETCH_COUNT is set.  We don't enable chunking
  if SHOW_ALL_RESULTS is false, since that requires us to accumulate all rows
  before we can tell what should be displayed, which would counter the idea
  of FETCH_COUNT.  Chunk fetching is also disabled if \gset, \crosstab,
  \gexec and \watch are used.

> +   if (fetch_count > 0 && result_status == PGRES_TUPLES_CHUNK)

Could it be that result_status == PGRES_TUPLES_CHUNK, but fetch_count is 0?
if not, perhaps there should be an Assert that verifies that, and the "if"
statement should only check for the latter condition.

> --- a/src/bin/psql/t/001_basic.pl
> +++ b/src/bin/psql/t/001_basic.pl
> @@ -184,10 +184,10 @@ like(
> "\\set FETCH_COUNT 1\nSELECT error;\n\\errverbose",
> on_error_stop => 0))[2],
> qr/\A^psql::2: ERROR:  .*$
> -^LINE 2: SELECT error;$
> +^LINE 1: SELECT error;$
>  ^ *^.*$
>  ^psql::3: error: ERROR:  [0-9A-Z]{5}: .*$
> -^LINE 2: SELECT error;$
> +^LINE 1: SELECT error;$

Why does the output change?  Perhaps there is a good and harmless
explanation, but the naïve expectation would be that it doesn't.

The patch does not apply any more because of a conflict with the
non-blocking PQcancel patch.

After fixing the problem manually, it builds without warning.
The regression tests pass, and the feature works as expected.

Yours,
Laurenz Albe

Re: Allowing DESC for a PRIMARY KEY column

2024-03-29 Thread Tom Lane

Mitar  writes:
> And I would like to be able to specify PRIMARY KEY (id, revision DESC)
> because the most common query I am making is:
> SELECT data FROM values WHERE id=123 ORDER BY revision DESC LIMIT 1

Did you experiment with whether that actually needs a special index?
I get

regression=# create table t(id int, revision int, primary key(id, revision));
CREATE TABLE
regression=# explain select * from t WHERE id=123 ORDER BY revision DESC LIMIT 
1;
  QUERY PLAN
  
--
 Limit  (cost=0.15..3.45 rows=1 width=8)
   ->  Index Only Scan Backward using t_pkey on t  (cost=0.15..36.35 rows=11 
width=8)
 Index Cond: (id = 123)
(3 rows)

> My understanding, based on [2], is that the primary key index cannot
> help here, unless it is defined with DESC on revision. But this does
> not seem to be possible. Would you entertain a patch adding this
> feature? It seems pretty straightforward?

You would need a lot stronger case than "I didn't bother checking
whether I really need this".

regards, tom lane

Security lessons from liblzma

2024-03-29 Thread Bruce Momjian

You might have seen reports today about a very complex exploit added to
recent versions of liblzma.  Fortunately, it was only enabled two months
ago and has not been pushed to most stable operating systems like Debian
and Ubuntu.  The original detection report is:

https://www.openwall.com/lists/oss-security/2024/03/29/4

And this ycombinator discussion has details:

https://news.ycombinator.com/item?id=39865810

It looks like an earlier commit with a binary blob "test data"
contained the bulk of the backdoor, then the configure script
enabled it, and then later commits patched up valgrind errors
caused by the backdoor. See the commit links in the "Compromised
Repository" section.

and I think the configure came in through the autoconf output file
'configure', not configure.ac:

This is my main take-away from this. We must stop using upstream
configure and other "binary" scripts. Delete them all and run
"autoreconf -fi" to recreate them. (Debian already does something
like this I think.)

Now, we don't take pull requests, and all our committers are known
individuals, but this might have cautionary lessons for us.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Security lessons from liblzma

2024-03-29 Thread Tom Lane

Daniel Gustafsson  writes:
> One take-away for me is how important it is to ship recipes for regenerating
> any testdata which is included in generated/compiled/binary format.

IMO that's a hard, no-exceptions requirement just for
maintainability's sake, never mind security risks.

regards, tom lane

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-29 Thread Tomas Vondra

On 3/29/24 22:39, Thomas Munro wrote:
> ...
> 
>> I don't recall seeing a system with disabled readahead, but I'm
>> sure there are cases where it may not really work - it clearly can't
>> work with direct I/O, ...
> 
> Right, for direct I/O everything is slow right now including seq scan.
> We need to start asynchronous reads in the background (imagine
> literally just a bunch of background "I/O workers" running preadv() on
> your behalf to get your future buffers ready for you, or equivalently
> Linux io_uring).  That's the real goal of this project: restructuring
> so we have the information we need to do that, ie teach every part of
> PostgreSQL to predict the future in a standard and centralised way.
> Should work out better than RA heuristics, because we're not just
> driving in a straight line, we can turn corners too.
> 
>> ... but I've also not been very successful with
>> prefetching on ZFS.
> 
> posix_favise() did not do anything in OpenZFS before 2.2, maybe you
> have an older version?
> 

Sorry, I meant the prefetch (readahead) built into ZFS. I may be wrong
but I don't think the regular RA (in linux kernel) works for ZFS, right?

I was wondering if we could use this (posix_fadvise) to improve that,
essentially by issuing fadvise even for sequential patterns. But now
that I think about that, if posix_fadvise works since 2.2, maybe RA
works too now?)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP Incremental JSON Parser

2024-03-29 Thread Andrew Dunstan




On 2024-03-26 Tu 17:52, Jacob Champion wrote:

On Mon, Mar 25, 2024 at 3:14 PM Jacob Champion
 wrote:

- add Assert calls in impossible error cases [2]

To expand on this one, I think these parts of the code (marked with
`<- here`) are impossible to reach:


switch (top)
{
 case JSON_TOKEN_STRING:
 if (next_prediction(pstack) == JSON_TOKEN_COLON)
 ctx = JSON_PARSE_STRING;
 else
 ctx = JSON_PARSE_VALUE;<- here
 break;
 case JSON_TOKEN_NUMBER:<- here
 case JSON_TOKEN_TRUE:  <- here
 case JSON_TOKEN_FALSE: <- here
 case JSON_TOKEN_NULL:  <- here
 case JSON_TOKEN_ARRAY_START:   <- here
 case JSON_TOKEN_OBJECT_START:  <- here
 ctx = JSON_PARSE_VALUE;
 break;
 case JSON_TOKEN_ARRAY_END: <- here
 ctx = JSON_PARSE_ARRAY_NEXT;
 break;
 case JSON_TOKEN_OBJECT_END:<- here
 ctx = JSON_PARSE_OBJECT_NEXT;
 break;
 case JSON_TOKEN_COMMA: <- here
 if (next_prediction(pstack) == JSON_TOKEN_STRING)
 ctx = JSON_PARSE_OBJECT_NEXT;
 else
 ctx = JSON_PARSE_ARRAY_NEXT;
 break;

Since none of these cases are non-terminals, the only way to get to
this part of the code is if (top != tok). But inspecting the
productions and transitions that can put these tokens on the stack, it
looks like the only way for them to be at the top of the stack to
begin with is if (tok == top). (Otherwise, we would have chosen a
different production, or else errored out on a non-terminal.)

This case is possible...


 case JSON_TOKEN_STRING:
 if (next_prediction(pstack) == JSON_TOKEN_COLON)
 ctx = JSON_PARSE_STRING;

...if we're in the middle of JSON_PROD_[MORE_]KEY_PAIRS. But the
corresponding else case is not, because if we're in the middle of a
_KEY_PAIRS production, the next_prediction() _must_ be
JSON_TOKEN_COLON.

Do you agree, or am I missing a way to get there?




One good way of testing would be to add the Asserts, build with 
-DFORCE_JSON_PSTACK, and run the standard regression suite, which has a 
fairly comprehensive set of JSON errors. I'll play with that.


cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Comments on Custom RMGRs

2024-03-29 Thread Jeff Davis

On Fri, 2024-03-29 at 18:20 +0700, Danil Anisimow wrote:
> 
> In [rmgr_003.v3.patch] I added a phase argument to RmgrCheckpoint().
> Currently it is only called in two places: before and after
> CheckPointBuffers().

I am fine with this.

You've moved the discussion forward in two ways:

  1. Changes to pg_stat_statements to actually use the API; and
  2. The hook is called at multiple points.

Those at least partially address the concerns raised by Andres and
Robert. But given that there was pushback from multiple people on the
feature, I'd like to hear from at least one of them. It's very late in
the cycle so I'm not sure we'll get more feedback in time, though.

Regards,
Jeff Davis

Re: Building with meson on NixOS/nixpkgs

2024-03-29 Thread walther


Wolfgang Walther:
To build on NixOS/nixpkgs I came up with a few small patches to 
meson.build. All of this works fine with Autoconf/Make already.


In v3, I added another small patch for meson, this one about proper 
handling of -Dlibedit_preferred when used together with -Dreadline=enabled.


Best,

WolfgangFrom 1e7db8b57f69ed9866fdf874bbd0dcb33d25c045 Mon Sep 17 00:00:00 2001
From: Wolfgang Walther 
Date: Sat, 2 Mar 2024 17:18:38 +0100
Subject: [PATCH v3 1/4] Fallback to uuid for ossp-uuid with meson

The upstream name for the ossp-uuid package / pkg-config file is "uuid". Many
distributions change this to be "ossp-uuid" to not conflict with e2fsprogs.

This lookup fails on distributions which don't change this name, for example
NixOS / nixpkgs. Both "ossp-uuid" and "uuid" are also checked in configure.ac.
---
 meson.build | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/meson.build b/meson.build
index 18b5be842e3..4be5f65e8b6 100644
--- a/meson.build
+++ b/meson.build
@@ -1346,7 +1346,8 @@ if uuidopt != 'none'
 uuidfunc = 'uuid_to_string'
 uuidheader = 'uuid.h'
   elif uuidopt == 'ossp'
-uuid = dependency('ossp-uuid', required: true)
+# upstream is called "uuid", but many distros change this to "ossp-uuid"
+uuid = dependency('ossp-uuid', 'uuid', required: true)
 uuidfunc = 'uuid_export'
 uuidheader = 'uuid.h'
   else
-- 
2.44.0

From 28e8067b6b04ac601946fab4aa0849b3dfec5a7d Mon Sep 17 00:00:00 2001
From: Wolfgang Walther 
Date: Sat, 2 Mar 2024 22:06:25 +0100
Subject: [PATCH v3 2/4] Fallback to clang in PATH with meson

Some distributions put clang into a different path than the llvm binary path.

For example, this is the case on NixOS / nixpkgs, which failed to find clang
with meson before this patch.
---
 meson.build | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/meson.build b/meson.build
index 4be5f65e8b6..4a6af978fd4 100644
--- a/meson.build
+++ b/meson.build
@@ -759,7 +759,10 @@ if add_languages('cpp', required: llvmopt, native: false)
 llvm_binpath = llvm.get_variable(configtool: 'bindir')
 
 ccache = find_program('ccache', native: true, required: false)
-clang = find_program(llvm_binpath / 'clang', required: true)
+
+# Some distros put LLVM and clang in different paths, so fallback to
+# find via PATH, too.
+clang = find_program(llvm_binpath / 'clang', 'clang', required: true)
   endif
 elif llvmopt.auto()
   message('llvm requires a C++ compiler')
-- 
2.44.0

From 01fda05ea14e8cf0d201b749c84b1f3cb532f353 Mon Sep 17 00:00:00 2001
From: Wolfgang Walther 
Date: Mon, 11 Mar 2024 19:54:41 +0100
Subject: [PATCH v3 3/4] Support absolute bindir/libdir in regression tests
 with meson

Passing an absolute bindir/libdir will install the binaries and libraries to
/tmp_install/ and /tmp_install/ respectively.

This is path is correctly passed to the regression test suite via configure/make,
but not via meson, yet. This is because the "/" operator in the following expression
throws away the whole left side when the right side is an absolute path:

  test_install_location / get_option('libdir')

This was already correctly handled for dir_prefix, which is likely absolute as well.
This patch handles both bindir and libdir in the same way - prefixing absolute paths
with the tmp_install path correctly.
---
 meson.build | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/meson.build b/meson.build
index 4a6af978fd4..3e628659843 100644
--- a/meson.build
+++ b/meson.build
@@ -3048,15 +3048,17 @@ test_install_destdir = meson.build_root() / 'tmp_install/'
 if build_system != 'windows'
   # On unixoid systems this is trivial, we just prepend the destdir
   assert(dir_prefix.startswith('/')) # enforced by meson
-  test_install_location = '@0@@1@'.format(test_install_destdir, dir_prefix)
+  temp_install_bindir = '@0@@1@'.format(test_install_destdir, dir_prefix / dir_bin)
+  temp_install_libdir = '@0@@1@'.format(test_install_destdir, dir_prefix / dir_lib)
 else
   # drives, drive-relative paths, etc make this complicated on windows, call
   # into a copy of meson's logic for it
   command = [
 python, '-c',
 'import sys; from pathlib import PurePath; d1=sys.argv[1]; d2=sys.argv[2]; print(str(PurePath(d1, *PurePath(d2).parts[1:])))',
-test_install_destdir, dir_prefix]
-  test_install_location = run_command(command, check: true).stdout().strip()
+test_install_destdir]
+  temp_install_bindir = run_command(command, dir_prefix / dir_bin, check: true).stdout().strip()
+  temp_install_libdir = run_command(command, dir_prefix / dir_lib, check: true).stdout().strip()
 endif
 
 meson_install_args = meson_args + ['install'] + {
@@ -3093,7 +3095,6 @@ testport = 4
 
 test_env = environment()
 
-temp_install_bindir = test_install_location / get_option('bindir')
 test_initdb_template = meson.build_root() / 'tmp_install' / 'initdb-template'
 test_env.set('PG_REGRESS', pg_regress.full_path())

Re: Statistics Import and Export

2024-03-29 Thread Tom Lane

Jeff Davis  writes:
> On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
>> I’d certainly think “with stats” would be the preferred default of
>> our users.

> What do you think about starting off with it as non-default, and then
> switching it to default in 18?

I'm with Stephen: I find it very hard to imagine that there's any
users who wouldn't want this as default.  If we do what you suggest,
then there will be three historical behaviors to cope with not two.
That doesn't sound like it'll make anyone's life better.

As for the "it might break" argument, that could be leveled against
any nontrivial patch.  You're at least offering an opt-out switch,
which is something we more often don't do.

(I've not read the patch yet, but I assume the switch works like
other pg_dump filters in that you can apply it on the restore
side?)

regards, tom lane

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote:
> We don't do MSVC via autoconf/Make.  We used to have a special build
> framework for MSVC which parsed Makefiles to produce "solution" files,
> but it was removed as soon as Meson was mature enough to build.  See
> commit 1301c80b2167.  If it builds with Meson, you're good.

The latest cfbot build for this seems to indicate that at least newer MSVC
knows AVX512 intrinsics without any special compiler flags [0], so maybe
what I had in v14 is good enough.  A previous version of the patch set [1]
had the following lines:

+  if host_system == 'windows'
+test_flags = ['/arch:AVX512']
+  endif

I'm not sure if this is needed for older MSVC or something else.  IIRC I
couldn't find any other examples of this sort of thing in the meson
scripts, either.  Paul, do you recall why you added this?

[0] https://cirrus-ci.com/task/5787206636273664?logs=configure#L159
[1] 
https://postgr.es/m/attachment/158206/v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patch

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Fri, Mar 29, 2024 at 04:06:17PM +, Amonson, Paul D wrote:
>> Yeah, I understand that much, but I want to know how portable the XGETBV
>> instruction is.  Unless I can assume that all x86_64 systems and compilers
>> support that instruction, we might need an additional configure check and/or
>> CPUID check.  It looks like MSVC has had support for the _xgetbv intrinsic 
>> for
>> quite a while, but I'm still researching the other cases.
> 
> I see google web references to the xgetbv instruction as far back as 2009
> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for
> _xgetbv() MSVC built-in. How far back do you need to go?

Hm.  It seems unlikely that a compiler would understand AVX512 intrinsics
and not XGETBV then.  I guess the other question is whether CPUID
indicating AVX512 is enabled implies the availability of XGETBV on the CPU.
If that's not safe, we might need to add another CPUID test.

It would probably be easy enough to add a couple of tests for this, but if
we don't have reason to believe there's any practical case to do so, I
don't know why we would.  I'm curious what others think about this.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: broken JIT support on Fedora 40

2024-03-29 Thread Thomas Munro

On Fri, Mar 22, 2024 at 7:15 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > For verification, I've modified the deform.outblock to call LLVMBuildRet
> > instead of LLVMBuildRetVoid and this seems to help -- inline and deform
> > stages are still performed as before, but nothing crashes. But of course
> > it doesn't sound right that inlining pass cannot process such code.

Thanks for investigating and filing the issue.  It doesn't seem to be
moving yet.  Do you want to share the LLVMBuildRet() workaround?
Maybe we need to consider shipping something like that in the
meantime?

Re: Statistics Import and Export

2024-03-29 Thread Jeff Davis

On Fri, 2024-03-29 at 20:54 -0400, Stephen Frost wrote:
> What’s different, given the above arguments, in making the change
> with 18 instead of now?

Acknowledged. You, Tom, and Corey (and perhaps everyone else) seem to
be aligned here, so that's consensus enough for me. Default is with
stats, --no-statistics to disable them.

> Independently, I had a thought around doing an analyze as the data is
> being loaded ..

Right, I think there are some interesting things to pursue here. I also
had an idea to use logical decoding to get a streaming sample, which
would be better randomness than block sampling. At this point that's
just an idea, I haven't looked into it seriously.

Regards,
Jeff Davis

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

Here's a v17 of the patch.  This one has configure checks for everything
(i.e., CPUID, XGETBV, and the AVX512 intrinsics) as well as the relevant
runtime checks (i.e., we call CPUID to check for XGETBV and AVX512 POPCNT
availability, and we call XGETBV to ensure the ZMM registers are enabled).
I restricted the AVX512 configure checks to x86_64 since we know we won't
have TRY_POPCNT_FAST on 32-bit, and we rely on pg_popcount_fast() as our
fallback implementation in the AVX512 version.  Finally, I removed the
inline assembly in favor of using the _xgetbv() intrinsic on all systems.
It looks like that's available on gcc, clang, and msvc, although it
sometimes requires -mxsave, so that's applied to
pg_popcount_avx512_choose.o as needed.  I doubt this will lead to SIGILLs,
but it's admittedly a little shaky.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a26b209927cc6b266b33f74fd734772eff87bff9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v17 1/1] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  56 +++---
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 13 files changed, 638 insertions(+), 39 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF

Re: To what extent should tests rely on VACUUM ANALYZE?

2024-03-29 Thread Alexander Lakhin


29.03.2024 11:59, Alexander Lakhin wrote:


This simple change fixes the issue for me:
-VACUUM ANALYZE tenk2;
+VACUUM (ANALYZE, DISABLE_PAGE_SKIPPING) tenk2;



I'm sorry, I wasn't persevering enough when testing that...
After more test runs, I see that in fact it doesn't help.

Best regards,
Alexander

Re: remaining sql/json patches

2024-03-29 Thread jian he

On Fri, Mar 29, 2024 at 11:20 AM jian he  wrote:
>
>
> +
> +JSON_TABLE (
> +context_item,
> path_expression  AS
> json_path_name  
> PASSING { value AS
> varname } , ...
> 
> +COLUMNS (  class="parameter">json_table_column ,
> ... )
> + { ERROR | EMPTY
> } ON ERROR 
> +)
> top level (not in the COLUMN clause) also allows
> NULL ON ERROR.
>
we can also specify DEFAULT expression ON
ERROR.
like:
SELECT * FROM JSON_TABLE(jsonb'"1.23"', 'strict $.a' COLUMNS (js2 int
PATH '$') default '1' on error);

+   
+
+ name type
FORMAT JSON ENCODING
UTF8
+   PATH
json_path_specification 
+
+
+
+ Inserts a composite SQL/JSON item into the output row.
+
+
+ The provided PATH expression is evaluated and
+ the column is filled with the produced SQL/JSON item.  If the
+ PATH expression is omitted, path expression
+ $.name is used,
+ where name is the provided column name.
+ In this case, the column name must correspond to one of the
+ keys within the SQL/JSON item produced by the row pattern.
+
+
+ Optionally, you can specify WRAPPER,
+ QUOTES clauses to format the output and
+ ON EMPTY and ON ERROR to handle
+ those scenarios appropriately.
+

Similarly, I am not sure of the description of "composite SQL/JSON item".
by observing the following 3 examples:
SELECT * FROM JSON_TABLE(jsonb'{"a": "z"}', '$.a' COLUMNS (js2 text
format json PATH '$' omit quotes));
SELECT * FROM JSON_TABLE(jsonb'{"a": "z"}', '$.a' COLUMNS (js2 text
format json PATH '$'));
SELECT * FROM JSON_TABLE(jsonb'{"a": "z"}', '$.a' COLUMNS (js2 text PATH '$'));

i think, FORMAT JSON specification means that,
if your specified type is text or varchar related AND didn't specify
quotes behavior
then FORMAT JSON produced output can be casted to json data type.
so FORMAT JSON seems not related to array and records data type.

also the last para can be:
+
+ Optionally, you can specify WRAPPER,
+ QUOTES clauses to format the output and
+ ON EMPTY and ON ERROR to handle
+ those missing values and structural errors, respectively.
+


+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("only string constants are supported in JSON_TABLE"
+   " path specification"),
should be:

+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("only string constants are supported in JSON_TABLE path
specification"),


+   
+
+ AS json_path_name
+
+
+
+
+ The optional json_path_name serves as an
+ identifier of the provided
json_path_specification.
+ The path name must be unique and distinct from the column names.
+ When using the PLAN clause, you must specify the names
+ for all the paths, including the row pattern. Each path name can appear in
+ the PLAN clause only once.
+
+
+   
as of v46, we don't have PLAN clause.
also "must be unique and distinct from the column names." seems incorrect.
for example:
SELECT * FROM JSON_TABLE(jsonb'"1.23"', '$.a' as js2 COLUMNS (js2 int
PATH '$'));

Re: Add new error_action COPY ON_ERROR "log"

2024-03-29 Thread torikoshia


On 2024-03-28 21:36, torikoshia wrote:

On 2024-03-28 17:27, Bharath Rupireddy wrote:
On Thu, Mar 28, 2024 at 1:43 PM torikoshia 
 wrote:


Attached patch fixes the doc,


May I know which patch you are referring to? And, what do you mean by
"fixes the doc"?


Ugh, I replied to the wrong thread.
Sorry for making you confused and please ignore it.


but I'm wondering perhaps it might be
better to modify the codes to prohibit abbreviation of the value.


Please help me understand the meaning here.

When seeing the query which abbreviates ON_ERROR value, I feel it's 
not

obvious what happens compared to other options which tolerates
abbreviation of the value such as FREEZE or HEADER.

   COPY t1 FROM stdin WITH (ON_ERROR);

What do you think?


So, do you mean to prohibit ON_ERROR being specified without any value
like in COPY t1 FROM stdin WITH (ON_ERROR);? If yes, I think all the
other options do allow that [1].

Even if we were to do something like this, shall we discuss this 
separately?


Having said that, what do you think of the v13 patch posted upthread?


It looks good to me other than what Sawada-san lastly pointed out.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-29 Thread Bertrand Drouvot

Hi,

On Fri, Mar 29, 2024 at 09:39:31AM +0530, Amit Kapila wrote:
> On Wed, Mar 27, 2024 at 9:00 PM Bharath Rupireddy
>  wrote:
> >
> >
> > Thanks. I'm attaching v29 patches. 0001 managing inactive_since on the
> > standby for sync slots.
> >
> 
> Commit message states: "why we can't just update inactive_since for
> synced slots on the standby with the value received from remote slot
> on the primary. This is consistent with any other slot parameter i.e.
> all of them are synced from the primary."
> 
> The inactive_since is not consistent with other slot parameters which
> we copy. We don't perform anything related to those other parameters
> like say two_phase phase which can change that property. However, we
> do acquire the slot, advance the slot (as per recent discussion [1]),
> and release it. Since these operations can impact inactive_since, it
> seems to me that inactive_since is not the same as other parameters.
> It can have a different value than the primary. Why would anyone want
> to know the value of inactive_since from primary after the standby is
> promoted?

I think it can be useful "before" it is promoted and in case the primary is 
down.
I agree that tracking the activity time of a synced slot can be useful, why
not creating a dedicated field for that purpose (and keep inactive_since a
perfect "copy" of the primary)?

> Now, the other concern is that calling GetCurrentTimestamp()
> could be costly when the values for the slot are not going to be
> updated but if that happens we can optimize such that before acquiring
> the slot we can have some minimal pre-checks to ensure whether we need
> to update the slot or not.

Right, but for a very active slot it is likely that we call 
GetCurrentTimestamp()
during almost each sync cycle.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Statistics Import and Export

2024-03-29 Thread Jeff Davis

On Tue, 2024-03-26 at 00:16 +0100, Tomas Vondra wrote:
> I did take a closer look at v13 today. I have a bunch of comments and
> some minor whitespace fixes in the attached review patches.

I also attached a patch implementing a different approach to the
pg_dump support. Instead of trying to create a query that uses SQL
"format()" to create more SQL, I did all the formatting in C. It turned
out to be about 30% fewer lines, and I find it more understandable and
consistent with the way other stuff in pg_dump happens.

The attached patch is pretty rough -- not many comments, and perhaps
some things should be moved around. I only tested very basic
dump/reload in SQL format.

Regards,
Jeff Davis

From 7ca575e5a02bf380af92b6144622468a501f7636 Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Sat, 16 Mar 2024 17:21:10 -0400
Subject: [PATCH vjeff] Enable dumping of table/index stats in pg_dump.

For each table/matview/index dumped, it will also generate a statement
that calls all of the pg_set_relation_stats() and
pg_set_attribute_stats() calls necessary to restore the statistics of
the current system onto the destination system.

As is the pattern with pg_dump options, this can be disabled with
--no-statistics.
---
 src/bin/pg_dump/pg_backup.h  |   2 +
 src/bin/pg_dump/pg_backup_archiver.c |   5 +
 src/bin/pg_dump/pg_dump.c| 229 ++-
 src/bin/pg_dump/pg_dump.h|   1 +
 src/bin/pg_dump/pg_dumpall.c |   5 +
 src/bin/pg_dump/pg_restore.c |   3 +
 6 files changed, 243 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 9ef2f2017e..1db5cf52eb 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -112,6 +112,7 @@ typedef struct _restoreOptions
 	int			no_publications;	/* Skip publication entries */
 	int			no_security_labels; /* Skip security label entries */
 	int			no_subscriptions;	/* Skip subscription entries */
+	int			no_statistics;		/* Skip statistics import */
 	int			strict_names;
 
 	const char *filename;
@@ -179,6 +180,7 @@ typedef struct _dumpOptions
 	int			no_security_labels;
 	int			no_publications;
 	int			no_subscriptions;
+	int			no_statistics;
 	int			no_toast_compression;
 	int			no_unlogged_table_data;
 	int			serializable_deferrable;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index d97ebaff5b..d5f61399d9 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -2833,6 +2833,10 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
 	if (ropt->no_subscriptions && strcmp(te->desc, "SUBSCRIPTION") == 0)
 		return 0;
 
+	/* If it's a stats dump, maybe ignore it */
+	if (ropt->no_statistics && strcmp(te->desc, "STATISTICS") == 0)
+		return 0;
+
 	/* Ignore it if section is not to be dumped/restored */
 	switch (curSection)
 	{
@@ -2862,6 +2866,7 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
 	 */
 	if (strcmp(te->desc, "ACL") == 0 ||
 		strcmp(te->desc, "COMMENT") == 0 ||
+		strcmp(te->desc, "STATISTICS") == 0 ||
 		strcmp(te->desc, "SECURITY LABEL") == 0)
 	{
 		/* Database properties react to createDB, not selectivity options. */
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index b1c4c3ec7f..d483122998 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -428,6 +428,7 @@ main(int argc, char **argv)
 		{"no-comments", no_argument, _comments, 1},
 		{"no-publications", no_argument, _publications, 1},
 		{"no-security-labels", no_argument, _security_labels, 1},
+		{"no-statistics", no_argument, _statistics, 1},
 		{"no-subscriptions", no_argument, _subscriptions, 1},
 		{"no-toast-compression", no_argument, _toast_compression, 1},
 		{"no-unlogged-table-data", no_argument, _unlogged_table_data, 1},
@@ -1144,6 +1145,7 @@ help(const char *progname)
 	printf(_("  --no-commentsdo not dump comments\n"));
 	printf(_("  --no-publicationsdo not dump publications\n"));
 	printf(_("  --no-security-labels do not dump security label assignments\n"));
+	printf(_("  --no-statistics  do not dump statistics\n"));
 	printf(_("  --no-subscriptions   do not dump subscriptions\n"));
 	printf(_("  --no-table-access-method do not dump table access methods\n"));
 	printf(_("  --no-tablespaces do not dump tablespace assignments\n"));
@@ -7001,6 +7003,7 @@ getTables(Archive *fout, int *numTables)
 
 		/* Tables have data */
 		tblinfo[i].dobj.components |= DUMP_COMPONENT_DATA;
+		tblinfo[i].dobj.components |= DUMP_COMPONENT_STATISTICS;
 
 		/* Mark whether table has an ACL */
 		if (!PQgetisnull(res, i, i_relacl))
@@ -7498,6 +7501,7 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables)
 			indxinfo[j].dobj.catId.tableoid = atooid(PQgetvalue(res, j, i_tableoid));
 			indxinfo[j].dobj.catId.oid =

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-29 Thread Masahiko Sawada

On Thu, Mar 28, 2024 at 6:15 PM John Naylor  wrote:
>
> On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada  
> wrote:
> >
> > Pushed the refactoring patch.
> >
> > I've attached the rebased vacuum improvement patch for cfbot. I
> > mentioned in the commit message that this patch eliminates the 1GB
> > limitation.
> >
> > I think the patch is in good shape. Do you have other comments or
> > suggestions, John?
>
> I'll do another pass tomorrow, but first I wanted to get in another
> slightly-challenging in-situ test.

Thanks!

>
> About the "tuples missed" -- I didn't expect contention during this
> test. I believe that's completely unrelated behavior, but wanted to
> mention it anyway, since I found it confusing.

I don't investigate it enough but bgwriter might be related to the contention.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Psql meta-command conninfo+

2024-03-29 Thread Imseih (AWS), Sami

Thank you for the updated patch.

First and foremost, thank you very much for the review.

> The initial and central idea was always to keep the metacommand
> "\conninfo" in its original state, that is, to preserve the string as it is.
> The idea of "\conninfo+" is to expand this to include more information.
> If I change the "\conninfo" command and transform it into a table,
> I would have to remove all the translation part (files) that has already
> been implemented in the past. I believe that keeping the string and
> its translations is a good idea and there is no great reason to change that.

You are right. Not much to be gained to change this.



For the current patch, I have a few more comments.



1/ The output should be reorganized to show the fields that are part of 
“conninfo” first.



I suggest it should look like this:



Current Connection Information

-[ RECORD 1 ]+-

Database | postgres

Authenticated User   | postgres

Socket Directory | /tmp

Host |

Server Port  | 5432

Server Address   |

Client Address   |

Client Port  |

Backend PID  | 30846

System User  |

Current User | postgres

Session User | postgres

Application Name | psql

SSL Connection   | f

SSL Protocol |

SSL Cipher   |

SSL Compression  |

GSSAPI Authenticated | f

GSSAPI Principal |

GSSAPI Encrypted | f

GSSAPI Credentials Delegated | f





With regards to the documentation. I think it's a good idea that every field 
has an

description. However, I have some comments:



1/ For the description of the conninfo command, what about simplifying like 
below?



"Outputs a string displaying information about the current database connection. 
When + is appended, more details about the connection are displayed in table 
format:"



2/ I don't think the descriptions need to start with "Displays". It is very 
repetitive.



3/ For the "Socket Directory" description, this could be NULL if the host was 
not specified.



what about the below?



"The socket directory of the connection. NULL if the host or hostaddr are 
specified for the connection"



4/ For most of the fields, they are just the output of a function, such as 
"pg_catalog.system_user()". What about the docs simply

link to the documentation of the function. This way we are not copying 
descriptions and have to be concerned if the description

of the function changes in the future.



5/ "true" and "false", do not need double quotes. This is not the convention 
used in other places docs.





Regards,



Sami

Re: Synchronizing slots from primary to standby

2024-03-29 Thread Amit Kapila

On Fri, Mar 29, 2024 at 6:36 AM Zhijie Hou (Fujitsu)
 wrote:
>
>
> Attach a new version patch which fixed an un-initialized variable issue and
> added some comments.
>

The other approach to fix this issue could be that the slotsync worker
get the serialized snapshot using pg_read_binary_file() corresponding
to restart_lsn and writes those at standby. But there are cases when
we won't have such a file like (a) when we initially create the slot
and reach the consistent_point, or (b) also by the time the slotsync
worker starts to read the remote snapshot file, the snapshot file
could have been removed by the checkpointer on the primary (if the
restart_lsn of the remote has been advanced in this window). So, in
such cases, we anyway need to advance the slot. I think these could be
optimizations that we could do in the future.

Few comments:
=
1.
- if (slot->data.database != MyDatabaseId)
+ if (slot->data.database != MyDatabaseId && !fast_forward)
  ereport(ERROR,
  (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  errmsg("replication slot \"%s\" was not created in this database",
@@ -526,7 +527,7 @@ CreateDecodingContext(XLogRecPtr start_lsn,
  * Do not allow consumption of a "synchronized" slot until the standby
  * gets promoted.
  */
- if (RecoveryInProgress() && slot->data.synced)
+ if (RecoveryInProgress() && slot->data.synced && !IsSyncingReplicationSlots())

Add comments at both of the above places.

2.
+extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr moveto,
+   bool *found_consistent_point);
+

This API looks a bit awkward as the functionality doesn't match the
name. How about having a function with name
LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto,
ready_for_decoding) with the same functionality as your patch has for
pg_logical_replication_slot_advance() and then invoke it both from
pg_logical_replication_slot_advance and slotsync.c. The function name
is too big, we can think of a shorter name. Any ideas?

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-03-29 Thread Bertrand Drouvot

Hi,

On Fri, Mar 29, 2024 at 01:06:15AM +, Zhijie Hou (Fujitsu) wrote:
> Attach a new version patch which fixed an un-initialized variable issue and
> added some comments. Also, temporarily enable DEBUG2 for the 040 tap-test so 
> that
> we can analyze the possible CFbot failures easily.
> 

Thanks!

+   if (remote_slot->confirmed_lsn != slot->data.confirmed_flush)
+   {
+   /*
+* By advancing the restart_lsn, confirmed_lsn, and xmin using
+* fast-forward logical decoding, we ensure that the required 
snapshots
+* are saved to disk. This enables logical decoding to quickly 
reach a
+* consistent point at the restart_lsn, eliminating the risk of 
missing
+* data during snapshot creation.
+*/
+   pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
+   
found_consistent_point);
+   ReplicationSlotsComputeRequiredLSN();
+   updated_lsn = true;
+   }

Instead of using pg_logical_replication_slot_advance() for each synced slot
and during sync cycles what about?:

- keep sync slot synchronization as it is currently (not using 
pg_logical_replication_slot_advance())
- create "an hidden" logical slot if sync slot feature is on
- at the time of promotion use pg_logical_replication_slot_advance() on this
hidden slot only to advance to the max lsn of the synced slots

I'm not sure that would be enough, just asking your thoughts on this (benefits
would be to avoid calling pg_logical_replication_slot_advance() on each sync 
slots
and during the sync cycles).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-03-29 Thread Masahiko Sawada

On Fri, Mar 29, 2024 at 2:09 PM vignesh C  wrote:
>
> On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  wrote:
> >
> > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada  
> > wrote:
> > >
> > >
> > > I've attached new version patches.
> >
> > Since the previous patch conflicts with the current HEAD, I've
> > attached the rebased patches.
>
> Thanks for the updated patch.
> One comment:
> I felt we can mention the improvement where we update memory
> accounting info at transaction level instead of per change level which
> is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
> ReorderBufferSerializeTXN also in the commit message:

Agreed.

I think the patch is in good shape. I'll push the patch with the
suggestion next week, barring any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: To what extent should tests rely on VACUUM ANALYZE?

2024-03-29 Thread Richard Guo

On Fri, Mar 29, 2024 at 1:33 AM Tom Lane  wrote:

> Tomas Vondra  writes:
> > Yeah. I think it's good to design the data/queries in such a way that
> > the behavior does not flip due to minor noise like in this case.
>
> +1

Agreed.  The query in problem is:

-- we can pull up the sublink into the inner JoinExpr.
explain (costs off)
SELECT * FROM tenk1 A INNER JOIN tenk2 B
ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);

For this query, the RHS of the semijoin can be unique-ified, allowing it
to be joined to anything else by unique-ifying the RHS.  Hence, both
join orders 'A/C/B' (as in the answer file) and 'B/C/A' (as in the
reported plan diff) are legal.

So I'm wondering if we can make this test case more stable by using
'c.odd > b.odd' instead of 'c.odd = b.odd' in the subquery, as attached.
As a result, the RHS of the semijoin cannot be unique-ified any more, so
that the only legal join order is 'A/B/C'.  We would not have different
join orders due to noises in the estimates, while still testing what we
intend to test.

Thanks
Richard

v1-0001-Stabilize-a-test-case-in-subselect.patch
Description: Binary data

Re: Can't find not null constraint, but \d+ shows that

2024-03-29 Thread jian he

hi.
about v4, i think, i understand the changes you made.
RemoveConstraintById(Oid conId)
will drop a single constraint record.
if the constraint is primary key, then primary key associated
attnotnull should set to false.
but sometimes it shouldn't.


for example:
drop table if exists t2;
CREATE TABLE t2(c0 int, c1 int);
ALTER TABLE  t2 ADD CONSTRAINT t2_pk PRIMARY KEY(c0, c1);
ALTER TABLE t2 ALTER COLUMN c0 ADD GENERATED ALWAYS AS IDENTITY;
ALTER TABLE  t2 DROP c1;

+ * If this was a NOT NULL or the primary key, the constrained columns must
+ * have had pg_attribute.attnotnull set.  See if we need to reset it, and
+ * do so.
+ */
+ if (unconstrained_cols != NIL)

unconstrained_cols is not NIL, which means we have dropped a primary key.
I am confused by "If this was a NOT NULL or the primary key".

+
+ /*
+ * Since the above deletion has been made visible, we can now
+ * search for any remaining constraints on this column (or these
+ * columns, in the case we're dropping a multicol primary key.)
+ * Then, verify whether any further NOT NULL exists, and reset
+ * attnotnull if none.
+ */
+ contup = findNotNullConstraintAttnum(RelationGetRelid(rel), attnum);
+ if (HeapTupleIsValid(contup))
+ {
+ heap_freetuple(contup);
+ heap_freetuple(atttup);
+ continue;
+ }

I am a little bit confused by the above comment.
I think the comments should say,
if contup is valid, that means, we already have one  "not null"
constraint associate with the attnum
in that condition, we must not set attnotnull, otherwise the
consistency between attnotnull and "not null"
table constraint will be broken.

other than that, the change in RemoveConstraintById looks sane.

Re: Synchronizing slots from primary to standby

2024-03-29 Thread Amit Kapila

On Fri, Mar 29, 2024 at 9:34 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Thanks for updating the patch! Here is a comment for it.
>
> ```
> +/*
> + * By advancing the restart_lsn, confirmed_lsn, and xmin using
> + * fast-forward logical decoding, we can verify whether a consistent
> + * snapshot can be built. This process also involves saving necessary
> + * snapshots to disk during decoding, ensuring that logical decoding
> + * efficiently reaches a consistent point at the restart_lsn without
> + * the potential loss of data during snapshot creation.
> + */
> +pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
> +found_consistent_point);
> +ReplicationSlotsComputeRequiredLSN();
> +updated_lsn = true;
> ```
>
> You added them like pg_replication_slot_advance(), but the function also calls
> ReplicationSlotsComputeRequiredXmin(false) at that time. According to the 
> related
> commit b48df81 and discussions [1], I know it is needed only for physical 
> slots,
> but it makes more consistent to call requiredXmin() as well, per [2]:
>

Yeah, I also think it is okay to call for the sake of consistency with
pg_replication_slot_advance().

-- 
With Regards,
Amit Kapila.

Remove excessive trailing semicolons

2024-03-29 Thread Richard Guo

Noticed some newly introduced excessive trailing semicolons:

$ git grep -E ";;$" -- *.c *.h
src/include/lib/radixtree.h:int deletepos =
slot - n4->children;;
src/test/modules/test_tidstore/test_tidstore.c: BlockNumber prevblkno = 0;;

Here is a trivial patch to remove them.

Thanks
Richard


v1-0001-Remove-excessive-trailing-semicolons.patch
Description: Binary data

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-29 Thread Amit Kapila

On Fri, Mar 29, 2024 at 11:49 AM Bertrand Drouvot
 wrote:
>
> On Fri, Mar 29, 2024 at 09:39:31AM +0530, Amit Kapila wrote:
> >
> > Commit message states: "why we can't just update inactive_since for
> > synced slots on the standby with the value received from remote slot
> > on the primary. This is consistent with any other slot parameter i.e.
> > all of them are synced from the primary."
> >
> > The inactive_since is not consistent with other slot parameters which
> > we copy. We don't perform anything related to those other parameters
> > like say two_phase phase which can change that property. However, we
> > do acquire the slot, advance the slot (as per recent discussion [1]),
> > and release it. Since these operations can impact inactive_since, it
> > seems to me that inactive_since is not the same as other parameters.
> > It can have a different value than the primary. Why would anyone want
> > to know the value of inactive_since from primary after the standby is
> > promoted?
>
> I think it can be useful "before" it is promoted and in case the primary is 
> down.
>

It is not clear to me what is user going to do by checking the
inactivity time for slots when the corresponding server is down. I
thought the idea was to check such slots and see if they need to be
dropped or enabled again to avoid excessive disk usage, etc.

> I agree that tracking the activity time of a synced slot can be useful, why
> not creating a dedicated field for that purpose (and keep inactive_since a
> perfect "copy" of the primary)?
>

We can have a separate field for this but not sure if it is worth it.

> > Now, the other concern is that calling GetCurrentTimestamp()
> > could be costly when the values for the slot are not going to be
> > updated but if that happens we can optimize such that before acquiring
> > the slot we can have some minimal pre-checks to ensure whether we need
> > to update the slot or not.
>
> Right, but for a very active slot it is likely that we call 
> GetCurrentTimestamp()
> during almost each sync cycle.
>

True, but if we have to save a slot to disk each time to persist the
changes (for an active slot) then probably GetCurrentTimestamp()
shouldn't be costly enough to matter.

-- 
With Regards,
Amit Kapila.

Re: To what extent should tests rely on VACUUM ANALYZE?

2024-03-29 Thread Alexander Lakhin


28.03.2024 20:33, Tom Lane wrote:



But I'm a bit confused - how come the estimates do change at all? The
analyze simply fetches 30k rows, and tenk only has 10k of them. So we
should have *exact* numbers, and it should be exactly the same for all
the analyze runs. So how come it changes like this?

It's plausible that the VACUUM ANALYZE done by test_setup fails
ConditionalLockBufferForCleanup() sometimes because of concurrent
activity like checkpointer writes.  I'm not quite sure how we
get from that to the observed symptom though.  Maybe the
VACUUM needs DISABLE_PAGE_SKIPPING?


Yeah, the way from ConditionalLockBufferForCleanup() returning false to
reltuples < 1 is not one-step, as I thought initially. There is also
sanity_check doing VACUUM in between. So, effectively the troublesome
scenario is:
VACUUM ANALYZE tenk2; -- with cleanup lock not granted for some blocks
VACUUM tenk2;

In this scenario, lazy_scan_heap() -> vac_estimate_reltuples() called two
times.
First, with rel_pages: 384, vacrel->scanned_pages: 384,
vacrel->live_tuples: 1 and it results in
vacrel->new_live_tuples = 1,

And second, with rel_pages: 345, vacrel->scanned_pages: 80,
vacrel->live_tuples: 2315 (for instance), and we get
vacrel->new_live_tuples = 9996,

With unmodified ConditionalLockBufferForCleanup() the second call is
performed with rel_pages: 345, vacrel->scanned_pages: 1,
vacrel->live_tuples: 24 and it returns 1.

This simple change fixes the issue for me:
-VACUUM ANALYZE tenk2;
+VACUUM (ANALYZE, DISABLE_PAGE_SKIPPING) tenk2;

But it looks like subselect is not the only test that can fail due to
vacuum instability. I see that create_index also suffers from cranky
ConditionalLockBufferForCleanup() (+if (rand() % 10 == 0)
return false; ), although it placed in parallel_schedule before
sanity_check, so this failure needs another explanation:
-  QUERY PLAN

- Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+  QUERY PLAN
+--
+ Sort
+   Sort Key: thousand
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY 
('{1001,3000}'::integer[])))
+(4 rows)

Best regards,
Alexander

Re: Synchronizing slots from primary to standby

2024-03-29 Thread Amit Kapila

On Fri, Mar 29, 2024 at 1:08 PM Bertrand Drouvot
 wrote:
>
> On Fri, Mar 29, 2024 at 07:23:11AM +, Zhijie Hou (Fujitsu) wrote:
> > On Friday, March 29, 2024 2:48 PM Bertrand Drouvot 
> >  wrote:
> > >
> > > Hi,
> > >
> > > On Fri, Mar 29, 2024 at 01:06:15AM +, Zhijie Hou (Fujitsu) wrote:
> > > > Attach a new version patch which fixed an un-initialized variable
> > > > issue and added some comments. Also, temporarily enable DEBUG2 for the
> > > > 040 tap-test so that we can analyze the possible CFbot failures easily.
> > > >
> > >
> > > Thanks!
> > >
> > > +   if (remote_slot->confirmed_lsn != slot->data.confirmed_flush)
> > > +   {
> > > +   /*
> > > +* By advancing the restart_lsn, confirmed_lsn, and xmin 
> > > using
> > > +* fast-forward logical decoding, we ensure that the 
> > > required
> > > snapshots
> > > +* are saved to disk. This enables logical decoding to 
> > > quickly
> > > reach a
> > > +* consistent point at the restart_lsn, eliminating the 
> > > risk of
> > > missing
> > > +* data during snapshot creation.
> > > +*/
> > > +
> > > pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
> > > +
> > > found_consistent_point);
> > > +   ReplicationSlotsComputeRequiredLSN();
> > > +   updated_lsn = true;
> > > +   }
> > >
> > > Instead of using pg_logical_replication_slot_advance() for each synced 
> > > slot and
> > > during sync cycles what about?:
> > >
> > > - keep sync slot synchronization as it is currently (not using
> > > pg_logical_replication_slot_advance())
> > > - create "an hidden" logical slot if sync slot feature is on
> > > - at the time of promotion use pg_logical_replication_slot_advance() on 
> > > this
> > > hidden slot only to advance to the max lsn of the synced slots
> > >
> > > I'm not sure that would be enough, just asking your thoughts on this 
> > > (benefits
> > > would be to avoid calling pg_logical_replication_slot_advance() on each 
> > > sync
> > > slots and during the sync cycles).
> >
> > Thanks for the idea !
> >
> > I considered about this. I think advancing the "hidden" slot on promotion 
> > may be a
> > bit late, because if we cannot reach the consistent point after advancing 
> > the
> > "hidden" slot, then it means we may need to remove all the synced slots as 
> > we
> > are not sure if they are usable(will not loss data) after promotion.
>
> What about advancing the hidden slot during the sync cycles then?
>
> > The current approach is to mark such un-consistent slot as temp and persist
> > them once it reaches consistent point, so that user can ensure the slot can 
> > be
> > used after promotion once persisted.
>
> Right, but do we need to do so for all the sync slots? Would a single hidden
> slot be enough?
>

Even if we mark one of the synced slots as persistent without reaching
a consistent state, it could create a problem after promotion. And,
how a single hidden slot would serve the purpose, different synced
slots will have different restart/confirmed_flush LSN and we won't be
able to perform advancing for those using a single slot. For example,
say for first synced slot, it has not reached a consistent state and
then how can it try for the second slot? This sounds quite tricky to
make work. We should go with something simple where the chances of
introducing bugs are lesser.

-- 
With Regards,
Amit Kapila.

Logical replication failure modes

2024-03-29 Thread Philip Warner


I am trying to discover the causes of occasional data loss in logical
replication; it is VERY rare and happens every few week/months. 


Our setup is a source DB running in docker on AWS cloud server. The
source database is stored in on local disks on the cloud server. 


The replication target is a K8 POD running in an AWS instance with an
attached persistent AWS disk. The disk mounting is managed by K8.
Periodically this POD is deleted and restarted in an orderly way, and
the persistent disk stores the database. 


What we are seeing is *very* occasional records not being replicated in
the more active tables. 


Sometimes we have a backlog of several GB of data due to missing fields
in the target or network outages etc. 


I am also seeing signs that some triggers are not being applied (at the
same time frame): ie. data *is* inserted but triggers that summarize
that data is not summarizing some rows and the dates on those
non-summarized rows corresponds to dates on unrelated missing rows in
other tables.

This all leads me to conclude that there might be missing transactions?
Or non-applied transactions etc. But it is further complicated by the
fact that there is a second target database that *does* have all the
missing records. 


Any insights or avenues of exploration would be very welcome!

RE: Improve eviction algorithm in ReorderBuffer

2024-03-29 Thread Hayato Kuroda (Fujitsu)

Dear Sawada-san,

> Agreed.
> 
> I think the patch is in good shape. I'll push the patch with the
> suggestion next week, barring any objections.

Thanks for working on this. Agreed it is committable.
Few minor comments:

```
+ * Either txn or change must be non-NULL at least. We update the memory
+ * counter of txn if it's non-NULL, otherwise change->txn.
```

IIUC no one checks the restriction. Should we add Assert() for it, e.g,:
Assert(txn || change)? 

```
+/* make sure enough space for a new node */
...
+/* make sure enough space for a new node */
```

Should be started with upper case?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: Synchronizing slots from primary to standby

2024-03-29 Thread Bertrand Drouvot

Hi,

On Fri, Mar 29, 2024 at 02:35:22PM +0530, Amit Kapila wrote:
> On Fri, Mar 29, 2024 at 1:08 PM Bertrand Drouvot
>  wrote:
> >
> > On Fri, Mar 29, 2024 at 07:23:11AM +, Zhijie Hou (Fujitsu) wrote:
> > > On Friday, March 29, 2024 2:48 PM Bertrand Drouvot 
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Fri, Mar 29, 2024 at 01:06:15AM +, Zhijie Hou (Fujitsu) wrote:
> > > > > Attach a new version patch which fixed an un-initialized variable
> > > > > issue and added some comments. Also, temporarily enable DEBUG2 for the
> > > > > 040 tap-test so that we can analyze the possible CFbot failures 
> > > > > easily.
> > > > >
> > > >
> > > > Thanks!
> > > >
> > > > +   if (remote_slot->confirmed_lsn != slot->data.confirmed_flush)
> > > > +   {
> > > > +   /*
> > > > +* By advancing the restart_lsn, confirmed_lsn, and 
> > > > xmin using
> > > > +* fast-forward logical decoding, we ensure that the 
> > > > required
> > > > snapshots
> > > > +* are saved to disk. This enables logical decoding to 
> > > > quickly
> > > > reach a
> > > > +* consistent point at the restart_lsn, eliminating the 
> > > > risk of
> > > > missing
> > > > +* data during snapshot creation.
> > > > +*/
> > > > +
> > > > pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
> > > > +
> > > > found_consistent_point);
> > > > +   ReplicationSlotsComputeRequiredLSN();
> > > > +   updated_lsn = true;
> > > > +   }
> > > >
> > > > Instead of using pg_logical_replication_slot_advance() for each synced 
> > > > slot and
> > > > during sync cycles what about?:
> > > >
> > > > - keep sync slot synchronization as it is currently (not using
> > > > pg_logical_replication_slot_advance())
> > > > - create "an hidden" logical slot if sync slot feature is on
> > > > - at the time of promotion use pg_logical_replication_slot_advance() on 
> > > > this
> > > > hidden slot only to advance to the max lsn of the synced slots
> > > >
> > > > I'm not sure that would be enough, just asking your thoughts on this 
> > > > (benefits
> > > > would be to avoid calling pg_logical_replication_slot_advance() on each 
> > > > sync
> > > > slots and during the sync cycles).
> > >
> > > Thanks for the idea !
> > >
> > > I considered about this. I think advancing the "hidden" slot on promotion 
> > > may be a
> > > bit late, because if we cannot reach the consistent point after advancing 
> > > the
> > > "hidden" slot, then it means we may need to remove all the synced slots 
> > > as we
> > > are not sure if they are usable(will not loss data) after promotion.
> >
> > What about advancing the hidden slot during the sync cycles then?
> >
> > > The current approach is to mark such un-consistent slot as temp and 
> > > persist
> > > them once it reaches consistent point, so that user can ensure the slot 
> > > can be
> > > used after promotion once persisted.
> >
> > Right, but do we need to do so for all the sync slots? Would a single hidden
> > slot be enough?
> >
> 
> Even if we mark one of the synced slots as persistent without reaching
> a consistent state, it could create a problem after promotion. And,
> how a single hidden slot would serve the purpose, different synced
> slots will have different restart/confirmed_flush LSN and we won't be
> able to perform advancing for those using a single slot. For example,
> say for first synced slot, it has not reached a consistent state and
> then how can it try for the second slot? This sounds quite tricky to
> make work. We should go with something simple where the chances of
> introducing bugs are lesser.

Yeah, better to go with something simple.

+   if (remote_slot->confirmed_lsn != slot->data.confirmed_flush)
+   {
+   /*
+* By advancing the restart_lsn, confirmed_lsn, and xmin using
+* fast-forward logical decoding, we ensure that the required 
snapshots
+* are saved to disk. This enables logical decoding to quickly 
reach a
+* consistent point at the restart_lsn, eliminating the risk of 
missing
+* data during snapshot creation.
+*/
+   pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
+   
found_consistent_point);

In our case, what about skipping WaitForStandbyConfirmation() in
pg_logical_replication_slot_advance()? (It could go until the
RecoveryInProgress() check in StandbySlotsHaveCaughtup() if we don't skip it).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-03-29 Thread Amit Kapila

On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada  wrote:
>
> On Fri, Mar 29, 2024 at 2:09 PM vignesh C  wrote:
> >
> > On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  wrote:
> > >
> > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > >
> > > > I've attached new version patches.
> > >
> > > Since the previous patch conflicts with the current HEAD, I've
> > > attached the rebased patches.
> >
> > Thanks for the updated patch.
> > One comment:
> > I felt we can mention the improvement where we update memory
> > accounting info at transaction level instead of per change level which
> > is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
> > ReorderBufferSerializeTXN also in the commit message:
>
> Agreed.
>
> I think the patch is in good shape. I'll push the patch with the
> suggestion next week, barring any objections.
>

Few minor comments:
1.
@@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
  Assert(txn->nentries_mem == 0);
  }

+ ReorderBufferMaybeResetMaxHeap(rb);
+

Can we write a comment about why this reset is required here?
Otherwise, the reason is not apparent.

2.
Although using max-heap to select the largest
+ * transaction is effective when there are many transactions being decoded,
+ * there is generally no need to use it as long as all transactions being
+ * decoded are top-level transactions. Therefore, we use MaxConnections as the
+ * threshold so we can prevent switching to the state unless we use
+ * subtransactions.
+ */
+#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections

Isn't using max-heap equally effective in finding the largest
transaction whether there are top-level or top-level plus
subtransactions? This comment indicates it is only effective when
there are subtransactions.

-- 
With Regards,
Amit Kapila.

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-29 Thread Tomas Vondra




On 3/29/24 02:12, Thomas Munro wrote:
> On Fri, Mar 29, 2024 at 10:43 AM Tomas Vondra
>  wrote:
>> I think there's some sort of bug, triggering this assert in heapam
>>
>>   Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);
> 
> Thanks for the repro.  I can't seem to reproduce it (still trying) but
> I assume this is with Melanie's v11 patch set which had
> v11-0016-v10-Read-Stream-API.patch.
> 
> Would you mind removing that commit and instead applying the v13
> stream_read.c patches[1]?  v10 stream_read.c was a little confused
> about random I/O combining, which I fixed with a small adjustment to
> the conditions for the "if" statement right at the end of
> read_stream_look_ahead().  Sorry about that.  The fixed version, with
> eic=4, with your test query using WHERE a < a, ends its scan with:
> 

I'll give that a try. Unfortunately unfortunately the v11 still has the
problem I reported about a week ago:

  ERROR:  prefetch and main iterators are out of sync

So I can't run the full benchmarks :-( but master vs. streaming read API
should work, I think.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Logical replication failure modes

2024-03-29 Thread Philip Warner


I should have added that the source DB is 16.1 and the target is 16.0

Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences

2024-03-29 Thread Akshat Jaimini

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hii,

Thanks for the updated patch. I ran make installcheck-world after applying the 
patch and recompiling it. It did fail for a particular test but from the logs 
it seems to be unrelated to this particular patch since it fails for the 
following:

==
select error_trap_test();
-  error_trap_test  

- division_by_zero detected
-(1 row)
-
+ERROR:  cannot start subtransactions during a parallel operation
+CONTEXT:  PL/pgSQL function error_trap_test() line 2 during statement block 
entry
+parallel worker
 reset debug_parallel_query;
 drop function error_trap_test();
 drop function zero_divide();
==

The code seems to implement the feature and has good and explanatory comments 
associated with it.
I believe we can go ahead with committing patch although I would request some 
senior contributors to also take a look at this patch since I am relatively new 
to patch reviews.
Changing the status to 'Ready for Committer'.

Regards,
Akshat Jaimini

The new status of this patch is: Ready for Committer

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-29 Thread Thomas Munro

I spent a bit of time today testing Melanie's v11, except with
read_stream.c v13, on Linux, ext4, and 3000 IOPS cloud storage.  I
think I now know roughly what's going on.  Here are some numbers,
using your random table from above and a simple SELECT * FROM t WHERE
a < 100 OR a = 123456.  I'll keep parallelism out of this for now.
These are milliseconds:

eic unpatched patched
041729572
1   30846   10376
2   184355562
4   189803503
8   189802680
16  189763233

So with eic=0, unpatched wins.  The reason is that Linux readahead
wakes up and scans the table at 150MB/s, because there are enough
clusters to trigger it.  But patched doesn't look quite so sequential
because we removed the sequential accesses by I/O combining...

At eic=1, unpatched completely collapses.  I'm not sure why exactly.

Once you go above eic=1, Linux seems to get out of the way and just do
what we asked it to do: iostat shows exactly 3000 IOPS, exactly 8KB
avg read size, and (therefore) throughput of 24MB/sec, though you can
see the queue depth being exactly what we asked it to do,eg 7.9 or
whatever for eic=8, while patched eats it for breakfast because it
issues wide requests, averaging around 27KB.

It seems more informative to look at the absolute numbers rather than
the A/B ratios, because then you can see how the numbers themselves
are already completely nuts, sort of interference patterns from
interaction with kernel heuristics.

On the other hand this might be a pretty unusual data distribution.
People who store random numbers or hashes or whatever probably don't
really search for ranges of them (unless they're trying to mine
bitcoins in SQL).  I dunno.  Maybe we need more realistic tests, or
maybe we're just discovering all the things that are bad about the
pre-existing code.

Re: Comments on Custom RMGRs

2024-03-29 Thread Danil Anisimow

On Fri, Mar 22, 2024 at 2:02 AM Jeff Davis  wrote:
> On Thu, 2024-03-21 at 19:47 +0700, Danil Anisimow wrote:
> > [pgss_001.v1.patch] adds a custom resource manager to the
> > pg_stat_statements extension.
>
> Did you consider moving the logic for loading the initial contents from
> disk from pgss_shmem_startup to .rmgr_startup?

I tried it, but .rmgr_startup is not called if the system was shut down
cleanly.

> My biggest concern is that it might not be quite right for a table AM
> that has complex state that needs action to be taken at a slightly
> different time, e.g. right after CheckPointBuffers().

> Then again, the rmgr is a low-level API, and any extension using it
> should be prepared to adapt to changes. If it works for pgss, then we
> know it works for at least one thing, and we can always improve it
> later. For instance, we might call the hook several times and pass it a
> "phase" argument.

In [rmgr_003.v3.patch] I added a phase argument to RmgrCheckpoint().
Currently it is only called in two places: before and after
CheckPointBuffers().

--
Regards,
Daniil Anisimov
Postgres Professional: http://postgrespro.com
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 3e2f1d4a23..5a1fbe8379 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -44,8 +44,8 @@
 
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
-	{ name, redo, desc, identify, startup, cleanup, mask, decode },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,checkpoint) \
+	{ name, redo, desc, identify, startup, cleanup, mask, decode, checkpoint },
 
 RmgrData	RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
@@ -83,6 +83,25 @@ RmgrCleanup(void)
 	}
 }
 
+/*
+ * Checkpoint all resource managers.
+ *
+ * See CreateCheckPoint for details about flags.
+ * phase shows a position in which RmgrCheckpoint is called in CheckPointGuts.
+ */
+void
+RmgrCheckpoint(int flags, RmgrCheckpointPhase phase)
+{
+	for (int rmid = 0; rmid <= RM_MAX_ID; rmid++)
+	{
+		if (!RmgrIdExists(rmid))
+			continue;
+
+		if (RmgrTable[rmid].rm_checkpoint != NULL)
+			RmgrTable[rmid].rm_checkpoint(flags, phase);
+	}
+}
+
 /*
  * Emit ERROR when we encounter a record with an RmgrId we don't
  * recognize.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 20a5f86209..d7ecab6769 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7357,8 +7357,13 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
 	CheckPointPredicate();
+
+	RmgrCheckpoint(flags, RMGR_CHECKPOINT_BEFORE_BUFFERS);
+
 	CheckPointBuffers(flags);
 
+	RmgrCheckpoint(flags, RMGR_CHECKPOINT_AFTER_BUFFERS);
+
 	/* Perform all queued up fsyncs */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 22f7351fdc..11ae1e7af4 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -28,7 +28,7 @@
  * RmgrNames is an array of the built-in resource manager names, to make error
  * messages a bit nicer.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,checkpoint) \
   name,
 
 static const char *const RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 6b8c17bb4c..2bb5ba8c9f 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -32,7 +32,7 @@
 #include "storage/standbydefs.h"
 #include "utils/relmapper.h"
 
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,checkpoint) \
 	{ name, desc, identify},
 
 static const RmgrDescData RmgrDescTable[RM_N_BUILTIN_IDS] = {
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index 3b6a497e1b..34ddc0210c 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
  * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
  * file format.
  */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,checkpoint) \
 	symname,
 
 typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 78e6b908c6..0b03cc69be 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -24,26 +24,26 @@
  * Changes to this list possibly need an XLOG_PAGE_MAGIC bump.
  */
 
-/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */

Re: To what extent should tests rely on VACUUM ANALYZE?

2024-03-29 Thread Alexander Lakhin


29.03.2024 11:59, Alexander Lakhin wrote:


But it looks like subselect is not the only test that can fail due to
vacuum instability. I see that create_index also suffers from cranky
ConditionalLockBufferForCleanup() (+if (rand() % 10 == 0)
return false; ), although it placed in parallel_schedule before
sanity_check, so this failure needs another explanation:
-  QUERY PLAN

- Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+  QUERY PLAN
+--
+ Sort
+   Sort Key: thousand
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY 
('{1001,3000}'::integer[])))
+(4 rows)


I think that deviation can be explained by the fact that cost_index() takes
baserel->allvisfrac (derived from pg_class.relallvisible) into account for
the index-only-scan case, and I see the following difference when a test
run fails:
    relname    | relpages | reltuples | relallvisible | indisvalid | 
autovacuum_count | autoanalyze_count
 
--+--+---+---++--+---
- tenk1    |  345 | 1 |   345 |    |
    0 | 0
+ tenk1    |  345 | 1 |   305 |    |
    0 | 0

Best regards,
Alexander

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-29 Thread Alexander Korotkov

Hi, Euler!

On Fri, Mar 29, 2024 at 1:38 AM Euler Taveira  wrote:
> On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote:
>
> Fixed along with other issues spotted by Alexander Lakhin.
>
>
> [I didn't read the whole thread. I'm sorry if I missed something ...]
>
> You renamed the function in a previous version but let me suggest another one:
> pg_wal_replay_wait. It uses the same pattern as the other recovery control
> functions [1]. I think "for" doesn't add much for the function name and "lsn" 
> is
> used in functions that return an LSN (that's not the case here).
>
> postgres=# \df pg_wal_replay*
> List of functions
> -[ RECORD 1 ]---+-
> Schema  | pg_catalog
> Name| pg_wal_replay_pause
> Result data type| void
> Argument data types |
> Type| func
> -[ RECORD 2 ]---+-
> Schema  | pg_catalog
> Name| pg_wal_replay_resume
> Result data type| void
> Argument data types |
> Type| func

Makes sense to me.  I tried to make a new procedure name consistent
with functions acquiring various WAL positions.  But you're right,
it's better to be consistent with other functions controlling wal
replay.

> Regarding the arguments, I think the timeout should be bigint. There is at 
> least
> another function that implements a timeout that uses bigint.
>
> postgres=# \df pg_terminate_backend
> List of functions
> -[ RECORD 1 ]---+--
> Schema  | pg_catalog
> Name| pg_terminate_backend
> Result data type| boolean
> Argument data types | pid integer, timeout bigint DEFAULT 0
> Type| func
>
> I also suggests that the timeout unit should be milliseconds, hence, using
> bigint is perfectly fine for the timeout argument.
>
> +   
> +Throws an ERROR if the target lsn was not replayed
> +on standby within given timeout.  Parameter 
> timeout
> +is the time in seconds to wait for the 
> target_lsn
> +replay. When timeout value equals to zero no
> +timeout is applied.
> +   

This generally makes sense, but I'm not sure about this.  The
milliseconds timeout was used initially but received critics in [1].

Links.
1. 
https://www.postgresql.org/message-id/b45ff979-9d12-4828-a22a-e4cb327e115c%40eisentraut.org

--
Regards,
Alexander Korotkov


v14-0001-Implement-pg_wal_replay_wait-stored-procedure.patch
Description: Binary data

Re: To what extent should tests rely on VACUUM ANALYZE?

2024-03-29 Thread Richard Guo

On Thu, Mar 28, 2024 at 11:00 PM Alexander Lakhin 
wrote:

> When running multiple 027_stream_regress.pl test instances in parallel
> (and with aggressive autovacuum) on a rather slow machine, I encountered
> test failures due to the subselect test instability just as the following
> failures on buildfarm:
> 1)
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit=2024-03-27%2010%3A16%3A12
>
> ---
> /home/bf/bf-build/grassquit/HEAD/pgsql/src/test/regress/expected/subselect.out
> 2024-03-19 22:20:34.435867114 +
> +++
> /home/bf/bf-build/grassquit/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/subselect.out
>
> 2024-03-27 10:28:38.185776605 +
> @@ -2067,16 +2067,16 @@
>  QUERY PLAN
>   -
>Hash Join
> -   Hash Cond: (c.odd = b.odd)
> +   Hash Cond: (c.hundred = a.hundred)
>  ->  Hash Join
> - Hash Cond: (a.hundred = c.hundred)
> - ->  Seq Scan on tenk1 a
> + Hash Cond: (b.odd = c.odd)
> + ->  Seq Scan on tenk2 b
>->  Hash
>  ->  HashAggregate
>Group Key: c.odd, c.hundred
>->  Seq Scan on tenk2 c
>  ->  Hash
> - ->  Seq Scan on tenk2 b
> + ->  Seq Scan on tenk1 a
>   (11 rows)


FWIW, this issue is also being reproduced in Cirrus CI, as Matthias
reported in another thread [1] days ago.

[1]
https://www.postgresql.org/message-id/CAEze2WiiE-iTKxgWQzcjyiiiA4q-zsdkkAdCaD_E83xA2g2BLA%40mail.gmail.com

Thanks
Richard

Re: Can't find not null constraint, but \d+ shows that

2024-03-29 Thread Tender Wang

jian he  于2024年3月29日周五 14:56写道：

> hi.
> about v4, i think, i understand the changes you made.
> RemoveConstraintById(Oid conId)
> will drop a single constraint record.
> if the constraint is primary key, then primary key associated
> attnotnull should set to false.
> but sometimes it shouldn't.
>

Yeah, indeed.

>
>
> for example:
> drop table if exists t2;
> CREATE TABLE t2(c0 int, c1 int);
> ALTER TABLE  t2 ADD CONSTRAINT t2_pk PRIMARY KEY(c0, c1);
> ALTER TABLE t2 ALTER COLUMN c0 ADD GENERATED ALWAYS AS IDENTITY;
> ALTER TABLE  t2 DROP c1;
>
> + * If this was a NOT NULL or the primary key, the constrained columns must
> + * have had pg_attribute.attnotnull set.  See if we need to reset it, and
> + * do so.
> + */
> + if (unconstrained_cols != NIL)
>
> unconstrained_cols is not NIL, which means we have dropped a primary key.
> I am confused by "If this was a NOT NULL or the primary key".
>

NOT NULL means the definition of column having not-null constranit. For
example:
create table t1(a int not null);
the pg_attribute.attnotnull is set to true.
primary key case like below:
create table t2(a int primary key);
the pg_attribute.attnotnull is set to true.

I think aboved case can explain what's meaning about comments in
dropconstraint_internal.
But here, in RemoveConstraintById() , we only care about primary key case,
so NOT NULL is better
to removed from comments.


>
> +
> + /*
> + * Since the above deletion has been made visible, we can now
> + * search for any remaining constraints on this column (or these
> + * columns, in the case we're dropping a multicol primary key.)
> + * Then, verify whether any further NOT NULL exists, and reset
> + * attnotnull if none.
> + */
> + contup = findNotNullConstraintAttnum(RelationGetRelid(rel), attnum);
> + if (HeapTupleIsValid(contup))
> + {
> + heap_freetuple(contup);
> + heap_freetuple(atttup);
> + continue;
> + }
>
> I am a little bit confused by the above comment.
> I think the comments should say,
> if contup is valid, that means, we already have one  "not null"
> constraint associate with the attnum
> in that condition, we must not set attnotnull, otherwise the
> consistency between attnotnull and "not null"
> table constraint will be broken.
>
> other than that, the change in RemoveConstraintById looks sane.
>
 Above comments want to say that after pk constranit dropped, if there are
tuples in
pg_constraint, that means the definition of column has not-null constraint.
So we can't
set pg_attribute.attnotnull to false.

For example:
create table t1(a int not null);
alter table t1 add constraint t1_pk primary key(a);
alter table t1 drop constraint t1_pk;


-- 
Tender Wang
OpenPie:  https://en.openpie.com/

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-29 Thread Jelte Fennema-Nio

On Thu, 28 Mar 2024 at 19:03, Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > It doesn't fail when it's too fast -- it's just that it doesn't cover
> > the case we want to cover.
>
> That's hardly better, because then you think you have test
> coverage but maybe you don't.

Honestly, that seems quite a lot better. Instead of having randomly
failing builds, you have a test that creates coverage 80+% of the
time. And that also seems a lot better than having no coverage at all
(which is what we had for the last 7 years since introduction of
cancellations to postgres_fdw). It would be good to expand the comment
in the test though saying that the test might not always cover the
intended code path, due to timing problems.

> Could we make this test bulletproof by using an injection point?
> If not, I remain of the opinion that we're better off without it.

Possibly, and if so, I agree that would be better than the currently
added test. But I honestly don't feel like spending the time on
creating such a test. And given 7 years have passed without someone
adding any test for this codepath at all, I don't expect anyone else
will either.

If you both feel we're better off without the test, feel free to
remove it. This was just some small missing test coverage that I
noticed while working on this patch, that I thought I'd quickly
address. I don't particularly care a lot about the specific test.

Re: Streaming I/O, vectored I/O (WIP)

2024-03-29 Thread Thomas Munro

On Fri, Mar 29, 2024 at 9:45 AM Heikki Linnakangas  wrote:
> master (213c959a29):8.0 s
> streaming-api v13:  9.5 s

Hmm, that's not great, and I think I know one factor that has
confounded my investigation and the conflicting reports I have
received from a couple of people: some are using meson, which is
defaulting to -O3 by default, and others are using make which gives
you -O2 by default, but at -O2, GCC doesn't inline that
StartReadBuffer specialisation that is used in the "fast path", and
possibly more.  Some of that gap is closed by using
pg_attribute_inline_always.  Clang fails to inline at any level.  So I
should probably use the "always" macro there because that is the
intention.  Still processing the rest of your email...

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-29 Thread John Naylor

On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada  wrote:
> I think the patch is in good shape. Do you have other comments or
> suggestions, John?

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1918,11 +1918,6 @@ include_dir 'conf.d'
 too high.  It may be useful to control for this by separately
 setting .

-   
-Note that for the collection of dead tuple identifiers,
-VACUUM is only able to utilize up to a maximum of
-1GB of memory.
-   

This is mentioned twice for two different GUCs -- need to remove the
other one, too. Other than that, I just have minor nits:

- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TID store, a storage for dead TIDs

I think I've helped edit this sentence before, but I still don't quite
like it. I'm thinking now "is storage for the dead tuple IDs".

- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.

I think "maximum" is redundant with "upper bounds".

I also feel the commit message needs more "meat" -- we need to clearly
narrate the features and benefits. I've attached how I would write it,
but feel free to use what you like to match your taste.

I've marked it Ready for Committer.
Use TID Store for storage of dead tuple IDs during lazy vacuum

Previously, we used a simple array for storing dead tuple IDs during
lazy vacuum, which had a number of problems:

* The array used a single allocation and so was limited to 1GB.
* The allocation was pessimistically sized according to table size.
* Lookup with binary search was slow because of poor CPU cache and
  branch prediction behavior.

This commit replaces that array with the TID store from commit XX.

Since the backing radix tree makes small allocations as needed,
the 1GB limit is now gone. Further, the total memory used is now
often smaller by an order of magnitude or more, depending on the
distribution of blocks and offsets. These two features should make
multiple rounds of heap scanning and index cleanup an extremely rare
event. TID lookup during index cleanup is also several times faster,
even more so when index order is correlated with heap tuple order.

Since there is no longer a predictable relationship between the number
of dead tuples vacuumed and the space taken up by their TIDs, the number
of tuples no longer provides any meaningful insights for users, nor
is the maximum number predictable. For that reason this commit also
changes to byte-based progress reporting, with the relevant columns
of pg_stat_progress_vacuum renamed accordingly to max_dead_tuple_bytes
and dead_tuple_bytes.

For parallel vacuum, both the TID store and supplemental information
specific to vacuum are shared among the parallel vacuum workers. As
with the previous array, we don't take any locks on TidStore during
parallel vacuum since writes are still only done by the leader process.

XXX: bump catalog version

Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar
Discussion: 
https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com

RE: Synchronizing slots from primary to standby

2024-03-29 Thread Zhijie Hou (Fujitsu)

On Friday, March 29, 2024 2:48 PM Bertrand Drouvot 
 wrote:
> 
> Hi,
> 
> On Fri, Mar 29, 2024 at 01:06:15AM +, Zhijie Hou (Fujitsu) wrote:
> > Attach a new version patch which fixed an un-initialized variable
> > issue and added some comments. Also, temporarily enable DEBUG2 for the
> > 040 tap-test so that we can analyze the possible CFbot failures easily.
> >
> 
> Thanks!
> 
> +   if (remote_slot->confirmed_lsn != slot->data.confirmed_flush)
> +   {
> +   /*
> +* By advancing the restart_lsn, confirmed_lsn, and xmin using
> +* fast-forward logical decoding, we ensure that the required
> snapshots
> +* are saved to disk. This enables logical decoding to quickly
> reach a
> +* consistent point at the restart_lsn, eliminating the risk 
> of
> missing
> +* data during snapshot creation.
> +*/
> +
> pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
> +
> found_consistent_point);
> +   ReplicationSlotsComputeRequiredLSN();
> +   updated_lsn = true;
> +   }
> 
> Instead of using pg_logical_replication_slot_advance() for each synced slot 
> and
> during sync cycles what about?:
> 
> - keep sync slot synchronization as it is currently (not using
> pg_logical_replication_slot_advance())
> - create "an hidden" logical slot if sync slot feature is on
> - at the time of promotion use pg_logical_replication_slot_advance() on this
> hidden slot only to advance to the max lsn of the synced slots
> 
> I'm not sure that would be enough, just asking your thoughts on this (benefits
> would be to avoid calling pg_logical_replication_slot_advance() on each sync
> slots and during the sync cycles).

Thanks for the idea !

I considered about this. I think advancing the "hidden" slot on promotion may 
be a
bit late, because if we cannot reach the consistent point after advancing the
"hidden" slot, then it means we may need to remove all the synced slots as we
are not sure if they are usable(will not loss data) after promotion. And it may
confuse user a bit as they have seen these slots to be sync-ready.

The current approach is to mark such un-consistent slot as temp and persist
them once it reaches consistent point, so that user can ensure the slot can be
used after promotion once persisted.

Another optimization idea is to check the snapshot file existence before 
calling the
slot_advance(). If the file already exists, we skip the decoding and directly
update the restart_lsn. This way, we could also avoid some duplicate decoding
work.

Best Regards,
Hou zj

Re: Synchronizing slots from primary to standby

2024-03-29 Thread Bertrand Drouvot

Hi,

On Fri, Mar 29, 2024 at 07:23:11AM +, Zhijie Hou (Fujitsu) wrote:
> On Friday, March 29, 2024 2:48 PM Bertrand Drouvot 
>  wrote:
> > 
> > Hi,
> > 
> > On Fri, Mar 29, 2024 at 01:06:15AM +, Zhijie Hou (Fujitsu) wrote:
> > > Attach a new version patch which fixed an un-initialized variable
> > > issue and added some comments. Also, temporarily enable DEBUG2 for the
> > > 040 tap-test so that we can analyze the possible CFbot failures easily.
> > >
> > 
> > Thanks!
> > 
> > +   if (remote_slot->confirmed_lsn != slot->data.confirmed_flush)
> > +   {
> > +   /*
> > +* By advancing the restart_lsn, confirmed_lsn, and xmin 
> > using
> > +* fast-forward logical decoding, we ensure that the 
> > required
> > snapshots
> > +* are saved to disk. This enables logical decoding to 
> > quickly
> > reach a
> > +* consistent point at the restart_lsn, eliminating the 
> > risk of
> > missing
> > +* data during snapshot creation.
> > +*/
> > +
> > pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
> > +
> > found_consistent_point);
> > +   ReplicationSlotsComputeRequiredLSN();
> > +   updated_lsn = true;
> > +   }
> > 
> > Instead of using pg_logical_replication_slot_advance() for each synced slot 
> > and
> > during sync cycles what about?:
> > 
> > - keep sync slot synchronization as it is currently (not using
> > pg_logical_replication_slot_advance())
> > - create "an hidden" logical slot if sync slot feature is on
> > - at the time of promotion use pg_logical_replication_slot_advance() on this
> > hidden slot only to advance to the max lsn of the synced slots
> > 
> > I'm not sure that would be enough, just asking your thoughts on this 
> > (benefits
> > would be to avoid calling pg_logical_replication_slot_advance() on each sync
> > slots and during the sync cycles).
> 
> Thanks for the idea !
> 
> I considered about this. I think advancing the "hidden" slot on promotion may 
> be a
> bit late, because if we cannot reach the consistent point after advancing the
> "hidden" slot, then it means we may need to remove all the synced slots as we
> are not sure if they are usable(will not loss data) after promotion.

What about advancing the hidden slot during the sync cycles then?

> The current approach is to mark such un-consistent slot as temp and persist
> them once it reaches consistent point, so that user can ensure the slot can be
> used after promotion once persisted.

Right, but do we need to do so for all the sync slots? Would a single hidden
slot be enough?

> Another optimization idea is to check the snapshot file existence before 
> calling the
> slot_advance(). If the file already exists, we skip the decoding and directly
> update the restart_lsn. This way, we could also avoid some duplicate decoding
> work.

Yeah, I think it's a good idea (even better if we can do this check without
performing any I/O).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Possibility to disable `ALTER SYSTEM`

2024-03-29 Thread Bruce Momjian

On Fri, Mar 29, 2024 at 08:46:33AM -0400, Robert Haas wrote:
> On Thu, Mar 28, 2024 at 3:33 PM Bruce Momjian  wrote:
> > I am fine with moving ahead.  I thought my later emails explaining we
> > have to be careful communicated that.
> 
> OK. Thanks for clarifying. I've committed the patch with the two
> wording changes that you suggested in your subsequent email.

Great, I know this has been frustrating, and you are right that this
wouldn't have been any simpler next year.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

2024-03-29 Thread Jeff Davis

On Fri, 2024-03-29 at 05:32 -0400, Corey Huinker wrote:
> That is fairly close to what I came up with per our conversation
> (attached below), but I really like the att_stats_arginfo construct
> and I definitely want to adopt that and expand it to a third
> dimension that flags the fields that cannot be null. I will
> incorporate that into v15.

Sounds good. I think it cuts down on the boilerplate.

> 0002:
> - All relstats and attrstats calls are now their own statement
> instead of a compound statement
> - moved the archive TOC entry from post-data back to SECTION_NONE (as
> it was modeled on object COMMENTs), which seems to work better.
> - remove meta-query in favor of more conventional query building
> - removed all changes to fe_utils/

Can we get a consensus on whether the default should be with stats or
without? That seems like the most important thing remaining in the
pg_dump changes.

There's still a failure in the pg_upgrade TAP test. One seems to be
ordering, so perhaps we need to ORDER BY the attribute number. Others
seem to be missing relstats and I'm not sure why yet. I suggest doing
some manual pg_upgrade tests and comparing the before/after dumps to
see if you can reproduce a smaller version of the problem.

Regards,
Jeff Davis

Re: Possibility to disable `ALTER SYSTEM`

2024-03-29 Thread Robert Haas

On Thu, Mar 28, 2024 at 3:33 PM Bruce Momjian  wrote:
> I am fine with moving ahead.  I thought my later emails explaining we
> have to be careful communicated that.

OK. Thanks for clarifying. I've committed the patch with the two
wording changes that you suggested in your subsequent email.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-03-29 Thread Laurenz Albe

On Tue, 2024-01-30 at 15:29 +0100, Daniel Verite wrote:
> PFA a rebased version.

I had a look at patch 0001 (0002 will follow).

> - 
> -  Retrieving Query Results Row-by-Row
> + 
> +  Retrieving Query Results by chunks

That should be "in chunks".

> +
> +   
> +
> +  PQsetChunkedRowsMode
> +PQsetChunkedRowsMode
> + 
> +  
> +   Select the mode retrieving results in chunks for the 
> currently-executing query.

That is questionable English.  How about

  Select to receive the results for the currently-executing query in chunks.

> +   This function is similar to  linkend="libpq-PQsetSingleRowMode"/>,
> +   except that it can retrieve a user-specified number of rows
> +   per call to , instead of a single 
> row.

The "user-specified number" is "maxRows".  So a better wording would be:

  ... except that it can retrieve maxRows rows
  per call to  instead of a single row.

> -error.  But in single-row mode, those rows will have already been
> +error.  But in single-row or chunked modes, those rows will have already 
> been

I'd say it should be "in *the* single-row or chunk modes".

> --- a/src/interfaces/libpq/fe-exec.c
> +++ b/src/interfaces/libpq/fe-exec.c
> @@ -41,7 +41,8 @@ char *const pgresStatus[] = {
> "PGRES_COPY_BOTH",
> "PGRES_SINGLE_TUPLE",
> "PGRES_PIPELINE_SYNC",
> -   "PGRES_PIPELINE_ABORTED"
> +   "PGRES_PIPELINE_ABORTED",
> +   "PGRES_TUPLES_CHUNK"
>  };

I think that PGRES_SINGLE_TUPLE and PGRES_TUPLES_CHUNK should be next to each
other, but that's no big thing.
The same applies to the change in src/interfaces/libpq/libpq-fe.h

I understand that we need to keep the single-row mode for compatibility
reasons.  But I think that under the hood, "single-row mode" should be the
same as "chunk mode with chunk size one".
That should save some code repetition.

Yours,
Laurenz Albe

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-29 Thread Pavel Borisov

Hi, hackers!

On Fri, 29 Mar 2024 at 16:45, Alexander Korotkov 
wrote:

> Hi, Euler!
>
> On Fri, Mar 29, 2024 at 1:38 AM Euler Taveira  wrote:
> > On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote:
> >
> > Fixed along with other issues spotted by Alexander Lakhin.
> >
> >
> > [I didn't read the whole thread. I'm sorry if I missed something ...]
> >
> > You renamed the function in a previous version but let me suggest
> another one:
> > pg_wal_replay_wait. It uses the same pattern as the other recovery
> control
> > functions [1]. I think "for" doesn't add much for the function name and
> "lsn" is
> > used in functions that return an LSN (that's not the case here).
> >
> > postgres=# \df pg_wal_replay*
> > List of functions
> > -[ RECORD 1 ]---+-
> > Schema  | pg_catalog
> > Name| pg_wal_replay_pause
> > Result data type| void
> > Argument data types |
> > Type| func
> > -[ RECORD 2 ]---+-
> > Schema  | pg_catalog
> > Name| pg_wal_replay_resume
> > Result data type| void
> > Argument data types |
> > Type| func
>
> Makes sense to me.  I tried to make a new procedure name consistent
> with functions acquiring various WAL positions.  But you're right,
> it's better to be consistent with other functions controlling wal
> replay.
>
> > Regarding the arguments, I think the timeout should be bigint. There is
> at least
> > another function that implements a timeout that uses bigint.
> >
> > postgres=# \df pg_terminate_backend
> > List of functions
> > -[ RECORD 1 ]---+--
> > Schema  | pg_catalog
> > Name| pg_terminate_backend
> > Result data type| boolean
> > Argument data types | pid integer, timeout bigint DEFAULT 0
> > Type| func
> >
> > I also suggests that the timeout unit should be milliseconds, hence,
> using
> > bigint is perfectly fine for the timeout argument.
> >
> > +   
> > +Throws an ERROR if the target lsn was not
> replayed
> > +on standby within given timeout.  Parameter
> timeout
> > +is the time in seconds to wait for the
> target_lsn
> > +replay. When timeout value equals to
> zero no
> > +timeout is applied.
> > +   
>
> This generally makes sense, but I'm not sure about this.  The
> milliseconds timeout was used initially but received critics in [1].
>
I see in Postgres we already have different units for timeouts:

e.g in guc's:
wal_receiver_status_interval in seconds
tcp_keepalives_idle in seconds

commit_delay in microseconds

deadlock_timeout in milliseconds
max_standby_archive_delay in milliseconds
vacuum_cost_delay in milliseconds
autovacuum_vacuum_cost_delay in milliseconds
etc..

I haven't counted precisely, but I feel that milliseconds are the most
often used in both guc's and functions. So I'd propose using milliseconds
for the patch as it was proposed originally.

Regards,
Pavel Borisov
Supabase.

Re: To what extent should tests rely on VACUUM ANALYZE?

2024-03-29 Thread Tom Lane

Richard Guo  writes:
> On Fri, Mar 29, 2024 at 1:33 AM Tom Lane  wrote:
>> Tomas Vondra  writes:
>>> Yeah. I think it's good to design the data/queries in such a way that
>>> the behavior does not flip due to minor noise like in this case.

>> +1

> Agreed.  The query in problem is:
> -- we can pull up the sublink into the inner JoinExpr.
> explain (costs off)
> SELECT * FROM tenk1 A INNER JOIN tenk2 B
> ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);

> So I'm wondering if we can make this test case more stable by using
> 'c.odd > b.odd' instead of 'c.odd = b.odd' in the subquery, as attached.

I'm not sure that that is testing the same thing (since it's no longer
an equijoin), or that it would fix the issue.  The problem really is
that all three baserels have identical statistics so there's no
difference in cost between different join orders, and then it's mostly
a matter of unspecified implementation details which order we will
choose, and even the smallest change in one rel's statistics can
flip it.  The way we have fixed similar issues elsewhere is to add a
scan-level WHERE restriction that makes one of the baserels smaller,
breaking the symmetry.  So I'd try something like

 explain (costs off)
 SELECT * FROM tenk1 A INNER JOIN tenk2 B
-ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
+ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd)
+WHERE a.thousand < 750;

(I first tried reducing the size of B, but that caused the join
order to change; restricting A makes it keep the existing plan.)

Might or might not need to mess with the size of C, but since that
one needs uniquification it's different from the others already.

regards, tom lane

1 2 >

1 - 100 of 109 matches

Mail list logo