Re: Changes to recovery_min_apply_delay are ignored while waiting for delay

2021-08-13 Thread Soumyadeep Chakraborty
Hey Michael,

Really appreciate the review!

On Wed, Aug 11, 2021 at 12:40 AM Michael Paquier  wrote:

> Agreed that the current behavior is confusing.  As you are using the
> commit timestamp for the comparison, this is right.  One small-ish
> comment I have about the code is that we should mention
> recovery_min_apply_delay for HandleStartupProcInterrupts(), and not
> only the trigger file location.

Cool, updated.

> +# First, set the delay for the next commit to some obscenely high value.
> It has no need to be obscenely high, just high enough to give the time
> to slow machines to catch the wait event lookup done.  So this could
> use better words to explain this choice.

Sounds good. Done.

> Anyway, it seems to me that something is incorrect in this new test
> (manual tests pass here, the TAP test is off).  One thing we need to
> make sure for any new test added is that it correctly breaks if the
> fix proposed is *not* in place.  And as far as I can see, the test
> passes even if the recalculation of delayUntil is done within the loop
> that reloads the configuration.  The thing may be a bit tricky to make
> reliable as the parameter reloading may cause wait_event to not be
> RecoveryApplyDelay, so we should have at least a check based on a scan
> of pg_stat_replication.replay_lsn on the primary.  Perhaps you have
> an other idea?

Hmm, please see attached v4 which just shifts the test to the middle,
like v1. When I run the test without the code change, the test hangs
as expected in:

# Now the standby should catch up.
$node_primary->wait_for_catchup('standby', 'replay');

and passes with the code change, as expected. I can't explain why the
test doesn't freeze up in v3 in wait_for_catchup() at the end.

As for checking on the wait event, since we only signal the standby
after performing the check, that should be fine. Nothing else would be
sending a SIGHUP before the check. Is that assumption correct?

> When using wait_for_catchup(), I would recommend to list "replay" for
> this test, even if that's the default mode, to make clear what the
> intention is.

Done.

Regards,
Soumyadeep (VMware)
From 824076a977af0e40da1c7eb9e4aeac9a8e81a7ee Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Mon, 2 Aug 2021 20:50:37 -0700
Subject: [PATCH v4 1/1] Refresh delayUntil in recovery delay loop

This ensures that the wait interval in the loop is correctly
recalculated with the updated value of recovery_min_apply_delay.

Now, if one changes the GUC while we are waiting, those changes will
take effect. Practical applications include instantly cancelling a long
wait time by setting recovery_min_apply_delay to 0. This can be useful
in tests.
---
 src/backend/access/transam/xlog.c   | 13 +---
 src/test/recovery/t/005_replay_delay.pl | 41 ++---
 2 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d0ec6a834be..74ad7ff905b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6234,12 +6234,11 @@ recoveryApplyDelay(XLogReaderState *record)
 	if (!getRecordTimestamp(record, ))
 		return false;
 
-	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
-
 	/*
 	 * Exit without arming the latch if it's already past time to apply this
 	 * record
 	 */
+	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
 	msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
 	if (msecs <= 0)
 		return false;
@@ -6248,14 +6247,20 @@ recoveryApplyDelay(XLogReaderState *record)
 	{
 		ResetLatch(>recoveryWakeupLatch);
 
-		/* might change the trigger file's location */
+		/* might change the trigger file's location and recovery_min_apply_delay */
 		HandleStartupProcInterrupts();
 
 		if (CheckForStandbyTrigger())
 			break;
 
 		/*
-		 * Wait for difference between GetCurrentTimestamp() and delayUntil
+		 * Recalculate delayUntil as recovery_min_apply_delay could have changed
+		 * since last time.
+		 */
+		delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
+
+		/*
+		 * Wait for difference between GetCurrentTimestamp() and delayUntil.
 		 */
 		msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
 delayUntil);
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
index 0b56380e0a7..ad8093df41a 100644
--- a/src/test/recovery/t/005_replay_delay.pl
+++ b/src/test/recovery/t/005_replay_delay.pl
@@ -7,7 +7,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 3;
+use Test::More tests => 4;
 
 # Initialize primary node
 my $node_primary = PostgresNode->new('primary');
@@ -56,6 +56,39 @@ $node_standby->poll_query_until('postgres',
 ok(time() - $primary_insert_time >= $delay,
 	"standby applies WAL only after replication delay");
 
+# Now test to see if updates to recovery_min_apply_delay apply when a standby 

Re: [UNVERIFIED SENDER] Re: [bug] Logical Decoding of relation rewrite with toast does not reset toast_hash

2021-08-13 Thread David Zhang

Hi Drouvot,

I don't see extra data in your output and it looks like your 
copy/paste is missing some content, no?


On my side, that looks good and here is what i get with the patch applied:

I ran the test again, now I got the same output as yours, and it looks 
good for me. (The issue I mentioned in previous email was caused by my 
console output.)


Thank you,

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca




Re: Autovacuum on partitioned table (autoanalyze)

2021-08-13 Thread Álvaro Herrera
Here is a proposal for 14.  This patch has four main changes:

* The mod counts are only propagated to the topmost parent, not to each 
ancestor.  This means that we'll only analyze the topmost partitioned table and 
not each intermediate partitioned table; seems a good compromise to avoid 
sampling all partitions multiple times per round.

* One pgstat message is sent containing many partition/parent pairs, not just 
one. This reduces the number of messages sent.  123 partitions fit in one 
message (messages are 1000 bytes).  This is done once per autovacuum worker 
run, so it shouldn't be too bad.

* There's a sleep between sending the message and re-reading stats.  It would 
be great to have a mechanism by which pgstat collector says "I've received and 
processed up to this point", but we don't have that; what we can do is sleep 
PGSTAT_STAT_INTERVAL and then reread the file, so we're certain that the file 
we read is at least as new as that time.  This is far longer than it takes to 
process the messages.  Note that if the messages do take longer than that to be 
processed by the collector, it's not a big loss anyway; those tables will be 
processed by the next autovacuum run.

* I changed vacuum_expand_rel to put the main-rel OID at the end. (This is a 
variation of Horiguchi-san proposed patch; instead of making the complete list 
be in the opposite order, it's just that one OID that appears at the other 
end). This has the same effect as his patch: any error reports thrown by 
vacuum/analyze mention the first partition rather than the main table.  This 
part is in 0002 and I'm not totally convinced it's a sane idea.

Minor changes:
* I reduced autovacuum from three passes over pg_class to two passes, per your 
observation that we can acquire toast association together with processing 
partitions, and then use that in the second pass to collect everything.

* I moved the catalog-accessing code to partition.c, so we don't need to have 
pgstat.c doing it.

Some doc changes are pending, and some more commentary in parts of the code, 
but I think this is much more sensible.  I do lament the lack of a syscache for 
pg_inherits.From 3e904de5f15cfc69692ad2aea64c0034445d957e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 10 Aug 2021 13:05:59 -0400
Subject: [PATCH 1/2] Propagate counts up only to topmost ancestor

Ignore intermediate partitions, to avoid redundant sampling of
partitions.  If needed, those intermediate partitions can be analyzed
manually.
---
 src/backend/catalog/partition.c |  53 +++
 src/backend/commands/analyze.c  |   3 +-
 src/backend/postmaster/autovacuum.c | 222 ++--
 src/backend/postmaster/pgstat.c |  60 
 src/include/catalog/partition.h |   1 +
 src/include/pgstat.h|  21 ++-
 6 files changed, 211 insertions(+), 149 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 790f4ccb92..017d5ba5a2 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -26,6 +26,7 @@
 #include "nodes/makefuncs.h"
 #include "optimizer/optimizer.h"
 #include "partitioning/partbounds.h"
+#include "pgstat.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/fmgroids.h"
 #include "utils/partcache.h"
@@ -166,6 +167,58 @@ get_partition_ancestors_worker(Relation inhRel, Oid relid, List **ancestors)
 	get_partition_ancestors_worker(inhRel, parentOid, ancestors);
 }
 
+/*
+ * Inform pgstats collector about the topmost ancestor of
+ * each of the given partitions.
+ */
+void
+partition_analyze_report_ancestors(List *partitions)
+{
+	List	   *report_parts = NIL;
+	List	   *ancestors = NIL;
+	Relation	inhRel;
+	ListCell   *lc;
+
+	inhRel = table_open(InheritsRelationId, AccessShareLock);
+
+	/*
+	 * Search pg_inherits for the topmost ancestor of each given partition,
+	 * and if found, store both their OIDs in lists.
+	 *
+	 * By the end of this loop, partitions and ancestors are lists to be
+	 * read in parallel, where the i'th element of ancestors is the topmost
+	 * ancestor of the i'th element of partitions.
+	 */
+	foreach(lc, partitions)
+	{
+		Oid		partition_id = lfirst_oid(lc);
+		Oid		cur_relid;
+
+		cur_relid = partition_id;
+		for (;;)
+		{
+			bool	detach_pending;
+			Oid		parent_relid;
+
+			parent_relid = get_partition_parent_worker(inhRel, cur_relid,
+	   _pending);
+			if ((!OidIsValid(parent_relid) || detach_pending) &&
+cur_relid != partition_id)
+			{
+report_parts = lappend_oid(report_parts, partition_id);
+ancestors = lappend_oid(ancestors, cur_relid);
+break;
+			}
+
+			cur_relid = parent_relid;
+		}
+	}
+
+	table_close(inhRel, AccessShareLock);
+
+	pgstat_report_anl_ancestors(report_parts, ancestors);
+}
+
 /*
  * index_get_partition
  *		Return the OID of index of the given partition that is a child
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 0099a04bbe..c930e3e3cd 100644
--- 

Re: Default to TIMESTAMP WITH TIME ZONE?

2021-08-13 Thread David G. Johnston
On Friday, August 13, 2021, Tom Lane  wrote:
>
> The one thing I could potentially see us doing is more strongly
> encouraging the use of the names "timestamp" and "timestamptz",
> up to and including changing what format_type() et al. put out.


 +1.  Having the canonical form be timestamptz would make pretending the
“with time zone” version doesn’t exist much easier.

David J.


Re: [PATCH] Native spinlock support on RISC-V

2021-08-13 Thread Tom Lane
Andres Freund  writes:
> On 2021-08-13 13:37:02 -0400, Tom Lane wrote:
>> so it seems like someday we might want to expend some effort on native
>> atomics.  I agree that that day need not be today, though.  This patch
>> seems sufficient until we get to the point of (at least) having some
>> RISC-V in the buildfarm.

> Gcc's intriniscs are pretty good these days (and if not, a *lot* of
> projects just outright break).
> What's more, It turns out that using intrinsics with compilers of the
> last ~5 years often generates *better* code than inline assembly
> (e.g. because the compiler can utilize condition codes more effectively
> and has more detailed information about clobbers). So for new platforms
> that'll only have support by new compilers it doesn't really make sense
> to add inline assembler imo.

I didn't say it had to be __asm blocks ;-).  I was thinking more of the
sort of stuff we have in e.g. arch-arm.h and arch-ia64.h, where we know
some things about what is efficient or less efficient on a particular
architecture.  gcc will do its best to provide implementations of
its builtins, but that doesn't mean that using a particular one of
them is always the most sane thing to do.

But anyway, that seems like minor optimization that maybe someday
somebody will be motivated to do.  We're on the same page about this
being enough for today.

I did not like confusing the RISC-V case with the ARM case: duplicating
the code block seems better, in case there's ever reason to add
arch-specific stuff like SPIN_DELAY.  So I split it off to its own
code block and pushed it.

regards, tom lane




Re: [PATCH] Native spinlock support on RISC-V

2021-08-13 Thread Andres Freund
Hi,

On 2021-08-13 13:37:02 -0400, Tom Lane wrote:
> "Andres Freund"  writes:
> > On Fri, Aug 13, 2021, at 19:25, Tom Lane wrote:
> >> I now have looked at the patch, and it seems good as far as it goes,
> >> but I wonder whether some effort ought to be expended in
> >> src/include/port/atomics/.
> 
> > That should automatically pick up the intrinsic. I think we should do the 
> > same on modern compilers for spinlocks, but that's a separate discussion I 
> > guess.
> 
> I was looking at the comment in atomics.h about
> 
>  * Provide a full fallback of the pg_*_barrier(), pg_atomic**_flag and
>  * pg_atomic_* APIs for platforms without sufficient spinlock and/or atomics
>  * support. In the case of spinlock backed atomics the emulation is expected
>  * to be efficient, although less so than native atomics support.
> 
> so it seems like someday we might want to expend some effort on native
> atomics.  I agree that that day need not be today, though.  This patch
> seems sufficient until we get to the point of (at least) having some
> RISC-V in the buildfarm.

For gcc compatible compilers the relevant comments would be

 * There exist generic, hardware independent, implementations for several
 * compilers which might be sufficient, although possibly not optimal, for a
 * new platform. If no such generic implementation is available spinlocks (or
 * even OS provided semaphores) will be used to implement the API.

and

/*
 * Compiler specific, but architecture independent implementations.
 *
 * Provide architecture independent implementations of the atomic
 * facilities. At the very least compiler barriers should be provided, but a
 * full implementation of
 * * pg_compiler_barrier(), pg_write_barrier(), pg_read_barrier()
 * * pg_atomic_compare_exchange_u32(), pg_atomic_fetch_add_u32()
 * using compiler intrinsics are a good idea.
 */

Gcc's intriniscs are pretty good these days (and if not, a *lot* of
projects just outright break).

What's more, It turns out that using intrinsics with compilers of the
last ~5 years often generates *better* code than inline assembly
(e.g. because the compiler can utilize condition codes more effectively
and has more detailed information about clobbers). So for new platforms
that'll only have support by new compilers it doesn't really make sense
to add inline assembler imo.

Greetings,

Andres Freund




Re: [PATCH] Native spinlock support on RISC-V

2021-08-13 Thread Tom Lane
"Andres Freund"  writes:
> On Fri, Aug 13, 2021, at 19:25, Tom Lane wrote:
>> I now have looked at the patch, and it seems good as far as it goes,
>> but I wonder whether some effort ought to be expended in
>> src/include/port/atomics/.

> That should automatically pick up the intrinsic. I think we should do the 
> same on modern compilers for spinlocks, but that's a separate discussion I 
> guess.

I was looking at the comment in atomics.h about

 * Provide a full fallback of the pg_*_barrier(), pg_atomic**_flag and
 * pg_atomic_* APIs for platforms without sufficient spinlock and/or atomics
 * support. In the case of spinlock backed atomics the emulation is expected
 * to be efficient, although less so than native atomics support.

so it seems like someday we might want to expend some effort on native
atomics.  I agree that that day need not be today, though.  This patch
seems sufficient until we get to the point of (at least) having some
RISC-V in the buildfarm.

regards, tom lane




Re: [PATCH] Native spinlock support on RISC-V

2021-08-13 Thread Andres Freund
Hi,

On Fri, Aug 13, 2021, at 19:25, Tom Lane wrote:
> I wrote:
> > Andres Freund  writes:
> >> Should we backpatch this? It's not like we're going to break existing
> >> risc-v systems by enabling spinlock support...
> 
> > Yeah, why not?  If you were building with --disable-spinlocks before,
> > this shouldn't change anything for you.
> > (I haven't actually looked at the patch, mind you, but in principle
> > it shouldn't break anything that worked before.)
> 
> I now have looked at the patch, and it seems good as far as it goes,
> but I wonder whether some effort ought to be expended in
> src/include/port/atomics/.

That should automatically pick up the intrinsic. I think we should do the same 
on modern compilers for spinlocks, but that's a separate discussion I guess.

Address




Re: [PATCH] Native spinlock support on RISC-V

2021-08-13 Thread Tom Lane
I wrote:
> Andres Freund  writes:
>> Should we backpatch this? It's not like we're going to break existing
>> risc-v systems by enabling spinlock support...

> Yeah, why not?  If you were building with --disable-spinlocks before,
> this shouldn't change anything for you.
> (I haven't actually looked at the patch, mind you, but in principle
> it shouldn't break anything that worked before.)

I now have looked at the patch, and it seems good as far as it goes,
but I wonder whether some effort ought to be expended in
src/include/port/atomics/.

regards, tom lane




Re: Default to TIMESTAMP WITH TIME ZONE?

2021-08-13 Thread Tom Lane
"David G. Johnston"  writes:
> On Fri, Aug 13, 2021 at 9:28 AM Simon Riggs 
> wrote:
>> The only hope is to eventually change the default, so probably
>> the best thing is to apply pressure via the SQL Std process.

> Then there is no hope because this makes the situation worse.

Agreed; the points I made upthread are just as valid if the change
is made in the standard.  But I'd be astonished if the SQL committee
would consider such a change anyway.

The one thing I could potentially see us doing is more strongly
encouraging the use of the names "timestamp" and "timestamptz",
up to and including changing what format_type() et al. put out.
Yeah, "timestamptz" is not standard, but so what?  At least it's
not actually *contrary* to the standard, as the original proposal
here is.  (Also, while I hate to bring it up in this context,
our timestamptz data type is not really compatible with the spec
in the first place, so that a case could be made that this behavior
is more honest/spec-compatible than what we do today.)

If you wanted to be even more in people's faces about it, you
could print the type names as "timestamptz" and "timestamp
without time zone".

regards, tom lane




Re: Default to TIMESTAMP WITH TIME ZONE?

2021-08-13 Thread David G. Johnston
On Fri, Aug 13, 2021 at 9:28 AM Simon Riggs 
wrote:

>
> The only hope is to eventually change the default, so probably
> the best thing is to apply pressure via the SQL Std process.
>
>
Then there is no hope because this makes the situation worse.

If anything I'd suggest the SQL standard should probably just admit this
"default behavior of timestamp" is a bad idea and deprecate its existence.
IOW, the only two standard conforming syntaxes are the
explicit WITH/WITHOUT TIME ZONE ones.  Any database implementation that
implements "timestamp" as a type alias is doing so in an implementation
dependent way.  Code that wants to be SQL standard conforming portable
needs to use the explicit types.

David J.


Re: Multiple Postgres process are running in background

2021-08-13 Thread Ranier Vilela
Em sex., 13 de ago. de 2021 às 11:55, Ram Charan Kallem <
ramcharan.kal...@non.se.com> escreveu:

> Hi,
>
>
>
> We are using Postgres 10 (Single client)and observed that there are
> multiple PostgreSQL Server process are running in background.
>
> Why these additional process are created  or is this an expected behavior.
>
This is a normal and expected behavior.

pgsql-hackers@, is not an appropriate place to such questions.

regards,
Ranier Vilela


Re: Default to TIMESTAMP WITH TIME ZONE?

2021-08-13 Thread Simon Riggs
On Fri, 13 Aug 2021 at 17:23, Andrew Dunstan  wrote:

> What do other DBMSs do?

I think MySQL defaults to WITH TIME ZONE, not sure, but I would bet a
few others follow the standard.

> This strikes me as primarily an education issue
> (I did a webinar on it not that long ago)

Yes, agreed.

> If you want to protect against people using tz-less timestamp, maybe an
> event trigger would be a solution, although maybe that's using a
> sledgehammer to crack a nut.

If you know about the issue, I guess you fix it in lots of different ways.

The issue is about those that don't know and my patch didn't help them
either. The only hope is to eventually change the default, so probably
the best thing is to apply pressure via the SQL Std process.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Default to TIMESTAMP WITH TIME ZONE?

2021-08-13 Thread Andrew Dunstan


On 8/12/21 7:25 PM, Simon Riggs wrote:
> I thought we found that changing behavior via GUC usually ends badly.
>> Yeah.  Changing from SQL-spec to not-SQL-spec behavior is going to be
>> one tough sell to begin with, even without the point that that's been
>> our behavior for over two decades.  But proposing to do it via a GUC
>> is just not-even-worth-discussing territory.  That would force every
>> wannabe-portable program to cope with both behaviors; which would
>> end up meaning that not only do you still have to take care to write
>> WITH TIME ZONE when you want that, but *also* you'd have to be sure
>> to write WITHOUT TIME ZONE when you want that.  In short, the worst
>> of both worlds.
> All of which I agree with, but this wasn't a cute idea of mine, this
> was what our users have requested because of the extreme annoyance
> caused by the current behavior.
>

What do other DBMSs do? This strikes me as primarily an education issue
(I did a webinar on it not that long ago)


If you want to protect against people using tz-less timestamp, maybe an
event trigger would be a solution, although maybe that's using a
sledgehammer to crack a nut.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





[PATCH] Error out if SKIP LOCKED and WITH TIES are both specified

2021-08-13 Thread David Christensen
Both bugs #16676[1] and #17141[2] illustrate that the combination of SKIP
LOCKED and FETCH FIRST
WITH TIES break expectations when it comes to rows returned to other
sessions accessing the same
row.  Since this situation is detectable from the syntax and hard to fix
otherwise, forbid for now,
with the potential to fix in the future.

[1]
https://www.postgresql.org/message-id/16676-fd62c3c835880da6%40postgresql.org
[2]
https://www.postgresql.org/message-id/17141-913d78b9675aac8e%40postgresql.org

Proposed backpatch to 13.


0001-Error-out-if-SKIP-LOCKED-and-WITH-TIES-are-both-spec.patch
Description: Binary data


Re: [PATCH] Native spinlock support on RISC-V

2021-08-13 Thread Tom Lane
Andres Freund  writes:
> On 2021-08-13 11:09:04 -0400, Tom Lane wrote:
>> Marek Szuba  writes:
>>> Tested against PostgreSQL 13.3 on a physical rv64gc system (BeagleV
>>> Starlight beta board) - builds and installs fine, all tests pass.

> Should we backpatch this? It's not like we're going to break existing
> risc-v systems by enabling spinlock support...

Yeah, why not?  If you were building with --disable-spinlocks before,
this shouldn't change anything for you.

(I haven't actually looked at the patch, mind you, but in principle
it shouldn't break anything that worked before.)

regards, tom lane




Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

2021-08-13 Thread Andres Freund
Hi,

(dropping -committers to avoid moderation stalls due xposting to multiple lists 
-
I find that more annoying than helpful)

On 2021-08-13 14:38:37 +0530, Amit Kapila wrote:
> > What I'm wondering is why it is a good idea to have a SharedFileSet specific
> > cleanup mechanism. One that only operates on process lifetime level, rather
> > than something more granular. I get that the of the files here needs to be
> > longer than a transaction, but that can easily be addressed by having a 
> > longer
> > lived resource owner.
> >
> > Process lifetime may work well for the current worker.c, but even there it
> > doesn't seem optimal. One e.g. could easily imagine that we'd want to handle
> > connection errors or configuration changes without restarting the worker, in
> > which case process lifetime obviously isn't a good idea anymore.
> >
> 
> I don't deny that we can't make this at a more granular level. IIRC,
> at that time, we tried to follow AtProcExit_Files which cleans up temp
> files at proc exit and we needed something similar for temporary files
> used via SharedFileSet.

The comparison to AtProcExit_Files isn't convincing to me - normally temp
files are cleaned up long before that via resowner cleanup.

To me the reuse of SharedFileSet for worker.c as executed seems like a bad
design. As things stand there's little code shared between dsm/non-dsm shared
file sets. The cleanup semantics are entirely different. Several functions
don't work if used on the "wrong kind" of set (e.g. SharedFileSetAttach()).


> I think we can extend this API but I guess it is better to then do it
> for dsm-based as well so that these get tracked via resowner.

DSM segments are resowner managed already, so it's not obvious that that'd buy
us much? Although I guess there could be a few edge cases that'd look cleaner,
because we could reliably trigger cleanup in the leader instead of relying on
dsm detach hooks + refcounts to manage when a set is physically deleted?

Greetings,

Andres Freund




Re: Default to TIMESTAMP WITH TIME ZONE?

2021-08-13 Thread Bruce Momjian
On Fri, Aug 13, 2021 at 07:09:15AM -0700, David G. Johnston wrote:
> On Fri, Aug 13, 2021 at 12:33 AM Gavin Flower 
> wrote:
> 
> 
> I always use the tz version, except when I forget.
> 
> 
> I find it nearly impossible for me to forget how this works.  But that is
> probably because I just pretend that the standard, multi-word, data types 
> don't
> even exist.  It's not that "timestamp" defaults to "WITHOUT TIME ZONE" but
> rather that there are only two types in existence: timestamp and timestamptz. 
> It's self-evident which one doesn't handle time zones.
> 
> 
> 
> Suspect that it is extremely rare when one would not want to use the TZ
> option, which suggests to me that using TZ should be the default.
> 
> 
> 
> I would agree, but at this point I'd leave it up to documentation and
> educational materials to teach/encourage, not runtime behaviors.

Yes, only using 'timetamptz' does fix it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: [PATCH] Native spinlock support on RISC-V

2021-08-13 Thread Andres Freund
Hi,

On 2021-08-13 11:09:04 -0400, Tom Lane wrote:
> Marek Szuba  writes:
> > Tested against PostgreSQL 13.3 on a physical rv64gc system (BeagleV
> > Starlight beta board) - builds and installs fine, all tests pass.

Seems like a good idea to me.

> Cool ... I had hoped to acquire one of those myself, but I didn't
> make the cut.

Should we backpatch this? It's not like we're going to break existing
risc-v systems by enabling spinlock support...

Greetings,

Andres Freund




Re: [PATCH] Native spinlock support on RISC-V

2021-08-13 Thread Tom Lane
Marek Szuba  writes:
> Tested against PostgreSQL 13.3 on a physical rv64gc system (BeagleV 
> Starlight beta board) - builds and installs fine, all tests pass.

Cool ... I had hoped to acquire one of those myself, but I didn't
make the cut.

regards, tom lane




[PATCH] Native spinlock support on RISC-V

2021-08-13 Thread Marek Szuba

Hello,

The attached patch adds native spinlock support to PostgreSQL on RISC-V 
systems. As suspected by Richard W.M. Jones of Red Hat back in 2016, the 
__sync_lock_test_and_set() approach applied on arm and arm64 works here 
as well.


Tested against PostgreSQL 13.3 on a physical rv64gc system (BeagleV 
Starlight beta board) - builds and installs fine, all tests pass. From 
what I can see in gcc documentation this should in theory work on rv32 
(and possibly rv128) as well, therefore the patch as it stands covers 
all RISC-V systems (i.e. doesn't check the value of __risc_xlen) - but I 
haven't confirmed this experimentally.


--
MS
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -315,12 +315,12 @@
 #endif	 /* __ia64__ || __ia64 */
 
 /*
- * On ARM and ARM64, we use __sync_lock_test_and_set(int *, int) if available.
+ * On ARM, ARM64 and RISC-V, we use __sync_lock_test_and_set(int *, int) if available.
  *
  * We use the int-width variant of the builtin because it works on more chips
  * than other widths.
  */
-#if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64)
+#if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64) || defined(__riscv)
 #ifdef HAVE_GCC__SYNC_INT32_TAS
 #define HAS_TEST_AND_SET
 
@@ -337,7 +337,7 @@
 #define S_UNLOCK(lock) __sync_lock_release(lock)
 
 #endif	 /* HAVE_GCC__SYNC_INT32_TAS */
-#endif	 /* __arm__ || __arm || __aarch64__ || __aarch64 */
+#endif	 /* __arm__ || __arm || __aarch64__ || __aarch64 || __riscv */
 
 
 /* S/390 and S/390x Linux (32- and 64-bit zSeries) */


OpenPGP_signature
Description: OpenPGP digital signature


RE: Multiple Postgres process are running in background

2021-08-13 Thread Ram Charan Kallem
Hi,

We are using Postgres 10 (Single client)and observed that there are multiple 
PostgreSQL Server process are running in background.
Why these additional process are created  or is this an expected behavior.

[cid:image001.png@01D7905A.F629F440]

Regards,
RamCharan



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

2021-08-13 Thread Amit Kapila
On Thu, Aug 12, 2021 at 6:24 PM Andres Freund  wrote:
>
> On 2021-08-12 05:48:19 -0700, Andres Freund wrote:
> > I think SharedFileSetInit() needs a comment explaining that it needs to be
> > called in a process-lifetime memory context if used without dsm
> > segments. Because otherwise SharedFileSetDeleteOnProcExit() will access
> > already freed memory (both for filesetlist and the SharedFileSet itself).
>
> Oh. And I think it's not ok that SharedFileSetDeleteAll() unconditionally does
> SharedFileSetUnregister(). SharedFileSetUnregister() asserts out if there's no
> match, but DSM based sets are never entered into filesetlist. So one cannot
> have a non-DSM and DSM set coexisting. Which seems surprising.
>

Oops, it should be allowed to have both non-DSM and DSM set
coexisting. I think we can remove Assert from
SharedFileSetUnregister(). The other way could be to pass a parameter
to SharedFileSetDeleteAll() to tell whether to unregister or not.

-- 
With Regards,
Amit Kapila.




Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

2021-08-13 Thread Amit Kapila
On Thu, Aug 12, 2021 at 6:18 PM Andres Freund  wrote:
>
> Hi,
>
> On 2021-08-12 15:06:23 +0530, Amit Kapila wrote:
> > On Thu, Aug 12, 2021 at 1:52 PM Andres Freund  wrote:
> > > I'm not so sure. Why does sharedfileset have its own proc exit hook in the
> > > first place? ISTM that this should be dealt with using resowners, rathers 
> > > than
> > > a sharedfileset specific mechanism?
> > >
>
> > The underlying temporary files need to be closed at xact end but need
> > to survive across transactions.
>
> Why do they need to be closed at xact end? To avoid wasting memory due to too
> many buffered files?
>

Yes.

>
> > These are registered with the resource owner via
> > PathNameOpenTemporaryFile/PathNameCreateTemporaryFile and then closed
> > at xact end. So, we need a way to remove the files used by the process
> > (apply worker in this particular case) before process exit and used
> > this proc_exit hook (possibly on the lines of AtProcExit_Files).
>
> What I'm wondering is why it is a good idea to have a SharedFileSet specific
> cleanup mechanism. One that only operates on process lifetime level, rather
> than something more granular. I get that the of the files here needs to be
> longer than a transaction, but that can easily be addressed by having a longer
> lived resource owner.
>
> Process lifetime may work well for the current worker.c, but even there it
> doesn't seem optimal. One e.g. could easily imagine that we'd want to handle
> connection errors or configuration changes without restarting the worker, in
> which case process lifetime obviously isn't a good idea anymore.
>

I don't deny that we can't make this at a more granular level. IIRC,
at that time, we tried to follow AtProcExit_Files which cleans up temp
files at proc exit and we needed something similar for temporary files
used via SharedFileSet. I think we can extend this API but I guess it
is better to then do it for dsm-based as well so that these get
tracked via resowner.

>
> I think SharedFileSetInit() needs a comment explaining that it needs to be
> called in a process-lifetime memory context if used without dsm
> segments.
>

We already have a comment about proc_exit clean up of files but will
extend that a bit about memory context.

-- 
With Regards,
Amit Kapila.




Re: pg_settings.pending_restart not set when line removed

2021-08-13 Thread Tom Lane
Alexander Kukushkin  writes:
> IMO is totally wrong, because the actual value didn't change: it was an
> empty string in the config and now it remains an empty string due to the
> default value in the guc.c

I can't get very excited about that.  The existing message about
"parameter \"%s\" cannot be changed without restarting the server"
was emitted without regard to that fine point, and nobody has
complained about it.

regards, tom lane




Re: pg_settings.pending_restart not set when line removed

2021-08-13 Thread Alexander Kukushkin
Hi Alvaro,



On Tue, 27 Jul 2021 at 22:17, Alvaro Herrera 
wrote:

> I tested this -- it works correctly AFAICS.
>

Nope, IMO it doesn't work correctly.
Lets say we have recovery_target = '' in the config:
localhost/postgres=# select name, setting, setting is null, pending_restart
from pg_settings where name = 'recovery_target';
 name   │ setting │ ?column? │ pending_restart
─┼─┼──┼─
recovery_target │ │ f│ f
(1 row)


After that we remove it from the config and call pg_ctl reload. It sets the
panding_restart.
localhost/postgres=# select name, setting, setting is null, pending_restart
from pg_settings where name = 'recovery_target';
 name   │ setting │ ?column? │ pending_restart
─┼─┼──┼─
recovery_target │ │ f│ t
(1 row)

IMO is totally wrong, because the actual value didn't change: it was an
empty string in the config and now it remains an empty string due to the
default value in the guc.c

Regards,
--
Alexander Kukushkin


Re: Default to TIMESTAMP WITH TIME ZONE?

2021-08-13 Thread David G. Johnston
On Fri, Aug 13, 2021 at 12:33 AM Gavin Flower 
wrote:

>
> I always use the tz version, except when I forget.


I find it nearly impossible for me to forget how this works.  But that is
probably because I just pretend that the standard, multi-word, data types
don't even exist.  It's not that "timestamp" defaults to "WITHOUT TIME
ZONE" but rather that there are only two types in existence: timestamp and
timestamptz.  It's self-evident which one doesn't handle time zones.


> Suspect that it is extremely rare when one would not want to use the TZ
> option, which suggests to me that using TZ should be the default.
>
>
I would agree, but at this point I'd leave it up to documentation and
educational materials to teach/encourage, not runtime behaviors.

David J.


Proposal: More structured logging

2021-08-13 Thread Ronan Dunklau
Hello,

The logging system already captures a lot of information in the ErrorData. But 
at present there is no way for a log message authors to include more metadata 
about the log outside of the log message itself. For example, including the 
extension name which can be useful for filtering / dispatching log messages.
This can be done in the log message itself, but that requires specialized 
parsing in the log output. 

Even though among the available loggers in core, only the csv logger would be 
able to make sense of it, I feel it would be beneficial to add a tagging system 
to logs, by adding different (tag, value) combinations to a log entry, with an 
API similar to what we do for other ErrorData fields:

ereport(NOTICE,
 (errmsg("My log message")),
 (errtag("EMITTER", "MYEXTENSION")),
 (errtag("MSG-ID", "%d", error_message_id))
);

Please find attached a very small POC patch to better demonstrate what I 
propose.  

Third party logging hooks could then exploit those values to output them 
correctly. For example the json loggger written by Michael Paquier here: 
https://github.com/michaelpq/pg_plugins/tree/master/jsonlog, or  the 
seeminlgy-abandonned journald hook here: https://github.com/intgr/pg_journal 
could add those information in a structured way.

I think the pgaudit extension (or something similar) could also make good use 
of such a feature instead of dumping a CSV entry in the log file. 

As for Postgres own log messages, I'm sure we could find practical use cases 
for tagging messages like this.

On a related note, the only structured logger we have in-core is the CSV 
logger. Many users nowadays end up feeding the logs to journald either by 
capturing the stderr output with systemd, or by having syslog implemented by 
journald itself. Would there be any interest in having native journald support 
as a logging_destination ?

Best regards,

-- 
Ronan Dunklau>From 270ffc5ed2fbe5f5076bddee14c5fb3555b87e4f Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 13 Aug 2021 15:03:18 +0200
Subject: [PATCH v1 1/2] Add ErrorTag support

---
 src/backend/utils/error/elog.c | 48 ++
 src/include/utils/elog.h   | 10 +++
 2 files changed, 58 insertions(+)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index a3e1c59a82..5b9b1b8a72 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -465,6 +465,7 @@ errstart(int elevel, const char *domain)
 		edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION;
 	/* errno is saved here so that error parameter eval can't change it */
 	edata->saved_errno = errno;
+	edata->tags = NIL;
 
 	/*
 	 * Any allocations for this error state level should go into ErrorContext
@@ -516,6 +517,7 @@ errfinish(const char *filename, int lineno, const char *funcname)
 	int			elevel;
 	MemoryContext oldcontext;
 	ErrorContextCallback *econtext;
+	ListCell   *lc;
 
 	recursion_depth++;
 	CHECK_STACK_DEPTH();
@@ -621,7 +623,18 @@ errfinish(const char *filename, int lineno, const char *funcname)
 		pfree(edata->constraint_name);
 	if (edata->internalquery)
 		pfree(edata->internalquery);
+	/* Every tag should have been palloc'ed */
+	if (edata->tags != NIL)
+	{
+		foreach(lc, edata->tags)
+		{
+			ErrorTag   *tag = (ErrorTag *) lfirst(lc);
 
+			pfree(tag->tagvalue);
+			pfree(tag);
+		}
+		pfree(edata->tags);
+	}
 	errordata_stack_depth--;
 
 	/* Exit error-handling context */
@@ -1192,6 +1205,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural,
 	return 0;	/* return value does not matter */
 }
 
+int
+errtag(const char *tag, const char *fmt_value,...)
+{
+	ErrorData  *edata = [errordata_stack_depth];
+	ErrorTag   *etag;
+	MemoryContext oldcontext;
+	StringInfoData buf;
+
+	recursion_depth++;
+	CHECK_STACK_DEPTH();
+	oldcontext = MemoryContextSwitchTo(edata->assoc_context);
+	etag = palloc(sizeof(ErrorTag));
+	etag->tagname = tag;
+	initStringInfo();
+	for (;;)
+	{
+		va_list		args;
+		int			needed;
+
+		errno = edata->saved_errno;
+		va_start(args, fmt_value);
+		needed = appendStringInfoVA(, fmt_value, args);
+		va_end(args);
+		if (needed == 0)
+			break;
+		enlargeStringInfo(, needed);
+	}
+	etag->tagvalue = pstrdup(buf.data);
+	edata->tags = lappend(edata->tags, etag);
+	pfree(buf.data);
+	MemoryContextSwitchTo(oldcontext);
+	recursion_depth--;
+	return 0;
+}
+
 
 /*
  * errcontext_msg --- add a context error message text to the current error
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index f53607e12e..1c490d1b11 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -15,6 +15,7 @@
 #define ELOG_H
 
 #include 
+#include "nodes/pg_list.h"
 
 /* Error level codes */
 #define DEBUG5		10			/* Debugging messages, in categories of
@@ -193,6 +194,8 @@ extern int	errhint(const char *fmt,...) pg_attribute_printf(1, 2);
 extern int	errhint_plural(const char *fmt_singular, const char *fmt_plural,
 		   unsigned long n,...) 

Re: CI/windows docker vs "am a service" autodetection on windows

2021-08-13 Thread Andres Freund
Hi,

Magnus, Michael, Anyone - I'd appreciate a look.

On 2021-03-05 12:55:37 -0800, Andres Freund wrote:
> > After fighting with a windows VM for a bit (ugh), it turns out that yes,
> > there is stderr, but that fileno(stderr) returns -2, and
> > GetStdHandle(STD_ERROR_HANDLE) returns NULL (not INVALID_HANDLE_VALUE).
> > 
> > The complexity however is that while that's true for pg_ctl within
> > pgwin32_ServiceMain:
> > checking what stderr=7FF8687DFCB0 is (handle: 0, fileno=-2)
> > but not for postmaster or backends
> > WARNING: 01000: checking what stderr=7FF880F5FCB0 is (handle: 92, 
> > fileno=2)
> > 
> > which makes sense in a way, because we don't tell CreateProcessAsUser()
> > that it should pass stdin/out/err down (which then seems to magically
> > get access to the "topmost" console applications output - damn, this
> > stuff is weird).
> 
> That part is not too hard to address - it seems we only need to do that
> in pg_ctl pgwin32_doRunAsService(). It seems that the
> stdin/stderr/stdout being set to invalid will then be passed down to
> postmaster children.
> 
> https://docs.microsoft.com/en-us/windows/console/getstdhandle
> "If an application does not have associated standard handles, such as a
> service running on an interactive desktop, and has not redirected them,
> the return value is NULL."
> 
> There does seem to be some difference between what services get as std*
> - GetStdHandle() returns NULL, and what explicitly passing down invalid
> handles to postmaster does - GetStdHandle() returns
> INVALID_HANDLE_VALUE. But passing down NULL rather than
> INVALID_HANDLE_VALUE to postmaster seems to lead to postmaster
> re-opening console buffers.
> 
> Patch attached.

> I'd like to commit something to address this issue to master soon - it
> allows us to run a lot more tests in cirrus-ci... But probably not
> backpatch it [yet] - there've not really been field complaints, and who
> knows if there end up being some unintentional side-effects...

Because it'd allow us to run more tests as part of cfbot and other CI
efforts, I'd like to push this forward. So I'm planning to commit this
to master soon-ish, unless somebody wants to take this over? I'm really
not a windows person...

Greetings,

Andres Freund




Re: EXEC_BACKEND vs bgworkers without BGWORKER_SHMEM_ACCESS

2021-08-13 Thread Andres Freund
On 2021-08-05 19:56:49 -0700, Andres Freund wrote:
> Done in the attached patch. I don't think we need to add more to the docs than
> the flag being required?

Pushed that patch now. If we want further additions to the docs we can
do so separately.




Re: SI messages sent when excuting ROLLBACK PREPARED command

2021-08-13 Thread Michael Paquier
On Wed, Aug 11, 2021 at 03:14:11PM +0900, Michael Paquier wrote:
> I would just tweak the comment block at the top of what's being
> changed, as per the attached.  Please let me know if there are any
> objections. 

And applied as of 710796f.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Hooks at XactCommand level

2021-08-13 Thread Gilles Darold
Le 13/08/2021 à 11:58, Andres Freund a écrit :
> Hi,
>
> On 2021-08-10 10:12:26 +0200, Gilles Darold wrote:
>> Sorry for the response delay. I have though about adding this odd hook to be
>> able to implement this feature through an extension because I don't think
>> this is something that should be implemented in core. There were also
>> patches proposals which were all rejected.
>>
>> We usually implement the feature at client side which is imo enough for the
>> use cases. But the problem is that this a catastrophe in term of
>> performances. I have done a small benchmark to illustrate the problem. This
>> is a single process client on the same host than the PG backend.
>>
>> For 10,000 tuples inserted with 50% of failures and rollback at statement
>> level handled at client side:
>>
>>     Expected: 5001, Count:  5001
>>     DML insert took: 13 wallclock secs ( 0.53 usr +  0.94 sys =  1.47
>> CPU)
> Something seems off here. This suggests every insert took 2.6ms. That
> seems awfully long, unless your network latency is substantial. I did a
> quick test implementing this in the naive-most way in pgbench, and I get
> better times - and there's *lots* of room for improvement.
>
> I used a pgbench script that sent the following:
> BEGIN;
> SAVEPOINT insert_fail;
> INSERT INTO testinsert(data) VALUES (1);
> ROLLBACK TO SAVEPOINT insert_fail;
> SAVEPOINT insert_success;
> INSERT INTO testinsert(data) VALUES (1);
> RELEASE SAVEPOINT insert_success;
> {repeat 5 times}
> COMMIT;
>
> I.e. 5 failing and 5 succeeding insertions wrapped in one transaction. I
> get >2500 tps, i.e. > 25k rows/sec. And it's not hard to optimize that
> further - the {ROLLBACK TO,RELEASE} SAVEPOINT; SAVEPOINT; INSERT can be
> sent in one roundtrip. That gets me to somewhere around 40k rows/sec.
>
>
> BEGIN;
>
> \startpipeline
> SAVEPOINT insert_fail;
> INSERT INTO testinsert(data) VALUES (1);
> \endpipeline
>
> \startpipeline
> ROLLBACK TO SAVEPOINT insert_fail;
> SAVEPOINT insert_success;
> INSERT INTO testinsert(data) VALUES (1);
> \endpipeline
>
> \startpipeline
> RELEASE SAVEPOINT insert_success;
> SAVEPOINT insert_fail;
> INSERT INTO testinsert(data) VALUES (1);
> \endpipeline
>
> \startpipeline
> ROLLBACK TO SAVEPOINT insert_fail;
> SAVEPOINT insert_success;
> INSERT INTO testinsert(data) VALUES (1);
> \endpipeline
>
> {repeat last two blocks three times}
> COMMIT;
>
> Greetings,
>
> Andres Freund


I have written a Perl script to mimic what I have found in an Oracle
batch script to import data in a table. I had this use case in a recent
migration the only difference is that the batch was written in Java.


$dbh->do("BEGIN") or die "FATAL: " . $dbh->errstr . "\n";
my $start = new Benchmark;
my $sth = $dbh->prepare("INSERT INTO t1 VALUES (?, ?)");
exit 1 if (not defined $sth);
for (my $i = 0; $i <= 1; $i++)
{
    $dbh->do("SAVEPOINT foo") or die "FATAL: " . $dbh->errstr . "\n";
    # Generate a duplicate key each two row inserted
    my $val = $i;
    $val = $i-1 if ($i % 2 != 0);
    unless ($sth->execute($val, 'insert '.$i)) {
    $dbh->do("ROLLBACK TO foo") or die "FATAL: " .
$dbh->errstr . "\n";
    } else {
    $dbh->do("RELEASE foo") or die "FATAL: " . $dbh->errstr
. "\n";
    }
}
$sth->finish();
my $end = new Benchmark;

$dbh->do("COMMIT;");

my $td = timediff($end, $start);
print "DML insert took: " . timestr($td) . "\n";


The timing reported are from my personal computer, there is no network
latency, it uses localhost. Anyway, the objective was not to bench the
DML throughput but the overhead of the rollback at statement level made
at client side versus server side. I guess that you might have the same
speed gain around x3 to x5 or more following the number of tuples?


The full script can be found here
https://github.com/darold/pg_statement_rollbackv2/blob/main/test/batch_script_example.pl


Cheers,

-- 
Gilles Darold






Re: Shared memory size computation oversight?

2021-08-13 Thread Julien Rouhaud
On Fri, Aug 13, 2021 at 10:52:50AM +0200, Peter Eisentraut wrote:
> On 12.08.21 16:18, Julien Rouhaud wrote:
> > 
> > But changing RequestAddinShmemSpace() to apply CACHELINEALIGN() would
> > only really work for that specific usage only?  If an extension does
> > multiple allocations you can't rely on correct results.
> 
> I think you can do different things here to create inconsistent results, but
> I think there should be one common, standard, normal, straightforward way to
> get a correct result.

Unless I'm missing something, the standard and straightforward way to get a
correct result is to account for padding bytes in the C code, as it's currently
done in all contrib modules.  The issue is that those modules aren't using the
correct alignment anymore.
> 
> > > Btw., I think your patch was wrong to apply CACHELINEALIGN() to
> > > intermediate results rather than at the end.
> > 
> > I'm not sure why you think it's wrong.  It's ShmemInitStruct() that
> > will apply the padding, so if the extension calls it multiple times
> > (whether directly or indirectly), then the padding will be added
> > multiple times.  Which means that in theory the extension should
> > account for it multiple time in the amount of memory it's requesting.
> 
> Yeah, in that case it's probably rather the case that there is one
> CACHELINEALIGN() too few, since pg_stat_statements does two separate shmem
> allocations.

I still don't get it.  Aligning the total shmem size is totally different from
properly padding all single allocation sizes, and is almost never the right
answer.

Using a naive example, if your extension needs to ShmemInitStruct() twice 64B,
postgres will consume 2*128B.  But if you only rely on RequestAddinShmemSpace()
to add a CACHELINEALIGN(), then no padding at all will be added, and you'll
then be not one but two CACHELINEALIGN() too few.

But again, the real issue is not the CACHELINEALIGN() roundings, as those have
a more or less stable size and are already accounted for in the extra 100kB,
but the dynahash size estimation which seems to be increasingly off as the
number of entries grows.




Re: Bug in huge simplehash

2021-08-13 Thread Andres Freund


On 2021-08-13 14:40:08 +0300, Yura Sokolov wrote:
> Ranier Vilela писал 2021-08-13 14:12:
> > Em sex., 13 de ago. de 2021 às 07:15, Andres Freund
> >  escreveu:
> > > Personally I find it more obvious to understand the intended
> > > behaviour
> > > with ~0 (i.e. all bits set) than with a width truncation.
> > 
> > https://godbolt.org/z/57WcjKqMj
> > The generated code is identical.
> 
> I believe, you mean https://godbolt.org/z/qWzE1ePTE

I don't think the choice of instructions matters. This is called around
creation and resizing - the other costs are several orders of magnitude
more expensive than determining the mask.




Re: Bug in huge simplehash

2021-08-13 Thread Yura Sokolov

Ranier Vilela писал 2021-08-13 14:12:

Em sex., 13 de ago. de 2021 às 07:15, Andres Freund
 escreveu:


Hi,

On 2021-08-13 12:44:17 +0300, Yura Sokolov wrote:

Andres Freund писал 2021-08-13 12:21:

Any chance you'd write a test for simplehash with such huge

amount of

values? It'd require a small bit of trickery to be practical. On

systems

with MAP_NORESERVE it should be feasible.


Which way C structures should be tested in postgres?
dynahash/simplhash - do they have any tests currently?
I'll do if hint is given.


We don't have a great way right now :(. I think the best is to have
a
SQL callable function that uses some API. See e.g. test_atomic_ops()
et
al in src/test/regress/regress.c


>  static inline void
> -SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
> +SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize)
>  {
>   uint64  size;
>
> @@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb,

uint32

> newsize)
>
>   /* now set size */
>   tb->size = size;
> -
> - if (tb->size == SH_MAX_SIZE)
> - tb->sizemask = 0;
> - else
> - tb->sizemask = tb->size - 1;
> + tb->sizemask = (uint32)(size - 1);

ISTM using ~0 would be nicer here?


I don't think so.
To be rigid it should be `~(uint32)0`.
But I believe it doesn't differ from `tb->sizemask = (uint32)(size

- 1)`

that is landed with patch, therefore why `if` is needed?


Personally I find it more obvious to understand the intended
behaviour
with ~0 (i.e. all bits set) than with a width truncation.


https://godbolt.org/z/57WcjKqMj
The generated code is identical.


I believe, you mean https://godbolt.org/z/qWzE1ePTE


regards,
Ranier Vilela





Re: Bug in huge simplehash

2021-08-13 Thread Ranier Vilela
Em sex., 13 de ago. de 2021 às 07:15, Andres Freund 
escreveu:

> Hi,
>
> On 2021-08-13 12:44:17 +0300, Yura Sokolov wrote:
> > Andres Freund писал 2021-08-13 12:21:
> > > Any chance you'd write a test for simplehash with such huge amount of
> > > values? It'd require a small bit of trickery to be practical. On
> systems
> > > with MAP_NORESERVE it should be feasible.
> >
> > Which way C structures should be tested in postgres?
> > dynahash/simplhash - do they have any tests currently?
> > I'll do if hint is given.
>
> We don't have a great way right now :(. I think the best is to have a
> SQL callable function that uses some API. See e.g. test_atomic_ops() et
> al in src/test/regress/regress.c
>
>
> > > >  static inline void
> > > > -SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
> > > > +SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize)
> > > >  {
> > > >   uint64  size;
> > > >
> > > > @@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32
> > > > newsize)
> > > >
> > > >   /* now set size */
> > > >   tb->size = size;
> > > > -
> > > > - if (tb->size == SH_MAX_SIZE)
> > > > - tb->sizemask = 0;
> > > > - else
> > > > - tb->sizemask = tb->size - 1;
> > > > + tb->sizemask = (uint32)(size - 1);
> > >
> > > ISTM using ~0 would be nicer here?
> >
> > I don't think so.
> > To be rigid it should be `~(uint32)0`.
> > But I believe it doesn't differ from `tb->sizemask = (uint32)(size - 1)`
> > that is landed with patch, therefore why `if` is needed?
>
> Personally I find it more obvious to understand the intended behaviour
> with ~0 (i.e. all bits set) than with a width truncation.
>
https://godbolt.org/z/57WcjKqMj
The generated code is identical.

regards,
Ranier Vilela


Re: logical replication empty transactions

2021-08-13 Thread Ajin Cherian
On Mon, Aug 2, 2021 at 7:20 PM Amit Kapila  wrote:
>
> On Fri, Jul 23, 2021 at 3:39 PM Ajin Cherian  wrote:
> >
>
> Let's first split the patch for prepared and non-prepared cases as
> that will help to focus on each of them separately. BTW, why haven't
> you considered implementing point 1b as explained by Andres in his
> email [1]? I think we can send a keepalive message in case of
> synchronous replication when we skip an empty transaction, otherwise,
> it might delay in responding to transactions synchronous_commit mode.
> I think in the tests done in the thread, it might not have been shown
> because we are already sending keepalives too frequently. But what if
> someone disables wal_sender_timeout or kept it to a very large value?
> See WalSndKeepaliveIfNecessary. The other thing you might want to look
> at is if the reason for frequent keepalives is the same as described
> in the email [2].
>

I have tried to address the comment here by modifying the
ctx->update_progress callback function (WalSndUpdateProgress) provided
for plugins. I have added an option
by which the callback can specify if it wants to send keep_alives. And
when the callback is called with that option set, walsender updates a
flag force_keep_alive_syncrep.
The Walsender in the WalSndWaitForWal for loop, checks this flag and
if synchronous replication is enabled, then sends a keep alive.
Currently this logic
is added as an else to the current logic that is already there in
WalSndWaitForWal, which is probably considered unnecessary and a
source of the keep alive flood
that you talked about. So, I can change that according to how that fix
shapes up there. I have also added an extern function in syncrep.c
that makes it possible
for walsender to query if synchronous replication is turned on.

The reason I had to turn on a flag and rely on the WalSndWaitForWal to
send the keep alive in its next iteration is because I tried doing
this directly when a
commit is skipped but it didn't work. The reason for this is that when
the commit is being decoded the sentptr at the moment is at the commit
LSN and the keep alive
will be sent for the commit LSN but the syncrep wait is waiting for
end_lsn of the transaction which is the next LSN. So, sending a keep
alive at the moment the
commit is decoded doesn't seem to solve the problem of the waiting
synchronous reply.

> Few other miscellaneous comments:
> 1.
> static void
>  pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
> ReorderBufferTXN *txn,
> - XLogRecPtr commit_lsn)
> + XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
> + TimestampTz prepare_time)
>  {
> + PGOutputTxnData*txndata = (PGOutputTxnData *) 
> txn->output_plugin_private;
> +
>   OutputPluginUpdateProgress(ctx);
>
> + /*
> + * If the BEGIN PREPARE was not yet sent, then it means there were no
> + * relevant changes encountered, so we can skip the COMMIT PREPARED
> + * message too.
> + */
> + if (txndata)
> + {
> + bool skip = !txndata->sent_begin_txn;
> + pfree(txndata);
> + txn->output_plugin_private = NULL;
>
> How is this supposed to work after the restart when prepared is sent
> before the restart and we are just sending commit_prepared after
> restart? Won't this lead to sending commit_prepared even when the
> corresponding prepare is not sent? Can we think of a better way to
> deal with this?
>

I have tried to resolve this by adding logic in worker,c to silently
ignore spurious commit_prepareds. But this change required checking if
the prepare exists on the
subscriber before attempting the commit_prepared but the current API
that checks this requires prepare time and transaction end_lsn. But
for this I had to
change the protocol of commit_prepared, and I understand that this
would break backward compatibility between subscriber and publisher
(you have raised this issue as well).
I am not sure how else to handle this, let me know if you have any
other ideas. One option could be to have another API to check if the
prepare exists on the subscriber with
the prepared 'gid' alone, without checking prepare_time or end_lsn.
Let me know if this idea works.

I have left out the patch 0002 for prepared transactions until we
arrive at a decision on how to address the above issue.

Peter,
I have also addressed the comments you've raised on patch 0001, please
have a look and confirm.

Regards,
Ajin Cherian
Fujitsu Australia.


v13-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: Bug in huge simplehash

2021-08-13 Thread Andres Freund
Hi,

On 2021-08-13 12:44:17 +0300, Yura Sokolov wrote:
> Andres Freund писал 2021-08-13 12:21:
> > Any chance you'd write a test for simplehash with such huge amount of
> > values? It'd require a small bit of trickery to be practical. On systems
> > with MAP_NORESERVE it should be feasible.
> 
> Which way C structures should be tested in postgres?
> dynahash/simplhash - do they have any tests currently?
> I'll do if hint is given.

We don't have a great way right now :(. I think the best is to have a
SQL callable function that uses some API. See e.g. test_atomic_ops() et
al in src/test/regress/regress.c


> > >  static inline void
> > > -SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
> > > +SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize)
> > >  {
> > >   uint64  size;
> > > 
> > > @@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32
> > > newsize)
> > > 
> > >   /* now set size */
> > >   tb->size = size;
> > > -
> > > - if (tb->size == SH_MAX_SIZE)
> > > - tb->sizemask = 0;
> > > - else
> > > - tb->sizemask = tb->size - 1;
> > > + tb->sizemask = (uint32)(size - 1);
> > 
> > ISTM using ~0 would be nicer here?
> 
> I don't think so.
> To be rigid it should be `~(uint32)0`.
> But I believe it doesn't differ from `tb->sizemask = (uint32)(size - 1)`
> that is landed with patch, therefore why `if` is needed?

Personally I find it more obvious to understand the intended behaviour
with ~0 (i.e. all bits set) than with a width truncation.

Greetings,

Andres Freund




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2021-08-13 Thread Andres Freund
Hi,

On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote:
> On Tue, Aug 3, 2021 at 2:13 PM Andres Freund  wrote:
> > > Also, I'm unsure how writing the buffer action stats out in
> > > pgstat_write_statsfiles() will work, since I think that backends can
> > > update their buffer action stats after we would have already persisted
> > > the data from the BufferActionStatsArray -- causing us to lose those
> > > updates.
> >
> > I was thinking it'd work differently. Whenever a connection ends, it reports
> > its data up to pgstats.c (otherwise we'd loose those stats). By the time
> > shutdown happens, they all need to have already have reported their stats - 
> > so
> > we don't need to do anything to get the data to pgstats.c during shutdown
> > time.
> >
> 
> When you say "whenever a connection ends", what part of the code are you
> referring to specifically?

pgstat_beshutdown_hook()


> Also, when you say "shutdown", do you mean a backend shutting down or
> all backends shutting down (including postmaster) -- like pg_ctl stop?

Admittedly our language is very imprecise around this :(. What I meant
is that backends would report their own stats up to the stats collector
when the connection ends (in pgstat_beshutdown_hook()). That means that
when the whole server (pgstat and then postmaster, potentially via
pg_ctl stop) shuts down, all the per-connection stats have already been
reported up to pgstat.


> > > diff --git a/src/backend/catalog/system_views.sql 
> > > b/src/backend/catalog/system_views.sql
> > > index 55f6e3711d..96cac0a74e 100644
> > > --- a/src/backend/catalog/system_views.sql
> > > +++ b/src/backend/catalog/system_views.sql
> > > @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
> > >  pg_stat_get_bgwriter_buf_written_checkpoints() AS 
> > > buffers_checkpoint,
> > >  pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
> > >  pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
> > > -pg_stat_get_buf_written_backend() AS buffers_backend,
> > > -pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
> > > -pg_stat_get_buf_alloc() AS buffers_alloc,
> > >  pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
> >
> > Material for a separate patch, not this. But if we're going to break
> > monitoring queries anyway, I think we should consider also renaming
> > maxwritten_clean (and perhaps a few others), because nobody understands what
> > that is supposed to mean.

> Do you mean I shouldn't remove anything from the pg_stat_bgwriter view?

No - I just meant that now that we're breaking pg_stat_bgwriter queries,
we should also rename the columns to be easier to understand. But that
it should be a separate patch / commit...


> > > @@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 
> > > *num_buf_alloc)
> > >*/
> > >   *complete_passes += nextVictimBuffer / NBuffers;
> > >   }
> > > -
> > > - if (num_buf_alloc)
> > > - {
> > > - *num_buf_alloc = 
> > > pg_atomic_exchange_u32(>numBufferAllocs, 0);
> > > - }
> > >   SpinLockRelease(>buffer_strategy_lock);
> > >   return result;
> > >  }
> >
> > Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I
> > suspect this patch shouldn't get rid of numBufferAllocs at the same time as
> > overhauling the stats stuff. Perhaps we don't need both - but it's not 
> > obvious
> > that that's the case / how we can make that work.
> >
> >
> 
> I initially meant to add a function to the patch like
> pg_stat_get_buffer_actions() but which took a BufferActionType and
> BackendType as parameters and returned a single value which is the
> number of buffer action types of that type for that type of backend.
> 
> let's say I defined it like this:
> uint64
>   pg_stat_get_backend_buffer_actions_stats(BackendType backend_type,
>   BufferActionType ba_type)
> 
> Then, I intended to use that in StrategySyncStart() to set num_buf_alloc
> by subtracting the value of StrategyControl->numBufferAllocs from the
> value returned by pg_stat_get_backend_buffer_actions_stats(B_BG_WRITER,
> BA_Alloc), val, then adding that value, val, to
> StrategyControl->numBufferAllocs.

I don't think you could restrict this to B_BG_WRITER? The whole point of
this logic is that bgwriter uses the stats for *all* backends to get the
"usage rate" for buffers, which it then uses to control how many buffers
to clean.


> I think that would have the same behavior as current, though I'm not
> sure if the performance would end up being better or worse. It wouldn't
> be atomically incrementing StrategyControl->numBufferAllocs, but it
> would do a few additional atomic operations in StrategySyncStart() than
> before. Also, we would do all the work done by
> pg_stat_get_buffer_actions() in StrategySyncStart().

I think it'd be better to separate changing the bgwriter pacing logic
(and thus 

Re: [PATCH] Hooks at XactCommand level

2021-08-13 Thread Andres Freund
Hi,

On 2021-08-10 10:12:26 +0200, Gilles Darold wrote:
> Sorry for the response delay. I have though about adding this odd hook to be
> able to implement this feature through an extension because I don't think
> this is something that should be implemented in core. There were also
> patches proposals which were all rejected.
>
> We usually implement the feature at client side which is imo enough for the
> use cases. But the problem is that this a catastrophe in term of
> performances. I have done a small benchmark to illustrate the problem. This
> is a single process client on the same host than the PG backend.
>
> For 10,000 tuples inserted with 50% of failures and rollback at statement
> level handled at client side:
>
>     Expected: 5001, Count:  5001
>     DML insert took: 13 wallclock secs ( 0.53 usr +  0.94 sys =  1.47
> CPU)

Something seems off here. This suggests every insert took 2.6ms. That
seems awfully long, unless your network latency is substantial. I did a
quick test implementing this in the naive-most way in pgbench, and I get
better times - and there's *lots* of room for improvement.

I used a pgbench script that sent the following:
BEGIN;
SAVEPOINT insert_fail;
INSERT INTO testinsert(data) VALUES (1);
ROLLBACK TO SAVEPOINT insert_fail;
SAVEPOINT insert_success;
INSERT INTO testinsert(data) VALUES (1);
RELEASE SAVEPOINT insert_success;
{repeat 5 times}
COMMIT;

I.e. 5 failing and 5 succeeding insertions wrapped in one transaction. I
get >2500 tps, i.e. > 25k rows/sec. And it's not hard to optimize that
further - the {ROLLBACK TO,RELEASE} SAVEPOINT; SAVEPOINT; INSERT can be
sent in one roundtrip. That gets me to somewhere around 40k rows/sec.


BEGIN;

\startpipeline
SAVEPOINT insert_fail;
INSERT INTO testinsert(data) VALUES (1);
\endpipeline

\startpipeline
ROLLBACK TO SAVEPOINT insert_fail;
SAVEPOINT insert_success;
INSERT INTO testinsert(data) VALUES (1);
\endpipeline

\startpipeline
RELEASE SAVEPOINT insert_success;
SAVEPOINT insert_fail;
INSERT INTO testinsert(data) VALUES (1);
\endpipeline

\startpipeline
ROLLBACK TO SAVEPOINT insert_fail;
SAVEPOINT insert_success;
INSERT INTO testinsert(data) VALUES (1);
\endpipeline

{repeat last two blocks three times}
COMMIT;

Greetings,

Andres Freund




Re: Bug in huge simplehash

2021-08-13 Thread Yura Sokolov

Andres Freund писал 2021-08-13 12:21:

Hi,

On 2021-08-10 11:52:59 +0300, Yura Sokolov wrote:
- sizemask is set only in SH_COMPUTE_PARAMETERS . And it is set in 
this way:


/* now set size */
tb->size = size;

if (tb->size == SH_MAX_SIZE)
tb->sizemask = 0;
else
tb->sizemask = tb->size - 1;

  that means, when we are resizing to SH_MAX_SIZE, sizemask becomes 
zero.


I think that was intended to be ~0.


I believe so.


Ahh... ok, patch is updated to fix this as well.


Any chance you'd write a test for simplehash with such huge amount of
values? It'd require a small bit of trickery to be practical. On 
systems

with MAP_NORESERVE it should be feasible.


Which way C structures should be tested in postgres?
dynahash/simplhash - do they have any tests currently?
I'll do if hint is given.


 static inline void
-SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
+SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize)
 {
uint64  size;

@@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 
newsize)


/* now set size */
tb->size = size;
-
-   if (tb->size == SH_MAX_SIZE)
-   tb->sizemask = 0;
-   else
-   tb->sizemask = tb->size - 1;
+   tb->sizemask = (uint32)(size - 1);


ISTM using ~0 would be nicer here?


I don't think so.
To be rigid it should be `~(uint32)0`.
But I believe it doesn't differ from `tb->sizemask = (uint32)(size - 1)`
that is landed with patch, therefore why `if` is needed?



Greetings,

Andres Freund





Re: Autovacuum on partitioned table (autoanalyze)

2021-08-13 Thread Andres Freund
Hi,

On 2021-08-11 18:33:07 -0400, Alvaro Herrera wrote:
> After thinking about the described issues for a while, my proposal is to
> completely revamp the way this feature works.  See below.
>
> Now, the proposal seems awfully invasive, but it's *the* way I see to
> avoid the pgstat traffic.  For pg14, maybe we can live with it, and just
> use the smaller patches that Horiguchi-san and I have posted, which
> solve the other issues; also, Euler Taveira suggested that we could add
> a reloption to turn the feature off completely for some tables (maybe
> make it off by default and have a reloption to turn it on for specific
> partition hierarchies), so that it doesn't cause unduly pain for people
> with large partitioning hierarchies.

I think we should revert the changes for 14 - to me the feature clearly isn't
mature enough to be released.


> * PgStat_StatTabEntry gets a new "Oid reportAncestorOid" member. This is
>   the OID of a single partitioned ancestor, to which the changed-tuple
>   counts are propagated up.
>   Normally this is the topmost ancestor; but if the user wishes some
>   intermediate ancestor to receive the counts they can use
>   ALTER TABLE the_intermediate_ancestor SET (autovacuum_enabled=on).
>
> * Corollary 1: for the normal case of single-level partitioning, the
>   parent partitioned table behaves as currently.
>
> * Corollary 2: for multi-level partitioning with no especially
>   configured intermediate ancestors, only the leaf partitions and the
>   top-level partitioned table will be analyzed.  Intermediate ancestors
>   are ignored by autovacuum.
>
> * Corollary 3: for multi-level partitioning with some intermediate
>   ancestor(s) marked as autovacuum_enabled=on, that ancestor will
>   receive all the counts from all of its partitions, so it will get
>   analyzed itself; and it'll also forward those counts up to its
>   report-ancestor.

This seems awfully confusing to me.

One fundamental issue here is that we separately build stats for partitioned
tables and partitions. Can we instead tackle this by reusing the stats for
partitions for the inheritance stats?  I think it's a bit easier to do that
for partitioned tables than for old school inheritance roots, because there's
no other rows in partitioned tables.

Greetings,

Andres Freund




Re: Bug in huge simplehash

2021-08-13 Thread Andres Freund
Hi,

On 2021-08-10 11:52:59 +0300, Yura Sokolov wrote:
> - sizemask is set only in SH_COMPUTE_PARAMETERS . And it is set in this way:
> 
> /* now set size */
> tb->size = size;
> 
> if (tb->size == SH_MAX_SIZE)
> tb->sizemask = 0;
> else
> tb->sizemask = tb->size - 1;
> 
>   that means, when we are resizing to SH_MAX_SIZE, sizemask becomes zero.

I think that was intended to be ~0.


> Ahh... ok, patch is updated to fix this as well.

Any chance you'd write a test for simplehash with such huge amount of
values? It'd require a small bit of trickery to be practical. On systems
with MAP_NORESERVE it should be feasible.


>  static inline void
> -SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
> +SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize)
>  {
>   uint64  size;
>  
> @@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
>  
>   /* now set size */
>   tb->size = size;
> -
> - if (tb->size == SH_MAX_SIZE)
> - tb->sizemask = 0;
> - else
> - tb->sizemask = tb->size - 1;
> + tb->sizemask = (uint32)(size - 1);

ISTM using ~0 would be nicer here?

Greetings,

Andres Freund




Re: [bug] Logical Decoding of relation rewrite with toast does not reset toast_hash

2021-08-13 Thread Amit Kapila
On Fri, Aug 13, 2021 at 11:47 AM Drouvot, Bertrand  wrote:
>
> On 8/12/21 1:00 PM, Amit Kapila wrote:
> >>
> >> - sets "relwrewrite" for the toast.
> >>
> > --- a/src/backend/commands/tablecmds.c
> > +++ b/src/backend/commands/tablecmds.c
> > @@ -3861,6 +3861,10 @@ RenameRelationInternal(Oid myrelid, const char
> > *newrelname, bool is_internal, bo
> >*/
> >namestrcpy(&(relform->relname), newrelname);
> >
> > + /* reset relrewrite for toast */
> > + if (relform->relkind == RELKIND_TOASTVALUE)
> > + relform->relrewrite = InvalidOid;
> > +
> >
> > I find this change quite ad-hoc. I think this API is quite generic to
> > make such a change. I see two ways for this (a) pass a new bool flag
> > (reset_toast_rewrite) in this API and then make this change, (b) in
> > the specific place where we need this, change relrewrite separately
> > via a new API.
> >
> > I would prefer (b) in the ideal case, but I understand it would be an
> > additional cost, so maybe (a) is also okay. What do you people think?
>
> I would prefer a) because:
>
> - b) would need to update the exact same tuple one more time (means
> doing more or less the same work: open relation, search for the tuple,
> update the tuple...)
>
> - a) would still give the ability for someone reading the code to
> understand where the relrewrite reset is needed (as opposed to the way
> the patch is currently written)
>
> - finish_heap_swap() with swap_toast_by_content set to false, is the
> only place where we currently need to reset explicitly relrewrite (so
> that we would have the new API produced by b) being called only at that
> place).
>
> - That means that b) would be only for code readability but at the price
> of extra cost.
>

Anybody else would like to weigh in? I think it is good if few others
also share their opinion as we need to backpatch this bug-fix.

-- 
With Regards,
Amit Kapila.




Re: Shared memory size computation oversight?

2021-08-13 Thread Peter Eisentraut

On 12.08.21 16:18, Julien Rouhaud wrote:

On Thu, Aug 12, 2021 at 9:34 PM Peter Eisentraut
 wrote:


On 27.02.21 09:08, Julien Rouhaud wrote:

PFA a patch that fixes pg_prewarm and pg_stat_statements explicit alignment to
CACHELINEALIGN, and also update the alignment in hash_estimate_size() to what I
think ShmemInitHash will actually consume.


For extensions, wouldn't it make things better if
RequestAddinShmemSpace() applied CACHELINEALIGN() to its argument?

If you consider the typical sequence of RequestAddinShmemSpace(mysize())
and later ShmemInitStruct(..., mysize(), ...),


But changing RequestAddinShmemSpace() to apply CACHELINEALIGN() would
only really work for that specific usage only?  If an extension does
multiple allocations you can't rely on correct results.


I think you can do different things here to create inconsistent results, 
but I think there should be one common, standard, normal, 
straightforward way to get a correct result.



Btw., I think your patch was wrong to apply CACHELINEALIGN() to
intermediate results rather than at the end.


I'm not sure why you think it's wrong.  It's ShmemInitStruct() that
will apply the padding, so if the extension calls it multiple times
(whether directly or indirectly), then the padding will be added
multiple times.  Which means that in theory the extension should
account for it multiple time in the amount of memory it's requesting.


Yeah, in that case it's probably rather the case that there is one 
CACHELINEALIGN() too few, since pg_stat_statements does two separate 
shmem allocations.





Re: Default to TIMESTAMP WITH TIME ZONE?

2021-08-13 Thread Gavin Flower

On 13/08/21 5:14 pm, Greg Stark wrote:

I think having a GUC to change to a different set of semantics is not workable.

However that doesn't mean we can't do anything. We could have a GUC
that just disables allowing creating columns of type timestamp without
tz. That could print an error with a hint suggesting you can change
the variable if you really want to allow them.

I still would hesitate to make it the default but admins could set it
to help their developers.


[...]

I always use the tz version, except when I forget.  Initially, I was 
totally unaware of the need & usefulness of storing time in tz. So I 
would use the GUC.


Suspect that it is extremely rare when one would not want to use the TZ 
option, which suggests to me that using TZ should be the default.



Cheers,
Gavin





Re: Postgres picks suboptimal index after building of an extended statistics

2021-08-13 Thread Andrey V. Lepikhov

On 8/12/21 4:26 AM, Tomas Vondra wrote:

On 8/11/21 2:48 AM, Peter Geoghegan wrote:

On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov
 wrote:

Ivan Frolkov reported a problem with choosing a non-optimal index during
a query optimization. This problem appeared after building of an
extended statistics.


Any thoughts on this, Tomas?



Thanks for reminding me, I missed / forgot about this thread.

I agree the current behavior is unfortunate, but I'm not convinced the 
proposed patch is fixing the right place - doesn't this mean the index 
costing won't match the row estimates displayed by EXPLAIN?
I think, it is not a problem. In EXPLAIN you will see only 1 row 
with/without this patch.


I wonder if we should teach clauselist_selectivity about UNIQUE indexes, 
and improve the cardinality estimates directly, not just costing for 
index scans.

This idea looks better. I will try to implement it.


Also, is it correct that the patch calculates num_sa_scans only when 
(numIndexTuples >= 0.0)?

Thanks, fixed.

--
regards,
Andrey Lepikhov
Postgres Professional
>From 8a4ad08d61a5d14a45ef5e182f002e918f0eaccc Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Wed, 23 Jun 2021 12:05:24 +0500
Subject: [PATCH] In the case of an unique one row btree index scan only one
 row can be returned. In the genericcostestimate() routine we must arrange the
 index selectivity value in accordance with this knowledge.

---
 src/backend/utils/adt/selfuncs.c | 78 
 1 file changed, 48 insertions(+), 30 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 0c8c05f6c2..9538c4a5b4 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6369,15 +6369,9 @@ genericcostestimate(PlannerInfo *root,
 	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
-	List	   *selectivityQuals;
 	ListCell   *l;
 
-	/*
-	 * If the index is partial, AND the index predicate with the explicitly
-	 * given indexquals to produce a more accurate idea of the index
-	 * selectivity.
-	 */
-	selectivityQuals = add_predicate_to_index_quals(index, indexQuals);
+	numIndexTuples = costs->numIndexTuples;
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
@@ -6398,36 +6392,59 @@ genericcostestimate(PlannerInfo *root,
 		}
 	}
 
-	/* Estimate the fraction of main-table tuples that will be visited */
-	indexSelectivity = clauselist_selectivity(root, selectivityQuals,
-			  index->rel->relid,
-			  JOIN_INNER,
-			  NULL);
+	if (numIndexTuples >= 0.0)
+	{
+		List		*selectivityQuals;
 
-	/*
-	 * If caller didn't give us an estimate, estimate the number of index
-	 * tuples that will be visited.  We do it in this rather peculiar-looking
-	 * way in order to get the right answer for partial indexes.
-	 */
-	numIndexTuples = costs->numIndexTuples;
-	if (numIndexTuples <= 0.0)
+		/*
+		 * If the index is partial, AND the index predicate with the explicitly
+		 * given indexquals to produce a more accurate idea of the index
+		 * selectivity.
+		 */
+		selectivityQuals = add_predicate_to_index_quals(index, indexQuals);
+
+		/* Estimate the fraction of main-table tuples that will be visited */
+		indexSelectivity = clauselist_selectivity(root, selectivityQuals,
+  index->rel->relid,
+  JOIN_INNER,
+  NULL);
+
+		/*
+		 * If caller didn't give us an estimate, estimate the number of index
+		 * tuples that will be visited.  We do it in this rather peculiar-looking
+		 * way in order to get the right answer for partial indexes.
+		 */
+		if (numIndexTuples == 0.0)
+		{
+			numIndexTuples = indexSelectivity * index->rel->tuples;
+
+			/*
+			 * The above calculation counts all the tuples visited across all
+			 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
+			 * average per-indexscan number, so adjust.  This is a handy place to
+			 * round to integer, too.  (If caller supplied tuple estimate, it's
+			 * responsible for handling these considerations.)
+			 */
+			numIndexTuples = rint(numIndexTuples / num_sa_scans);
+		}
+	}
+	else
 	{
-		numIndexTuples = indexSelectivity * index->rel->tuples;
+		Assert(numIndexTuples == -1.0);
 
 		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
+		 * Unique one row scan can select no more than one row. It needs to
+		 * manually set the selectivity of the index. The value of numIndexTuples
+		 * will be corrected later.
 		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
+		indexSelectivity = 1.0 / index->rel->tuples;
 	}
 
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * 

Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-08-13 Thread Greg Nancarrow
On Thu, Aug 12, 2021 at 5:37 AM Robert Haas  wrote:
>
> 1. Then why doesn't the approach I proposed fix it?
>

I think that with your approach, it is not doing the expected
initialization done by SetTransactionSnapshot() (which is called by
RestoreTransactionSnapshot(), which your approach skips in the case of
the SQL script that reproduces the problem, because
IsolationUsesXactSnapshot() returns false for XACT_READ_COMMITTED).
There's some comments in SetTransactionSnapshot() explaining the
tricky parts of this initialization, testing that the source
transaction is still running, dealing with a race condition, and
setting up TransactionXmin.
Also, there's an "if (IsolationUsesXactSnapshot()) ..." block within
that function, doing some required setup for transaction-snapshot
mode, so it doesn't seem like a good idea to not call
RestoreTransactionSnapshot() if !IsolationUsesXactSnapshot(), as the
function is obviously catering for both cases, when the isolation
level does and doesn't use a transaction snapshot. So I think
SetTransactionSnapshot() always needs to be called.

With your proposed approach, what I'm seeing is that the worker calls
GetTransactionSnapshot() at some point, which then builds a new
snapshot, and results in increasing TransactionXmin (probably because
another concurrent transaction has since completed). This snapshot is
thus later than the snapshot used in the execution state of the query
being executed. This causes the Assert in
SubTransGetTopmostTransaction() to fire because the xid doesn't follow
or equal the TransactionXmin value.

> 2. Consider the case where the toplevel query is something like SELECT
> complexfunc() FROM generate_series(1,10) g -- in a case like this, I
> think complexfunc() can cause snapshots to be taken internally. For
> example suppose we end up inside exec_eval_simple_expr, or
> SPI_cursor_open_internal, in either case with read_only = false. Here
> we're going to again call GetTransactionSnapshot() and then execute a
> query which may use parallelism.
>
>

A query always uses the ActiveSnapshot at the time the QueryDesc is
created  - so as long as you don't (as the current code does) obtain a
potentially later snapshot and try to restore that in the worker as
the TransactionSnapshot (or let the worker create a new snapshot,
because no TransactionSnapshot was restored, which may have a greater
xmin than the ActiveSnapshot) then I think it should be OK.


Regards,
Greg Nancarrow
Fujitsu Australia




Re: Logical replication keepalive flood

2021-08-13 Thread Peter Smith
FYI - Here are some more counter results with/without the skip patch
[1] applied.

This is the same test setup as before except now using *synchronous* pub/sub.

//

Test setup:
- using synchronous pub/sub
- subscription is for the pgbench_history table
- pgbench is run for 10 seconds
- config for all the wal_sender/receiver timeout GUCs are just default values

WITHOUT the skip-first patch applied
=

RUN #1
--
LOG:  Total records: 310
LOG: 24:49 /   131 /49:8403 /9270
LOG:944: 1 / 0 / 1: 159693904 / 159693904
LOG:   8280: 1 / 2 / 2: 480 / 480

RUN #2
--
LOG:  Total records: 275
LOG: 24:45 /   129 /46:8580 /8766
LOG:   5392: 1 / 0 / 1: 160107248 / 160107248

RUN #3
--
LOG:  Total records: 330
LOG: 24:50 /   144 /51:8705 /8705
LOG:   3704: 1 / 0 / 1: 160510344 / 160510344
LOG:   8280: 1 / 2 / 2: 468 / 468

WITH the skip-first patch applied
=

RUN #1
--
LOG:  Total records: 247
LOG: 24: 5 /   172 /44: 3601700 / 3601700
LOG:   8280: 1 / 1 / 1:1192 /1192

RUN #2
--
LOG:  Total records: 338
LOG: 24: 8 /   199 /55:1335 /1335
LOG:   7597: 1 / 1 / 1:   11712 /   11712
LOG:   8280: 1 / 1 / 2: 480 / 480

RUN #3
--
LOG:  Total records: 292
LOG: 24: 4 /   184 /49: 719 / 719

//

As before there is a big % reduction of keepalives after the patch,
except here there was never really much of a "flood" in the first
place.

--
[1] 
https://www.postgresql.org/message-id/CAHut%2BPtyMBzweYUpb_QazVL6Uze2Yc5M5Ti2Xwee_eWM3Jrbog%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Skipping logical replication transactions on subscriber side

2021-08-13 Thread Greg Nancarrow
On Fri, Aug 13, 2021 at 2:06 PM Amit Kapila  wrote:
>
> On Thu, Aug 12, 2021 at 5:41 PM Greg Nancarrow  wrote:
> >
> > On Thu, Aug 12, 2021 at 9:18 PM Amit Kapila  wrote:
> > >
> > > > A minor comment on the 0001 patch: In the message I think that using
> > > > "ID" would look better than lowercase "id" and AFAICS it's more
> > > > consistent with existing messages.
> > > >
> > > > + appendStringInfo(, _(" in transaction id %u with commit timestamp 
> > > > %s"),
> > > >
> > >
> > > You have a point but I think in this case it might look a bit odd as
> > > we have another field 'commit timestamp' after that which is
> > > lowercase.
> > >
> >
> > I did a quick search and I couldn't find any other messages in the
> > Postgres code that use "transaction id", but I could find some that
> > use "transaction ID" and "transaction identifier".
> >
>
> Okay, but that doesn't mean using it here is bad. I am personally fine
> with a message containing something like "... in transaction
> id 740 with commit timestamp 2021-08-10 14:44:38.058174+05:30" but I
> won't mind if you and or others find some other way convenient. Any
> opinion from others?
>

Just to be clear, all I was saying is that I thought using uppercase
"ID" looked better in the message, and was more consistent with
existing logged messages, than using lowercase "id".
i.e. my suggestion was a trivial change:

BEFORE:
+ appendStringInfo(, _(" in transaction id %u with commit timestamp %s"),
AFTER:
+ appendStringInfo(, _(" in transaction ID %u with commit timestamp %s"),

But it was just a suggestion. Maybe others feel differently.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Next Steps with Hash Indexes

2021-08-13 Thread Dilip Kumar
On Fri, Aug 13, 2021 at 9:31 AM Amit Kapila  wrote:
>
> On Thu, Aug 12, 2021 at 8:30 PM Robert Haas  wrote:
> >
> > On Thu, Aug 12, 2021 at 12:22 AM Amit Kapila  
> > wrote:
> > > The design of the patch has changed since the initial proposal. It
> > > tries to perform unique inserts by holding a write lock on the bucket
> > > page to avoid duplicate inserts.
> >
> > Do you mean that you're holding a buffer lwlock while you search the
> > whole bucket? If so, that's surely not OK.
> >
>
> I think here you are worried that after holding lwlock we might
> perform reads of overflow pages which is not a good idea. I think
> there are fewer chances of having overflow pages for unique indexes so
> we don't expect such cases in common

I think if we identify the bucket based on the hash value of all the
columns then there should be a fewer overflow bucket, but IIUC, in
this patch bucket, is identified based on the hash value of the first
column only so there could be a lot of duplicates on the first column.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com