Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-06 Thread John Naylor
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada 
wrote:
> In addition to two patches, I've attached the third patch. It's not
> part of radix tree implementation but introduces a contrib module
> bench_radix_tree, a tool for radix tree performance benchmarking. It
> measures loading and lookup performance of both the radix tree and a
> flat array.

Hi Masahiko, I've been using these benchmarks, along with my own
variations, to try various things that I've mentioned. I'm long overdue for
an update, but the picture is not yet complete.

For now, I have two questions that I can't figure out on my own:

1. There seems to be some non-obvious limit on the number of keys that are
loaded (or at least what the numbers report). This is independent of the
number of tids per block. Example below:

john=# select * from bench_shuffle_search(0, 8*1000*1000);
NOTICE:  num_keys = 800, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 =
25, n256 = 981
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 800 |268435456 |4800 |661 |
 29 |  276 | 389

john=# select * from bench_shuffle_search(0, 9*1000*1000);
NOTICE:  num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 =
262144, n256 = 1028
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 8388608 |276824064 |5400 |718 |
 33 |  311 | 446

The array is the right size, but nkeys hasn't kept pace. Can you reproduce
this? Attached is the patch I'm using to show the stats when running the
test. (Side note: The numbers look unfavorable for radix tree because I'm
using 1 tid per block here.)

2. I found that bench_shuffle_search() is much *faster* for traditional
binary search on an array than bench_seq_search(). I've found this to be
true in every case. This seems counterintuitive to me -- any idea why this
is? Example:

john=# select * from bench_seq_search(0, 100);
NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 100 | 10199040 |   18000 |168 |
106 |  827 |3348

john=# select * from bench_shuffle_search(0, 100);
NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 100 | 10199040 |   18000 |171 |
107 |  827 |1400

--
John Naylor
EDB: http://www.enterprisedb.com
From 43a50a385930ee340d0a3b003910c704a0ff342c Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Thu, 6 Oct 2022 09:07:41 +0700
Subject: [PATCH v65 1/5] Turn on per-node counts in benchmark

Also add gitigore, fix whitespace, and change to NOTICE
---
 contrib/bench_radix_tree/.gitignore | 3 +++
 contrib/bench_radix_tree/bench_radix_tree.c | 5 +
 src/backend/lib/radixtree.c | 2 +-
 src/include/lib/radixtree.h | 2 +-
 4 files changed, 10 insertions(+), 2 deletions(-)
 create mode 100644 contrib/bench_radix_tree/.gitignore

diff --git a/contrib/bench_radix_tree/.gitignore b/contrib/bench_radix_tree/.gitignore
new file mode 100644
index 00..8830f5460d
--- /dev/null
+++ b/contrib/bench_radix_tree/.gitignore
@@ -0,0 +1,3 @@
+*data
+log/*
+results/*
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 5806ef7519..36c5218ae7 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -13,6 +13,7 @@
 #include "fmgr.h"
 #include "funcapi.h"
 #include "lib/radixtree.h"
+#include 
 #include "miscadmin.h"
 #include "utils/timestamp.h"
 
@@ -183,6 +184,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	TimestampDifference(start_time, end_time, , );
 	rt_load_ms = secs * 1000 + usecs / 1000;
 
+	rt_stats(rt);
+
 	/* measure the load time of the array */
 	itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
 	  sizeof(ItemPointerData) * ntids);
@@ -292,6 +295,8 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 	TimestampDifference(start_time, end_time, , );
 	load_time_ms = secs * 1000 + usecs / 1000;
 
+	rt_stats(rt);
+
 	MemSet(nulls, false, sizeof(nulls));
 	values[0] = 

Re: proposal: possibility to read dumped table's name from file

2022-10-06 Thread Pavel Stehule
Hi

I am sending version with handy written parser and meson support

po 3. 10. 2022 v 6:34 odesílatel Julien Rouhaud  napsal:

> Hi,
>
> > You started rewriting it, but you didn't finish it.
> >
> > Unfortunately, there is not a clean opinion on using bison's parser for
> > this purpose. I understand that the complexity of this language is too
> low,
> > so the benefit of using bison's gramatic is low too. Personally, I have
> not
> > any problem using bison for this purpose. For this case, I think we
> compare
> > two similarly long ways, but unfortunately, customers that have a problem
> > with long command lines still have this problem.
> >
> > Can we go forward? Daniel is strongly against handwritten parser. Is
> there
> > somebody strongly against bison's based parser? There is not any other
> way.
>
> I don't have a strong opinion either, but it seems that 2 people argued
> against
> a bison parser (vs only 1 arguing for) and the fact that the current habit
> is
> to rely on hand written parsers for simple cases (e.g. jsonapi.c /
> pg_parse_json()), it seems that we should go back to Pavel's original
> parser.
>
> I only had a quick look but it indeed seems trivial, it just maybe need a
> bit
> of refactoring to avoid some code duplication (getFiltersFromFile is
> duplicated, and getDatabaseExcludeFiltersFromFile could be removed if
> getFiltersFromFile knew about the 2 patterns).
>

I checked this code again, and I don't think some refactoring is easy.
getFiltersFromFile is not duplicated. It is just probably badly named.

These routines are used from pg_dump, pg_dumpall and pg_restore. There are
significant differences in supported objects and in types used for returned
lists (dumpOptions, SimpleStringList, and RestoreOptions). If I have one
routine, then I need to implement some mechanism for specification of
supported objects, and a special type that can be used as a proxy between
caller and parser to hold lists of parsed values. To be names less
confusing I renamed them to read_dump_filters, read_dumpall_filters and
read_restore_filters

Regards

Pavel
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 8b9d9f4cad..955bfcfdad 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -779,6 +779,80 @@ PostgreSQL documentation
   
  
 
+ 
+  --filter=filename
+  
+   
+Specify a filename from which to read patterns for objects to include
+or exclude from the dump. The patterns are interpreted according to the
+same rules as the corresponding options:
+-t/--table for tables,
+-n/--schema for schemas,
+--include-foreign-data for data on foreign servers and
+--exclude-table-data for table data.
+To read from STDIN, use - as the
+filename.  The --filter option can be specified in
+conjunction with the above listed options for including or excluding
+objects, and can also be specified more than once for multiple filter
+files.
+   
+
+   
+The file lists one object pattern per row, with the following format:
+
+{ include | exclude } { table | schema | foreign_data | data } PATTERN
+
+   
+
+   
+The first keyword specifies whether the objects matched by the pattern
+are to be included or excluded. The second keyword specifies the type
+of object to be filtered using the pattern:
+
+ 
+  
+   table: tables, works like
+   -t/--table
+  
+ 
+ 
+  
+   schema: schemas, works like
+   -n/--schema
+  
+ 
+ 
+  
+   foreign_data: data on foreign servers, works like
+   --include-foreign-data. This keyword can only be
+   used with the include keyword.
+  
+ 
+ 
+  
+   data: table data, works like
+   --exclude-table-data. This keyword can only be
+   used with the exclude keyword.
+  
+ 
+
+   
+
+   
+Lines starting with # are considered comments and
+ignored. Comments can be placed after filter as well. Blank lines
+are also ignored. See  for how to
+perform quoting in patterns.
+   
+
+   
+Example files are listed below in the 
+section.
+   
+
+  
+ 
+
  
   --if-exists
   
@@ -1119,6 +1193,7 @@ PostgreSQL documentation
 schema (-n/--schema) and
 table (-t/--table) qualifier
 match at least one extension/schema/table in the database to be dumped.
+This also applies to filters used with --filter.
 Note that if none of the extension/schema/table qualifiers find
 matches, pg_dump will generate an error
 even without --strict-names.
@@ -1528,6 +1603,19 @@ CREATE DATABASE foo WITH TEMPLATE template0;
 
 
 $ pg_dump -t 

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread Amit Kapila
On Fri, Oct 7, 2022 at 8:38 AM Peter Smith  wrote:
>
> On Thu, Oct 6, 2022 at 10:38 PM Amit Kapila  wrote:
> >
> > On Fri, Sep 30, 2022 at 1:56 PM Peter Smith  wrote:
> > >
> > > Here are my review comments for the v35-0001 patch:
> > >
> > > ==
> > >
> > > 3. GENERAL
> > >
> > > (this comment was written after I wrote all the other ones below so
> > > there might be some unintended overlaps...)
> > >
> > > I found the mixed use of the same member names having different
> > > meanings to be quite confusing.
> > >
> > > e.g.1
> > > PGOutputData 'streaming' is now a single char internal representation
> > > the subscription parameter streaming mode ('f','t','p')
> > > - bool streaming;
> > > + char streaming;
> > >
> > > e.g.2
> > > WalRcvStreamOptions 'streaming' is a C string version of the
> > > subscription streaming mode ("on", "parallel")
> > > - bool streaming; /* Streaming of large transactions */
> > > + char*streaming; /* Streaming of large transactions */
> > >
> > > e.g.3
> > > SubOpts 'streaming' is again like the first example - a single char
> > > for the mode.
> > > - bool streaming;
> > > + char streaming;
> > >
> > >
> > > IMO everything would become much simpler if you did:
> > >
> > > 3a.
> > > Rename "char streaming;" -> "char streaming_mode;"
> > >
> > > 3b.
> > > Re-designed the "char *streaming;" code to also use the single char
> > > notation, then also call that member 'streaming_mode'. Then everything
> > > will be consistent.
> > >
> >
> > Won't this impact the previous version publisher which already uses
> > on/off? We may need to maintain multiple values which would be
> > confusing.
> >
>
> I only meant that the *internal* struct member names mentioned could
> change - not anything exposed as user-visible parameter names or
> column names etc. Or were you referring to it as causing unnecessary
> troubles for back-patching? Anyway, the main point of this review
> comment was #3b.
>

My response was for 3b only.

> Unless I am mistaken, there is no reason why that one
> cannot be changed to use 'char' instead of 'char *', for consistency
> across all the same named members.
>

I feel this will bring more complexity to the code if you have to keep
it working with old-version publishers.

> > >
> > > 9. - parallel_apply_can_start
> > >
> > > +/*
> > > + * Returns true, if it is allowed to start a parallel apply worker, 
> > > false,
> > > + * otherwise.
> > > + */
> > > +static bool
> > > +parallel_apply_can_start(TransactionId xid)
> > >
> > > (The commas are strange)
> > >
> > > SUGGESTION
> > > Returns true if it is OK to start a parallel apply worker, false 
> > > otherwise.
> > >
> >
> > +1 for this.
> > >
> > > 28. - logicalrep_worker_detach
> > >
> > > + /* Stop the parallel apply workers. */
> > > + if (am_leader_apply_worker())
> > > + {
> > >
> > > Should that comment rather say like below?
> > >
> > > /* If this is the leader apply worker then stop all of its parallel
> > > apply workers. */
> > >
> >
> > I think this would be just saying what is apparent from the code, so
> > not sure if it is an improvement.
> >
> > >
> > > 38. - apply_handle_commit_prepared
> > >
> > > + *
> > > + * Note that we don't need to wait here if the transaction was prepared 
> > > in a
> > > + * parallel apply worker. Because we have already waited for the prepare 
> > > to
> > > + * finish in apply_handle_stream_prepare() which will ensure all the 
> > > operations
> > > + * in that transaction have happened in the subscriber and no concurrent
> > > + * transaction can create deadlock or transaction dependency issues.
> > >   */
> > >  static void
> > >  apply_handle_commit_prepared(StringInfo s)
> > >
> > > "worker. Because" -> "worker because"
> > >
> >
> > I think this will make this line too long. Can we think of breaking it
> > in some way?
>
> OK, how about below:
>
> Note that we don't need to wait here if the transaction was prepared
> in a parallel apply worker. In that case, we have already waited for
> the prepare to finish in apply_handle_stream_prepare() which will
> ensure all the operations in that transaction have happened in the
> subscriber, so no concurrent transaction can cause deadlock or
> transaction dependency issues.
>

Yeah, this looks better.

> >
> > > 58.
> > >
> > > --- fail - streaming must be boolean
> > > +-- fail - streaming must be boolean or 'parallel'
> > >  CREATE SUBSCRIPTION regress_testsub CONNECTION
> > > 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
> > > false, streaming = foo);
> > >
> > > I think there are tests already for explicitly create/set the
> > > subscription parameter streaming = on/off/parallel
> > >
> > > But what about when there is no value explicitly specified? Shouldn't
> > > there also be tests like below to check that *implied* boolean true
> > > still works for this enum?
> > >
> > > CREATE SUBSCRIPTION ... WITH (streaming)
> > > ALTER SUBSCRIPTION ... SET (streaming)
> > >
> >
> > I think before 

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread Amit Kapila
On Fri, Oct 7, 2022 at 8:47 AM Masahiko Sawada  wrote:
>
> On Thu, Oct 6, 2022 at 9:04 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > I think the root reason for this kind of deadlock problems is the table
> > structure difference between publisher and subscriber(similar to the unique
> > difference reported earlier[1]). So, I think we'd better disallow this 
> > case. For
> > example to avoid the reported problem, we could only support parallel apply 
> > if
> > pubviaroot is false on publisher and replicated tables' types(relkind) are 
> > the
> > same between publisher and subscriber.
> >
> > Although it might restrict some use cases, but I think it only restrict the
> > cases when the partitioned table's structure is different between publisher 
> > and
> > subscriber. User can still use parallel apply for cases when the table
> > structure is the same between publisher and subscriber which seems 
> > acceptable
> > to me. And we can also document that the feature is expected to be used for 
> > the
> > case when tables' structure are the same. Thoughts ?
>
> I'm concerned that it could be a big restriction for users. Having
> different partitioned table's structures on the publisher and the
> subscriber is quite common use cases.
>
> From the feature perspective, the root cause seems to be the fact that
> the apply worker does both receiving and applying changes. Since it
> cannot receive the subsequent messages while waiting for a lock on a
> table, the parallel apply worker also cannot move forward. If we have
> a dedicated receiver process, it can off-load the messages to the
> worker while another process waiting for a lock. So I think that
> separating receiver and apply worker could be a building block for
> parallel-apply.
>

I think the disadvantage that comes to mind is the overhead of passing
messages between receiver and applier processes even for non-parallel
cases. Now, I don't think it is advisable to have separate handling
for non-parallel cases. The other thing is that we need to someway
deal with feedback messages which helps to move synchronous replicas
and update subscriber's progress which in turn helps to keep the
restart point updated. These messages also act as heartbeat messages
between walsender and walapply process.

To deal with this, one idea is that we can have two connections to
walsender process, one with walreceiver and the other with walapply
process which according to me could lead to a big increase in resource
consumption and it will bring another set of complexities in the
system. Now, in this, I think we have two possibilities, (a) The first
one is that we pass all messages to the leader apply worker and then
it decides whether to execute serially or pass it to the parallel
apply worker. However, that can again deadlock in the truncate
scenario we discussed because the main apply worker won't be able to
receive new messages once it is blocked at the truncate command. (b)
The second one is walreceiver process itself takes care of passing
streaming transactions to parallel apply workers but if we do that
then walreceiver needs to wait at the transaction end to maintain
commit order which means it can also lead to deadlock in case the
truncate happens in a streaming xact.

The other alternative is that we allow walreceiver process to wait for
apply process to finish transaction and send the feedback but that
seems to be again an overhead if we have to do it even for small
transactions, especially it can delay sync replication cases. Even, if
we don't consider overhead, it can still lead to a deadlock because
walreceiver won't be able to move in the scenario we are discussing.

About your point that having different partition structures for
publisher and subscriber, I don't know how common it will be once we
have DDL replication. Also, the default value of
publish_via_partition_root is false which doesn't seem to indicate
that this is a quite common case.

We have fixed quite a few issues in this area in the last release or
two which were found during development, so not sure if these are used
quite often in the field but it could just be a coincidence. Also, it
will only matter if there are large transactions that perform on such
tables which I don't think will be easy to predict whether those are
common or not.

-- 
With Regards,
Amit Kapila.




Re: Query Jumbling for CALL and SET utility statements

2022-10-06 Thread Julien Rouhaud
On Thu, Oct 06, 2022 at 11:51:52PM -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > While studying a bit more this thread, I've been reminded of the fact
> > that this would treat different flavors of BEGIN/COMMIT commands (mix
> > of upper/lower characters, etc.) as different entries in
> > pg_stat_statements, and it feels inconsistent to me that we'd begin
> > jumbling the 2PC and savepoint commands with their nodes but not do
> > that for the rest of the commands, even if, as mentioned upthread,
> > applications may not mix grammars.
>
> I've been thinking since the beginning of this thread that there
> was no coherent, defensible rationale being offered for jumbling
> some utility statements and not others.

Only a very small subset causes trouble in real life scenario, but I agree that
cherry-picking some utility statements isn't a great approach.

> I wonder if the answer is to jumble them all.  We avoided that
> up to now because it would imply a ton of manual effort and
> future code maintenance ... but now that the backend/nodes/
> infrastructure is largely auto-generated, could we auto-generate
> the jumbling code?

That's a good idea.  Naively, it seems doable as the infrastructure in
gen_node_support.pl already supports everything that should be needed (like
per-member annotation).




Re: Query Jumbling for CALL and SET utility statements

2022-10-06 Thread Michael Paquier
On Thu, Oct 06, 2022 at 11:51:52PM -0400, Tom Lane wrote:
> I've been thinking since the beginning of this thread that there
> was no coherent, defensible rationale being offered for jumbling
> some utility statements and not others.

Yeah.  The potential performance impact of all the TransactionStmts
worries me a bit, though. 

> I wonder if the answer is to jumble them all.  We avoided that
> up to now because it would imply a ton of manual effort and
> future code maintenance ... but now that the backend/nodes/
> infrastructure is largely auto-generated, could we auto-generate
> the jumbling code?

Probably.  One part that may be tricky though is the location of the
constants we'd like to make generic, but perhaps this could be handled
by using a dedicated variable type that just maps to int?  It does not
seem like a mandatory requirement to add that everywhere as a first
step, either.
--
Michael


signature.asc
Description: PGP signature


Re: shadow variables - pg15 edition

2022-10-06 Thread David Rowley
On Fri, 7 Oct 2022 at 13:24, David Rowley  wrote:
> Since I just committed the patch to fix the final warnings, I think we
> should go ahead and commit the patch you wrote to add
> -Wshadow=compatible-local to the standard build flags. I don't mind
> doing this.

Pushed.

David




Re: Query Jumbling for CALL and SET utility statements

2022-10-06 Thread Tom Lane
Michael Paquier  writes:
> While studying a bit more this thread, I've been reminded of the fact
> that this would treat different flavors of BEGIN/COMMIT commands (mix
> of upper/lower characters, etc.) as different entries in
> pg_stat_statements, and it feels inconsistent to me that we'd begin 
> jumbling the 2PC and savepoint commands with their nodes but not do
> that for the rest of the commands, even if, as mentioned upthread,
> applications may not mix grammars.

I've been thinking since the beginning of this thread that there
was no coherent, defensible rationale being offered for jumbling
some utility statements and not others.

I wonder if the answer is to jumble them all.  We avoided that
up to now because it would imply a ton of manual effort and
future code maintenance ... but now that the backend/nodes/
infrastructure is largely auto-generated, could we auto-generate
the jumbling code?

regards, tom lane




Re: Query Jumbling for CALL and SET utility statements

2022-10-06 Thread Michael Paquier
On Thu, Oct 06, 2022 at 10:43:57AM +0200, Drouvot, Bertrand wrote:
> On 10/6/22 8:39 AM, Michael Paquier wrote:
>> I am not seeing SAVEPOINT, RELEASE, ROLLBACK .. TO SAVEPOINT
>> mentioned on this thread.  Would these be worth considering in what
>> gets compiled?  That would cover the remaining bits of
>> TransactionStmt.  The ODBC driver abuses of savepoints, for example,
>> so this could be useful for monitoring purposes in such cases.
> 
> Agree. I'll look at those too.

Thanks.

While studying a bit more this thread, I've been reminded of the fact
that this would treat different flavors of BEGIN/COMMIT commands (mix
of upper/lower characters, etc.) as different entries in
pg_stat_statements, and it feels inconsistent to me that we'd begin 
jumbling the 2PC and savepoint commands with their nodes but not do
that for the rest of the commands, even if, as mentioned upthread,
applications may not mix grammars.  If they do, one could finish by
viewing incorrect reports, and I'd like to think that this would make
the life of a lot of people easier.

SET/RESET and CALL have a much lower presence frequency than the
transaction commands, where it is fine by me to include both of these
under the utility statement switch.  For OLTP workloads (I've seen
quite a bit of 2PC used across multiple nodes for short transactions
with writes involving more than two remote nodes), with a lot of
BEGIN/COMMIT or even 2PC commands issued, the performance could be
noticeable?  It may make sense to control these with a different GUC
switch, where we drop completely the string-only approach under
track_utility.  In short, I don't have any objections about the
business with SET and CALL, but the transaction part worries me a
bit.  As a first step, we could cut the cake in two parts, and just
focus on SET/RESET and CALL, which was the main point of discussion
of this thread to begin with.
--
Michael


signature.asc
Description: PGP signature


Re: Record SET session in VariableSetStmt

2022-10-06 Thread Julien Rouhaud
On Fri, Oct 07, 2022 at 10:30:28AM +0900, Michael Paquier wrote:
> On Thu, Oct 06, 2022 at 08:28:27PM +0800, Julien Rouhaud wrote:
> > If we move to a real jumbling of VariableSetStmt, we should keep the rules
> > consistent with the rest of the jumble code and ignore an explicit 
> > "SESSION" in
> > the original command.
> 
> Hm, interesting bit, I should study more this area.  So the query ID
> calculation actually only cares about the contents of the Nodes
> parsed, while the query string used is the one when the entry is
> created for the first time.  It seems like the patch to add
> TransactionStmt nodes into the jumbling misses something here, as we'd
> still compile different query IDs depending on the query string itself
> for simple commands like BEGIN or COMMIT.  I'll reply on the other
> thread about all that..

Ah, indeed we have different TransactionStmtKind for BEGIN and START!




Re: use has_privs_of_role() for pg_hba.conf

2022-10-06 Thread Nathan Bossart
On Fri, Oct 07, 2022 at 11:06:47AM +0900, Michael Paquier wrote:
> Rather than putting that in a separate script, which means
> initializing a new node, etc. could it be better to put that in
> 001_password.pl instead?  It would be cheaper.

Works for me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3a274e6b4f7fc8bf7eda3579d8bb2378c73f2098 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 1 Apr 2022 11:40:51 -0700
Subject: [PATCH v3 1/1] Use has_privs_of_role for samerole, samegroup, and +
 in pg_hba.conf.

6198420 ensured that has_privs_of_role() is used for predefined
roles, which means that the role privilege inheritance hierarchy is
checked instead of mere role membership.  However, inheritance is
still not respected for pg_hba.conf.  This change alters the
authentication logic to consider role privileges instead of just
role membership.

Do not back-patch.

Author: Nathan Bossart
---
 doc/src/sgml/client-auth.sgml | 18 -
 src/backend/libpq/hba.c   | 19 -
 src/backend/utils/adt/acl.c   | 23 +++
 src/include/utils/acl.h   |  1 +
 src/test/authentication/t/001_password.pl | 48 +++
 5 files changed, 91 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index c6f1b70fd3..b06b57f169 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -221,12 +221,12 @@ hostnogssenc  database  usersameuser specifies that the record
matches if the requested database has the same name as the
requested user.  The value samerole specifies that
-   the requested user must be a member of the role with the same
+   the requested user must have privileges of the role with the same
name as the requested database.  (samegroup is an
obsolete but still accepted spelling of samerole.)
-   Superusers are not considered to be members of a role for the
-   purposes of samerole unless they are explicitly
-   members of the role, directly or indirectly, and not just by
+   Superusers are not considered to have privileges of a role for the
+   purposes of samerole unless they explicitly have
+   privileges of the role, directly or indirectly, and not just by
virtue of being a superuser.
The value replication specifies that the record
matches if a physical replication connection is requested, however, it
@@ -252,10 +252,10 @@ hostnogssenc  database  user+.
(Recall that there is no real distinction between users and groups
in PostgreSQL; a + mark really means
-   match any of the roles that are directly or indirectly members
+   match any of the roles that directly or indirectly have privileges
of this role, while a name without a + mark matches
only that specific role.) For this purpose, a superuser is only
-   considered to be a member of a role if they are explicitly a member
+   considered to have privileges of a role if they explicitly have privileges
of the role, directly or indirectly, and not just by virtue of
being a superuser.
Multiple user names can be supplied by separating them with commas.
@@ -788,9 +788,9 @@ hostall all 192.168.0.0/16  ident map=omicro
 # If these are the only three lines for local connections, they will
 # allow local users to connect only to their own databases (databases
 # with the same name as their database user name) except for administrators
-# and members of role "support", who can connect to all databases.  The file
-# $PGDATA/admins contains a list of names of administrators.  Passwords
-# are required in all cases.
+# and roles with privileges of role "support", who can connect to all
+# databases.  The file $PGDATA/admins contains a list of names of
+# administrators.  Passwords are required in all cases.
 #
 # TYPE  DATABASEUSERADDRESS METHOD
 local   sameuserall md5
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 4637426d62..faa675f4af 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -547,13 +547,13 @@ tokenize_auth_file(const char *filename, FILE *file, List **tok_lines,
 
 
 /*
- * Does user belong to role?
+ * Does user have privileges of role?
  *
  * userid is the OID of the role given as the attempted login identifier.
- * We check to see if it is a member of the specified role name.
+ * We check to see if it has privileges of the specified role name.
  */
 static bool
-is_member(Oid userid, const char *role)
+has_privs(Oid userid, const char *role)
 {
 	Oid			roleid;
 
@@ -566,11 +566,12 @@ is_member(Oid userid, const char *role)
 		return false;			/* if target role not exist, say "no" */
 
 	/*
-	 * See if user is directly or indirectly a member of role. For this

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread Masahiko Sawada
On Thu, Oct 6, 2022 at 9:04 PM houzj.f...@fujitsu.com
 wrote:
>
>
>
> > -Original Message-
> > From: Masahiko Sawada 
> > Sent: Thursday, October 6, 2022 4:07 PM
> > To: Hou, Zhijie/侯 志杰 
> > Cc: Amit Kapila ; Wang, Wei/王 威
> > ; Peter Smith ; Dilip
> > Kumar ; Shi, Yu/侍 雨 ;
> > PostgreSQL Hackers 
> > Subject: Re: Perform streaming logical transactions by background workers 
> > and
> > parallel apply
> >
> > On Tue, Sep 27, 2022 at 9:26 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Saturday, September 24, 2022 7:40 PM Amit Kapila
> >  wrote:
> > > >
> > > > On Thu, Sep 22, 2022 at 3:41 PM Amit Kapila 
> > > > wrote:
> > > > >
> > > > > On Thu, Sep 22, 2022 at 8:59 AM wangw.f...@fujitsu.com
> > > > >  wrote:
> > > > > >
> > > > >
> > > > > Few comments on v33-0001
> > > > > ===
> > > > >
> > > >
> > > > Some more comments on v33-0001
> > > > =
> > > > 1.
> > > > + /* Information from the corresponding LogicalRepWorker slot. */
> > > > + uint16 logicalrep_worker_generation;
> > > > +
> > > > + int logicalrep_worker_slot_no;
> > > > +} ParallelApplyWorkerShared;
> > > >
> > > > Both these variables are read/changed by leader/parallel workers without
> > > > using any lock (mutex). It seems currently there is no problem because 
> > > > of
> > the
> > > > way the patch is using in_parallel_apply_xact but I think it won't be a 
> > > > good
> > idea
> > > > to rely on it. I suggest using mutex to operate on these variables and 
> > > > also
> > check
> > > > if the slot_no is in a valid range after reading it in 
> > > > parallel_apply_free_worker,
> > > > otherwise error out using elog.
> > >
> > > Changed.
> > >
> > > > 2.
> > > >  static void
> > > >  apply_handle_stream_stop(StringInfo s)
> > > >  {
> > > > - if (!in_streamed_transaction)
> > > > + ParallelApplyWorkerInfo *winfo = NULL; TransApplyAction apply_action;
> > > > +
> > > > + if (!am_parallel_apply_worker() &&
> > > > + (!in_streamed_transaction && !stream_apply_worker))
> > > >   ereport(ERROR,
> > > >   (errcode(ERRCODE_PROTOCOL_VIOLATION),
> > > >   errmsg_internal("STREAM STOP message without STREAM START")));
> > > >
> > > > This check won't be able to detect missing stream start messages for 
> > > > parallel
> > > > apply workers apart from the first pair of start/stop. I thought of 
> > > > adding
> > > > in_remote_transaction check along with
> > > > am_parallel_apply_worker() to detect the same but that also won't work
> > > > because the parallel worker doesn't reset it at the stop message.
> > > > Another possibility is to introduce yet another variable for this but 
> > > > that
> > doesn't
> > > > seem worth it. I would like to keep this check simple.
> > > > Can you think of any better way?
> > >
> > > I feel we can reuse the in_streamed_transaction in parallel apply worker 
> > > to
> > > simplify the check there. I tried to set this flag in parallel apply 
> > > worker
> > > when stream starts and reset it when stream stop so that we can directly 
> > > check
> > > this flag for duplicate stream start message and other related things.
> > >
> > > > 3. I think we can skip sending start/stop messages from the leader to 
> > > > the
> > > > parallel worker because unlike apply worker it will process only one
> > > > transaction-at-a-time. However, it is not clear whether that is worth 
> > > > the
> > effort
> > > > because it is sent after logical_decoding_work_mem changes. For now, I 
> > > > have
> > > > added a comment for this in the attached patch but let me if I am 
> > > > missing
> > > > something or if I am wrong.
> > >
> > > I the suggested comments look good.
> > >
> > > > 4.
> > > > postgres=# select pid, leader_pid, application_name, backend_type from
> > > > pg_stat_activity;
> > > >   pid  | leader_pid | application_name | backend_type
> > > > ---++--+--
> > > >  27624 ||  | logical replication launcher
> > > >  17336 || psql | client backend
> > > >  26312 ||  | logical replication worker
> > > >  26376 || psql | client backend
> > > >  14004 ||  | logical replication worker
> > > >
> > > > Here, the second worker entry is for the parallel worker. Isn't it 
> > > > better if we
> > > > distinguish this by keeping type as a logical replication parallel 
> > > > worker? I
> > think
> > > > for this you need to change bgw_type in logicalrep_worker_launch().
> > >
> > > Changed.
> > >
> > > > 5. Can we name parallel_apply_subxact_info_add() as
> > > > parallel_apply_start_subtrans()?
> > > >
> > > > Apart from the above, I have added/edited a few comments and made a few
> > > > other cosmetic changes in the attached.
> > >
> >
> > While looking at v35 patch, I realized that there are some cases where
> > the logical replication gets stuck depending on 

Re: START_REPLICATION SLOT causing a crash in an assert build

2022-10-06 Thread Masahiko Sawada
On Fri, Oct 7, 2022 at 8:00 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-10-06 14:10:46 +0900, Kyotaro Horiguchi wrote:
> > +1. FWIW, the atttached is an example of what it looks like if we
> > avoid file format change.
>
> What about if we go the other direction - simply remove the name from the
> stats entry at all. I don't actually think we need it anymore. Unless I am
> missing something right now - entirely possible! - the danger that
> pgstat_acquire_replslot() mentions doesn't actually exist [anymore]. After a
> crash we throw away the old stats data and if a slot is dropped while shut
> down, we'll not load the slot data at startup.

+1. I think it works. Since the replication slot index doesn't change
during server running we can fetch the name from
ReplicationSlotCtl->replication_slots.

If we don't need the name in stats entry, pgstat_acquire_replslot() is
no longer necessary?

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread Peter Smith
On Thu, Oct 6, 2022 at 10:38 PM Amit Kapila  wrote:
>
> On Fri, Sep 30, 2022 at 1:56 PM Peter Smith  wrote:
> >
> > Here are my review comments for the v35-0001 patch:
> >
> > ==
> >
> > 3. GENERAL
> >
> > (this comment was written after I wrote all the other ones below so
> > there might be some unintended overlaps...)
> >
> > I found the mixed use of the same member names having different
> > meanings to be quite confusing.
> >
> > e.g.1
> > PGOutputData 'streaming' is now a single char internal representation
> > the subscription parameter streaming mode ('f','t','p')
> > - bool streaming;
> > + char streaming;
> >
> > e.g.2
> > WalRcvStreamOptions 'streaming' is a C string version of the
> > subscription streaming mode ("on", "parallel")
> > - bool streaming; /* Streaming of large transactions */
> > + char*streaming; /* Streaming of large transactions */
> >
> > e.g.3
> > SubOpts 'streaming' is again like the first example - a single char
> > for the mode.
> > - bool streaming;
> > + char streaming;
> >
> >
> > IMO everything would become much simpler if you did:
> >
> > 3a.
> > Rename "char streaming;" -> "char streaming_mode;"
> >
> > 3b.
> > Re-designed the "char *streaming;" code to also use the single char
> > notation, then also call that member 'streaming_mode'. Then everything
> > will be consistent.
> >
>
> Won't this impact the previous version publisher which already uses
> on/off? We may need to maintain multiple values which would be
> confusing.
>

I only meant that the *internal* struct member names mentioned could
change - not anything exposed as user-visible parameter names or
column names etc. Or were you referring to it as causing unnecessary
troubles for back-patching? Anyway, the main point of this review
comment was #3b. Unless I am mistaken, there is no reason why that one
cannot be changed to use 'char' instead of 'char *', for consistency
across all the same named members.

> >
> > 9. - parallel_apply_can_start
> >
> > +/*
> > + * Returns true, if it is allowed to start a parallel apply worker, false,
> > + * otherwise.
> > + */
> > +static bool
> > +parallel_apply_can_start(TransactionId xid)
> >
> > (The commas are strange)
> >
> > SUGGESTION
> > Returns true if it is OK to start a parallel apply worker, false otherwise.
> >
>
> +1 for this.
> >
> > 28. - logicalrep_worker_detach
> >
> > + /* Stop the parallel apply workers. */
> > + if (am_leader_apply_worker())
> > + {
> >
> > Should that comment rather say like below?
> >
> > /* If this is the leader apply worker then stop all of its parallel
> > apply workers. */
> >
>
> I think this would be just saying what is apparent from the code, so
> not sure if it is an improvement.
>
> >
> > 38. - apply_handle_commit_prepared
> >
> > + *
> > + * Note that we don't need to wait here if the transaction was prepared in 
> > a
> > + * parallel apply worker. Because we have already waited for the prepare to
> > + * finish in apply_handle_stream_prepare() which will ensure all the 
> > operations
> > + * in that transaction have happened in the subscriber and no concurrent
> > + * transaction can create deadlock or transaction dependency issues.
> >   */
> >  static void
> >  apply_handle_commit_prepared(StringInfo s)
> >
> > "worker. Because" -> "worker because"
> >
>
> I think this will make this line too long. Can we think of breaking it
> in some way?

OK, how about below:

Note that we don't need to wait here if the transaction was prepared
in a parallel apply worker. In that case, we have already waited for
the prepare to finish in apply_handle_stream_prepare() which will
ensure all the operations in that transaction have happened in the
subscriber, so no concurrent transaction can cause deadlock or
transaction dependency issues.

>
> >
> > 43.
> >
> >   /*
> > - * Initialize the worker's stream_fileset if we haven't yet. This will be
> > - * used for the entire duration of the worker so create it in a permanent
> > - * context. We create this on the very first streaming message from any
> > - * transaction and then use it for this and other streaming transactions.
> > - * Now, we could create a fileset at the start of the worker as well but
> > - * then we won't be sure that it will ever be used.
> > + * For the first stream start, check if there is any free parallel apply
> > + * worker we can use to process this transaction.
> >   */
> > - if (MyLogicalRepWorker->stream_fileset == NULL)
> > + if (first_segment)
> > + parallel_apply_start_worker(stream_xid);
> >
> > This comment update seems misleading. The
> > parallel_apply_start_worker() isn't just checking if there is a free
> > worker. All that free worker logic stuff is *inside* the
> > parallel_apply_start_worker() function, so maybe no need to mention
> > about it here at the caller.
> >
>
> It will be good to have some comments here instead of completely removing it.
>
> >
> > 39. - apply_handle_stream_abort
> >
> > + /* We receive abort information only when 

create subscription - improved warning message

2022-10-06 Thread Peter Smith
WARNING:  tables were not subscribed, you will have to run ALTER
SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables

~

When I first encountered the above CREATE SUBSCRIPTION warning message
I thought it was dubious-looking English...

On closer inspection I think the message has some other things that
could be improved:
a) it is quite long which IIUC is generally frowned upon
b) IMO most of the text it is more like a "hint" about what to do

~

PSA a patch which modifies this warning as follows:

BEFORE

test_sub=# create subscription sub1 connection 'host=localhost
port=test_pub' publication pub1 with (connect = false);
WARNING:  tables were not subscribed, you will have to run ALTER
SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
CREATE SUBSCRIPTION

AFTER

test_sub=# create subscription sub1 connection 'host=localhost
port=test_pub' publication pub1 with (connect = false);
WARNING:  tables were not subscribed
HINT:  You will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION
to subscribe the tables.
CREATE SUBSCRIPTION

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-create-subscription-warning-message.patch
Description: Binary data


Re: use has_privs_of_role() for pg_hba.conf

2022-10-06 Thread Michael Paquier
On Thu, Oct 06, 2022 at 10:43:43AM -0700, Nathan Bossart wrote:
> Here is a new version of the patch with a test.

Thanks, that helps a lot.  Now I grab the difference even if your
previous patch was already switching the documentation to tell exactly
that.  On the ground of 6198420, it looks indeed strange to not do the
same for pg_hba.conf.  That makes the whole story more consistent, for
one.

+$node->safe_psql('postgres', "CREATE DATABASE role1;");
+$node->safe_psql('postgres', "CREATE ROLE role1 LOGIN PASSWORD 'pass';");
+$node->safe_psql('postgres', "CREATE ROLE role2 LOGIN SUPERUSER INHERIT IN 
ROLE role1 PASSWORD 'pass';");
+$node->safe_psql('postgres', "CREATE ROLE role3 LOGIN SUPERUSER NOINHERIT IN 
ROLE role1 PASSWORD 'pass';");
So this comes down to role3, where HEAD allows a connection as long as
it is a member of role1 for +role1, samegroup and samerole, but the
patch would prevent the connection when role3 does not inherit the
permissions of role1, even if it is a superuser.

samegroup is a synonym of samerole, but fine by me to keep the full
coverage and all three sets.

Rather than putting that in a separate script, which means
initializing a new node, etc. could it be better to put that in
001_password.pl instead?  It would be cheaper.
--
Michael


signature.asc
Description: PGP signature


Re: Record SET session in VariableSetStmt

2022-10-06 Thread Michael Paquier
On Thu, Oct 06, 2022 at 08:28:27PM +0800, Julien Rouhaud wrote:
> If we move to a real jumbling of VariableSetStmt, we should keep the rules
> consistent with the rest of the jumble code and ignore an explicit "SESSION" 
> in
> the original command.

Hm, interesting bit, I should study more this area.  So the query ID
calculation actually only cares about the contents of the Nodes
parsed, while the query string used is the one when the entry is
created for the first time.  It seems like the patch to add
TransactionStmt nodes into the jumbling misses something here, as we'd
still compile different query IDs depending on the query string itself
for simple commands like BEGIN or COMMIT.  I'll reply on the other
thread about all that..
--
Michael


signature.asc
Description: PGP signature


Re: Issue with pg_stat_subscription_stats

2022-10-06 Thread Michael Paquier
On Thu, Oct 06, 2022 at 04:43:43PM -0700, Andres Freund wrote:
> Thanks for bringing this thread up, I'd lost track of it.

The merit goes to Sawada-san here, who has poked me about this thread
:p
--
Michael


signature.asc
Description: PGP signature


Re: Possible solution for masking chosen columns when using pg_dump

2022-10-06 Thread Виктория Шепард
Hi,
I took a look, here are several suggestions for improvement:

- Masking is not a main functionality of pg_dump and it is better to write
most of the connected things in a separate file like parallel.c or
dumputils.c. This will help slow down the growth of an already huge pg_dump
file.

- Also it can be hard to use a lot of different functions for different
fields, maybe it would be better to set up functions in a file.

- How will it work for the same field and tables in the different schemas?
Can we set up the exact schema for the field?

- misspelling in a word
>/*
>* Add all columns and funcions to list of MaskColumnInfo structures,
>*/

- Why did you use 256 here?
> char* table = (char*) pg_malloc(256 * sizeof(char));
Also for malloc you need malloc on 1 symbol more because you have to store
'\0' symbol.

- Instead of addFuncToDatabase you can run your query using something
already defined from fe_utils/query_utils.c. And It will be better to set
up a connection only once and create all functions. Establishing a
connection is a resource-intensive procedure. There are a lot of magic
numbers, better to leave some comments explaining why there are 64 or 512.

- It seems that you are not using temp_string
> char   *temp_string = (char*)malloc(256 * sizeof(char));

- Grammar issues
>/*
>* mask_column_info_list contains info about every to-be-masked column:
>* its name, a name its table (if nothing is specified - mask all columns
with this name),
>* name of masking function and name of schema containing this function
(public if not specified)
>*/
the name of its table


пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud :

> Hi,
>
> On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
> >
> > Hello, here's my take on masking data when using pg_dump
> >
> > The main idea is using PostgreSQL functions to replace data during a
> SELECT.
> > When table data is dumped SELECT a,b,c,d ... from ... query is
> generated, the columns that are marked for masking are replaced with result
> of functions on those columns
> > Example: columns name, count are to be masked, so the query will look as
> such: SELECT id, mask_text(name), mask_int(count), date from ...
> >
> > So about the interface: I added 2 more command-line options:
> >
> > --mask-columns, which specifies what columns from what tables will be
> masked
> > usage example:
> > --mask-columns "t1.name, t2.description" - both columns
> will be masked with the same corresponding function
> > or --mask-columns name - ALL columns with name "name" from
> all dumped tables will be masked with correspoding function
> >
> > --mask-function, which specifies what functions will mask data
> > usage example:
> > --mask-function mask_int - corresponding columns will be
> masked with function named "mask_int" from default schema (public)
> > or --mask-function my_schema.mask_varchar - same as above
> but with specified schema where the function is stored
> > or --mask-function somedir/filename - the function is
> "defined" here - more on the structure below
>
> FTR I wrote an extension POC [1] last weekend that does that but on the
> backend
> side.  The main advantage is that it's working with any existing versions
> of
> pg_dump (or any client relying on COPY or even plain interactive SQL
> statements), and that the DBA can force a dedicated role to only get a
> masked
> dump, even if they forgot to ask for it.
>
> I only had a quick look at your patch but it seems that you left some todo
> in
> russian, which isn't helpful at least to me.
>
> [1] https://github.com/rjuju/pg_anonymize
>
>
>


Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread Tom Lane
David Rowley  writes:
> On Fri, 7 Oct 2022 at 12:35, Tom Lane  wrote:
>> Which leaves me with the attached proposed wording.

> No objections here.

Cool, I'll push in a little bit.

> With these comments I'd be using slot MCTX_UNUSED4_ID first, then I'd
> probably be looking at MCTX_UNUSED5_ID after adjusting wipe_mem to do
> something other than setting bytes to 0x7F.

Well, the only way that you could free up a bitpattern that way is
to make wipe_mem use something ending in 000 or 001.  I'd be against
using 000 because then wiped memory might appear to contain valid
(aligned) pointers.  But perhaps 001 would be ok.

> I'd then use
> MCTX_UNUSED3_ID since that pattern is only used for larger chunks with
> glibc (per your findings).  After that, I'd probably start looking
> into making more than 3 bits available. If that wasn't possible, I'd
> be using MCTX_UNUSED2_ID and at last resort MCTX_UNUSED1_ID.

If we get to having three-quarters or seven-eighths of the bitpatterns
being valid IDs, we'll have precious little ability to detect garbage.
So personally I'd put "find a fourth bit" higher on the priority list.

In any case, we needn't invest more effort here until someone comes
with a fifth context method ... and I don't recall hearing discussions
of even a fourth one yet.

regards, tom lane




Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread David Rowley
On Fri, 7 Oct 2022 at 12:35, Tom Lane  wrote:
> Which leaves me with the attached proposed wording.

No objections here.

With these comments I'd be using slot MCTX_UNUSED4_ID first, then I'd
probably be looking at MCTX_UNUSED5_ID after adjusting wipe_mem to do
something other than setting bytes to 0x7F. I'd then use
MCTX_UNUSED3_ID since that pattern is only used for larger chunks with
glibc (per your findings).  After that, I'd probably start looking
into making more than 3 bits available. If that wasn't possible, I'd
be using MCTX_UNUSED2_ID and at last resort MCTX_UNUSED1_ID.

David




Re: Avoid mix char with bool type in comparisons

2022-10-06 Thread Ranier Vilela
Em qui., 6 de out. de 2022 às 21:21, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > My main concerns is this point:
> >   /* If already matched on earlier page, do no extra work */
> > - if (key->entryRes[j])
> > + if (key->entryRes[j] == GIN_TRUE)
> >   continue;
>
> > If GIN_MAYBE cases are erroneously ignored.
>
> So, if that's a bug, you should be able to produce a test case?
>
No Tom, unfortunately I don't have the knowledge to create a test with
GIN_MAYBE values.

With the patch, all current tests pass.
Either there are no bugs, or there are no tests that detect this specific
case.
And I agree with you, without a test showing the bug,
there's not much chance of the patch progressing.

Unless someone with more knowledge can say that the patch improves
robustness.

regards,
Ranie Vilela


Re: Issue with pg_stat_subscription_stats

2022-10-06 Thread Masahiko Sawada
On Fri, Oct 7, 2022 at 9:27 AM Andres Freund  wrote:
>
> On 2022-10-06 16:43:43 -0700, Andres Freund wrote:
> > On 2022-10-06 14:10:56 +0900, Michael Paquier wrote:
> > > On Tue, Jul 12, 2022 at 09:31:16AM +0530, Amit Kapila wrote:
> > > > I am not against backpatching this but OTOH it doesn't appear critical
> > > > enough to block one's work, so not backpatching should be fine.
> > >
> > > We are just talking about the reset timestamp not being set at
> > > when the object is created, right?  This does not strike me as
> > > critical, so applying it only on HEAD is fine IMO.  A few months ago,
> > > while in beta, I would have been fine with something applied to
> > > REL_15_STABLE.  Now that we are in RC, that's not worth taking a risk
> > > in my opinion.
> >
> > Agreed.
> >
> > > Amit or Andres, are you planning to double-check and perhaps merge
> > > this patch to take care of the inconsistency?
> >
> > I'll run it through CI and then to master unless somebody pipes up in the
> > meantime.
>
> And pushed. Thanks all!

Thanks!

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Issue with pg_stat_subscription_stats

2022-10-06 Thread Andres Freund
On 2022-10-06 16:43:43 -0700, Andres Freund wrote:
> On 2022-10-06 14:10:56 +0900, Michael Paquier wrote:
> > On Tue, Jul 12, 2022 at 09:31:16AM +0530, Amit Kapila wrote:
> > > I am not against backpatching this but OTOH it doesn't appear critical
> > > enough to block one's work, so not backpatching should be fine.
> > 
> > We are just talking about the reset timestamp not being set at
> > when the object is created, right?  This does not strike me as
> > critical, so applying it only on HEAD is fine IMO.  A few months ago,
> > while in beta, I would have been fine with something applied to
> > REL_15_STABLE.  Now that we are in RC, that's not worth taking a risk
> > in my opinion.
> 
> Agreed.
> 
> > Amit or Andres, are you planning to double-check and perhaps merge
> > this patch to take care of the inconsistency?
> 
> I'll run it through CI and then to master unless somebody pipes up in the
> meantime.

And pushed. Thanks all!




Re: meson: Add support for building with precompiled headers

2022-10-06 Thread Andres Freund
Hi,

On 2022-10-06 09:06:42 +0200, Peter Eisentraut wrote:
> On 05.10.22 21:08, Andres Freund wrote:
> > This is a patch split off from the initial meson thread [1] as it's
> > functionally largely independent (as suggested in [2]).
> > 
> > Using precompiled headers substantially speeds up building for windows, due 
> > to
> > the vast amount of headers included via windows.h. A cross build from
> > linux targetting mingw goes from
> 
> These patches look ok to me.  I can't really comment on the Windows details,
> but it sounds all reasonable.

Thanks for reviewing!


> Small issue:
> 
> +override CFLAGS += -DFD_SETSIZE=1024
> 
> (and similar)
> 
> should be CPPFLAGS.

Pushed with that adjusted.

Greetings,

Andres Freund




Re: shadow variables - pg15 edition

2022-10-06 Thread David Rowley
On Thu, 6 Oct 2022 at 13:39, Andres Freund  wrote:
> I attached a patch to add -Wshadow=compatible-local to our set of warnings.

Since I just committed the patch to fix the final warnings, I think we
should go ahead and commit the patch you wrote to add
-Wshadow=compatible-local to the standard build flags. I don't mind
doing this.

Does anyone think we shouldn't do it? Please let it be known soon.

David




Re: Avoid mix char with bool type in comparisons

2022-10-06 Thread Tom Lane
Ranier Vilela  writes:
> My main concerns is this point:
>   /* If already matched on earlier page, do no extra work */
> - if (key->entryRes[j])
> + if (key->entryRes[j] == GIN_TRUE)
>   continue;

> If GIN_MAYBE cases are erroneously ignored.

So, if that's a bug, you should be able to produce a test case?

regards, tom lane




Re: Avoid mix char with bool type in comparisons

2022-10-06 Thread Ranier Vilela
Em qui., 6 de out. de 2022 às 20:52, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > So, any use of this GinTernaryValue are:
>
> > 1. if (key->entryRes[j]) be FALSE if GIN_FALSE
> > 2. if (key->entryRes[j]) be TRUE if GIN_TRUE
> > 3. if (key->entryRes[j]) be TRUE if GIN_MAYBE
>
> Yeah, that's how it's designed.  Unless you can point to a bug,
> I do not think we ought to change this code.
>
Thanks for answering.

My main concerns is this point:
  /* If already matched on earlier page, do no extra work */
- if (key->entryRes[j])
+ if (key->entryRes[j] == GIN_TRUE)
  continue;

If GIN_MAYBE cases are erroneously ignored.

regards,
Ranier Vilela


Re: Support logical replication of DDLs

2022-10-06 Thread Ajin Cherian
On Fri, Oct 7, 2022 at 8:30 AM Zheng Li  wrote:
>
> > > Some tweaking is made in deparse_drop_command in order to make DROP
> > > TRANSFORM deparsing work. This is because the objidentity captured in
> > > currentEventTriggerState->SQLDropList contains the keyword 'on', for
> > > example "for typename on language lang", but the keyword 'on' is not
> > > needed in the current DROP TRANSFORM syntax. So we need to remove the
> > > 'on' keyword in objidentity. I'm not sure if this is the best way to
> > > handle it, maybe we can consider directly modifying what's captured in
> > > currentEventTriggerState->SQLDropList
> > > so we don't have the "on" keyword to begin with?
> >
> > The exact output format for identity is not set in stone; we should only
> > set it in stone once we have an actual working case for them.  This is
> > the first such use, so it seems OK to make minor modifications (such as
> > removing an undesirable ON) if it's a reasonable change and allows
> > consumer code to be more easily written.
>
> > So, +1 to dropping ON here.  However, if there are further strings that
> > need to be modified, let's see what they are.
>
> Thanks for confirming. Attaching the new patch set that removes the
> undesirable ON from getObjectIdentityParts() for TRANSFORM.
>
Thanks for the new patch-set.
Could you add the changes to patch 1 and patch 2, rather than adding a
new patch?
Otherwise, we'll have a separate patch for each command and it will
take double work to keep it updated
for each new command added.

thanks,
Ajin Cherian
Fujitsu Australia




Re: Avoid mix char with bool type in comparisons

2022-10-06 Thread Tom Lane
Ranier Vilela  writes:
> So, any use of this GinTernaryValue are:

> 1. if (key->entryRes[j]) be FALSE if GIN_FALSE
> 2. if (key->entryRes[j]) be TRUE if GIN_TRUE
> 3. if (key->entryRes[j]) be TRUE if GIN_MAYBE

Yeah, that's how it's designed.  Unless you can point to a bug,
I do not think we ought to change this code.

regards, tom lane




Re: Issue with pg_stat_subscription_stats

2022-10-06 Thread Andres Freund
Hi,

On 2022-10-06 14:10:56 +0900, Michael Paquier wrote:
> On Tue, Jul 12, 2022 at 09:31:16AM +0530, Amit Kapila wrote:
> > I am not against backpatching this but OTOH it doesn't appear critical
> > enough to block one's work, so not backpatching should be fine.
> 
> We are just talking about the reset timestamp not being set at
> when the object is created, right?  This does not strike me as
> critical, so applying it only on HEAD is fine IMO.  A few months ago,
> while in beta, I would have been fine with something applied to
> REL_15_STABLE.  Now that we are in RC, that's not worth taking a risk
> in my opinion.

Agreed.

> Amit or Andres, are you planning to double-check and perhaps merge
> this patch to take care of the inconsistency?

I'll run it through CI and then to master unless somebody pipes up in the
meantime.

Thanks for bringing this thread up, I'd lost track of it.

Greetings,

Andres Freund




Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread Tom Lane
I wrote:
> So I'm still inclined to leave 001 and 010 both unused, but the
> reason why is different than I thought before.

Which leaves me with the attached proposed wording.

regards, tom lane

diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index c80236d285..b1a3c74830 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -32,6 +32,11 @@
 #include "utils/memutils_internal.h"
 
 
+static void BogusFree(void *pointer);
+static void *BogusRealloc(void *pointer, Size size);
+static MemoryContext BogusGetChunkContext(void *pointer);
+static Size BogusGetChunkSpace(void *pointer);
+
 /*
  *	  GLOBAL MEMORY			 *
  */
@@ -74,10 +79,42 @@ static const MemoryContextMethods mcxt_methods[] = {
 	[MCTX_SLAB_ID].get_chunk_context = SlabGetChunkContext,
 	[MCTX_SLAB_ID].get_chunk_space = SlabGetChunkSpace,
 	[MCTX_SLAB_ID].is_empty = SlabIsEmpty,
-	[MCTX_SLAB_ID].stats = SlabStats
+	[MCTX_SLAB_ID].stats = SlabStats,
 #ifdef MEMORY_CONTEXT_CHECKING
-	,[MCTX_SLAB_ID].check = SlabCheck
+	[MCTX_SLAB_ID].check = SlabCheck,
 #endif
+
+	/*
+	 * Unused (as yet) IDs should have dummy entries here.  This allows us to
+	 * fail cleanly if a bogus pointer is passed to pfree or the like.  It
+	 * seems sufficient to provide routines for the methods that might get
+	 * invoked from inspection of a chunk (see MCXT_METHOD calls below).
+	 */
+
+	[MCTX_UNUSED1_ID].free_p = BogusFree,
+	[MCTX_UNUSED1_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED1_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED1_ID].get_chunk_space = BogusGetChunkSpace,
+
+	[MCTX_UNUSED2_ID].free_p = BogusFree,
+	[MCTX_UNUSED2_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED2_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED2_ID].get_chunk_space = BogusGetChunkSpace,
+
+	[MCTX_UNUSED3_ID].free_p = BogusFree,
+	[MCTX_UNUSED3_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED3_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED3_ID].get_chunk_space = BogusGetChunkSpace,
+
+	[MCTX_UNUSED4_ID].free_p = BogusFree,
+	[MCTX_UNUSED4_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED4_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED4_ID].get_chunk_space = BogusGetChunkSpace,
+
+	[MCTX_UNUSED5_ID].free_p = BogusFree,
+	[MCTX_UNUSED5_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED5_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED5_ID].get_chunk_space = BogusGetChunkSpace,
 };
 
 /*
@@ -125,6 +162,77 @@ static void MemoryContextStatsPrint(MemoryContext context, void *passthru,
 #define MCXT_METHOD(pointer, method) \
 	mcxt_methods[GetMemoryChunkMethodID(pointer)].method
 
+/*
+ * GetMemoryChunkMethodID
+ *		Return the MemoryContextMethodID from the uint64 chunk header which
+ *		directly precedes 'pointer'.
+ */
+static inline MemoryContextMethodID
+GetMemoryChunkMethodID(const void *pointer)
+{
+	uint64		header;
+
+	/*
+	 * Try to detect bogus pointers handed to us, poorly though we can.
+	 * Presumably, a pointer that isn't MAXALIGNED isn't pointing at an
+	 * allocated chunk.
+	 */
+	Assert(pointer == (const void *) MAXALIGN(pointer));
+
+	header = *((const uint64 *) ((const char *) pointer - sizeof(uint64)));
+
+	return (MemoryContextMethodID) (header & MEMORY_CONTEXT_METHODID_MASK);
+}
+
+/*
+ * GetMemoryChunkHeader
+ *		Return the uint64 chunk header which directly precedes 'pointer'.
+ *
+ * This is only used after GetMemoryChunkMethodID, so no need for error checks.
+ */
+static inline uint64
+GetMemoryChunkHeader(const void *pointer)
+{
+	return *((const uint64 *) ((const char *) pointer - sizeof(uint64)));
+}
+
+/*
+ * Support routines to trap use of invalid memory context method IDs
+ * (from calling pfree or the like on a bogus pointer).  As a possible
+ * aid in debugging, we report the header word along with the pointer
+ * address (if we got here, there must be an accessible header word).
+ */
+static void
+BogusFree(void *pointer)
+{
+	elog(ERROR, "pfree called with invalid pointer %p (header 0x%016llx)",
+		 pointer, (long long) GetMemoryChunkHeader(pointer));
+}
+
+static void *
+BogusRealloc(void *pointer, Size size)
+{
+	elog(ERROR, "repalloc called with invalid pointer %p (header 0x%016llx)",
+		 pointer, (long long) GetMemoryChunkHeader(pointer));
+	return NULL;/* keep compiler quiet */
+}
+
+static MemoryContext
+BogusGetChunkContext(void *pointer)
+{
+	elog(ERROR, "GetMemoryChunkContext called with invalid pointer %p (header 0x%016llx)",
+		 pointer, (long long) GetMemoryChunkHeader(pointer));
+	return NULL;/* keep compiler quiet */
+}
+
+static Size
+BogusGetChunkSpace(void *pointer)
+{
+	elog(ERROR, "GetMemoryChunkSpace called with invalid pointer %p (header 0x%016llx)",
+		 pointer, (long long) GetMemoryChunkHeader(pointer));
+	return 0;	/* keep compiler quiet */
+}
+
 
 

Refactor to introduce pg_strcoll().

2022-10-06 Thread Jeff Davis
Refactors to isolate strcoll, wcscoll, and ucol_strcoll into
pg_locale.c which seems like a better place. Most of the buffer
manipulation and equality optimizations are still left in varlena.c.

Patch attached.

I'm not able to easily test on windows, but it should be close to
correct as I just moved some code around.

Is there a good way to look for regressions (perf or correctness) when
making changes in this area, especially on windows and/or with strange
collation rules? What I did doesn't look like a problem, but there's
always a chance of a regression.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From 644ee9ac28652884d91f17f4fc7755104538d9f1 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 6 Oct 2022 10:46:36 -0700
Subject: [PATCH] Refactor: introduce pg_strcoll().

Isolate collation routines into pg_locale.c and simplify varlena.c.
---
 src/backend/utils/adt/pg_locale.c | 178 +
 src/backend/utils/adt/varlena.c   | 207 +-
 src/include/utils/pg_locale.h |   3 +-
 3 files changed, 182 insertions(+), 206 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 2b42d9ccd8..696ff75201 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1639,6 +1639,184 @@ pg_newlocale_from_collation(Oid collid)
 	return cache_entry->locale;
 }
 
+/*
+ * win32_utf8_wcscoll
+ *
+ * Convert UTF8 arguments to wide characters and invoke wcscoll() or
+ * wcscoll_l(). Will allocate on large input.
+ */
+#ifdef WIN32
+#define TEXTBUFLEN		1024
+static int
+win32_utf8_wcscoll(const char *arg1, int len1, const char *arg2, int len2,
+   pg_locale_t locale)
+{
+	char		a1buf[TEXTBUFLEN];
+	char		a2buf[TEXTBUFLEN];
+	char	   *a1p,
+			   *a2p;
+	int			a1len;
+	int			a2len;
+	int			r;
+	int			result;
+
+	if (len1 >= TEXTBUFLEN / 2)
+	{
+		a1len = len1 * 2 + 2;
+		a1p = palloc(a1len);
+	}
+	else
+	{
+		a1len = TEXTBUFLEN;
+		a1p = a1buf;
+	}
+	if (len2 >= TEXTBUFLEN / 2)
+	{
+		a2len = len2 * 2 + 2;
+		a2p = palloc(a2len);
+	}
+	else
+	{
+		a2len = TEXTBUFLEN;
+		a2p = a2buf;
+	}
+
+	/* API does not work for zero-length input */
+	if (len1 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
+(LPWSTR) a1p, a1len / 2);
+		if (!r)
+			ereport(ERROR,
+	(errmsg("could not convert string to UTF-16: error code %lu",
+			GetLastError(;
+	}
+	((LPWSTR) a1p)[r] = 0;
+
+	if (len2 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
+(LPWSTR) a2p, a2len / 2);
+		if (!r)
+			ereport(ERROR,
+	(errmsg("could not convert string to UTF-16: error code %lu",
+			GetLastError(;
+	}
+	((LPWSTR) a2p)[r] = 0;
+
+	errno = 0;
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+	else
+#endif
+		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
+ * headers */
+		ereport(ERROR,
+(errmsg("could not compare Unicode strings: %m")));
+
+	if (a1p != a1buf)
+		pfree(a1p);
+	if (a2p != a2buf)
+		pfree(a2p);
+
+	return result;
+}
+#endif
+
+/*
+ * pg_strcoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. Arguments must be NUL-terminated. If the locale is not specified,
+ * use the database collation.
+ *
+ * If the collation is deterministic, break ties with memcmp(), and then with
+ * the string length.
+ */
+int
+pg_strcoll(const char *arg1, int len1, const char *arg2, int len2,
+		   pg_locale_t locale)
+{
+	int result;
+
+#ifdef WIN32
+	/* Win32 does not have UTF-8, so we need to map to UTF-16 */
+	if (GetDatabaseEncoding() == PG_UTF8
+		&& (!locale || locale->provider == COLLPROVIDER_LIBC))
+	{
+		result = win32_utf8_wcscoll(arg1, len1, arg2, len2, locale);
+	}
+	else
+#endif			/* WIN32 */
+	if (locale)
+	{
+		if (locale->provider == COLLPROVIDER_ICU)
+		{
+#ifdef USE_ICU
+#ifdef HAVE_UCOL_STRCOLLUTF8
+			if (GetDatabaseEncoding() == PG_UTF8)
+			{
+UErrorCode	status;
+
+status = U_ZERO_ERROR;
+result = ucol_strcollUTF8(locale->info.icu.ucol,
+		  arg1, len1,
+		  arg2, len2,
+		  );
+if (U_FAILURE(status))
+	ereport(ERROR,
+			(errmsg("collation failed: %s", u_errorName(status;
+			}
+			else
+#endif
+			{
+int32_t		ulen1,
+			ulen2;
+UChar	   *uchar1,
+		   *uchar2;
+
+ulen1 = icu_to_uchar(, arg1, len1);
+ulen2 = icu_to_uchar(, arg2, len2);
+
+result = ucol_strcoll(locale->info.icu.ucol,
+	  uchar1, ulen1,
+	  uchar2, ulen2);
+
+pfree(uchar1);
+pfree(uchar2);
+			}
+#else			/* not USE_ICU */
+			/* shouldn't happen */
+			elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif			/* not USE_ICU */
+		}
+		else
+		{
+#ifdef HAVE_LOCALE_T
+			

Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread Tom Lane
I wrote:
> FreeBSD 13.0, arm64: Usually the low-order nibble is  or ,
> but for some smaller values of N it sometimes comes up as 0010.
> NetBSD 9.2, amd64: results similar to FreeBSD.

I looked into NetBSD's malloc.c, and what I discovered is that
their implementation doesn't have any chunk headers: chunks of
the same size are allocated consecutively within pages, and all
the bookkeeping data is somewhere else.  Presumably FreeBSD is
the same.  So the apparent special case with 0010 is an illusion,
even though I saw it on two different machines (maybe it's a
specific value that we're allocating??)  The most likely case
is  due to the immediately previous word having never been
used (note that like palloc, they round chunk sizes up to powers
of two, so unused space at the end of a chunk is common).  I'm
not sure whether the cases I saw with  are chance artifacts
or reflect some real mechanism, but probably the former.  I
thought for a bit that that might be the effects of wipe_mem
on the previous chunk, but palloc'd storage would never share
the same page as malloc'd storage under this allocator, because
we grab it from malloc in larger-than-page chunks.

However ... after looking into glib's malloc.c, I find that
it does use a chunk header, and very conveniently the three bits
that we care about are flag bits (at least on 64-bit machines):

chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of previous chunk, if unallocated (P clear)  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of chunk, in bytes |A|M|P|
  mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| User data starts here...  .

The A bit is only used when threading, and hence should always
be zero in our usage.  The M bit only gets set in chunks large
enough to be separately mmap'd, so when it is set P must be 0.
If M is not set then P seems to usually be 1, although it could
be 0.  So the three possibilities for what we can see under
glibc are 000, 001, 010 (the last only occuring for chunks
larger than 128K).  This squares with experimental results on
my machine --- I'd not thought to try sizes above 100K before.

So I'm still inclined to leave 001 and 010 both unused, but the
reason why is different than I thought before.

Going forward, we could commandeer 010 if we need to without losing
very much debuggability, since malloc'ing more than 128K in a chunk
won't happen often.

regards, tom lane




Re: START_REPLICATION SLOT causing a crash in an assert build

2022-10-06 Thread Andres Freund
Hi,

On 2022-10-06 14:10:46 +0900, Kyotaro Horiguchi wrote:
> +1. FWIW, the atttached is an example of what it looks like if we
> avoid file format change.

What about if we go the other direction - simply remove the name from the
stats entry at all. I don't actually think we need it anymore. Unless I am
missing something right now - entirely possible! - the danger that
pgstat_acquire_replslot() mentions doesn't actually exist [anymore]. After a
crash we throw away the old stats data and if a slot is dropped while shut
down, we'll not load the slot data at startup.

The attached is a rough prototype, but should be enough for Jaime to test and
Horiguchi to test / review / think about.

Amit, I CCed you, since you've thought a bunch about the dangers in this area
too.

Greetings,

Andres Freund
>From f34c6ed510fbe4c079a0c80e9040cac77c60fd19 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 6 Oct 2022 15:50:36 -0700
Subject: [PATCH v3] pgstat: Remove slotname from PgStat_StatReplSlotEntry

This fixes a bug where the slotname is reset as part of
pgstat_reset_replslot() subsequently causing an allocation failure in
pgstat_report_replslot(). It also is architecturally nicer, because having the
name as part of the stats isn't quite right, on account of not actually being
stats :)

As far as I can tell we actually have worked around all the dangers that lead
us to keeping the stats name part of the stats.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/pgstat.h |  2 ++
 src/include/utils/pgstat_internal.h  |  5 +--
 src/backend/utils/activity/pgstat.c  |  2 +-
 src/backend/utils/activity/pgstat_replslot.c | 36 +---
 4 files changed, 14 insertions(+), 31 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ad7334a0d21..2a85fa0369a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -321,7 +321,9 @@ typedef struct PgStat_StatFuncEntry
 
 typedef struct PgStat_StatReplSlotEntry
 {
+#if IN_PG_15_ONLY_UNUSED
 	NameData	slotname;
+#endif
 	PgStat_Counter spill_txns;
 	PgStat_Counter spill_count;
 	PgStat_Counter spill_bytes;
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 40a36028554..627c1389e4c 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -242,7 +242,8 @@ typedef struct PgStat_KindInfo
 	/*
 	 * For variable-numbered stats with named_on_disk. Optional.
 	 */
-	void		(*to_serialized_name) (const PgStatShared_Common *header, NameData *name);
+	void		(*to_serialized_name) (const PgStat_HashKey *key,
+	   const PgStatShared_Common *header, NameData *name);
 	bool		(*from_serialized_name) (const NameData *name, PgStat_HashKey *key);
 
 	/*
@@ -567,7 +568,7 @@ extern void pgstat_relation_delete_pending_cb(PgStat_EntryRef *entry_ref);
  */
 
 extern void pgstat_replslot_reset_timestamp_cb(PgStatShared_Common *header, TimestampTz ts);
-extern void pgstat_replslot_to_serialized_name_cb(const PgStatShared_Common *header, NameData *name);
+extern void pgstat_replslot_to_serialized_name_cb(const PgStat_HashKey *key, const PgStatShared_Common *header, NameData *name);
 extern bool pgstat_replslot_from_serialized_name_cb(const NameData *name, PgStat_HashKey *key);
 
 
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 609f0b1ad86..1b97597f17c 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -1367,7 +1367,7 @@ pgstat_write_statsfile(void)
 			/* stats entry identified by name on disk (e.g. slots) */
 			NameData	name;
 
-			kind_info->to_serialized_name(shstats, );
+			kind_info->to_serialized_name(>key, shstats, );
 
 			fputc('N', fpout);
 			write_chunk_s(fpout, >key.kind);
diff --git a/src/backend/utils/activity/pgstat_replslot.c b/src/backend/utils/activity/pgstat_replslot.c
index 2039ac3147f..793f26fc950 100644
--- a/src/backend/utils/activity/pgstat_replslot.c
+++ b/src/backend/utils/activity/pgstat_replslot.c
@@ -82,12 +82,6 @@ pgstat_report_replslot(ReplicationSlot *slot, const PgStat_StatReplSlotEntry *re
 	shstatent = (PgStatShared_ReplSlot *) entry_ref->shared_stats;
 	statent = >stats;
 
-	/*
-	 * Any mismatch should have been fixed in pgstat_create_replslot() or
-	 * pgstat_acquire_replslot().
-	 */
-	Assert(namestrcmp(>slotname, NameStr(slot->data.name)) == 0);
-
 	/* Update the replication slot statistics */
 #define REPLSLOT_ACC(fld) statent->fld += repSlotStat->fld
 	REPLSLOT_ACC(spill_txns);
@@ -124,7 +118,6 @@ pgstat_create_replslot(ReplicationSlot *slot)
 	 * if we previously crashed after dropping a slot.
 	 */
 	memset(>stats, 0, sizeof(shstatent->stats));
-	namestrcpy(>stats.slotname, NameStr(slot->data.name));
 
 	pgstat_unlock_entry(entry_ref);
 }
@@ -135,27 +128,13 @@ pgstat_create_replslot(ReplicationSlot *slot)
 void
 pgstat_acquire_replslot(ReplicationSlot *slot)
 {
-	PgStat_EntryRef 

Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread David Rowley
On Fri, 7 Oct 2022 at 10:57, Tom Lane  wrote:
> I poked at this some more by creating a function that intentionally
> does pfree(malloc(N)) for various values of N.
>
> RHEL8, x86_64: the low-order nibble of the header is consistently 0001.
>
> macOS 12.6, arm64: the low-order nibble is consistently .
>
> FreeBSD 13.0, arm64: Usually the low-order nibble is  or ,
> but for some smaller values of N it sometimes comes up as 0010.
>
> NetBSD 9.2, amd64: results similar to FreeBSD.

Out of curiosity I tried using the attached on a Windows machine and got:

0: 130951
1: 131061
2: 133110
3: 129053
4: 131061
5: 131067
6: 131070
7: 131203

So it seems pretty much entirely inconsistent on that platform.

Also, on an Ubuntu machine I didn't get the consistent 0001 as you got
on your RHEL machine. There were a very small number of 010's there
too:

0: 0
1: 1048569
2: 7
3: 0
4: 0
5: 0
6: 0
7: 0

Despite Windows not being very consistent here, I think it's a useful
change as if our most common platform (Linux/glibc) is fairly
consistent, then that'll give us wide coverage to track down any buggy
code.

In anycase, even on Windows (assuming it's completely random) we'll
have a 5 out of 8 chance of getting a nice error message if there are
any bad pointers being passed.

David
#include 
#include 

typedef unsigned long long uint64;
int main(void)
{
size_t count[8] = {0};

size_t n;

for (n = 0; n < 1024*1024; n++)
{
void  *ptr = (void *) malloc(n);
uint64 header;

if (!ptr)
{
fprintf(stderr, "OOM\n");
return -1;
}
header = *((uint64 *) ((char *) ptr - sizeof(uint64)));
count[header & 7]++;
free(ptr);
}
for (n = 0; n <= 7; n++)
printf("%zd: %zd\n", n, count[n]);
return 0;
}


Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-10-06 Thread Melanie Plageman
v31 failed in CI, so
I've attached v32 which has a few issues fixed:
- addressed some compiler warnings I hadn't noticed locally
- autovac launcher and worker do indeed use bulkread strategy if they
  end up starting before critical indexes have loaded and end up doing a
  sequential scan of some catalog tables, so I have changed the
  restrictions on BackendTypes allowed to track IO Operations in
  IOCONTEXT_BULKREAD
- changed the name of the column "fsynced" to "files_synced" to make it
  more clear what unit it is in (and that the unit differs from that of
  the "unit" column)

In an off-list discussion with Andres, he mentioned that he thought
buffers reused by a BufferAccessStrategy should be split from buffers
"acquired" and that "acquired" should be renamed "clocksweeps".

I have started doing this, but for BufferAccessStrategy IO there are a
few choices about how we want to count the clocksweeps:

Currently the following situations are counted under the following
IOContexts and IOOps:

IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_ACQUIRE
- reuse a buffer from the ring

IOCONTEXT_SHARED, IOOP_ACQUIRE
- add a buffer to the strategy ring initially
- add a new shared buffer to the ring when all the existing buffers in
  the ring are pinned

And in the new paradigm, I think these are two good options:

1)
IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP
- add a buffer to the strategy ring initially
- add a new shared buffer to the ring when all the existing buffers in
  the ring are pinned

IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE
- reuse a buffer from the ring

2)
IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP
- add a buffer to the strategy ring initially

IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE
- reuse a buffer from the ring

IOCONTEXT SHARED, IOOP_CLOCKSWEEP
- add a new shared buffer to the ring when all the existing buffers in
  the ring are pinned

However, if we want to differentiate between buffers initially added to
the ring and buffers taken from shared buffers and added to the ring
because all strategy ring buffers are pinned or have a usage count above
one, then we would need to either do so inside of GetBufferFromRing() or
propagate this distinction out somehow (easy enough if we care to do
it).

There are other combinations that I could come up with a justification
for as well, but I wanted to know what other people thought made sense
(and would make sense to users).

- Melanie
From d5859f08316a185f8fb889da8552422d9b609dd1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 6 Oct 2022 12:23:38 -0400
Subject: [PATCH v32 2/3] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats is checked before flushing pending stats and
before reading in the existing stats file to shared memory.

Author: Melanie Plageman 
Reviewed-by: Andres Freund 
Reviewed-by: Justin Pryzby 
Reviewed-by: Kyotaro Horiguchi 
Reviewed-by: Maciek Sakrejda 
Reviewed-by: Lukas Fittl 
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml  |   2 +
 src/backend/utils/activity/pgstat.c   |  35 
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c  |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c| 158 ++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c |   4 +
 src/backend/utils/activity/pgstat_wal.c   |   4 +-
 src/backend/utils/adt/pgstatfuncs.c   |   4 +-
 src/include/miscadmin.h   |   2 +
 src/include/pgstat.h  |  83 -
 src/include/utils/pgstat_internal.h   |  36 
 src/tools/pgindent/typedefs.list  |   3 +
 13 files changed, 353 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 342b20ebeb..14dfd650f8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5360,6 +5360,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 the pg_stat_bgwriter
  

Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread Tom Lane
I wrote:
> I also avoided using 001: based on my work with converting guc.c to use
> palloc [1], it seems that pfree'ing a malloc-provided pointer is likely
> to see 001 a lot, at least on 64-bit glibc platforms.

I poked at this some more by creating a function that intentionally
does pfree(malloc(N)) for various values of N.

RHEL8, x86_64: the low-order nibble of the header is consistently 0001.

macOS 12.6, arm64: the low-order nibble is consistently .

FreeBSD 13.0, arm64: Usually the low-order nibble is  or ,
but for some smaller values of N it sometimes comes up as 0010.

NetBSD 9.2, amd64: results similar to FreeBSD.

I still haven't looked into anybody's source code, but based on these
results I'm inclined to leave both 001 and 010 IDs unused for now.
That'll help the GUC malloc -> palloc transition tremendously, because
people will get fairly clear errors rather than weird assertions
and/or memory corruption.  That'll leave us in a situation where only
one more context ID can be assigned without risk of reducing our error
detection ability, but I'm content with that for now.

regards, tom lane




Re: [Commitfest 2022-09] Date is Over.

2022-10-06 Thread Greg Stark
Fwiw I'm going through some patches looking for patches to review And
I'm finding that the patches I'm seeing actually did get reviews, some of
them months ago.

If there was any substantial feedback since the last patch was posted I
would say you should change the status to Waiting on Author when moving it
forward rather than leaving it as Needs Review.

Ideally there should be very few patches moved to the next commitfest as
Needs Review. Only patches that have been not getting attention and the
author is blocked waiting on feedback.


Avoid mix char with bool type in comparisons

2022-10-06 Thread Ranier Vilela
Hi,

GIN Indexes:

Defines a type char GinTernaryValue with 3 values:
#define GIN_FALSE 0 /* item is not present / does not match */
#define GIN_TRUE 1 /* item is present / matches */
#define GIN_MAYBE 2 /* don't know if item is present / don't know
* if matches */

So, any use of this GinTernaryValue are:

1. if (key->entryRes[j]) be FALSE if GIN_FALSE
2. if (key->entryRes[j]) be TRUE if GIN_TRUE
3. if (key->entryRes[j]) be TRUE if GIN_MAYBE

So gin matchs can fail with GYN_MAYBE or I lost something?

regards,
Ranier Vilela


avoid_mix_bool_with_char_comparator.patch
Description: Binary data


Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread Tom Lane
David Rowley  writes:
> However, maybe you've left it this way as you feel it's a decision
> that must be made in the future, perhaps based on how difficult it
> would be to free up another bit?

Yeah, pretty much.  I think it'll be a long time before we run out
of memory context IDs, and it's hard to anticipate the tradeoffs
that will matter at that time.

regards, tom lane




Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread David Rowley
On Fri, 7 Oct 2022 at 09:05, Tom Lane  wrote:
>
> Here's a v2 incorporating discussed changes.
>
> In reordering enum MemoryContextMethodID, I arranged to avoid using
> 000 and 111 as valid IDs, since those bit patterns will appear in
> zeroed and wipe_mem'd memory respectively.  Those should probably be
> more-or-less-permanent exclusions, so I added comments about it.

I'm just considering some future developer here that is writing a new
MemoryContext type and there's no more space left and she or he needs
to either use 000 or 111. I think if that was me, I might be unsure if
I should be looking to expand the bit-space to make room. I might
think that based on the word "avoid" in:

> + MCTX_UNUSED1_ID, /* avoid: 000 occurs in never-used memory */
> + MCTX_UNUSED5_ID /* avoid: 111 occurs in wipe_mem'd memory */

but the final sentence in:

> + * dummy entries for unused IDs in the mcxt_methods[] array.  We also try
> + * to avoid using bit-patterns as valid IDs if they are likely to occur in
> + * garbage data.

leads me to believe we're just *trying* to avoid using these bit-patterns.

Also, the comment in mcxt_methods[] might make me believe that it's ok
for me to use them if I really need to.

> + * Unused (as yet) IDs should have dummy entries here.  This allows us to

Based on these comments, I'm not quite sure if I should be completely
avoiding using 000 and 111 or I should just use those last when there
are no other free slots in the array. It might be quite a long time
before someone is in this situation, so should we be more clear?

However, maybe you've left it this way as you feel it's a decision
that must be made in the future, perhaps based on how difficult it
would be to free up another bit?

David




Re: shadow variables - pg15 edition

2022-10-06 Thread David Rowley
On Thu, 6 Oct 2022 at 20:32, Alvaro Herrera  wrote:
>
> On 2022-Oct-06, David Rowley wrote:
> > I didn't want to do it that way because all this code is in a while
> > loop and the outer "now" will be reused after it's set by the code
> > above.  It's not really immediately obvious to me what repercussions
> > that would have, but it didn't seem worth taking any risks.
>
> No, it's re-initialized to zero every time through the loop, so setting
> it to something else at the bottom doesn't have any further effect.

Oh yeah, you're right.

> If it were *not* reinitialized every time through the loop, then what
> would happen is that every iteration in the loop (and each operation
> within) would see exactly the same value of "now", because it's only set
> "lazily" (meaning, if already set, don't change it.)

On my misread, that's what I was afraid of changing, but now seeing
that now = 0 at the start of each loop, I understand that
pg_time_now_lazy will get an up-to-date time on each loop.

I'm happy if you want to change it to use the outer scoped variable
instead of the now2 one.

David




Re: [PATCH v1] [meson] fix some typo to make it more readable

2022-10-06 Thread Andres Freund
Hi,

On 2022-10-06 11:06:06 +0800, Junwang Zhao wrote:
> Seems there are some typo in file src/backend/meson.build comment, pls
> have a look.

Thanks! I editorilized the first sentence a bit more and pushed this.

Regards,

Andres




Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread Tom Lane
Here's a v2 incorporating discussed changes.

In reordering enum MemoryContextMethodID, I arranged to avoid using
000 and 111 as valid IDs, since those bit patterns will appear in
zeroed and wipe_mem'd memory respectively.  Those should probably be
more-or-less-permanent exclusions, so I added comments about it.

I also avoided using 001: based on my work with converting guc.c to use
palloc [1], it seems that pfree'ing a malloc-provided pointer is likely
to see 001 a lot, at least on 64-bit glibc platforms.  I've not stuck
my nose into the glibc sources to see how consistent that might be,
but it definitely recurred several times while I was chasing down
places needing adjustment in that patch.

I'm not sure if there are any platform-dependent reasons to avoid
other bit-patterns, but we do still have a little bit of flexibility
left here if anyone has candidates.

regards, tom lane

[1] https://www.postgresql.org/message-id/2982579.1662416866%40sss.pgh.pa.us

diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index c80236d285..b1a3c74830 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -32,6 +32,11 @@
 #include "utils/memutils_internal.h"
 
 
+static void BogusFree(void *pointer);
+static void *BogusRealloc(void *pointer, Size size);
+static MemoryContext BogusGetChunkContext(void *pointer);
+static Size BogusGetChunkSpace(void *pointer);
+
 /*
  *	  GLOBAL MEMORY			 *
  */
@@ -74,10 +79,42 @@ static const MemoryContextMethods mcxt_methods[] = {
 	[MCTX_SLAB_ID].get_chunk_context = SlabGetChunkContext,
 	[MCTX_SLAB_ID].get_chunk_space = SlabGetChunkSpace,
 	[MCTX_SLAB_ID].is_empty = SlabIsEmpty,
-	[MCTX_SLAB_ID].stats = SlabStats
+	[MCTX_SLAB_ID].stats = SlabStats,
 #ifdef MEMORY_CONTEXT_CHECKING
-	,[MCTX_SLAB_ID].check = SlabCheck
+	[MCTX_SLAB_ID].check = SlabCheck,
 #endif
+
+	/*
+	 * Unused (as yet) IDs should have dummy entries here.  This allows us to
+	 * fail cleanly if a bogus pointer is passed to pfree or the like.  It
+	 * seems sufficient to provide routines for the methods that might get
+	 * invoked from inspection of a chunk (see MCXT_METHOD calls below).
+	 */
+
+	[MCTX_UNUSED1_ID].free_p = BogusFree,
+	[MCTX_UNUSED1_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED1_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED1_ID].get_chunk_space = BogusGetChunkSpace,
+
+	[MCTX_UNUSED2_ID].free_p = BogusFree,
+	[MCTX_UNUSED2_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED2_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED2_ID].get_chunk_space = BogusGetChunkSpace,
+
+	[MCTX_UNUSED3_ID].free_p = BogusFree,
+	[MCTX_UNUSED3_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED3_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED3_ID].get_chunk_space = BogusGetChunkSpace,
+
+	[MCTX_UNUSED4_ID].free_p = BogusFree,
+	[MCTX_UNUSED4_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED4_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED4_ID].get_chunk_space = BogusGetChunkSpace,
+
+	[MCTX_UNUSED5_ID].free_p = BogusFree,
+	[MCTX_UNUSED5_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED5_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED5_ID].get_chunk_space = BogusGetChunkSpace,
 };
 
 /*
@@ -125,6 +162,77 @@ static void MemoryContextStatsPrint(MemoryContext context, void *passthru,
 #define MCXT_METHOD(pointer, method) \
 	mcxt_methods[GetMemoryChunkMethodID(pointer)].method
 
+/*
+ * GetMemoryChunkMethodID
+ *		Return the MemoryContextMethodID from the uint64 chunk header which
+ *		directly precedes 'pointer'.
+ */
+static inline MemoryContextMethodID
+GetMemoryChunkMethodID(const void *pointer)
+{
+	uint64		header;
+
+	/*
+	 * Try to detect bogus pointers handed to us, poorly though we can.
+	 * Presumably, a pointer that isn't MAXALIGNED isn't pointing at an
+	 * allocated chunk.
+	 */
+	Assert(pointer == (const void *) MAXALIGN(pointer));
+
+	header = *((const uint64 *) ((const char *) pointer - sizeof(uint64)));
+
+	return (MemoryContextMethodID) (header & MEMORY_CONTEXT_METHODID_MASK);
+}
+
+/*
+ * GetMemoryChunkHeader
+ *		Return the uint64 chunk header which directly precedes 'pointer'.
+ *
+ * This is only used after GetMemoryChunkMethodID, so no need for error checks.
+ */
+static inline uint64
+GetMemoryChunkHeader(const void *pointer)
+{
+	return *((const uint64 *) ((const char *) pointer - sizeof(uint64)));
+}
+
+/*
+ * Support routines to trap use of invalid memory context method IDs
+ * (from calling pfree or the like on a bogus pointer).  As a possible
+ * aid in debugging, we report the header word along with the pointer
+ * address (if we got here, there must be an accessible header word).
+ */
+static void
+BogusFree(void *pointer)
+{
+	elog(ERROR, "pfree called with invalid pointer %p (header 0x%016llx)",
+		 pointer, (long long) 

Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread Tom Lane
Andres Freund  writes:
> On 2022-10-06 15:10:44 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>>> Maybe worth printing the method ID as well?

>> I doubt it'd be useful.

> I was thinking it could be useful to see whether the bits are likely to be the
> result of wipe_mem(). But I guess for that we should print the whole byte,
> rather than just the method. Perhaps not worth it.

I think printing the whole int64 header word would be appropriate if
we were hoping for something like that.  Still not sure if it's useful.
On the other hand, if control gets there then you are probably in need
of debugging help ...

regards, tom lane




Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread Andres Freund
Hi,

On 2022-10-06 15:10:44 -0400, Tom Lane wrote:
> Andres Freund  writes:
> >> +  elog(ERROR, "pfree called with invalid pointer %p", pointer);
>
> > Maybe worth printing the method ID as well?
>
> I doubt it'd be useful.

I was thinking it could be useful to see whether the bits are likely to be the
result of wipe_mem(). But I guess for that we should print the whole byte,
rather than just the method. Perhaps not worth it.

Greetings,

Andres Freund




Re: Lambda expressions (was Re: BUG #15471)

2022-10-06 Thread Paul Ramsey
On Tue, Oct 30, 2018 at 3:20 PM Andres Freund  wrote:
>
> Hi,
>
> On 2018-10-30 16:54:45 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2018-10-30 16:23:37 -0400, Tom Lane wrote:
> > >> Well, a Lambda expression is not something that can be optimized away
> > >> (unless perhaps you can get rid of the need for any of its output Params)
> > >> so I don't see how any of its subexpressions would ever wind up split out
> > >> from inside it.
> >
> > > Isn't that a pretty fundamental requirement for the postgis (and I
> > > presume lots of other) usecases?  What they have is wrapper function
> > > functions like ST_DWithin(geometry, geometry, float) that roughly expand
> > > to something (simplified) like
> >
> > > SELECT $1 && ST_Expand($2, $3) AND _ST_DWithin($1, $2, $3);

Just an FYI on our current state of play. Since Pg12 / PostGIS 3 we
have been using Tom's support function API to slide an index operator
into function calls that we want to have index support (ST_DWithin,
ST_Intersects, ST_Contains, etc, etc). So we are a lot less sensitive
to the vagaries of inlining than we used to be (there's still some of
this in our raster sub-module).

Thanks always!

P.


> >
> > > where && is the overlaps operator, and then _ST_DWithin is the accurate
> > > computation. That allows a quick bounding-box (or similar) search via
> > > index, and then an accurate re-check.   But $2 might be the result of a
> > > function (with constant input), and that'll currently prevent the SQL
> > > function from being inlined when the function is expensive. And the
> > > postgis folks *want* its functions be marked expensive, because they
> > > really are - but getting index searches is also important.
> >
> > Hm.  But if we're trying to avoid evaluating $2 twice, how would you
> > expect to allow the ST_Expand and _ST_DWithin calls to get separated?
> > They can't be allowed to get very far apart in the tree, ISTM.
>
> I think postgis would be OK just tacking a 'force-inline this' on
> functions like ST_DWithin, and live with the redundant expansion. Right
> now they can either have accurate costs (and thus trigger
> e.g. parallelism / avoid plans with more costs) but no index scans, or
> the reverse.  While not optimal, it'd be better than what's there now -
> but it's still awfully crummy.
>
> And yes, they can't get allowed to get too far apart. That's the problem
> I was trying - and obviously failing - to describe in the lounge, btw
> :).  If we allowed the LET() to be split at the baserel level (around
> the match_restriction_clauses_to_index() in create_index_paths()), it
> would probably be ok, but that'd be architecturally somewhat
> invasive.
>
>
> > The patch as I had it also dealt with another limitation on function
> > inlining, which was the restrictions on what you can do with strict
> > functions, by means of allowing a LambdaExpr to be "strict"; that
> > is short-circuit to a NULL result if any of the Param values turn
> > out null.
>
> Right, and similar with volatile.
>
>
> > It doesn't seem possible to do what you're talking about in that case
> > either.  Maybe the PostGIS guys don't care, since they're probably OK
> > with not marking their wrapper functions strict, but I'm not sure that
> > that's the whole universe we should be worried about.
>
> I agree, but we also should take care not to regress cases of inlining
> like postgis', which currently be separated.  If we'd only create the
> new LET node in cases we previously couldn't inline that's obviously not
> a problem.
>
> Greetings,
>
> Andres Freund
>




Re: meson PGXS compatibility

2022-10-06 Thread Andres Freund
Hi,

On 2022-10-06 11:34:26 +0200, Peter Eisentraut wrote:
> On 05.10.22 22:07, Andres Freund wrote:
> > My colleague Bilal has set up testing and verified that a few extensions 
> > build
> > with the pgxs compatibility layer, on linux at last. Currently pg_qualstats,
> > pg_cron, hypopg, orafce, postgis, pg_partman work. He also tested pgbouncer,
> > but for him that failed both with autoconf and meson generated pgxs.
> 
> pgbouncer doesn't use pgxs.

Ah, right. It'd still be interesting to make sure it works, but looks like the
only integration point is pg_config --includedir and pg_config --libdir, so
it should...

Greetings,

Andres Freund




Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread Tom Lane
Andres Freund  writes:
> Yea, that makes sense. I wouldn't get rid of the MAXALIGN Assert though - it's
> not replaced by the the unused mcxt stuff afaics.

OK.

>> +elog(ERROR, "pfree called with invalid pointer %p", pointer);

> Maybe worth printing the method ID as well?

I doubt it'd be useful.

regards, tom lane




Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread Andres Freund
Hi,

On 2022-10-06 14:19:21 -0400, Tom Lane wrote:
> One more thing: based on what I saw in working with my pending guc.c
> changes, the assertions in GetMemoryChunkMethodID are largely useless
> for detecting bogus pointers.  I think we should do something more
> like the attached, which will result in a clean failure if the method
> ID bits are invalid.

Yea, that makes sense. I wouldn't get rid of the MAXALIGN Assert though - it's
not replaced by the the unused mcxt stuff afaics.


> I'm a little tempted also to rearrange the MemoryContextMethodID enum
> so that likely bit patterns like 000 are not valid IDs.

Yea, I was suggesting that during a review as well. We can still relax it
later if we run out of bits.


> +/*
> + * Support routines to trap use of invalid memory context method IDs
> + * (from calling pfree or the like on a bogus pointer).
> + */
> +static void
> +BogusFree(void *pointer)
> +{
> + elog(ERROR, "pfree called with invalid pointer %p", pointer);
> +}

Maybe worth printing the method ID as well?

Greetings,

Andres Freund




Re: list of acknowledgments for PG15

2022-10-06 Thread Tom Lane
Alvaro Herrera  writes:
> (Also: I think it would be nice to have people's names that are
> originally in scripts other than Latin to appear in both scripts.)

That'd move the goalposts for the docs toolchain rather a long way,
I fear.

As for the point originally made, I'm not sure whether Peter has a
consistent rule for which release cycle people get acknowledged in.
It may be that we're already into the time frame in which Nakamori-san
should be listed in PG v16 acknowledgments instead.  I have no objection
to adding him if we're still in the v15 frame, though.

regards, tom lane




Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread Tom Lane
One more thing: based on what I saw in working with my pending guc.c
changes, the assertions in GetMemoryChunkMethodID are largely useless
for detecting bogus pointers.  I think we should do something more
like the attached, which will result in a clean failure if the method
ID bits are invalid.

I'm a little tempted also to rearrange the MemoryContextMethodID enum
so that likely bit patterns like 000 are not valid IDs.

While I didn't change it here, I wonder why GetMemoryChunkMethodID is
publicly exposed at all.  AFAICS it could be static inside mcxt.c.

regards, tom lane

diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index c80236d285..e82d9b21ba 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -32,6 +32,11 @@
 #include "utils/memutils_internal.h"
 
 
+static void BogusFree(void *pointer);
+static void *BogusRealloc(void *pointer, Size size);
+static MemoryContext BogusGetChunkContext(void *pointer);
+static Size BogusGetChunkSpace(void *pointer);
+
 /*
  *	  GLOBAL MEMORY			 *
  */
@@ -74,10 +79,42 @@ static const MemoryContextMethods mcxt_methods[] = {
 	[MCTX_SLAB_ID].get_chunk_context = SlabGetChunkContext,
 	[MCTX_SLAB_ID].get_chunk_space = SlabGetChunkSpace,
 	[MCTX_SLAB_ID].is_empty = SlabIsEmpty,
-	[MCTX_SLAB_ID].stats = SlabStats
+	[MCTX_SLAB_ID].stats = SlabStats,
 #ifdef MEMORY_CONTEXT_CHECKING
-	,[MCTX_SLAB_ID].check = SlabCheck
+	[MCTX_SLAB_ID].check = SlabCheck,
 #endif
+
+	/*
+	 * Unused (as yet) IDs should have dummy entries here.  This allows us to
+	 * fail cleanly if a bogus pointer is passed to pfree or the like.  It
+	 * seems sufficient to provide routines for the methods that might get
+	 * invoked from inspection of a chunk (see MCXT_METHOD calls below).
+	 */
+
+	[MCTX_UNUSED3_ID].free_p = BogusFree,
+	[MCTX_UNUSED3_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED3_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED3_ID].get_chunk_space = BogusGetChunkSpace,
+
+	[MCTX_UNUSED4_ID].free_p = BogusFree,
+	[MCTX_UNUSED4_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED4_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED4_ID].get_chunk_space = BogusGetChunkSpace,
+
+	[MCTX_UNUSED5_ID].free_p = BogusFree,
+	[MCTX_UNUSED5_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED5_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED5_ID].get_chunk_space = BogusGetChunkSpace,
+
+	[MCTX_UNUSED6_ID].free_p = BogusFree,
+	[MCTX_UNUSED6_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED6_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED6_ID].get_chunk_space = BogusGetChunkSpace,
+
+	[MCTX_UNUSED7_ID].free_p = BogusFree,
+	[MCTX_UNUSED7_ID].realloc = BogusRealloc,
+	[MCTX_UNUSED7_ID].get_chunk_context = BogusGetChunkContext,
+	[MCTX_UNUSED7_ID].get_chunk_space = BogusGetChunkSpace,
 };
 
 /*
@@ -125,6 +162,34 @@ static void MemoryContextStatsPrint(MemoryContext context, void *passthru,
 #define MCXT_METHOD(pointer, method) \
 	mcxt_methods[GetMemoryChunkMethodID(pointer)].method
 
+/*
+ * Support routines to trap use of invalid memory context method IDs
+ * (from calling pfree or the like on a bogus pointer).
+ */
+static void
+BogusFree(void *pointer)
+{
+	elog(ERROR, "pfree called with invalid pointer %p", pointer);
+}
+
+static void *
+BogusRealloc(void *pointer, Size size)
+{
+	elog(ERROR, "repalloc called with invalid pointer %p", pointer);
+}
+
+static MemoryContext
+BogusGetChunkContext(void *pointer)
+{
+	elog(ERROR, "GetMemoryChunkContext called with invalid pointer %p", pointer);
+}
+
+static Size
+BogusGetChunkSpace(void *pointer)
+{
+	elog(ERROR, "GetMemoryChunkSpace called with invalid pointer %p", pointer);
+}
+
 
 /*
  *	  EXPORTED ROUTINES		 *
diff --git a/src/include/utils/memutils_internal.h b/src/include/utils/memutils_internal.h
index 377cee7a84..797409ee94 100644
--- a/src/include/utils/memutils_internal.h
+++ b/src/include/utils/memutils_internal.h
@@ -77,12 +77,20 @@ extern void SlabCheck(MemoryContext context);
  *		The maximum value for this enum is constrained by
  *		MEMORY_CONTEXT_METHODID_MASK.  If an enum value higher than that is
  *		required then MEMORY_CONTEXT_METHODID_BITS will need to be increased.
+ *		For robustness, ensure that MemoryContextMethodID has a value for
+ *		each possible bit-pattern of MEMORY_CONTEXT_METHODID_MASK, and make
+ *		dummy entries for unused IDs in the mcxt_methods[] array.
  */
 typedef enum MemoryContextMethodID
 {
 	MCTX_ASET_ID,
 	MCTX_GENERATION_ID,
 	MCTX_SLAB_ID,
+	MCTX_UNUSED3_ID,
+	MCTX_UNUSED4_ID,
+	MCTX_UNUSED5_ID,
+	MCTX_UNUSED6_ID,
+	MCTX_UNUSED7_ID
 } MemoryContextMethodID;
 
 /*
@@ -110,20 +118,11 @@ extern void MemoryContextCreate(MemoryContext node,
  *		directly precedes 

[PATCH] Check system cache invalidations before each command in transaction

2022-10-06 Thread Vasya
Hello guys.
In the previous discussion [1] we find out that while we are in
transaction function definition is not invalidated if it was redefined
in another session. Here is a patch to fix this. Also, I did a small
perfomance impact measurement (test.sh in attachment) on my home PC
with Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz. The result is (each
transaction is a 10 million calls to functions):

- without patch
latency average = 37087.639 ms
tps = 0.026963

- with patch
latency average = 38793.125 ms
tps = 0.025778

What do you think about it, guys?

[1] 
https://www.postgresql.org/message-id/flat/1205251664297977%40mail.yandex.ru



From 2a9bd029dcb180b01e0274ae6ef18faf647ff777 Mon Sep 17 00:00:00 2001
From: Miha 
Date: Thu, 6 Oct 2022 19:13:26 +0300
Subject: [PATCH] Check system cache invalidations before each command in
 transaction

---
 src/backend/access/transam/xact.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c1ffbd89b8..c7593b6c55 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2939,14 +2939,22 @@ StartTransactionCommand(void)

 			/*
 			 * We are somewhere in a transaction block or subtransaction and
-			 * about to start a new command.  For now we do nothing, but
-			 * someday we may do command-local resource initialization. (Note
-			 * that any needed CommandCounterIncrement was done by the
-			 * previous CommitTransactionCommand.)
+			 * about to start a new command. (Note that any needed
+			 * CommandCounterIncrement was done by the previous
+			 * CommitTransactionCommand.)
 			 */
 		case TBLOCK_INPROGRESS:
-		case TBLOCK_IMPLICIT_INPROGRESS:
 		case TBLOCK_SUBINPROGRESS:
+			AcceptInvalidationMessages();
+			break;
+
+			/*
+			 * We are somewhere in implicit transaction block. For now we do
+			 * nothing, but someday we may do command-local resource
+			 * initialization. (Note that any needed  CommandCounterIncrement
+			 * was done by the previous CommitTransactionCommand.)
+			 */
+		case TBLOCK_IMPLICIT_INPROGRESS:
 			break;

 			/*
--
2.25.1



test.sh
Description: application/shellscript


Re: use has_privs_of_role() for pg_hba.conf

2022-10-06 Thread Nathan Bossart
On Thu, Oct 06, 2022 at 07:33:46AM -0400, Joe Conway wrote:
> On 10/6/22 04:09, Michael Paquier wrote:
>> This patch looks simple, but it is a very sensitive area so I think
>> that we should be really careful.  pg_hba.conf does not have a lot of
>> test coverage, so I'd really prefer if we add something to see the
>> difference of behavior and check the behavior that we are switching
>> here.
> 
> Agreed

Here is a new version of the patch with a test.

>> Joe, you are registered as a reviewer and committer of this patch, by
>> the way.  Are you planning to look at it?
> 
> I am meaning to get to it, but as you say wanted to spend some time to
> understand the nuances and life keeps getting in the way. I will try to
> prioritize it over the next week.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 8ab3b80b61790045277f5c9fbc6214cad5c14b58 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 1 Apr 2022 11:40:51 -0700
Subject: [PATCH v2 1/1] Use has_privs_of_role for samerole, samegroup, and +
 in pg_hba.conf.

6198420 ensured that has_privs_of_role() is used for predefined
roles, which means that the role privilege inheritance hierarchy is
checked instead of mere role membership.  However, inheritance is
still not respected for pg_hba.conf.  This change alters the
authentication logic to consider role privileges instead of just
role membership.

Do not back-patch.

Author: Nathan Bossart
---
 doc/src/sgml/client-auth.sgml  |  18 ++---
 src/backend/libpq/hba.c|  19 ++---
 src/backend/utils/adt/acl.c|  23 ++
 src/include/utils/acl.h|   1 +
 src/test/authentication/meson.build|   1 +
 src/test/authentication/t/004_privs.pl | 101 +
 6 files changed, 145 insertions(+), 18 deletions(-)
 create mode 100644 src/test/authentication/t/004_privs.pl

diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index c6f1b70fd3..b06b57f169 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -221,12 +221,12 @@ hostnogssenc  database  usersameuser specifies that the record
matches if the requested database has the same name as the
requested user.  The value samerole specifies that
-   the requested user must be a member of the role with the same
+   the requested user must have privileges of the role with the same
name as the requested database.  (samegroup is an
obsolete but still accepted spelling of samerole.)
-   Superusers are not considered to be members of a role for the
-   purposes of samerole unless they are explicitly
-   members of the role, directly or indirectly, and not just by
+   Superusers are not considered to have privileges of a role for the
+   purposes of samerole unless they explicitly have
+   privileges of the role, directly or indirectly, and not just by
virtue of being a superuser.
The value replication specifies that the record
matches if a physical replication connection is requested, however, it
@@ -252,10 +252,10 @@ hostnogssenc  database  user+.
(Recall that there is no real distinction between users and groups
in PostgreSQL; a + mark really means
-   match any of the roles that are directly or indirectly members
+   match any of the roles that directly or indirectly have privileges
of this role, while a name without a + mark matches
only that specific role.) For this purpose, a superuser is only
-   considered to be a member of a role if they are explicitly a member
+   considered to have privileges of a role if they explicitly have privileges
of the role, directly or indirectly, and not just by virtue of
being a superuser.
Multiple user names can be supplied by separating them with commas.
@@ -788,9 +788,9 @@ hostall all 192.168.0.0/16  ident map=omicro
 # If these are the only three lines for local connections, they will
 # allow local users to connect only to their own databases (databases
 # with the same name as their database user name) except for administrators
-# and members of role "support", who can connect to all databases.  The file
-# $PGDATA/admins contains a list of names of administrators.  Passwords
-# are required in all cases.
+# and roles with privileges of role "support", who can connect to all
+# databases.  The file $PGDATA/admins contains a list of names of
+# administrators.  Passwords are required in all cases.
 #
 # TYPE  DATABASEUSERADDRESS METHOD
 local   sameuserall md5
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 4637426d62..faa675f4af 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -547,13 +547,13 @@ tokenize_auth_file(const char *filename, FILE *file, List **tok_lines,
 
 
 /*

Re: list of acknowledgments for PG15

2022-10-06 Thread Alvaro Herrera
On 2022-Oct-07, Fujii Masao wrote:

> On 2022/09/08 21:13, Peter Eisentraut wrote:
> > 
> > Attached is the plain-text list of acknowledgments for the PG15 release 
> > notes, current through REL_15_BETA4.  Please check for problems such as 
> > wrong sorting, duplicate names in different variants, or names in the wrong 
> > order etc.  (Note that the current standard is given name followed by 
> > surname, independent of cultural origin.)
> 
> I'd propose to add "Tatsuhiro Nakamori" whose patch was back-patched to v15 
> last week, into the list. Thought?
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=249b0409b181311bb1c375311e43eb767b5c3bdd

I agree, he has made some other contributions in the list, even if his
email does not yet show up in the git log.

(I think it would be good to have people's full name when writing the
commit messages, too ...)

(Also: I think it would be nice to have people's names that are
originally in scripts other than Latin to appear in both scripts.)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-10-06 Thread Melanie Plageman
v31 attached
I've also addressed failing test mentioned by Andres in [1]

On Fri, Sep 30, 2022 at 7:18 PM Lukas Fittl  wrote:
>
> On Tue, Sep 27, 2022 at 11:20 AM Melanie Plageman  
> wrote:
>
> First of all, I'm excited about this patch, and I think it will be a big help 
> to understand better which part of Postgres is producing I/O (and why).
>

Thanks! I'm happy to hear that.

> I've paired up with Maciek (CCed) on a review of this patch and had a few 
> comments, focused on the user experience:
>

Thanks for taking the time to review!

> The term "strategy" as an "io_context" is hard to understand, as its not a 
> concept an end-user / DBA would be familiar with. Since this comes from 
> BufferAccessStrategyType (i.e. anything not NULL/BAS_NORMAL is treated as 
> "strategy"), maybe we could instead split this out into the individual 
> strategy types? i.e. making "strategy" three different I/O contexts instead: 
> "shared_bulkread", "shared_bulkwrite" and "shared_vacuum", retaining "shared" 
> to mean NULL / BAS_NORMAL.

I have split strategy out into "vacuum", "bulkread", and "bulkwrite". I
thought it was less clear with shared as a prefix. If we were to have
BufferAccessStrategies in the future which acquire local buffers (for
example), we could start prefixing the columns to differentiate.

This opened up some new questions about which BufferAccessStrategies
will be employed by which BackendTypes and which IOOps will be valid in
a given BufferAccessStrategy.

I've excluded IOCONTEXT_BULKREAD and IOCONTEXT_BULKWRITE for autovacuum
worker -- though those may not be inherently invalid, they seem not to
be done now and added extra rows to the view.

I've also disallowed IOOP_EXTEND for IOCONTEXT_BULKREAD.

> Separately, could we also track buffer hits without incurring extra overhead? 
> (not just allocs and reads) -- Whilst we already have shared read and hit 
> counters in a few other places, this would help make the common "What's my 
> cache hit ratio" question more accurate to answer in the presence of 
> different shared buffer access strategies. Tracking hits could also help for 
> local buffers (e.g. to tune temp_buffers based on seeing a low cache hit 
> ratio).

I've started tracking hits and added "hit" to the view.
I added IOOP_HIT and IOOP_ACQUIRE to those IOOps disallowed for
checkpointer and bgwriter.

I have added tests for hit, but I'm not sure I can keep them. It seems
like they might fail if the blocks are evicted between the first and
second time I try to read them.

> Additionally, some minor notes:
>
> - Since the stats are counting blocks, it would make sense to prefix the view 
> columns with "blks_", and word them in the past tense (to match current 
> style), i.e. "blks_written", "blks_read", "blks_extended", "blks_fsynced" 
> (realistically one would combine this new view with other data e.g. from 
> pg_stat_database or pg_stat_statements, which all use the "blks_" prefix, and 
> stop using pg_stat_bgwriter for this which does not use such a prefix)

I have changed the column names to be in the past tense.

There are no columns equivalent to "dirty" or "misses" from the other
views containing information on buffer hits/block reads/writes/etc. I'm
not sure whether or not those make sense in this context.

Because we want to add non-block-oriented IO in the future (like
temporary file IO) to this view and want to use the same "read",
"written", "extended" columns, I would prefer not to prefix the columns
with "blks_". I have added a column "unit" which would contain the unit
in which read, written, and extended are in. Unfortunately, fsyncs are
not per block, so "unit" doesn't really work for this. I documented
this.

The most correct thing to do to accommodate block-oriented and
non-block-oriented IO would be to specify all the values in bytes.
However, I would like this view to be usable visually (as opposed to
just in scripts and by tools). The only current value of unit is
"block_size" which could potentially be combined with the value of the
GUC to get bytes.

I've hard-coded the string "block_size" into the view generation
function pg_stat_get_io(), so, if this idea makes sense, perhaps I
should do something better there.

> - "alloc" as a name doesn't seem intuitive (and it may be confused with 
> memory allocations) - whilst this is already named this way in 
> pg_stat_bgwriter, it feels like this is an opportunity to eventually 
> deprecate the column there and make this easier to understand - specifically, 
> maybe we can clarify that this means buffer *acquisitions*? (either by 
> renaming the field to "blks_acquired", or clarifying in the documentation)

I have renamed it to acquired. It doesn't overlap completely with
buffers_alloc in pg_stat_bgwriter, so I didn't mention that in docs.

> - Assuming we think this view could realistically cover all I/O produced by 
> Postgres in the future (thus warranting the name "pg_stat_io"), it may be 
> best to have an 

Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread Tom Lane
David Rowley  writes:
> I did see that PostGIS does use
> MemoryContextContains(), though I didn't look at their code to figure
> out if they're always passing it a pointer to an allocated chunk.

As far as I can tell from a cursory look, they should be able to use
the GetMemoryChunkContext workaround, because they are just doing this
with SHARED_GSERIALIZED objects, which seem to always be palloc'd.

regards, tom lane




Re: Support logical replication of DDLs

2022-10-06 Thread Alvaro Herrera
On 2022-Oct-06, Zheng Li wrote:

> Some tweaking is made in deparse_drop_command in order to make DROP
> TRANSFORM deparsing work. This is because the objidentity captured in
> currentEventTriggerState->SQLDropList contains the keyword 'on', for
> example "for typename on language lang", but the keyword 'on' is not
> needed in the current DROP TRANSFORM syntax. So we need to remove the
> 'on' keyword in objidentity. I'm not sure if this is the best way to
> handle it, maybe we can consider directly modifying what's captured in
> currentEventTriggerState->SQLDropList
> so we don't have the "on" keyword to begin with?

The exact output format for identity is not set in stone; we should only
set it in stone once we have an actual working case for them.  This is
the first such use, so it seems OK to make minor modifications (such as
removing an undesirable ON) if it's a reasonable change and allows
consumer code to be more easily written.

So, +1 to dropping ON here.  However, if there are further strings that
need to be modified, let's see what they are.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
 It does it in a really, really complicated way
 why does it need to be complicated?
 Because it's MakeMaker.




Re: pgsql: Avoid improbable PANIC during heap_update.

2022-10-06 Thread Alvaro Herrera
On 2022-Sep-29, Jaime Casanova wrote:

> This doesn't look as improbable because I saw it at least 3 times with
> v15beta4.

To further the case of the not-so-low-probability, we have customers
that are hitting this about once per day, with Postgres 14 ... so their
systems are crashing all the time :-(  We've collected a bunch of
backtraces, and while I didn't analyze all of them, I hear that they all
look related to this fix.

Not good.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
Maybe there's lots of data loss but the records of data loss are also lost.
(Lincoln Yeoh)




Re: list of acknowledgments for PG15

2022-10-06 Thread Fujii Masao



On 2022/09/08 21:13, Peter Eisentraut wrote:


Attached is the plain-text list of acknowledgments for the PG15 release notes, 
current through REL_15_BETA4.  Please check for problems such as wrong sorting, 
duplicate names in different variants, or names in the wrong order etc.  (Note 
that the current standard is given name followed by surname, independent of 
cultural origin.)


I'd propose to add "Tatsuhiro Nakamori" whose patch was back-patched to v15 
last week, into the list. Thought?

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=249b0409b181311bb1c375311e43eb767b5c3bdd

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATIONFrom 74a89cac092969208067143d25743b40ddbfbfb0 Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Sat, 1 Oct 2022 12:51:45 +0900
Subject: [PATCH] Update list of acknowledgments in release nodes.

---
 doc/src/sgml/release-15.sgml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/doc/src/sgml/release-15.sgml b/doc/src/sgml/release-15.sgml
index 9b752e26f2..7b18bfefa2 100644
--- a/doc/src/sgml/release-15.sgml
+++ b/doc/src/sgml/release-15.sgml
@@ -3900,6 +3900,7 @@ Author: Etsuro Fujita 
 Takamichi Osumi
 Takayuki Tsunakawa
 Takeshi Ideriha
+Tatsuhiro Nakamori
 Tatsuhito Kasahara
 Tatsuo Ishii
 Tatsuro Yamada
-- 
2.37.1



Re: Testing DDL Deparser

2022-10-06 Thread Alvaro Herrera
Hello

Overall, many thanks for working on this.  I hope that the objectives
can be fulfilled, so that we can have dependable DDL replication soon.

I haven't read the patch at all, so I can't comment on what you've done,
but I have comments to some of your questions:

On 2022-Oct-05, Runqi Tian wrote:

> I came up with some ideas during the investigation and want to collect
> some feedback:
> 1, Currently we want to utilize the test cases from regression tests.
> However you will find that many test cases end with DROP commands. In
> current deparser testing approach proposed in [2] and [3], we compare
> the pg_dump schema results between the original SQL scripts and
> deparser generated commands. Because most test cases end with DROP
> command, the schema will not be shown in pg_dump, so the test coverage
> is vastly reduced. Any suggestion to this problem?

The whole reason some objects are *not* dropped is precisely so that
this type of testing has something to work on.  If we find that there
are object types that would be good to have in order to increase
coverage, what we can do is change the .sql files to not drop those
objects.  This should be as minimal as possible (i.e. we don't need tons
of tables that are all essentially identical, just a representative
bunch of objects of different types).

> 2, We found that DROP command are not returned by
> pg_event_trigger_ddl_commands() fucntion in ddl_command_end trigger,
> but it’s caught by ddl_command_end trigger. Currently, we catch DROP
> command in sql_drop trigger. It’s unclear why
> pg_event_trigger_ddl_commands() function is designed to not return
> DROP command.

Yeah, the idea is that a DDL processor needs to handle both the DROP and
the other cases separately in these two event types.  As I recall, we
needed to handle DROP separately because there was no way to collect the
necessary info otherwise.

> 3, For unsupported DDL commands by the deparser, the current
> implementation just skips them silently. So we cannot detect
> unsupported DDL commands easily. Customers may also want the deparser
> related features like logical replication to be executed in a strict
> mode, so that the system can warn them when deparser can not deparse
> some DDL command. So I propose to introduce a new GUC such as
> “StopOnDeparserUnsupportedCommand = true/false” to allow the deparser
> to execute in strict mode, in which an unsupported DDL command will
> raise an error.

No opinion on this.  I don't think the deparser should be controlled by
individual GUCs, since it will probably have multiple simultaneous uses.

> 4, We found that the event trigger function
> pg_event_trigger_ddl_commands() only returns subcommands, and deparser
> is deparsing subcommands returned by this function. The deparser works
> on subcommand level by using this function, but the deparser is
> designed to deparse the complete command to JSON output. So there is a
> mismatch here, what do you think about this problem? Should the
> deparser work at subcommand level? Or should we provide some event
> trigger function which can return the complete command instead of
> subcommands?

Not clear on what this means.  Are you talking about ALTER TABLE
subcommands?  If so, then what you have to do (ISTM) is construct a
single ALTER TABLE command containing several subcommands, when that is
what is being deparsed; the reason for the user to submit several
subcommands is to apply optimizations such as avoiding multiple table
rewrites for several operations, when they can all share a single table
rewrite.  Therefore I think replaying such a command should try and do
the same, if at all possible.


-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Crear es tan difícil como ser libre" (Elsa Triolet)




Re: [Commitfest 2022-09] Date is Over.

2022-10-06 Thread Ibrar Ahmed
On Wed, Oct 5, 2022 at 3:01 PM Julien Rouhaud  wrote:

> Hi,
>
> On Wed, Oct 05, 2022 at 02:50:58PM +0500, Ibrar Ahmed wrote:
> > On Wed, 5 Oct 2022 at 1:43 PM, Alvaro Herrera 
> > wrote:
> >
> > > On 2022-Oct-03, Ibrar Ahmed wrote:
> > >
> > > > The date of the current commitfest is over, here is the current
> status of
> > > > the  "September 2022 commitfest."
> > > > There were 296 patches in the commitfest and 58 were get committed.
> > >
> > > Are you moving the open patches to the next commitfest, closing some as
> > > RwF, etc?  I'm not clear what the status is, for the November
> commitfest.
> >
> >
> > I am also not clear about that should I move that or wait till November.
> > Anybody guide me
>
> The CF should be marked as closed, and its entries fully processed within
> a few
> days after its final day.
>
> The general rule is that patches that have been waiting on authors for 2
> weeks
> or more without any answer from the author(s) should get returned with
> feedback with some message to the author, and the rest is moved to the next
> commitfest.  If you have the time to look at some patches and see if they
> need
> something else, that's always better.
>
> Thanks for your response; I will do that in two days.


-- 

Ibrar Ahmed.
Senior Software Engineer, PostgreSQL Consultant.


Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-10-06 Thread Bruce Momjian
On Thu, Oct  6, 2022 at 01:33:33PM +0530, Bharath Rupireddy wrote:
> On Thu, Oct 6, 2022 at 2:30 AM Bruce Momjian  wrote:
> >
> > As I highlighted above, by default you notify the administrator that a
> > sychronous replica is not responding and then ignore it.  If it becomes
> > responsive again, you notify the administrator again and add it back as
> > a sychronous replica.
> >
> > > command in any form may pose security risks. I'm not sure at this
> > > point how this new timeout is going to work alongside
> > > wal_sender_timeout.
> >
> > We have archive_command, so I don't see a problem with another shell
> > command.
> 
> Why do we need a new command to inform the admin/user about a sync
> replication being ignored (from sync quorum) for not responding or
> acknowledging for a certain amount of time in SyncRepWaitForLSN()?
> Can't we just add an extra column or use existing sync_state in
> pg_stat_replication()? We can either introduce a new state such as
> temporary_async or just use the existing state 'potential' [1]. A
> problem is that the server has to be monitored for this extra, new
> state. If we do this, we don't need another command to report.

Yes, that is a good point.  I assumed people would want notification
immediately rather than waiting for monitoring to notice it.  Consider
if you monitor every five seconds but the primary loses sync and goes
down during that five-second interval --- there would be no way to know
if sync stopped and reported committed transactions to the client before
the primary went down.  I would love to just rely on monitoring but I am
not sure that is sufficient for this use-case.

Of course, if email is being sent it might be still in the email queue
when the primary goes down, but I guess if I was doing it I would make
sure the email was delivered _before_ returning.  The point is that we
would not disable the sync and acknowledge the commit to the client
until the notification command returns success --- that kind of
guarantee is hard to do with monitoring.

These are good discussions to have --- maybe I am wrong.

> > > > Once we have that, we can consider removing the cancel ability while
> > > > waiting for synchronous replicas (since we have the timeout) or make it
> > > > optional.  We can also consider how do notify the administrator during
> > > > query cancel (if we allow it), backend abrupt exit/crash, and
> > >
> > > Yeah. If we have the
> > > timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
> > > the users can then choose to disable processing query cancels/proc
> > > dies while waiting for sync replication in SyncRepWaitForLSN().
> >
> > Yes.  We might also change things so a query cancel that happens during
> > sychronous replica waiting can only be done by an administrator, not the
> > session owner.  Again, lots of design needed here.
> 
> Yes, we need infrastructure to track who issued the query cancel or
> proc die and so on. IMO, it's not a good way to allow/disallow query
> cancels or CTRL+C based on role types - superusers or users with
> replication roles or users who are members of any of predefined roles.
> 
> In general, it is the walsender serving sync standby that has to mark
> itself as async standby by removing itself from
> synchronous_standby_names, reloading config variables and waking up
> the backends that are waiting in syncrep wait queue for it to update
> LSN.
> 
> And, the new auto removal timeout should always be set to less than
> wal_sender_timeout.
> 
> All that said, imagine we have
> timeout-and-auto-removal-of-standby-from-sync-standbys-list solution
> in one or the other forms with auto removal timeout set to 5 minutes,
> any of following can happen:
> 
> 1) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(),
> no query cancel or proc die interrupt is arrived, the sync standby is
> made as async standy after the timeout i.e. 5 minutes.
> 2) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(),
> say for about 3 minutes, then query cancel or proc die interrupt is
> arrived, should we immediately process it or wait for timeout to
> happen (2 more minutes) and then process the interrupt? If we
> immediately process the interrupts, then the
> locally-committed-but-not-replicated-to-sync-standby problems
> described upthread [2] are left unresolved.

I have a feeling once we have the timeout, we would disable query cancel
when we are in this stage since it is canceling a committed query.  The
timeout would cancel the sync but at least the administrator would know.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Reducing the chunk header sizes on all memory context types

2022-10-06 Thread Tom Lane
David Rowley  writes:
> On Wed, 5 Oct 2022 at 04:55, Tom Lane  wrote:
>> After studying the existing usages of MemoryContextContains, I think
>> there is a better answer, which is to just nuke them.

> I was under the impression you wanted to keep that function around in
> cassert builds for some of the guc.c changes you were making.

I would've liked to have it, but for that purpose an unreliable version
of MemoryContextContains is probably little help.  In any case, as you
mention, I can do something with GetMemoryChunkContext() instead.

>> As far as I can tell, the datumCopy steps associated with aggregate
>> finalfns are basically useless.  They only serve to prevent
>> returning a pointer directly to the aggregate's transition value
>> (or, perhaps, to a portion thereof).  But what's wrong with that?
>> It'll last as long as we need it to.  Maybe there was a reason
>> back before we required finalfns to not scribble on the transition
>> values, but those days are gone.

> Yeah, I wondered the same thing. I couldn't see a situation where the
> aggregate context would disappear.

I have a feeling that we might once have aggressively reset the
aggregate context ... but we don't anymore.

>> The one place where we actually need the conditional datumCopy is
>> with window functions, and even there I don't think we need it
>> in simple cases with only one window function.  The case that is
>> hazardous is where multiple window functions are sharing a
>> WindowObject.  So I'm content to optimize the single-window-function
>> case and just always copy if there's more than one.  (Sadly, there
>> is no existing regression test that catches this, so I added one.)

> I was unsure what window functions might exist out in the wild, so I'd
> added some code to pass along the return type information so that any
> extensions which need to make a copy can do so.  However, maybe it's
> better just to wait to see if anyone complains about that before we go
> to the trouble.

I'd originally feared that a window function might return a pointer
into the WindowObject's tuplestore, which we manipulate immediately
after the window function returns.  However, AFAICS the APIs we
provide don't have any such hazard.  The actual hazard is that we
might get a pointer into one of the temp slots, which are independent
storage because we tell them to copy the source tuple.  (Maybe a comment
about that would be appropriate.)

> I've looked at your patches and don't see any problems. Our findings
> seem to be roughly the same. i.e the datumCopy is mostly useless.

Cool, I'll push in a little bit.

> Maybe it's worth doing;

> #define MemoryContextContains(c, p) (GetMemoryChunkContext(p) == (c))

> in memutils.h? or are we better to force extension authors to
> re-evaluate their code in case anyone is passing memory that's not
> pointing directly to a palloc'd chunk?

I think the latter.  The fact that MemoryContextContains was (mostly)
safe on arbitrary pointers was an important part of its API IMO.
I'm okay with giving up that property to reduce chunk overhead,
but we'll do nobody any service by pretending we still have it.

regards, tom lane




Re: Transparent column encryption

2022-10-06 Thread Andres Freund
Hi,

On 2022-10-06 16:25:51 +0200, Peter Eisentraut wrote:
> On 02.10.22 09:16, Andres Freund wrote:
> > On 2022-09-27 15:51:25 +0200, Peter Eisentraut wrote:
> > > Updated version with meson build system support added (for added files and
> > > new tests).
> > 
> > This fails on windows: https://cirrus-ci.com/task/6151847080624128
> > 
> > https://api.cirrus-ci.com/v1/artifact/task/6151847080624128/testrun/build/testrun/column_encryption/001_column_encryption/log/regress_log_001_column_encryption
> > 
> > psql error: stderr: 'OPENSSL_Uplink(7FFC165CBD50,08): no 
> > OPENSSL_Applink'
> 
> What in the world is that about?  What scant information I could find
> suggests that it has something to do with building a "release" build against
> an "debug" build of the openssl library, or vice versa.  But this patch
> doesn't introduce any use of openssl that we haven't seen before.

It looks to me that one needs to compile, in some form, openssl/applink.c and
link it to the application. No idea why that'd be required now and not
earlier.

Greetings,

Andres Freund




Re: request clarification on pg_restore documentation

2022-10-06 Thread Bruce Momjian
On Thu, Sep 29, 2022 at 02:30:13PM +, PG Doc comments form wrote:
> The following documentation comment has been logged on the website:
> 
> Page: https://www.postgresql.org/docs/14/app-pgrestore.html
> Description:
> 
> pg_restore seems to have two ways to restore data:
> 
> --section=data 
> 
> or
> 
>  --data-only
> 
> There is this cryptic warning that --data-only is "similar to but for
> historical reasons different from" --section=data
> 
> But there is no further explanation of what those differences are or what
> might be missed or different in your restore if you pick one option or the
> other. Maybe one or the other option is the "preferred current way" and one
> is the "historical way" or they are aimed at different types of use cases,
> but that's not clear.

[Thread moved from docs to hackers because there are behavioral issues.]

Very good question.  I dug into this and found this commit which says
--data-only and --section=data were equivalent:

commit a4cd6abcc9
Author: Andrew Dunstan 
Date:   Fri Dec 16 19:09:38 2011 -0500

Add --section option to pg_dump and pg_restore.

Valid values are --pre-data, data and post-data. The option can be
given more than once. --schema-only is equivalent to
--> --section=pre-data --section=post-data. --data-only is equivalent
--> to --section=data.

and then this commit which says they are not:

commit 4317e0246c
Author: Tom Lane 
Date:   Tue May 29 23:22:14 2012 -0400

Rewrite --section option to decouple it from 
--schema-only/--data-only.

--> The initial implementation of pg_dump's --section option supposed 
that the
--> existing --schema-only and --data-only options could be made 
equivalent to
--> --section settings.  This is wrong, though, due to dubious but long 
since
set-in-stone decisions about where to dump SEQUENCE SET items, as 
seen in
bug report from Martin Pitt.  (And I'm not totally convinced there 
weren't
other bugs, either.)  Undo that coupling and instead drive --section
filtering off current-section state tracked as we scan through the 
TOC
list to call _tocEntryRequired().

To make sure those decisions don't shift around and hopefully save 
a few
cycles, run _tocEntryRequired() only once per TOC entry and save 
the result
in a new TOC field.  This required minor rejiggering of ACL 
handling but
also allows a far cleaner implementation of 
inhibit_data_for_failed_table.

Also, to ensure that pg_dump and pg_restore have the same behavior 
with
respect to the --section switches, add _tocEntryRequired() 
filtering to
WriteToc() and WriteDataChunks(), rather than trying to implement 
section
filtering in an entirely orthogonal way in dumpDumpableObject().  
This
required adjusting the handling of the special ENCODING and 
STDSTRINGS
items, but they were pretty weird before anyway.

and this commit which made them closer:

commit 5a39114fe7
Author: Tom Lane 
Date:   Fri Oct 26 12:12:42 2012 -0400

In pg_dump, dump SEQUENCE SET items in the data not pre-data 
section.

Represent a sequence's current value as a separate TableDataInfo 
dumpable
object, so that it can be dumped within the data section of the 
archive
--> rather than in pre-data.  This fixes an undesirable inconsistency 
between
--> the meanings of "--data-only" and "--section=data", and also fixes 
dumping
of sequences that are marked as extension configuration tables, as 
per a
report from Marko Kreen back in July.  The main cost is that we do 
one more
SQL query per sequence, but that's probably not very meaningful in 
most
databases.

Looking at the restore code, I see --data-only disabling triggers, while
--section=data doesn't.  I also tested --data-only vs. --section=data in
pg_dump for the regression database and saw the only differences as the
creation and comments on large objects, e.g.,

-- Name: 2121; Type: BLOB; Schema: -; Owner: postgres
--

SELECT pg_catalog.lo_create('2121');


ALTER LARGE OBJECT 2121 OWNER TO postgres;

--
-- Name: LARGE OBJECT 2121; Type: COMMENT; Schema: -; Owner: postgres
--

COMMENT ON LARGE OBJECT 2121 IS 'testing comments';

but the large object _data_ was dumped in both cases.

So, where does this leave us?  We know we need --section=data because
the pre/post-data options are clearly useful, so why would someone use
--data-only vs. --section=data.  We don't document why to use one rather
than the other, so the --data-only option looks useless to me.  Do we

Re: Transparent column encryption

2022-10-06 Thread Peter Eisentraut

On 02.10.22 09:16, Andres Freund wrote:

On 2022-09-27 15:51:25 +0200, Peter Eisentraut wrote:

Updated version with meson build system support added (for added files and
new tests).


This fails on windows: https://cirrus-ci.com/task/6151847080624128

https://api.cirrus-ci.com/v1/artifact/task/6151847080624128/testrun/build/testrun/column_encryption/001_column_encryption/log/regress_log_001_column_encryption

psql error: stderr: 'OPENSSL_Uplink(7FFC165CBD50,08): no OPENSSL_Applink'


What in the world is that about?  What scant information I could find 
suggests that it has something to do with building a "release" build 
against an "debug" build of the openssl library, or vice versa.  But 
this patch doesn't introduce any use of openssl that we haven't seen before.






RE: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread houzj.f...@fujitsu.com
On Thursday, October 6, 2022 9:00 PM Kuroda, Hayato/黒田 隼人 
 wrote:
> 
> Dear Hou,
> 
> > Thanks for the suggestion.
> >
> > I tried to add a WaitLatch, but it seems affect the performance
> > because the Latch might not be set when leader send some message to
> > parallel apply worker which means it will wait until timeout.
> 
> Yes, currently it leader does not notify anything.
> To handle that leader must set a latch in parallel_apply_send_data().
> It can be done if leader accesses to winfo->shared-> 
> logicalrep_worker_slot_no,
> and sets a latch for LogicalRepCtxStruct->worker[slot_no].

Thanks for the suggestion. I think we could do that, but I feel it's not great
to set latch frequently. Besides, to access the LogicalRepCtxStruct->worker[]
we would need to hold a lock which might also bring some overhead.

Best regards,
Hou zj


Re: START_REPLICATION SLOT causing a crash in an assert build

2022-10-06 Thread Jonathan S. Katz

On 10/6/22 1:10 AM, Kyotaro Horiguchi wrote:

At Thu, 6 Oct 2022 13:44:43 +0900, Michael Paquier  wrote 
in

On Wed, Oct 05, 2022 at 11:24:57PM -0400, Jonathan S. Katz wrote:

On 10/5/22 8:44 PM, Andres Freund wrote:

I have two ideas how to fix it. As a design constraint, I'd be interested in
the RMTs opinion on the following:
Is a cleaner fix that changes the stats format (i.e. existing stats will be
thrown away when upgrading) or one that doesn't change the stats format
preferrable?


[My opinion, will let Michael/John chime in]

Ideally we would keep the stats on upgrade -- I think that's the better user
experience.


The release has not happened yet, so I would be fine to remain
flexible and bump again PGSTAT_FILE_FORMAT_ID so as we have the
cleanest approach in place for the release and the future.


Yes, I agree with this.


 At the

end, we would throw automatically the data of a file that's marked
with a version that does not match with what we expect at load time,
so there's a limited impact on the user experience, except, well,
losing these stats, of course.


I'm fine with this.


+1. FWIW, the atttached is an example of what it looks like if we
avoid file format change.


Thanks for the quick turnaround. I'll let others chime in on the review.

Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


ps command does not show walsender's connected db

2022-10-06 Thread bt22nakamorit

Hi,

When walsender process is evoked for logical replication, walsender is 
connected to a database of the subscriber.
Naturally, ones would want the name of the connected database to show in 
the entry of ps command for walsender.
In detail, running ps aux during the logical replication shows results 
like the following:

postgres=# \! ps aux | grep postgres;
...
ACTC-I\+ 14575  0.0  0.0 298620 14228 ?Ss   18:22   0:00 
postgres: walsender ACTC-I\nakamorit [local] S


However, since walsender is connected to a database of the subscriber in 
logical replication, it should show the database name, as in the 
following:

postgres=# \! ps aux | grep postgres
...
ACTC-I\+ 15627  0.0  0.0 298624 13936 ?Ss   15:45   0:00 
postgres: walsender ACTC-I\nakamorit postgres


Showing the database name should not apply in streaming replication 
though since walsender process is not connected to any specific 
database.


The attached patch adds the name of the connected database to the ps 
entry of walsender in logical replication, and not in streaming 
replication.


Thoughts?

Tatsuhiro Nakamoridiff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 383bc4776e..6cbb69767c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4187,7 +4187,9 @@ BackendStartup(Port *port)
 	if (pid == 0)/* child */
 	{
 		free(bn);
-
+		printf("fork child process\n");
+		printf("	am_walsender: %d\n", am_walsender);
+		printf("	am_db_walsender: %d\n", am_db_walsender);
 		/* Detangle from postmaster */
 		InitPostmasterChild();
 
@@ -4451,7 +4453,7 @@ BackendInitialize(Port *port)
 	if (am_walsender)
 		appendStringInfo(_data, "%s ", GetBackendTypeDesc(B_WAL_SENDER));
 	appendStringInfo(_data, "%s ", port->user_name);
-	if (!am_walsender)
+	if (!am_walsender || am_db_walsender)
 		appendStringInfo(_data, "%s ", port->database_name);
 	appendStringInfoString(_data, port->remote_host);
 	if (port->remote_port[0] != '\0')


Harmonize parameter names in Win32

2022-10-06 Thread Ranier Vilela
Hi,

Like how commits series "harmonize parameter names":
20e69da

and others.

This tries to harmonize more parameter names, mainly in Win32 and
some others files.

regards,
Ranier Vilela


harmonize_parameters_names_win32.patch
Description: Binary data


RE: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread kuroda.hay...@fujitsu.com
Dear Hou,

> Thanks for the suggestion.
> 
> I tried to add a WaitLatch, but it seems affect the performance
> because the Latch might not be set when leader send some
> message to parallel apply worker which means it will wait until
> timeout.

Yes, currently it leader does not notify anything.
To handle that leader must set a latch in parallel_apply_send_data().
It can be done if leader accesses to winfo->shared-> logicalrep_worker_slot_no,
and sets a latch for LogicalRepCtxStruct->worker[slot_no].


Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Allow foreign keys to reference a superset of unique columns

2022-10-06 Thread Peter Eisentraut

On 28.09.22 00:39, Kaiting Chen wrote:
What other semantics and edge cases regarding this proposal should I 
consider?


I'm not as pessimistic as others that it couldn't be made to work.  But 
it's the job of this proposal to figure this out.  Implementing it is 
probably not that hard in the end, but working out the specification in 
all details is the actual job.






RE: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread kuroda.hay...@fujitsu.com
Dear Hou,

I put comments for v35-0001.

01. catalog.sgml

```
+   Controls how to handle the streaming of in-progress transactions:
+   f = disallow streaming of in-progress transactions,
+   t = spill the changes of in-progress transactions to
+   disk and apply at once after the transaction is committed on the
+   publisher,
+   p = apply changes directly using a parallel apply
+   worker if available (same as 't' if no worker is available)
```

I'm not sure why 't' means "spill the changes to file". Is it compatibility 
issue?

~~~
02. applyworker.c - parallel_apply_stream_abort

The argument abort_data is not modified in the function. Maybe "const" modifier 
should be added.
(Other functions should be also checked...)

~~~
03. applyparallelworker.c - parallel_apply_find_worker

```
+   ParallelApplyWorkerEntry *entry = NULL;
```

This may not have to be initialized here.

~~~
04. applyparallelworker.c - HandleParallelApplyMessages

```
+   static MemoryContext hpm_context = NULL;
```

I think "hpm" means "handle parallel message", so it should be "hpam".

~~~
05. launcher.c - logicalrep_worker_launch()

```
if (is_subworker)
snprintf(bgw.bgw_type, BGW_MAXLEN, "logical replication 
parallel worker");
else
snprintf(bgw.bgw_type, BGW_MAXLEN, "logical replication 
worker");
```

I'm not sure why there are only bgw_type even if there are three types of apply 
workers. Is it for compatibility?

~~~
06. launcher.c - logicalrep_worker_stop_by_slot

An assertion like Assert(slot_no >=0 && slot_no < 
max_logical_replication_workers) should be added at the top of this function.

~~~
07. launcher.c - logicalrep_worker_stop_internal

```
+/*
+ * Workhorse for logicalrep_worker_stop(), logicalrep_worker_detach() and
+ * logicalrep_worker_stop_by_slot(). Stop the worker and wait for it to die.
+ */
+static void
+logicalrep_worker_stop_internal(LogicalRepWorker *worker)
```

I think logicalrep_worker_stop_internal() may be not "Workhorse" for 
logicalrep_worker_detach(). In the function internal function is called for 
parallel apply worker, and it does not main part of the detach function. 

~~~
08. worker.c - handle_streamed_transaction()

```
+   TransactionId current_xid = InvalidTransactionId;
```

This initialization is not needed. This is not used in non-streaming mode, 
otherwise it is substituted before used.

~~~
09. worker.c - handle_streamed_transaction()

```
+   case TRANS_PARALLEL_APPLY:
+   /* Define a savepoint for a subxact if needed. */
+   parallel_apply_start_subtrans(current_xid, stream_xid);
+   return false;
```

Based on other case-block, Assert(am_parallel_apply_worker()) may be added at 
the top of this part.
This suggestion can be said for other swith-case statements.

~~~
10. worker.c - apply_handle_stream_start

```
+ *
+ * XXX We can avoid sending pair of the START/STOP messages to the parallel
+ * worker because unlike apply worker it will process only one
+ * transaction-at-a-time. However, it is not clear whether that is worth the
+ * effort because it is sent after logical_decoding_work_mem changes.
```

I can understand that START message is not needed, but is STOP really 
removable? If leader does not send STOP to its child, does it lose a chance to 
change the worker-state to IDLE_IN_TRANSACTION?  

~~~
11. worker.c - apply_handle_stream_start

Currently the number of received chunks have not counted, but it can do if a 
variable "nchunks" is defined and incremented in apply_handle_stream_start(). 
This this info may be useful to determine appropriate logical_decoding_work_mem 
for workloads. How do you think?

~~~
12. worker.c - get_transaction_apply_action

{} are not needed.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Record SET session in VariableSetStmt

2022-10-06 Thread Drouvot, Bertrand




On 10/6/22 2:28 PM, Julien Rouhaud wrote:

On Thu, Oct 06, 2022 at 02:19:32PM +0200, Drouvot, Bertrand wrote:


On 10/6/22 1:18 PM, Julien Rouhaud wrote:


so
nothing should rely on how exactly someone spelled it.  This is also the case
for our core jumbling code, where we guarantee (or at least try to) that two
semantically identical statements will get the same queryid, and therefore
don't distinguish eg. LIKE vs ~~.


Agree, but on the other hand currently SET and SET SESSION are recorded with
distinct queryid:

postgres=# select calls, query, queryid from pg_stat_statements;
  calls |query|   queryid
---+-+--
  2 | select calls, query from pg_stat_statements | -6345508659980235519
  1 | set session enable_seqscan=1| -3921418831612111986
  1 | create extension pg_stat_statements | -1739183385080879393
  1 | set enable_seqscan=1|  7925920505912025406
(4 rows)

and this behavior would change with the Jumbling work in progress in [1]
(mentioned up-thread) if we don't record "SET SESSION".

I think that would make sense to keep the same behavior, what do you think?


It's because until now jumbling of utility statements was just hashing the
query text, which is quite terrible.  This was also implying getting different
queryids for things like this:

=# select query, queryid from pg_stat_statements where query like '%work_mem%';;
  query |   queryid
---+--
  SeT work_mem = 123465 | -1114638544275583196
  Set work_mem = 123465 | -1966597613643458788
  SET work_mem = 123465 |  4161441071081149574
  seT work_mem = 123465 |  8327271737593275474
(4 rows)

If we move to a real jumbling of VariableSetStmt, we should keep the rules
consistent with the rest of the jumble code and ignore an explicit "SESSION" in
the original command.


Understood, so I agree that it makes sense to keep the jumbling behavior 
consistent and so keep the same queryid for statements that are 
semantically identical.


Thanks for your feedback!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Record SET session in VariableSetStmt

2022-10-06 Thread Julien Rouhaud
On Thu, Oct 06, 2022 at 02:19:32PM +0200, Drouvot, Bertrand wrote:
>
> On 10/6/22 1:18 PM, Julien Rouhaud wrote:
>
> > so
> > nothing should rely on how exactly someone spelled it.  This is also the 
> > case
> > for our core jumbling code, where we guarantee (or at least try to) that two
> > semantically identical statements will get the same queryid, and therefore
> > don't distinguish eg. LIKE vs ~~.
>
> Agree, but on the other hand currently SET and SET SESSION are recorded with
> distinct queryid:
>
> postgres=# select calls, query, queryid from pg_stat_statements;
>  calls |query|   queryid
> ---+-+--
>  2 | select calls, query from pg_stat_statements | -6345508659980235519
>  1 | set session enable_seqscan=1| -3921418831612111986
>  1 | create extension pg_stat_statements | -1739183385080879393
>  1 | set enable_seqscan=1|  7925920505912025406
> (4 rows)
>
> and this behavior would change with the Jumbling work in progress in [1]
> (mentioned up-thread) if we don't record "SET SESSION".
>
> I think that would make sense to keep the same behavior, what do you think?

It's because until now jumbling of utility statements was just hashing the
query text, which is quite terrible.  This was also implying getting different
queryids for things like this:

=# select query, queryid from pg_stat_statements where query like '%work_mem%';;
 query |   queryid
---+--
 SeT work_mem = 123465 | -1114638544275583196
 Set work_mem = 123465 | -1966597613643458788
 SET work_mem = 123465 |  4161441071081149574
 seT work_mem = 123465 |  8327271737593275474
(4 rows)

If we move to a real jumbling of VariableSetStmt, we should keep the rules
consistent with the rest of the jumble code and ignore an explicit "SESSION" in
the original command.




Re: Move backup-related code to xlogbackup.c/.h

2022-10-06 Thread Bharath Rupireddy
On Thu, Oct 6, 2022 at 4:50 AM Andres Freund  wrote:
>
> I'm doubtful it's a good idea to expose these to outside of xlog.c - they are
> very low level, and it's very easy to break stuff by using them wrongly.

Hm. Here's the v3 patch set without exposing WAL insert lock related
functions. Please have a look.

On Thu, Oct 6, 2022 at 4:22 AM Nathan Bossart  wrote:
>
> Can we also replace the relevant code with calls to these functions in
> 0001?  That way, we can more easily review the changes you are making to
> this code separately from the large patch that just moves the code.

Done. Please have a look at 0001.

On Wed, Oct 5, 2022 at 11:28 PM Alvaro Herrera  wrote:
>
> On 2022-Oct-05, Michael Paquier wrote:
>
> > And FWIW, the SQL interfaces for pg_backup_start() and
> > pg_backup_stop() could stay in xlogfuncs.c.  This has the advantage to
> > centralize in the same file all the SQL-function-specific checks.
>
> As I recall, that has the disadvantage that the API exposure is a bit
> higher -- I mean, with the patch as originally posted, there was less
> cross-inclusion of header files, but that is gone in the version Bharat
> posted as reply to this.  I'm not sure if that's caused by *this*
> comment, or even that it's terribly significant, but it seems worth
> considering at least.

FWIW, I'm attaching 0003 patch for moving backup functions from
xlogfuncs.c to xlogbackup.c. It's natural to have them there when
we're moving backup related things, this also reduces backup code
footprint. We can leave xlogfuncs.c for WAL related SQL-callable
functions.

Please review the attached v3 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 2ab3934f28fc986904afe8131aaa639713e4836c Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 6 Oct 2022 11:01:22 +
Subject: [PATCH v3] Add functions for xlogbackup.c to call back into xlog.c

---
 src/backend/access/transam/xlog.c | 191 --
 src/include/access/xlog.h |  15 +++
 2 files changed, 142 insertions(+), 64 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 27085b15a8..3af9cc86c0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8315,10 +8315,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 	 * forcePageWrites, to ensure adequate interlocking against
 	 * XLogInsertRecord().
 	 */
-	WALInsertLockAcquireExclusive();
-	XLogCtl->Insert.runningBackups++;
-	XLogCtl->Insert.forcePageWrites = true;
-	WALInsertLockRelease();
+	SetXLogBackupRelatedInfo();
 
 	/* Ensure we release forcePageWrites if fail below */
 	PG_ENSURE_ERROR_CLEANUP(pg_backup_start_callback, (Datum) 0);
@@ -8356,6 +8353,8 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 		do
 		{
 			bool		checkpointfpw;
+			ControlFileData	*ControlFile;
+
 
 			/*
 			 * Force a CHECKPOINT.  Aside from being necessary to prevent torn
@@ -8384,6 +8383,8 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 			 * to restore starting from the checkpoint is precisely the REDO
 			 * pointer.
 			 */
+			ControlFile = GetControlFile();
+
 			LWLockAcquire(ControlFileLock, LW_SHARED);
 			state->checkpointloc = ControlFile->checkPoint;
 			state->startpoint = ControlFile->checkPointCopy.redo;
@@ -8400,9 +8401,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
  * (i.e., since last restartpoint used as backup starting
  * checkpoint) contain full-page writes.
  */
-SpinLockAcquire(>info_lck);
-recptr = XLogCtl->lastFpwDisableRecPtr;
-SpinLockRelease(>info_lck);
+recptr = GetlastFpwDisableRecPtr();
 
 if (!checkpointfpw || state->startpoint <= recptr)
 	ereport(ERROR,
@@ -8435,13 +8434,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 			 * taking a checkpoint right after another is not that expensive
 			 * either because only few buffers have been dirtied yet.
 			 */
-			WALInsertLockAcquireExclusive();
-			if (XLogCtl->Insert.lastBackupStart < state->startpoint)
-			{
-XLogCtl->Insert.lastBackupStart = state->startpoint;
-gotUniqueStartpoint = true;
-			}
-			WALInsertLockRelease();
+			gotUniqueStartpoint = SetlastBackupStart(state->startpoint);
 		} while (!gotUniqueStartpoint);
 
 		/*
@@ -8546,16 +8539,7 @@ static void
 pg_backup_start_callback(int code, Datum arg)
 {
 	/* Update backup counters and forcePageWrites on failure */
-	WALInsertLockAcquireExclusive();
-
-	Assert(XLogCtl->Insert.runningBackups > 0);
-	XLogCtl->Insert.runningBackups--;
-
-	if (XLogCtl->Insert.runningBackups == 0)
-	{
-		XLogCtl->Insert.forcePageWrites = false;
-	}
-	WALInsertLockRelease();
+	ResetXLogBackupRelatedInfo(NULL);
 }
 
 /*
@@ -8592,6 +8576,7 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 	int			

Re: Record SET session in VariableSetStmt

2022-10-06 Thread Drouvot, Bertrand

Hi,

On 10/6/22 1:18 PM, Julien Rouhaud wrote:

Hi,

On Thu, Oct 06, 2022 at 12:57:17PM +0200, Drouvot, Bertrand wrote:

Hi hackers,

"SET local" is currently recorded in VariableSetStmt (with the boolean
is_local) but "SET session" is not.

Please find attached a patch proposal to also record "SET session" so that
VariableSetStmt records all the cases.

Remark: Recording "SET session" will also help for the Jumbling work being
done in [1].


I don't think it's necessary.  SET and SET SESSION are semantically the same


Thanks for your feedback!


so
nothing should rely on how exactly someone spelled it.  This is also the case
for our core jumbling code, where we guarantee (or at least try to) that two
semantically identical statements will get the same queryid, and therefore
don't distinguish eg. LIKE vs ~~.


Agree, but on the other hand currently SET and SET SESSION are recorded 
with distinct queryid:


postgres=# select calls, query, queryid from pg_stat_statements;
 calls |query|   queryid
---+-+--
 2 | select calls, query from pg_stat_statements | -6345508659980235519
 1 | set session enable_seqscan=1| -3921418831612111986
 1 | create extension pg_stat_statements | -1739183385080879393
 1 | set enable_seqscan=1|  7925920505912025406
(4 rows)

and this behavior would change with the Jumbling work in progress in [1] 
(mentioned up-thread) if we don't record "SET SESSION".


I think that would make sense to keep the same behavior, what do you think?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2022-10-06 Thread Aleksander Alekseev
Maxim,

> Here is a rebased version of the patch set.

This is the wrong thread / CF entry. Please see
http://cfbot.cputube.org/ and https://commitfest.postgresql.org/ and
the first email in the thread.

-- 
Best regards,
Aleksander Alekseev




RE: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread houzj.f...@fujitsu.com
On Thursday, October 6, 2022 6:54 PM Kuroda, Hayato/黒田 隼人 
 wrote:
> 
> Dear Amit,
> 
> > Can't we use WaitLatch in the case of SHM_MQ_WOULD_BLOCK as we are
> > using it for the same case at some other place in the code? We can use
> > the same nap time as we are using in the leader apply worker.
> 
> I'm not sure whether such a short nap time is needed or not.
> Because unlike leader apply worker, parallel apply workers do not have timeout
> like wal_receiver_timeout, so they do not have to check so frequently and send
> feedback to publisher.
> But basically I agree that we can use same logic as leader.

Thanks for the suggestion.

I tried to add a WaitLatch, but it seems affect the performance
because the Latch might not be set when leader send some
message to parallel apply worker which means it will wait until
timeout.

I feel we'd better change it back to sync mode and do the ProcessConfigFile()
after receiving the message and before applying the change which seems also
address the problem.

Best regards,
Hou zj


RE: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread houzj.f...@fujitsu.com


> -Original Message-
> From: Masahiko Sawada 
> Sent: Thursday, October 6, 2022 4:07 PM
> To: Hou, Zhijie/侯 志杰 
> Cc: Amit Kapila ; Wang, Wei/王 威
> ; Peter Smith ; Dilip
> Kumar ; Shi, Yu/侍 雨 ;
> PostgreSQL Hackers 
> Subject: Re: Perform streaming logical transactions by background workers and
> parallel apply
> 
> On Tue, Sep 27, 2022 at 9:26 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Saturday, September 24, 2022 7:40 PM Amit Kapila
>  wrote:
> > >
> > > On Thu, Sep 22, 2022 at 3:41 PM Amit Kapila 
> > > wrote:
> > > >
> > > > On Thu, Sep 22, 2022 at 8:59 AM wangw.f...@fujitsu.com
> > > >  wrote:
> > > > >
> > > >
> > > > Few comments on v33-0001
> > > > ===
> > > >
> > >
> > > Some more comments on v33-0001
> > > =
> > > 1.
> > > + /* Information from the corresponding LogicalRepWorker slot. */
> > > + uint16 logicalrep_worker_generation;
> > > +
> > > + int logicalrep_worker_slot_no;
> > > +} ParallelApplyWorkerShared;
> > >
> > > Both these variables are read/changed by leader/parallel workers without
> > > using any lock (mutex). It seems currently there is no problem because of
> the
> > > way the patch is using in_parallel_apply_xact but I think it won't be a 
> > > good
> idea
> > > to rely on it. I suggest using mutex to operate on these variables and 
> > > also
> check
> > > if the slot_no is in a valid range after reading it in 
> > > parallel_apply_free_worker,
> > > otherwise error out using elog.
> >
> > Changed.
> >
> > > 2.
> > >  static void
> > >  apply_handle_stream_stop(StringInfo s)
> > >  {
> > > - if (!in_streamed_transaction)
> > > + ParallelApplyWorkerInfo *winfo = NULL; TransApplyAction apply_action;
> > > +
> > > + if (!am_parallel_apply_worker() &&
> > > + (!in_streamed_transaction && !stream_apply_worker))
> > >   ereport(ERROR,
> > >   (errcode(ERRCODE_PROTOCOL_VIOLATION),
> > >   errmsg_internal("STREAM STOP message without STREAM START")));
> > >
> > > This check won't be able to detect missing stream start messages for 
> > > parallel
> > > apply workers apart from the first pair of start/stop. I thought of adding
> > > in_remote_transaction check along with
> > > am_parallel_apply_worker() to detect the same but that also won't work
> > > because the parallel worker doesn't reset it at the stop message.
> > > Another possibility is to introduce yet another variable for this but that
> doesn't
> > > seem worth it. I would like to keep this check simple.
> > > Can you think of any better way?
> >
> > I feel we can reuse the in_streamed_transaction in parallel apply worker to
> > simplify the check there. I tried to set this flag in parallel apply worker
> > when stream starts and reset it when stream stop so that we can directly 
> > check
> > this flag for duplicate stream start message and other related things.
> >
> > > 3. I think we can skip sending start/stop messages from the leader to the
> > > parallel worker because unlike apply worker it will process only one
> > > transaction-at-a-time. However, it is not clear whether that is worth the
> effort
> > > because it is sent after logical_decoding_work_mem changes. For now, I 
> > > have
> > > added a comment for this in the attached patch but let me if I am missing
> > > something or if I am wrong.
> >
> > I the suggested comments look good.
> >
> > > 4.
> > > postgres=# select pid, leader_pid, application_name, backend_type from
> > > pg_stat_activity;
> > >   pid  | leader_pid | application_name | backend_type
> > > ---++--+--
> > >  27624 ||  | logical replication launcher
> > >  17336 || psql | client backend
> > >  26312 ||  | logical replication worker
> > >  26376 || psql | client backend
> > >  14004 ||  | logical replication worker
> > >
> > > Here, the second worker entry is for the parallel worker. Isn't it better 
> > > if we
> > > distinguish this by keeping type as a logical replication parallel 
> > > worker? I
> think
> > > for this you need to change bgw_type in logicalrep_worker_launch().
> >
> > Changed.
> >
> > > 5. Can we name parallel_apply_subxact_info_add() as
> > > parallel_apply_start_subtrans()?
> > >
> > > Apart from the above, I have added/edited a few comments and made a few
> > > other cosmetic changes in the attached.
> >
> 
> While looking at v35 patch, I realized that there are some cases where
> the logical replication gets stuck depending on partitioned table
> structure. For instance, there are following tables, publication, and
> subscription:
> 
> * On publisher
> create table p (c int) partition by list (c);
> create table c1 partition of p for values in (1);
> create table c2 (c int);
> create publication test_pub for table p, c1, c2 with
> (publish_via_partition_root = 'true');
> 
> * On subscriber
> create table p 

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread Amit Kapila
On Fri, Sep 30, 2022 at 1:56 PM Peter Smith  wrote:
>
> Here are my review comments for the v35-0001 patch:
>
> ==
>
> 3. GENERAL
>
> (this comment was written after I wrote all the other ones below so
> there might be some unintended overlaps...)
>
> I found the mixed use of the same member names having different
> meanings to be quite confusing.
>
> e.g.1
> PGOutputData 'streaming' is now a single char internal representation
> the subscription parameter streaming mode ('f','t','p')
> - bool streaming;
> + char streaming;
>
> e.g.2
> WalRcvStreamOptions 'streaming' is a C string version of the
> subscription streaming mode ("on", "parallel")
> - bool streaming; /* Streaming of large transactions */
> + char*streaming; /* Streaming of large transactions */
>
> e.g.3
> SubOpts 'streaming' is again like the first example - a single char
> for the mode.
> - bool streaming;
> + char streaming;
>
>
> IMO everything would become much simpler if you did:
>
> 3a.
> Rename "char streaming;" -> "char streaming_mode;"
>
> 3b.
> Re-designed the "char *streaming;" code to also use the single char
> notation, then also call that member 'streaming_mode'. Then everything
> will be consistent.
>

Won't this impact the previous version publisher which already uses
on/off? We may need to maintain multiple values which would be
confusing.

>
> 9. - parallel_apply_can_start
>
> +/*
> + * Returns true, if it is allowed to start a parallel apply worker, false,
> + * otherwise.
> + */
> +static bool
> +parallel_apply_can_start(TransactionId xid)
>
> (The commas are strange)
>
> SUGGESTION
> Returns true if it is OK to start a parallel apply worker, false otherwise.
>

+1 for this.
>
> 28. - logicalrep_worker_detach
>
> + /* Stop the parallel apply workers. */
> + if (am_leader_apply_worker())
> + {
>
> Should that comment rather say like below?
>
> /* If this is the leader apply worker then stop all of its parallel
> apply workers. */
>

I think this would be just saying what is apparent from the code, so
not sure if it is an improvement.

>
> 38. - apply_handle_commit_prepared
>
> + *
> + * Note that we don't need to wait here if the transaction was prepared in a
> + * parallel apply worker. Because we have already waited for the prepare to
> + * finish in apply_handle_stream_prepare() which will ensure all the 
> operations
> + * in that transaction have happened in the subscriber and no concurrent
> + * transaction can create deadlock or transaction dependency issues.
>   */
>  static void
>  apply_handle_commit_prepared(StringInfo s)
>
> "worker. Because" -> "worker because"
>

I think this will make this line too long. Can we think of breaking it
in some way?

>
> 43.
>
>   /*
> - * Initialize the worker's stream_fileset if we haven't yet. This will be
> - * used for the entire duration of the worker so create it in a permanent
> - * context. We create this on the very first streaming message from any
> - * transaction and then use it for this and other streaming transactions.
> - * Now, we could create a fileset at the start of the worker as well but
> - * then we won't be sure that it will ever be used.
> + * For the first stream start, check if there is any free parallel apply
> + * worker we can use to process this transaction.
>   */
> - if (MyLogicalRepWorker->stream_fileset == NULL)
> + if (first_segment)
> + parallel_apply_start_worker(stream_xid);
>
> This comment update seems misleading. The
> parallel_apply_start_worker() isn't just checking if there is a free
> worker. All that free worker logic stuff is *inside* the
> parallel_apply_start_worker() function, so maybe no need to mention
> about it here at the caller.
>

It will be good to have some comments here instead of completely removing it.

>
> 39. - apply_handle_stream_abort
>
> + /* We receive abort information only when we can apply in parallel. */
> + if (MyLogicalRepWorker->parallel_apply)
> + read_abort_info = true;
>
> 44a.
> SUGGESTION
> We receive abort information only when the publisher can support parallel 
> apply.
>

The existing comment seems better to me in this case.

>
> 55. - LogicalRepWorker
>
> + /* Indicates whether apply can be performed parallelly. */
> + bool parallel_apply;
> +
>
> 55a.
> "parallelly" - ?? is there a better way to phrase this? IMO that is an
> uncommon word.
>

How about ".. can be performed in parallel."?

> ~
>
> 55b.
> IMO this member name should be named slightly different to give a
> better feel for what it really means.
>
> Maybe something like one of:
> "parallel_apply_ok"
> "parallel_apply_enabled"
> "use_parallel_apply"
> etc?
>

The extra word doesn't seem to be useful here.

> 58.
>
> --- fail - streaming must be boolean
> +-- fail - streaming must be boolean or 'parallel'
>  CREATE SUBSCRIPTION regress_testsub CONNECTION
> 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
> false, streaming = foo);
>
> I think there are tests already for explicitly create/set the
> subscription 

Re: use has_privs_of_role() for pg_hba.conf

2022-10-06 Thread Joe Conway

On 10/6/22 04:09, Michael Paquier wrote:

On Mon, Apr 04, 2022 at 07:25:51AM -0700, Nathan Bossart wrote:

On Mon, Apr 04, 2022 at 09:36:13AM -0400, Joshua Brindle wrote:

Good catch, I think this is a logical followup to the previous
has_privs_of_role patch.

Reviewed and +1


Thanks!  I created a commitfest entry for this:


This patch looks simple, but it is a very sensitive area so I think
that we should be really careful.  pg_hba.conf does not have a lot of
test coverage, so I'd really prefer if we add something to see the
difference of behavior and check the behavior that we are switching
here.


Agreed


Joe, you are registered as a reviewer and committer of this patch, by
the way.  Are you planning to look at it?


I am meaning to get to it, but as you say wanted to spend some time to 
understand the nuances and life keeps getting in the way. I will try to 
prioritize it over the next week.


Joe
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: fix comment typo in xlogprefetcher.c

2022-10-06 Thread Michael Paquier
On Thu, Oct 06, 2022 at 08:12:37AM +, kato-...@fujitsu.com wrote:
> I found a comment typo in xlogprefetcher.c.
> Any thoughts?

Fixed, thanks.
--
Michael


signature.asc
Description: PGP signature


Re: Record SET session in VariableSetStmt

2022-10-06 Thread Julien Rouhaud
Hi,

On Thu, Oct 06, 2022 at 12:57:17PM +0200, Drouvot, Bertrand wrote:
> Hi hackers,
>
> "SET local" is currently recorded in VariableSetStmt (with the boolean
> is_local) but "SET session" is not.
>
> Please find attached a patch proposal to also record "SET session" so that
> VariableSetStmt records all the cases.
>
> Remark: Recording "SET session" will also help for the Jumbling work being
> done in [1].

I don't think it's necessary.  SET and SET SESSION are semantically the same so
nothing should rely on how exactly someone spelled it.  This is also the case
for our core jumbling code, where we guarantee (or at least try to) that two
semantically identical statements will get the same queryid, and therefore
don't distinguish eg. LIKE vs ~~.




Record SET session in VariableSetStmt

2022-10-06 Thread Drouvot, Bertrand

Hi hackers,

"SET local" is currently recorded in VariableSetStmt (with the boolean 
is_local) but "SET session" is not.


Please find attached a patch proposal to also record "SET session" so 
that VariableSetStmt records all the cases.


Remark: Recording "SET session" will also help for the Jumbling work 
being done in [1].


[1]: 
https://www.postgresql.org/message-id/66be1104-164f-dcb8-6c43-f03a68a139a7%40gmail.com


Looking forward to your feedback,

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comcommit 2401087366432b0a520ac45947c4584924099617
Author: bdrouvotAWS 
Date:   Thu Oct 6 10:20:18 2022 +

v1-0001-record-set-session-in-VariableSetStmt.patch

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 94d5142a4a..809e617d18 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1563,6 +1563,7 @@ VariableSetStmt:
VariableSetStmt *n = $2;
 
n->is_local = false;
+   n->is_session = false;
$$ = (Node *) n;
}
| SET LOCAL set_rest
@@ -1570,6 +1571,7 @@ VariableSetStmt:
VariableSetStmt *n = $3;
 
n->is_local = true;
+   n->is_session = false;
$$ = (Node *) n;
}
| SET SESSION set_rest
@@ -1577,6 +1579,7 @@ VariableSetStmt:
VariableSetStmt *n = $3;
 
n->is_local = false;
+   n->is_session = true;
$$ = (Node *) n;
}
;
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 633e7671b3..c3f33e7c4a 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2228,6 +2228,7 @@ typedef struct VariableSetStmt
char   *name;   /* variable to be set */
List   *args;   /* List of A_Const nodes */
boolis_local;   /* SET LOCAL? */
+   boolis_session; /* SET SESSION? */
 } VariableSetStmt;
 
 /* --


RE: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread kuroda.hay...@fujitsu.com
Dear Amit,

> Can't we use WaitLatch in the case of SHM_MQ_WOULD_BLOCK as we are
> using it for the same case at some other place in the code? We can use
> the same nap time as we are using in the leader apply worker.

I'm not sure whether such a short nap time is needed or not.
Because unlike leader apply worker, parallel apply workers do not have timeout 
like wal_receiver_timeout,
so they do not have to check so frequently and send feedback to publisher.
But basically I agree that we can use same logic as leader.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: [PATCH] Compression dictionaries for JSONB

2022-10-06 Thread Aleksander Alekseev
Hi hackers,

> For the record, Nikita and I agreed offlist that Nikita will join this
> effort as a co-author in order to implement the suggested improvements
> (and perhaps some improvements that were not suggested yet). Meanwhile
> I'm going to keep the current version of the patch up to date with the
> `master` branch.

Here is an updated patch with added Meson support.

-- 
Best regards,
Aleksander Alekseev


v8-0001-Compression-dictionaries-for-JSONB.patch
Description: Binary data


Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread Amit Kapila
On Thu, Sep 29, 2022 at 3:20 PM kuroda.hay...@fujitsu.com
 wrote:
>
> Dear Hou,
>
> Thanks for updating patch. I will review yours soon, but I reply to your 
> comment.
>
> > > 04. applyparallelworker.c - LogicalParallelApplyLoop()
> > >
> > > ```
> > > +   shmq_res = shm_mq_receive(mqh, , , false);
> > > ...
> > > +   if (ConfigReloadPending)
> > > +   {
> > > +   ConfigReloadPending = false;
> > > +   ProcessConfigFile(PGC_SIGHUP);
> > > +   }
> > > ```
> > >
> > >
> > > Here the parallel apply worker waits to receive messages and after 
> > > dispatching
> > > it ProcessConfigFile() is called.
> > > It means that .conf will be not read until the parallel apply worker 
> > > receives new
> > > messages and apply them.
> > >
> > > It may be problematic when users change log_min_message to debugXXX for
> > > debugging but the streamed transaction rarely come.
> > > They expected that detailed description appears on the log from next
> > > streaming chunk, but it does not.
> > >
> > > This does not occur in leader worker when it waits messages from 
> > > publisher,
> > > because it uses libpqrcv_receive(), which works asynchronously.
> > >
> > > I 'm not sure whether it should be documented that the evaluation of GUCs 
> > > may
> > > be delayed, how do you think?
> >
> > I changed the shm_mq_receive to asynchronous mode which is also consistent
> > with
> > what we did for Gather node when reading data from parallel query workers.
>
> I checked your implementation, but it seemed that the parallel apply worker 
> will not sleep
> even if there are no messages or signals. It might be very inefficient.
>
> In gather node - gather_readnext(), the same way is used, but I think there 
> is a premise
> that the wait-time is short because it is related with only one gather node.
> In terms of parallel apply worker, however, we cannot predict the wait-time 
> because
> it is related with the streamed transactions. If such transactions rarely 
> come, parallel apply workers may spend many CPU time.
>
> I think we should wait during short time or until leader notifies, if 
> shmq_res == SHM_MQ_WOULD_BLOCK.
> How do you think?
>

Can't we use WaitLatch in the case of SHM_MQ_WOULD_BLOCK as we are
using it for the same case at some other place in the code? We can use
the same nap time as we are using in the leader apply worker.

-- 
With Regards,
Amit Kapila.




Re: meson PGXS compatibility

2022-10-06 Thread Peter Eisentraut

On 05.10.22 22:07, Andres Freund wrote:

My colleague Bilal has set up testing and verified that a few extensions build
with the pgxs compatibility layer, on linux at last. Currently pg_qualstats,
pg_cron, hypopg, orafce, postgis, pg_partman work. He also tested pgbouncer,
but for him that failed both with autoconf and meson generated pgxs.


pgbouncer doesn't use pgxs.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-06 Thread John Naylor
On Thu, Oct 6, 2022 at 2:53 PM Masahiko Sawada 
wrote:
>
> On Wed, Oct 5, 2022 at 6:40 PM John Naylor 
wrote:
> >
> > This wasn't the focus of your current email, but while experimenting
with v6 I had another thought about local allocation: If we use the default
slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit,
right? If so, since aset and DSA also waste at least a few hundred bytes,
we could store a useless 256-byte slot array within node256. That way,
node128 and node256 share the same start of pointers/values array, so there
would be one less branch for getting that address. In v6,
rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses
a jump table for 5 kinds but not for 4), but possibly should be, and the
smaller the better.
>
> It would be good for performance but I'm a bit concerned that it's
> highly optimized to the design of aset and DSA. Since size 2088 will
> be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if
> we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a
> useless 256-byte and the assumption will be broken.

A new DSA class is hypothetical. A better argument against my idea is that
SLAB_DEFAULT_BLOCK_SIZE is arbitrary. FWIW, I looked at the prototype just
now and the slab block sizes are:

Max(pg_nextpower2_32((MAXALIGN(inner_class_info[i].size) + 16) * 32), 1024)

...which would be 128kB for nodemax. I'm curious about the difference.

> > One concern is, handling both local and dsa cases in the same code
requires more (predictable) branches and reduces code density. That might
be a reason in favor of templating to handle each case in its own
translation unit.
>
> Right. We also need to support locking for shared radix tree, which
> would require more branches.

Hmm, now it seems we'll likely want to template local vs. shared as a later
step...

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Query Jumbling for CALL and SET utility statements

2022-10-06 Thread Drouvot, Bertrand

Hi,

On 10/6/22 8:39 AM, Michael Paquier wrote:

On Mon, Sep 19, 2022 at 08:29:22AM +0200, Drouvot, Bertrand wrote:

Please find attached v6 taking care of the remarks mentioned above.
+case T_VariableSetStmt:
+{
+VariableSetStmt *stmt = (VariableSetStmt *) node;
+
+/* stmt->name is NULL for RESET ALL */
+if (stmt->name)
+{
+APP_JUMB(stmt->kind);
+APP_JUMB_STRING(stmt->name);
+JumbleExpr(jstate, (Node *) stmt->args);
+}
+}
+break;


Hmm.  If VariableSetStmt->is_local is not added to the jumble, then
aren't "SET foo = $1" and "SET LOCAL foo = $1" counted as the same
query?



Good catch, thanks!
While at it let's also jumble "SET SESSION foo =". For this one, we 
would need to record another bool in VariableSetStmt: I'll create a 
dedicated patch for that.




I am not seeing SAVEPOINT, RELEASE, ROLLBACK .. TO SAVEPOINT
mentioned on this thread.  Would these be worth considering in what
gets compiled?  That would cover the remaining bits of
TransactionStmt.  The ODBC driver abuses of savepoints, for example,
so this could be useful for monitoring purposes in such cases.


Agree. I'll look at those too.



As of the code stands, it could be cleaner to check
IsJumbleUtilityAllowed() in compute_utility_query_id(), falling back
to a default in JumbleQuery().  Not that what your patch does is
incorrect, of course.


Will look at it.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Testing DDL Deparser

2022-10-06 Thread Runqi Tian
Hello:

I’m working on developing a testing harness for the DDL Deparser being
worked on in [1], please apply the patches in [1] before apply this
patch. I think the testing harness needs to achieve the following
goals:

1. The deparsed JSON output is as expected.
2. The SQL commands re-formed from deparsed JSON should make the same
schema change as the original SQL commands.
3. Any DDL change without modifying the deparser should fail the
testing harness.

Based on these 3 goals, we think the deparser testing harness should
have 2 parts: the first part is unit testing to cover the first two
goals and the second part is integrating deparser test with pg_regress
to cover the third goal.

I think the unit test part can be based on
https://git.postgresql.org/gitweb/?p=postgresql.git;a=tree;f=src/test/modules/test_ddl_deparse;hb=HEAD.
We can improve upon this test by using it to validate the output JSON
and the re-formed SQL from the deparsed JSON.

[2] and [3] proposed using regression tests to test deparser and
provided an implementation using TAP framework. I made some changes
which enables testing any SQL file under the folder provided in
$inputdir variable. This implementation enables us to run test cases
under regression tests folder, or just any test cases using a SQL
file.

I came up with some ideas during the investigation and want to collect
some feedback:
1, Currently we want to utilize the test cases from regression tests.
However you will find that many test cases end with DROP commands. In
current deparser testing approach proposed in [2] and [3], we compare
the pg_dump schema results between the original SQL scripts and
deparser generated commands. Because most test cases end with DROP
command, the schema will not be shown in pg_dump, so the test coverage
is vastly reduced. Any suggestion to this problem?

2, We found that DROP command are not returned by
pg_event_trigger_ddl_commands() fucntion in ddl_command_end trigger,
but it’s caught by ddl_command_end trigger. Currently, we catch DROP
command in sql_drop trigger. It’s unclear why
pg_event_trigger_ddl_commands() function is designed to not return
DROP command.

3, For unsupported DDL commands by the deparser, the current
implementation just skips them silently. So we cannot detect
unsupported DDL commands easily. Customers may also want the deparser
related features like logical replication to be executed in a strict
mode, so that the system can warn them when deparser can not deparse
some DDL command. So I propose to introduce a new GUC such as
“StopOnDeparserUnsupportedCommand = true/false” to allow the deparser
to execute in strict mode, in which an unsupported DDL command will
raise an error.

4, We found that the event trigger function
pg_event_trigger_ddl_commands() only returns subcommands, and deparser
is deparsing subcommands returned by this function. The deparser works
on subcommand level by using this function, but the deparser is
designed to deparse the complete command to JSON output. So there is a
mismatch here, what do you think about this problem? Should the
deparser work at subcommand level? Or should we provide some event
trigger function which can return the complete command instead of
subcommands?

Your feedback is appreciated.

Regards,
Runqi Tian
Amazon RDS/Aurora for PostgreSQL

[1] 
https://www.postgresql.org/message-id/CALDaNm0VnaCg__huSDW%3Dn%3D_rSGGES90cpOtqwZeWnA6muoz3oA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAD21AoBVCoPPRKvU_5-%3DwEXsa92GsNJFJOcYyXzvoSEJCx5dKw%40mail.gmail.com
[3] https://www.postgresql.org/message-id/20150215044814.gl3...@alvh.no-ip.org
diff --git a/src/backend/commands/ddl_deparse.c b/src/backend/commands/ddl_deparse.c
index 5bfa800fc2..8cb946f7f1 100755
--- a/src/backend/commands/ddl_deparse.c
+++ b/src/backend/commands/ddl_deparse.c
@@ -6508,6 +6508,42 @@ deparse_DefElem(DefElem *elem, bool is_reset)
 	return set;
 }
 
+/*
+ * Given object_identity and object_type of dropped object, return a JSON representation of DROP command.
+ */
+Datum
+deparse_drop_ddl(PG_FUNCTION_ARGS)
+{
+	text	   *objidentity = PG_GETARG_TEXT_P(0);
+	const char	   *objidentity_str = text_to_cstring(objidentity);
+	text	   *objecttype = PG_GETARG_TEXT_P(1);
+	const char	   *objecttype_str = text_to_cstring(objecttype);
+
+	char		   *command;
+
+	// constraint is part of alter table command, no need to drop in DROP command
+	if (strcmp(objecttype_str, "table constraint") == 0) {
+		PG_RETURN_NULL();
+	} else if (strcmp(objecttype_str, "toast table") == 0) {
+		objecttype_str = "table";
+	}  else if (strcmp(objecttype_str, "default value") == 0) {
+		PG_RETURN_NULL();
+	} else if (strcmp(objecttype_str, "operator of access method") == 0) {
+		PG_RETURN_NULL();
+	} else if (strcmp(objecttype_str, "function of access method") == 0) {
+		PG_RETURN_NULL();
+	} else if (strcmp(objecttype_str, "table column") == 0) {
+		PG_RETURN_NULL();
+	}
+
+	command = deparse_drop_command(objidentity_str, 

Re: Query Jumbling for CALL and SET utility statements

2022-10-06 Thread Drouvot, Bertrand

Hi,

On 9/26/22 12:40 PM, Drouvot, Bertrand wrote:

let's add it in V7 attached
(that's safer should the code change later on).


Attached a tiny rebase needed due to 249b0409b1.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comdiff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out 
b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index 9ac5c87c3a..5d3330a87d 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -311,7 +311,7 @@ FROM pg_stat_statements ORDER BY query COLLATE "C";
  wal_records > $2 as wal_records_generated,   +|   |  |
 |   | 
  wal_records >= rows as wal_records_ge_rows   +|   |  |
 |   | 
  FROM pg_stat_statements ORDER BY query COLLATE "C"|   |  |
 |   | 
- SET pg_stat_statements.track_utility = FALSE  | 1 |0 | f  
 | f | t
+ SET pg_stat_statements.track_utility = $1 | 1 |0 | f  
 | f | t
  UPDATE pgss_test SET b = $1 WHERE a > $2  | 1 |3 | t  
 | t | t
 (7 rows)
 
@@ -462,6 +462,111 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER 
BY query COLLATE "C";
  SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C" 
| 0 |0
 (6 rows)
 
+-- PL/pgSQL procedure and pg_stat_statements.track = top
+CREATE PROCEDURE MINUS_TWO(i INTEGER) AS $$
+DECLARE
+  r INTEGER;
+BEGIN
+  SELECT (i - 1 - 1.0)::INTEGER INTO r;
+END; $$ LANGUAGE plpgsql;
+CREATE PROCEDURE SUM_TWO(i INTEGER, j INTEGER) AS $$
+DECLARE
+  r INTEGER;
+BEGIN
+  SELECT (j + j)::INTEGER INTO r;
+END; $$ LANGUAGE plpgsql;
+SET pg_stat_statements.track = 'top';
+SET pg_stat_statements.track_utility = FALSE;
+SELECT pg_stat_statements_reset();
+ pg_stat_statements_reset 
+--
+ 
+(1 row)
+
+CALL MINUS_TWO(3);
+CALL MINUS_TWO(7);
+CALL SUM_TWO(3, 8);
+CALL SUM_TWO(7, 5);
+SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
+query 
| calls | rows 
+--+---+--
+ CALL MINUS_TWO($1)   
| 2 |0
+ CALL SUM_TWO($1, $2) 
| 2 |0
+ SELECT pg_stat_statements_reset()
| 1 |1
+ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C" 
| 0 |0
+(4 rows)
+
+--
+-- pg_stat_statements.track = all
+--
+SET pg_stat_statements.track = 'all';
+SELECT pg_stat_statements_reset();
+ pg_stat_statements_reset 
+--
+ 
+(1 row)
+
+-- PL/pgSQL procedure and pg_stat_statements.track = all
+-- we drop and recreate the procedures to avoid any caching funnies
+DROP PROCEDURE MINUS_TWO(INTEGER);
+DROP PROCEDURE SUM_TWO(INTEGER, INTEGER);
+CREATE PROCEDURE MINUS_TWO(i INTEGER) AS $$
+DECLARE
+  r INTEGER;
+BEGIN
+  SELECT (i - 1 - 1.0)::INTEGER INTO r;
+END; $$ LANGUAGE plpgsql;
+CREATE PROCEDURE SUM_TWO(i INTEGER, j INTEGER) AS $$
+DECLARE
+  r INTEGER;
+BEGIN
+  SELECT (j + j)::INTEGER INTO r;
+END; $$ LANGUAGE plpgsql;
+CALL MINUS_TWO(3);
+CALL MINUS_TWO(7);
+CALL SUM_TWO(3, 8);
+CALL SUM_TWO(7, 5);
+SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
+query 
| calls | rows 
+--+---+--
+ CALL MINUS_TWO($1)   
| 2 |0
+ CALL SUM_TWO($1, $2) 
| 2 |0
+ SELECT (i - $2 - $3)::INTEGER
| 2 |2
+ SELECT (j + j)::INTEGER  
| 2 |2
+ SELECT pg_stat_statements_reset()
| 1 |1
+ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C" 
| 0 |0
+(6 rows)
+
+SET pg_stat_statements.track_utility = TRUE;
+-- SET
+SELECT pg_stat_statements_reset();
+ pg_stat_statements_reset 
+--
+ 
+(1 row)
+
+set enable_seqscan=false;
+set enable_seqscan=true;
+set seq_page_cost=2.0;
+set seq_page_cost=1.0;
+set enable_seqscan to default;
+set seq_page_cost to default;
+reset seq_page_cost;
+reset enable_seqscan;
+SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE 

Re: Support logical replication of DDLs

2022-10-06 Thread Peter Smith
The patches here are quite large, so for this review post, I have only
done a quick check for cosmetic stuff in the comments of patch
v24-0001.

~

I did this mostly just by cutting/pasting the whole patch text into a
grammar/spell-checker to see what it reported. Please use the same
strategy prior to posting future patch versions, because it will be
way more efficient/easier for the author to spend a few minutes to fix
everything like this up-front before posting, rather than getting a
flood of review comments to deal with (like this post) about such
stuff.

(BTW, most of these suggestions are just verbatim what my
grammar/spell-checker told me)

==

1. Commit comment

(Note #2) Note that, for ATTACH/DETACH PARTITION, we haven't added
extra logic on
subscriber to handle the case where the table on publisher is a PARTITIONED
TABLE while the target table on subcriber side is NORMAL table. We will
research this more and improve this later.

SUGGESTION (minor changes + fix typo)
(Note #2) Note that for ATTACH/DETACH PARTITION we haven't added extra logic on
the subscriber to handle the case where the table on the publisher is
a PARTITIONED
TABLE while the target table on the subscriber is a NORMAL table. We will
research this more and improve it later.

==

2. GENERAL - uppercase the comments

Please make all your single-line comments start with uppercase for consistency.

Here are some examples to fix:

Line  303: + /* add the "ON table" clause */
Line  331: + /* add the USING clause, if any */
Line  349: + /* add the WITH CHECK clause, if any */
Line  653: + /* otherwise, WITH TZ is added by typmod. */
Line  663: + /* otherwise, WITH TZ is added by typmode. */
Line 1946: + /* build list of privileges to grant/revoke */
Line 2017: + /* target objects.  We use object identities here */
Line 2041: + /* list of grantees */
Line 2053: + /* the wording of the grant option is variable ... */
Line 2632: + /* skip this one; we add one unconditionally below */
Line 2660: + /* done */
Line 2768: + /* add HANDLER clause */
Line 2780: + /* add VALIDATOR clause */
Line 2792: + /* add an OPTIONS clause, if any */
Line 2845: + /* add HANDLER clause */
Line 2859: + /* add VALIDATOR clause */
Line 2877: + /* add an OPTIONS clause, if any */
Line 5024: + /* add the rest of the stuff */
Line 5040: + /* add the rest of the stuff */
Line 5185: + /* a new constraint has a name and definition */
Line 5211: + /* done */
Line 6124: + /* add a CONCURRENTLY clause */
Line 6127: + /* add the matview name */
Line 6131: + /* add a WITH NO DATA clause */
Line 6302: + /* reloptions */
Line 6310: + /* tablespace */
Line 6592: + /* deconstruct the name list */
Line 6636: + /* deconstruct the name list */
Line 6725: + /* obtain object tuple */
Line 6729: + /* obtain namespace */

--

3.  Grammar/Spelling

3a. - format_type_detailed

+ * - nspid is the schema OID.  For certain SQL-standard types which have weird
+ *   typmod rules, we return InvalidOid; caller is expected to not schema-
+ *   qualify the name nor add quotes to the type name in this case.

"caller" -> "the caller"

~

3b. - format_type_detailed

+ * - typemodstr is set to the typemod, if any, as a string with parens

"parens" -> "parentheses"

~

3c. - format_type_detailed

+ else
+ /* otherwise, WITH TZ is added by typmode. */
+ *typname = pstrdup("TIME");

"typmode" -> "typmod" ?

~

3d. - new_objtree_for_qualname

+ * A helper routine to setup %{}D and %{}O elements.

"setup" -> "set up"

~

3e. - new_objtree_for_type

+/*
+ * A helper routine to setup %{}T elements.
+ */
+static ObjTree *
+new_objtree_for_type(Oid typeId, int32 typmod)

"setup" -> "set up"

~

3f. - pg_get_indexdef_detailed
+/*
+ * Return an index definition, split in several pieces.
+ *
+ * A large amount of code is duplicated from  pg_get_indexdef_worker, but
+ * control flow is different enough that it doesn't seem worth keeping them
+ * together.
+ */
+static void
+pg_get_indexdef_detailed(Oid indexrelid,

"split in" -> "split into"

~

3g. - ddl_deparse_to_json

+ * The command is expanded fully, so that there are no ambiguities even in the
+ * face of search_path changes.
+ */
+Datum
+ddl_deparse_to_json(PG_FUNCTION_ARGS)

"fully, so" -> "fully so"

~

3h. -deparse_AlterFunction

+ * Given a function OID and the parsetree that created it, return the JSON
+ * blob representing the alter command.
+ */

"the parsetree" -> "the parse tree"

~

3i. - deparse_AlterObjectSchemaStmt

+/*
+ * deparse an ALTER ... SET SCHEMA command.
+ */
+static ObjTree *
+deparse_AlterObjectSchemaStmt(ObjectAddress address, Node *parsetree,

"deparse" -> "Deparse"

~

3j. deparse_AlterObjectSchemaStmt

+ /*
+ * Since the command has already taken place from the point of view of
+ * catalogs, getObjectIdentity returns the object name with the already
+ * changed schema.  The output of our deparsing must return the original
+ * schema name however, so we chop the schema name off the identity string
+ * and then 

fix comment typo in xlogprefetcher.c

2022-10-06 Thread kato-...@fujitsu.com
Hello

I found a comment typo in xlogprefetcher.c.
Any thoughts?

--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -19,7 +19,7 @@
  * avoid a second buffer mapping table lookup.
  *
  * Currently, only the main fork is considered for prefetching.  Currently,
- * prefetching is only effective on systems where BufferPrefetch() does
+ * prefetching is only effective on systems where PrefetchBuffer() does
  * something useful (mainly Linux).
  *
  *-

regards sho kato


fix-typo-comment.patch
Description: fix-typo-comment.patch


Re: use has_privs_of_role() for pg_hba.conf

2022-10-06 Thread Michael Paquier
On Mon, Apr 04, 2022 at 07:25:51AM -0700, Nathan Bossart wrote:
> On Mon, Apr 04, 2022 at 09:36:13AM -0400, Joshua Brindle wrote:
>> Good catch, I think this is a logical followup to the previous
>> has_privs_of_role patch.
>> 
>> Reviewed and +1
> 
> Thanks!  I created a commitfest entry for this:

This patch looks simple, but it is a very sensitive area so I think
that we should be really careful.  pg_hba.conf does not have a lot of
test coverage, so I'd really prefer if we add something to see the
difference of behavior and check the behavior that we are switching
here.  What I have just committed in 051b096 would help a bit here,
actually, and changing pg_hba.conf rules with rule reload is cheap.

Joe, you are registered as a reviewer and committer of this patch, by
the way.  Are you planning to look at it?
--
Michael


signature.asc
Description: PGP signature


Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread Masahiko Sawada
On Tue, Sep 27, 2022 at 9:26 PM houzj.f...@fujitsu.com
 wrote:
>
> On Saturday, September 24, 2022 7:40 PM Amit Kapila  
> wrote:
> >
> > On Thu, Sep 22, 2022 at 3:41 PM Amit Kapila 
> > wrote:
> > >
> > > On Thu, Sep 22, 2022 at 8:59 AM wangw.f...@fujitsu.com
> > >  wrote:
> > > >
> > >
> > > Few comments on v33-0001
> > > ===
> > >
> >
> > Some more comments on v33-0001
> > =
> > 1.
> > + /* Information from the corresponding LogicalRepWorker slot. */
> > + uint16 logicalrep_worker_generation;
> > +
> > + int logicalrep_worker_slot_no;
> > +} ParallelApplyWorkerShared;
> >
> > Both these variables are read/changed by leader/parallel workers without
> > using any lock (mutex). It seems currently there is no problem because of 
> > the
> > way the patch is using in_parallel_apply_xact but I think it won't be a 
> > good idea
> > to rely on it. I suggest using mutex to operate on these variables and also 
> > check
> > if the slot_no is in a valid range after reading it in 
> > parallel_apply_free_worker,
> > otherwise error out using elog.
>
> Changed.
>
> > 2.
> >  static void
> >  apply_handle_stream_stop(StringInfo s)
> >  {
> > - if (!in_streamed_transaction)
> > + ParallelApplyWorkerInfo *winfo = NULL; TransApplyAction apply_action;
> > +
> > + if (!am_parallel_apply_worker() &&
> > + (!in_streamed_transaction && !stream_apply_worker))
> >   ereport(ERROR,
> >   (errcode(ERRCODE_PROTOCOL_VIOLATION),
> >   errmsg_internal("STREAM STOP message without STREAM START")));
> >
> > This check won't be able to detect missing stream start messages for 
> > parallel
> > apply workers apart from the first pair of start/stop. I thought of adding
> > in_remote_transaction check along with
> > am_parallel_apply_worker() to detect the same but that also won't work
> > because the parallel worker doesn't reset it at the stop message.
> > Another possibility is to introduce yet another variable for this but that 
> > doesn't
> > seem worth it. I would like to keep this check simple.
> > Can you think of any better way?
>
> I feel we can reuse the in_streamed_transaction in parallel apply worker to
> simplify the check there. I tried to set this flag in parallel apply worker
> when stream starts and reset it when stream stop so that we can directly check
> this flag for duplicate stream start message and other related things.
>
> > 3. I think we can skip sending start/stop messages from the leader to the
> > parallel worker because unlike apply worker it will process only one
> > transaction-at-a-time. However, it is not clear whether that is worth the 
> > effort
> > because it is sent after logical_decoding_work_mem changes. For now, I have
> > added a comment for this in the attached patch but let me if I am missing
> > something or if I am wrong.
>
> I the suggested comments look good.
>
> > 4.
> > postgres=# select pid, leader_pid, application_name, backend_type from
> > pg_stat_activity;
> >   pid  | leader_pid | application_name | backend_type
> > ---++--+--
> >  27624 ||  | logical replication launcher
> >  17336 || psql | client backend
> >  26312 ||  | logical replication worker
> >  26376 || psql | client backend
> >  14004 ||  | logical replication worker
> >
> > Here, the second worker entry is for the parallel worker. Isn't it better 
> > if we
> > distinguish this by keeping type as a logical replication parallel worker? 
> > I think
> > for this you need to change bgw_type in logicalrep_worker_launch().
>
> Changed.
>
> > 5. Can we name parallel_apply_subxact_info_add() as
> > parallel_apply_start_subtrans()?
> >
> > Apart from the above, I have added/edited a few comments and made a few
> > other cosmetic changes in the attached.
>

While looking at v35 patch, I realized that there are some cases where
the logical replication gets stuck depending on partitioned table
structure. For instance, there are following tables, publication, and
subscription:

* On publisher
create table p (c int) partition by list (c);
create table c1 partition of p for values in (1);
create table c2 (c int);
create publication test_pub for table p, c1, c2 with
(publish_via_partition_root = 'true');

* On subscriber
create table p (c int) partition by list (c);
create table c1 partition of p for values In (2);
create table c2 partition of p for values In (1);
create subscription test_sub connection 'port=5551 dbname=postgres'
publication test_pub with (streaming = 'parallel', copy_data =
'false');

Note that while both the publisher and the subscriber have the same
name tables the partition structure is different and rows go to a
different table on the subscriber (eg, row c=1 will go to c2 table on
the subscriber). If two current transactions are executed as follows,
the apply worker 

Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-10-06 Thread Bharath Rupireddy
On Thu, Oct 6, 2022 at 2:30 AM Bruce Momjian  wrote:
>
> As I highlighted above, by default you notify the administrator that a
> sychronous replica is not responding and then ignore it.  If it becomes
> responsive again, you notify the administrator again and add it back as
> a sychronous replica.
>
> > command in any form may pose security risks. I'm not sure at this
> > point how this new timeout is going to work alongside
> > wal_sender_timeout.
>
> We have archive_command, so I don't see a problem with another shell
> command.

Why do we need a new command to inform the admin/user about a sync
replication being ignored (from sync quorum) for not responding or
acknowledging for a certain amount of time in SyncRepWaitForLSN()?
Can't we just add an extra column or use existing sync_state in
pg_stat_replication()? We can either introduce a new state such as
temporary_async or just use the existing state 'potential' [1]. A
problem is that the server has to be monitored for this extra, new
state. If we do this, we don't need another command to report.

> > I'm thinking about the possible options that an admin has to get out
> > of this situation:
> > 1) Removing the standby from synchronous_standby_names.
>
> Yes, see above.  We might need a read-only GUC that reports which
> sychronous replicas are active.  As you can see, there is a lot of API
> design required here, but this is the most effective approach.

If we use the above approach to report via pg_stat_replication(), we
don't need this.

> > > Once we have that, we can consider removing the cancel ability while
> > > waiting for synchronous replicas (since we have the timeout) or make it
> > > optional.  We can also consider how do notify the administrator during
> > > query cancel (if we allow it), backend abrupt exit/crash, and
> >
> > Yeah. If we have the
> > timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
> > the users can then choose to disable processing query cancels/proc
> > dies while waiting for sync replication in SyncRepWaitForLSN().
>
> Yes.  We might also change things so a query cancel that happens during
> sychronous replica waiting can only be done by an administrator, not the
> session owner.  Again, lots of design needed here.

Yes, we need infrastructure to track who issued the query cancel or
proc die and so on. IMO, it's not a good way to allow/disallow query
cancels or CTRL+C based on role types - superusers or users with
replication roles or users who are members of any of predefined roles.

In general, it is the walsender serving sync standby that has to mark
itself as async standby by removing itself from
synchronous_standby_names, reloading config variables and waking up
the backends that are waiting in syncrep wait queue for it to update
LSN.

And, the new auto removal timeout should always be set to less than
wal_sender_timeout.

All that said, imagine we have
timeout-and-auto-removal-of-standby-from-sync-standbys-list solution
in one or the other forms with auto removal timeout set to 5 minutes,
any of following can happen:

1) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(),
no query cancel or proc die interrupt is arrived, the sync standby is
made as async standy after the timeout i.e. 5 minutes.
2) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(),
say for about 3 minutes, then query cancel or proc die interrupt is
arrived, should we immediately process it or wait for timeout to
happen (2 more minutes) and then process the interrupt? If we
immediately process the interrupts, then the
locally-committed-but-not-replicated-to-sync-standby problems
described upthread [2] are left unresolved.

[1] 
https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-REPLICATION-VIEW
sync_state text
Synchronous state of this standby server. Possible values are:
async: This standby server is asynchronous.
potential: This standby server is now asynchronous, but can
potentially become synchronous if one of current synchronous ones
fails.
sync: This standby server is synchronous.
quorum: This standby server is considered as a candidate for quorum standbys.

[2] 
https://www.postgresql.org/message-id/CALj2ACXmMWtpmuT-%3Dv8F%2BLk4QCbdkeN%2ByHKXeRGKFfjG96YbKA%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




  1   2   >