date:20220222

Re: Assert in pageinspect with NULL pages

2022-02-22 Thread Daria Lepikhova




18.02.2022 08:00, Justin Pryzby пишет:

BRIN can also crash if passed a non-brin index.

I've been sitting on this one for awhile.  Feel free to include it in your
patchset.

commit 08010a6037fc4e24a9ba05e5386e766f4310d35e
Author: Justin Pryzby 
Date:   Tue Jan 19 00:25:15 2021 -0600

 pageinspect: brin_page_items(): check that given relation is not only an 
index, but a brin one

diff --git a/contrib/pageinspect/brinfuncs.c b/contrib/pageinspect/brinfuncs.c
index f1e64a39ef2..3de6dd943ef 100644
--- a/contrib/pageinspect/brinfuncs.c
+++ b/contrib/pageinspect/brinfuncs.c
@@ -16,6 +16,7 @@
  #include "access/brin_tuple.h"
  #include "access/htup_details.h"
  #include "catalog/index.h"
+#include "catalog/pg_am.h"
  #include "catalog/pg_type.h"
  #include "funcapi.h"
  #include "lib/stringinfo.h"
@@ -59,7 +60,7 @@ brin_page_type(PG_FUNCTION_ARGS)
if (raw_page_size != BLCKSZ)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-errmsg("input page too small"),
+errmsg("input page wrong size"),
 errdetail("Expected size %d, got %d",
   BLCKSZ, raw_page_size)));
  
@@ -97,7 +98,7 @@ verify_brin_page(bytea *raw_page, uint16 type, const char *strtype)

if (raw_page_size != BLCKSZ)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-errmsg("input page too small"),
+errmsg("input page wrong size"),
 errdetail("Expected size %d, got %d",
   BLCKSZ, raw_page_size)));
  
@@ -169,7 +170,14 @@ brin_page_items(PG_FUNCTION_ARGS)

MemoryContextSwitchTo(oldcontext);
  
  	indexRel = index_open(indexRelid, AccessShareLock);

-   bdesc = brin_build_desc(indexRel);
+
+   /* Must be a BRIN index */
+   if (indexRel->rd_rel->relkind != RELKIND_INDEX ||
+   indexRel->rd_rel->relam != BRIN_AM_OID)
+   ereport(ERROR,
+   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+errmsg("\"%s\" is not a BRIN index",
+RelationGetRelationName(indexRel;
  
  	/* minimally verify the page we got */

page = verify_brin_page(raw_page, BRIN_PAGETYPE_REGULAR, "regular");
@@ -178,6 +186,7 @@ brin_page_items(PG_FUNCTION_ARGS)
 * Initialize output functions for all indexed datatypes; simplifies
 * calling them later.
 */
+   bdesc = brin_build_desc(indexRel);
columns = palloc(sizeof(brin_column_state *) * 
RelationGetDescr(indexRel)->natts);
for (attno = 1; attno <= bdesc->bd_tupdesc->natts; attno++)
{



Thanks! This is a very valuable note. I think I will definitely add it 
to the next version of the patch, as soon as I deal with the error 
exception.


--

Daria Lepikhova
Postgres Professional

Re: Assert in pageinspect with NULL pages

2022-02-22 Thread Daria Lepikhova


First of all, thanks for the review and remarks!

18.02.2022 08:02, Michael Paquier пишет:

On Thu, Feb 17, 2022 at 05:40:41PM +0800, Julien Rouhaud wrote:

About the patch, it's incorrectly using a hardcoded 8192 block-size rather than
using the computed :block_size (or computing one when it's not already the
case).

Well, the tests of pageinspect fail would already fail when using a
different page size than 8k, like the checksum ones :)

Anywa, I agree with your point that if this is not a reason to not
make the tests more portable if we can do easily.


Fixed.


I'm also wondering if it wouldn't be better to return NULL rather than throwing
an error for uninitialized pages.

Agreed that this is a sensible choice.  NULL would be helpful for the
case where one compiles all the checksums of a relation with a full
scan based on the relation size, for example.  All these behaviors
ought to be documented properly, as well.  For SRFs, this should just
return no rows rather than generating an error.


So, I tried to implement this remark. However, further getting rid of 
throwing an error and replacing it with a NULL return led to reproduce 
the problem that Alexander Lakhin mentioned And here I need little more 
time to figure it out.



And one more addition. In the previous version of the patch, I forgot to 
add tests for the gist index, but the described problem is also relevant 
for it.


--
Daria Lepikhova
Postgres Professional
From 7dbfa7f01bf485d2812b24af012721b4a1e1e785 Mon Sep 17 00:00:00 2001
From: "d.lepikhova" 
Date: Tue, 22 Feb 2022 16:25:31 +0500
Subject: [PATCH] Add checks for null pages into pageinspect

---
 contrib/pageinspect/brinfuncs.c   | 10 ++
 contrib/pageinspect/btreefuncs.c  | 10 ++
 contrib/pageinspect/expected/brin.out | 13 +
 contrib/pageinspect/expected/btree.out|  5 +
 contrib/pageinspect/expected/checksum.out |  7 +++
 contrib/pageinspect/expected/gin.out  |  7 +++
 contrib/pageinspect/expected/gist.out |  5 +
 contrib/pageinspect/expected/hash.out |  9 +
 contrib/pageinspect/rawpage.c | 10 ++
 contrib/pageinspect/sql/brin.sql  |  6 ++
 contrib/pageinspect/sql/btree.sql |  3 +++
 contrib/pageinspect/sql/checksum.sql  |  3 +++
 contrib/pageinspect/sql/gin.sql   |  5 +
 contrib/pageinspect/sql/gist.sql  |  5 +
 contrib/pageinspect/sql/hash.sql  |  6 ++
 15 files changed, 104 insertions(+)

diff --git a/contrib/pageinspect/brinfuncs.c b/contrib/pageinspect/brinfuncs.c
index 50892b5cc20..a2fee13e2d1 100644
--- a/contrib/pageinspect/brinfuncs.c
+++ b/contrib/pageinspect/brinfuncs.c
@@ -63,6 +63,10 @@ brin_page_type(PG_FUNCTION_ARGS)
  errdetail("Expected size %d, got %d",
 		   BLCKSZ, raw_page_size)));
 
+	/* check that the page is initialized before accessing any fields */
+	if (PageIsNew(page))
+		PG_RETURN_NULL();
+
 	switch (BrinPageType(page))
 	{
 		case BRIN_PAGETYPE_META:
@@ -103,6 +107,12 @@ verify_brin_page(bytea *raw_page, uint16 type, const char *strtype)
 
 	page = VARDATA(raw_page);
 
+	/* check that the page is initialized before accessing any fields */
+	if (PageIsNew(page))
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page is empty")));
+
 	/* verify the special space says this page is what we want */
 	if (BrinPageType(page) != type)
 		ereport(ERROR,
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 03debe336ba..5ea7af5b998 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -519,6 +519,12 @@ bt_page_items_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
 		UnlockReleaseBuffer(buffer);
 		relation_close(rel, AccessShareLock);
 
+		/* check that the page is initialized before accessing any fields */
+		if (PageIsNew(uargs->page))
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+	errmsg("input page is empty")));
+
 		uargs->offset = FirstOffsetNumber;
 
 		opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page);
@@ -615,6 +621,10 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 		uargs->page = VARDATA(raw_page);
 
+		/* check that the page is initialized before accessing any fields */
+		if (PageIsNew(uargs->page))
+			PG_RETURN_NULL();
+
 		uargs->offset = FirstOffsetNumber;
 
 		opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page);
diff --git a/contrib/pageinspect/expected/brin.out b/contrib/pageinspect/expected/brin.out
index 71eb190380c..2839ee9435b 100644
--- a/contrib/pageinspect/expected/brin.out
+++ b/contrib/pageinspect/expected/brin.out
@@ -48,4 +48,17 @@ SELECT * FROM brin_page_items(get_raw_page('test1_a_idx', 2), 'test1_a_idx')
   1 |  0 |  1 | f| f| f   | {1 .. 1}
 (1 row)
 
+-- Check that all functions fail gracefully with an empty page imput
+select brin_page_type(repeat(E'\\000',

Re: Failed transaction statistics to measure the logical replication progress

2022-02-22 Thread Amit Kapila

On Tue, Feb 22, 2022 at 6:45 AM tanghy.f...@fujitsu.com
 wrote:
>
> I found a problem when using it. When a replication workers exits, the
> transaction stats should be sent to stats collector if they were not sent 
> before
> because it didn't reach PGSTAT_STAT_INTERVAL. But I saw that the stats weren't
> updated as expected.
>
> I looked into it and found that the replication worker would send the
> transaction stats (if any) before it exits. But it got invalid subid in
> pgstat_send_subworker_xact_stats(), which led to the following result:
>
> postgres=# select pg_stat_get_subscription_worker(0, null);
>  pg_stat_get_subscription_worker
> -
>  (0,,2,0,00,"",)
> (1 row)
>
> I think that's because subid has already been cleaned when trying to send the
> stats. I printed the value of before_shmem_exit_list, the functions in this 
> list
> would be called in shmem_exit() when the worker exits.
> logicalrep_worker_onexit() would clean up the worker info (including subid), 
> and
> pgstat_shutdown_hook() would send stats if any.  logicalrep_worker_onexit() 
> was
> called before calling pgstat_shutdown_hook().
>

Yeah, I think that is a problem and maybe we can think of solving it
by sending the stats via logicalrep_worker_onexit before subid is
cleared but not sure that is a good idea. I feel we need to go back to
the idea of v21 for sending stats instead of using pgstat_report_stat.
I think the one thing which we could improve is to avoid trying to
send it each time before receiving each message by walrcv_receive and
rather try to send it before we try to wait (WaitLatchOrSocket).
Trying after each message doesn't seem to be required and could lead
to some overhead as well. What do you think?

-- 
With Regards,
Amit Kapila.

Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit

2022-02-22 Thread Etsuro Fujita

On Tue, Feb 22, 2022 at 1:03 AM Fujii Masao  wrote:
> On 2022/02/21 14:45, Etsuro Fujita wrote:
> > On Fri, Feb 18, 2022 at 1:46 AM Fujii Masao  
> > wrote:
> >> I reviewed 0001 patch. It looks good to me except the following minor 
> >> things. If these are addressed, I think that the 001 patch can be marked 
> >> as ready for committer.

> >> +* Also determine to commit (sub)transactions opened on the remote 
> >> server
> >> +* in parallel at (sub)transaction end.
> >>
> >> Like the comment "Determine whether to keep the connection ...", 
> >> "determine to commit" should be "determine whether to commit"?
> >
> > Agreed.  I’ll change it as such.

Done.

> Thanks! If that's updated, IMO it's ok to commit the 0001 patch.

Cool!  Attached is an updated patch.  Other changes other than that:
1) I added the curlevel parameter to
pgfdw_finish_pre_subcommit_cleanup() to avoid doing
GetCurrentTransactionNestLevel() there, as proposed, and 2) tweaked
comments a bit further, mostly for/in
pgfdw_finish_pre_commit_cleanup() and
pgfdw_finish_pre_subcommit_cleanup().  Barring objections, I’ll commit
the patch.

Thanks for reviewing!

Best regards,
Etsuro Fujita

v5-0001-postgres-fdw-Add-support-for-parallel-commit.patch
Description: Binary data

Handle infinite recursion in logical replication setup

2022-02-22 Thread vignesh C

 Hi,

In logical replication, currently Walsender sends the data that is
generated locally and the data that are replicated from other
instances. This results in infinite recursion in circular logical
replication setup.
Here the user is trying to have a 2-way replication setup with node 1
publishing data to node2 and node2 publishing data to node1, so that
the user can perform dml operations from any node, it can act as a
2-way multi master replication setup.

This problem can be reproduced with the following steps:
-- Instance 1
create publication pub1 for table t1;
create table t1(c1 int);

-- Instance 2
create table t1(c1 int);
create publication pub2 for table t1;
create subscription sub1 CONNECTION 'dbname=postgres port=5432'
publication pub1;

-- Instance 1
create subscription sub2 CONNECTION 'dbname=postgres port=5433'
publication pub2; insert into t1 values(10);

In this scenario, the Walsender in publisher pub1 sends data to the
apply worker in subscriber sub1, the apply worker in sub1 maps the
data to local tables and applies the individual changes as they are
received. Then the Walsender in publisher pub2 sends data to the apply
worker in subscriber sub2, the apply worker in sub2 maps the data to
local tables and applies the individual changes as they are received.
This process repeats infinitely.

Currently we do not differentiate if the data is locally generated
data, or a replicated data and we send both the data which causes
infinite recursion.

We could see that the record count has increased significantly within sometime:
select count(*) from t1;
  count
--
 400
(1 row)

If the table had primary key constraint, we could notice that the
first insert is successful and when the same insert is sent back, the
insert fails because of constraint error:
2022-02-23 09:28:43.592 IST [14743] ERROR:  duplicate key value
violates unique constraint "t1_pkey"
2022-02-23 09:28:43.592 IST [14743] DETAIL:  Key (c1)=(10) already exists.
2022-02-23 09:28:43.592 IST [14743] CONTEXT:  processing remote data
during "INSERT" for replication target relation "public.t1" in
transaction 727 at 2022-02-23 09:28:43.406738+05:30
2022-02-23 09:28:43.593 IST [14678] LOG:  background worker "logical
replication worker" (PID 14743) exited with exit code 1
2022-02-23 09:28:48.608 IST [14745] LOG:  logical replication apply
worker for subscription "sub2" has started
2022-02-23 09:28:48.624 IST [14745] ERROR:  duplicate key value
violates unique constraint "t1_pkey"
2022-02-23 09:28:48.624 IST [14745] DETAIL:  Key (c1)=(10) already exists.
2022-02-23 09:28:48.624 IST [14745] CONTEXT:  processing remote data
during "INSERT" for replication target relation "public.t1" in
transaction 727 at 2022-02-23 09:28:43.406738+05:30
2022-02-23 09:28:48.626 IST [14678] LOG:  background worker "logical
replication worker" (PID 14745) exited with exit code 1

The same problem can occur in any circular node setup like 3 nodes,
4node etc like: a) node1 publishing to node2 b) node2 publishing to
node3 c) node3 publishing back to node1.

Here there are two problems for the user: a) incremental
synchronization of table sending both local data and replicated data
by walsender b) Table synchronization of table using copy command
sending both local data and replicated data

For the first problem "Incremental synchronization of table by
Walsender" can be solved by:
Currently the locally generated data does not have replication origin
associated and the data that has originated from another instance will
have a replication origin associated. We could use this information to
differentiate locally generated data and replicated data and send only
the locally generated data. This "only_local" could be provided as an
option while subscription is created:
ex: CREATE SUBSCRIPTION sub1 CONNECTION 'dbname =postgres port=5433'
PUBLICATION pub1 with (only_local = on);

I have attached a basic patch for this, if the idea is accepted, I
will work further to test more scenarios, add documentation, and test
and post an updated patch.
For the second problem, Table synchronization of table including local
data and replicated data using copy command.

Let us consider the following scenario:
a) node1 publishing to node2 b) node2 publishing to node1. Here in
this case node1 will have replicated data from node2 and vice versa.

In the above if user wants to include node3 to subscribe data from
node2. Users will have to create a subscription in node3 to get the
data from node2. During table synchronization we send the complete
table data from node2 to node3. Node2 will have local data from node2
and also replicated data from node1. Currently we don't have an option
to differentiate between the locally generated data and replicated
data in the heap which will cause infinite recursion as described
above.

To handle this if user has specified only_local option, we could throw
a warning or error out while creating subscription in this case, we
could have a

RE: logical replication empty transactions

2022-02-22 Thread wangw.f...@fujitsu.com

On Feb, Wed 23, 2022 at 10:58 PM Ajin Cherian  wrote:
>
Few comments to V19-0001:

1. I think we should adjust the alignment format.
git am ../v19-0001-Skip-empty-transactions-for-logical-replication.patch
.git/rebase-apply/patch:197: indent with spaces.
* Before we send schema, make sure that STREAM START/BEGIN/BEGIN PREPARE
.git/rebase-apply/patch:198: indent with spaces.
* is sent. If not, send now.
.git/rebase-apply/patch:199: indent with spaces.
*/
.git/rebase-apply/patch:201: indent with spaces.
   pgoutput_send_stream_start(ctx, toptxn);
.git/rebase-apply/patch:204: indent with spaces.
   pgoutput_begin(ctx, toptxn);
warning: 5 lines add whitespace errors.

2. Structure member initialization.
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+   PGOutputTxnData*txndata = MemoryContextAllocZero(ctx->context,
+   
 sizeof(PGOutputTxnData));
+
+   txndata->sent_begin_txn = false;
+   txn->output_plugin_private = txndata;
+}
Do we need to set sent_stream_start and sent_any_stream to false here?

3. Maybe we should add Assert(txndata) like function pgoutput_commit_txn in
other functions.

4. In addition, I think we should keep a unified style.
a). log style (maybe first one is better.)
First style  : "Skipping replication of an empty transaction in XXX"
Second style : "skipping replication of an empty transaction"
b) flag name (maybe second one is better.)
First style  : variable "sent_begin_txn" in function pgoutput_stream_*.
Second style : variable "skip" in function pgoutput_commit_txn.


Regards,
Wang wei

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-22 Thread Peter Smith

Below are my review comments just for the v1 patch test code.

==

1. "table sync error" versus "tablesync error"

+# Wait for the table sync error to be reported.
+$node_subscriber->poll_query_until(
+ 'postgres',
+ qq[
+SELECT apply_error_count = 0 AND sync_error_count > 0
+FROM pg_stat_subscription_activity
+WHERE subname = 'tap_sub'
+]) or die "Timed out while waiting for table sync error";
+
+# Truncate test_tab1 so that table sync can continue.
+$node_subscriber->safe_psql('postgres', "TRUNCATE test_tab1;");
+
 # Wait for initial table sync for test_tab1 to finish.

IMO all these "table sync" should be changed to "tablesync", because a
table "sync error" sounds like something completely different to a
"tablesync error".

SUGGESTIONS
- "Wait for the table sync error to be reported." --> "Wait for the
tablesync error to be reported."
- "Timed out while waiting for table sync error" --> "Timed out while
waiting for tablesync error"
- "Truncate test_tab1 so that table sync can continue." -->  "Truncate
test_tab1 so that tablesync worker can fun to completion."
- "Wait for initial table sync for test_tab1 to finish." --> "Wait for
the tablesync worker of test_tab1 to finish."

~~~

2. Unnecessary INSERT VALUES (2)?

(I think this is a duplicate of what [Osumi] #6 reported)

+# Insert more data to test_tab1 on the subscriber and then on the
publisher, raising an
+# error on the subscriber due to violation of the unique constraint
on test_tab1.
+$node_subscriber->safe_psql('postgres', "INSERT INTO test_tab1 VALUES (2)");

Why does the test do INSERT data (2)? There is already data (1) from
the tablesync which will cause an apply worker PK violation when
another VALUES (1) is published.

Note, the test comment also needs to change...

~~~

3. Wait for the apply worker error

+# Wait for the apply error to be reported.
+$node_subscriber->poll_query_until(
+ 'postgres',
+ qq[
+SELECT apply_error_count > 0 AND sync_error_count > 0
+FROM pg_stat_subscription_activity
+WHERE subname = 'tap_sub'
+]) or die "Timed out while waiting for apply error";

This test is only for apply worker errors. So why is the test SQL
checking "AND sync_error_count > 0"?

(This is similar to what [Tang] #2 reported, but I think she was
referring to the other tablesync test)

~~~

4. Wrong worker?

(looks like a duplicate of what [Osumi] #5 already)

+
+# Truncate test_tab1 so that table sync can continue.
+$node_subscriber->safe_psql('postgres', "TRUNCATE test_tab1;");

 $node_subscriber->stop('fast');
 $node_publisher->stop('fast');

Cut/paste error? Aren't you doing TRUNCATE here so the apply worker
can continue; not the tablesync worker (which already completed)

"Truncate test_tab1 so that table sync can continue." --> "Truncate
test_tab1 so that the apply worker can continue."

--
[Osumi] 
https://www.postgresql.org/message-id/CAD21AoBRt%3DcyKsZP83rcMkHnT498gHH0TEP34fZBrGCxT-Ahwg%40mail.gmail.com
[Tang] 
https://www.postgresql.org/message-id/TYCPR01MB612840D018FEBD38268CC83BFB3C9%40TYCPR01MB6128.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Allow file inclusion in pg_hba and pg_ident files

2022-02-22 Thread Julien Rouhaud

Hi,

I recently had to work on some internal tool which, among other things, has to
merge pg_hba.conf and pg_ident.conf files between an existing (possibly
already updated) instance and some external repository.

My biggest concern is that any issue in either the external repository content
or the tool could leave the instance entirely unreachable.  There are some
protection attemps to avoid that, but really there isn't any practical
possibility if one wants to make sure that some of the entries are left alone
or not hidden no matter what.

To address that, I'd like to propose the possibility to include files in hba
and ident configuration files.  This was already discussed in the past, and in
my understanding this is mostly wanted, while some people expressed concerned
on a use case that wouldn't rely on thousands of entries.

In my case, there shouldn't be more than a dozen generated pg_hba/pg_ident
lines.  All I want is to have my main hba (and ident) files do something like:

include hba_forbid_non_ssl.conf
include hba_superuser.conf
include hba_replication.conf
include hba_application_generated.conf

So that the tool has a way to make sure that it won't prevent dba login or
replication, or allow unsecure connections, and can simply rewrite its target
file rather than merging it, sorting it with weird rules and deciding which
lines are ok to be removed and which one aren't.

I'm attaching a patchset that implements $SUBJECT:

0001 adds a new pg_ident_file_mappings view, which is basically the same as
pg_hba_file_rules view but for mappings.  It's probably already useful, for
instance if you need to tweak some regexp.

0002 adds a new "include" directive to both auth files, which will include the
given file at this exact position.  I chose to do this is in the tokenization
rather than the parsing, as it makes things simpler in general, and also allows
a single code path for both files.  It also adds 2 new columns to the
pg_hba_file_rules and pg_ident_file_mappings views: file_name and
(rule|mapping)_number, so it's possible to identify where a rule is coming from
and more importantly which has the priority over the over.  Note that I only
added "include", but maybe people would want something like include_dir or
include_if_exists too.

Both patches have updated documentation, but no test yet.  If there's an
agreement on the feature, I plan to add some TAP test for it.

Finally I also added 0003, which is a POC for a new pg_hba_matches() function,
that can help DBA to understand why their configuration isn't working as they
expect.  This only to start the discussion on that topic, the code is for now
really hackish, as I don't know how much this is wanted and/or if some other
behavior would be better, and there's also no documentation or test.  The
function for now only takes an optional inet (null means unix socket), the
target role and an optional ssl flag and returns the file, line and raw line
matching if any, or null.  For instance:

=# select * from pg_hba_matches('127.0.0.1'::inet, 'postgres');
-[ RECORD 1 ]-
file_name | /tmp/pgbuild/toto.conf
line_num  | 5
raw_line  | hostallall127.0.0.1/32   trust

I'm wondering for instance if it would be better to (optionally?) keep
iterating over the lines and print all lines that would have matched and in
which order, or if the function should return the parsed line information
rather than the raw line, although it could make the output worse if using huge
secondary auth files.  I'm also not sure if the various possible flags (ssl,
gss) should be explicitly set or if the function should try various
permutations on its own.

Note that this function only works against the current configuration files, not
the one currently active.  As far as I can see PostgresMain gets rid of
PostmasterContext, which is where they currently live, so they don't exist
anymore in the backends.  To make it work we would have to always keep those in
each backend, but also reload them along with the main config file on SIGHUP.
Except on Windows (and -DEXEC_BACKEND), as the current version of auth files
are always used anyway.  So would it make sense to add the possibility to use
the loaded files instead, given the required extra cost and the fact that it
wouldn't make sense on Windows?  If yes, I guess that we should also allow that
in the pg_hba / pg_ident views.

While at it, I noticed this comment added in de16ab72388:

> The pg_hba.conf file is read on start-up and when
> the main server process receives a
> SIGHUPSIGHUP
> signal. If you edit the file on an
> active system, you will need to signal the postmaster
> (using pg_ctl reload, calling the SQL function
> pg_reload_conf(), or using kill
> -HUP) to make it re-read the file.
> [...]
> The preceding statement is not true on Microsoft Windows: there, any
> changes in the pg_hba.conf file are immediately
> applied by subsequent

Re: pg_stat_get_replication_slot and pg_stat_get_subscription_worker incorrectly marked as proretset

2022-02-22 Thread Amit Kapila

On Tue, Feb 22, 2022 at 8:15 AM Michael Paquier  wrote:
>
> On Mon, Feb 21, 2022 at 05:00:43PM +0530, Amit Kapila wrote:
> > On Mon, Feb 21, 2022 at 4:56 PM Michael Paquier  wrote:
> >
> > It is still under discussion. I'll take the necessary action along
> > with other things related to that view based on the conclusion on that
> > thread.
>
> Sounds good to me.  The patch you are proposing upthread is OK for
> me.
>

Thanks, so you are okay with me pushing that patch just to HEAD. We
don't want to backpatch this to 14 as this is a catalog change and
won't cause any user-visible issue, is that correct?

-- 
With Regards,
Amit Kapila.

Re: Frontend error logging style

2022-02-22 Thread Andres Freund

On 2022-02-22 22:44:25 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > What about adding a pg_fatal() that's pg_log_fatal() + exit()? That keeps
> > pg_log_* stuff "log only", but adds something adjacent enough to hopefully
> > reduce future misunderstandings?
> 
> I'd be okay with that, except that pg_upgrade already has a pg_fatal
> (because it has its *own* logging system, just in case you thought
> this wasn't enough of a mess yet).  I'm in favor of aligning
> pg_upgrade's logging with the rest, but I'd hoped to leave that for
> later.  Making the names collide would be bad even as a short-term
> thing, I fear.

I guess we could name pg_upgrade's out of the way...


> I'm not against choosing some name other than pg_log_fatal, but that
> particular suggestion has got conflicts.  Got any other ideas?

Maybe pg_fatal_exit(), pg_exit_fatal() or pg_fatal_error()?

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-22 Thread Amit Kapila

On Tue, Feb 22, 2022 at 8:02 PM Masahiko Sawada  wrote:
>
> On Tue, Feb 22, 2022 at 6:53 PM Amit Kapila  wrote:
> >
>
> > 3. Can we send error stats pgstat_report_stat() as that will be called
> > via proc_exit() path. We can set the phase (apply/sync) in
> > apply_error_callback_arg and then use that to send the appropriate
> > message. I think this will obviate the need for try..catch.
>
> If we use pgstat_report_stat() to send subscription stats messages,
> all processes end up going through that path. It might not bring
> overhead in practice but I'd like to avoid it.
>

I am not sure about overhead but I see another problem if we use that
approach. In the exit path, logicalrep_worker_onexit() will get called
before pgstat_report_stat() and that will clear the
MyLogicalRepWorker->subid, so we won't know the id for which to send
stats. So, the way patch is doing seems reasonable to me unless
someone has better ideas.

-- 
With Regards,
Amit Kapila.

Re: Frontend error logging style

2022-02-22 Thread Tom Lane

Andres Freund  writes:
> What about adding a pg_fatal() that's pg_log_fatal() + exit()? That keeps
> pg_log_* stuff "log only", but adds something adjacent enough to hopefully
> reduce future misunderstandings?

I'd be okay with that, except that pg_upgrade already has a pg_fatal
(because it has its *own* logging system, just in case you thought
this wasn't enough of a mess yet).  I'm in favor of aligning
pg_upgrade's logging with the rest, but I'd hoped to leave that for
later.  Making the names collide would be bad even as a short-term
thing, I fear.

Looks like libpq_pipeline.c has its own pg_fatal, too.

I'm not against choosing some name other than pg_log_fatal, but that
particular suggestion has got conflicts.  Got any other ideas?

regards, tom lane

Re: Frontend error logging style

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-22 18:23:19 -0500, Tom Lane wrote:
> Robert Haas  writes:
> > On Fri, Nov 19, 2021 at 5:17 AM Peter Eisentraut
> >  wrote:
> >> If people want to do that kind of thing (I'm undecided whether the
> >> complexity is worth it), then make it a different API.  The pg_log_*
> >> calls are for writing formatted output.  They normalized existing
> >> hand-coded patterns at the time.  We can wrap another API on top of them
> >> that does flow control and output.  The pg_log_* stuff is more on the
> >> level of syslog(), which also just outputs stuff.  Nobody is suggesting
> >> that syslog(LOG_EMERG) should exit the program automatically.  But you
> >> can wrap higher-level APIs such as ereport() on top of that that might
> >> do that.

That'd maybe be a convincing argument in a project that didn't have
elog(FATAL).

> > Yeah, that might be a way forward.
> 
> This conversation seems to have tailed off without full resolution,
> but I observe that pretty much everyone except Peter is on board
> with defining pg_log_fatal as pg_log_error + exit(1).  So I think
> we should just do that, unless Peter wants to produce a finished
> alternative proposal.

What about adding a pg_fatal() that's pg_log_fatal() + exit()? That keeps
pg_log_* stuff "log only", but adds something adjacent enough to hopefully
reduce future misunderstandings?

To further decrease the chance of such mistakes, we could rename
pg_log_fatal() to something else. Or add a a pg_log_fatal() function that's
marked pg_nodiscard(), so that each callsite explicitly has to be (void)
pg_log_fatal().

> I've now gone through and fleshed out the patch I sketched upthread.
> This exercise convinced me that we absolutely should do something
> very like this, because
> 
> (1) As it stands, this patch removes a net of just about 1000 lines
> of code.  So we clearly need *something*.

> (2) The savings would be even more, except that people have invented
> macros for pg_log_error + exit(1) in at least four places already.
> (None of them quite the same of course.)  Here I've just fixed those
> macro definitions to use pg_log_fatal.  In the name of consistency, we
> should probably get rid of those macros in favor of using pg_log_fatal
> directly; but I've not done so here, since this patch is eyewateringly
> long and boring already.

Agreed.

I don't care that much about the naming here.

For a moment I was wondering whether there'd be a benefit for having the "pure
logging" stuff be in a different static library than the erroring variant. So
that e.g. libpq's check for exit() doesn't run into trouble. But then I
remembered that libpq doesn't link against common...

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-22 Thread Peter Smith

Hi. Below are my review comments for the v1 patch.

==

1. Commit message - wording

As the result of the discussion, we've concluded that the stats
collector is not an appropriate place to store the error information of
subscription workers.

SUGGESTION
It was decided (refer to the Discussion link below) that the stats
collector is not an appropriate place to store the error information of
subscription workers.

~~~

2. Commit message - wording

This commits changes the view so that it stores only statistics
counters: apply_error_count and sync_error_count.

SUGGESTION
"This commits changes the view" --> "This patch changes the
pg_stat_subscription_workers view"

~~~

3. Commit message - wording

Removing these error details, since we don't need to record the
error information for apply workers and tablesync workers separately,
the view now has one entry per subscription.

DID THIS MEAN...
After removing these error details, there is no longer separate
information for apply workers and tablesync workers, so the view now
has only one entry per subscription.

--

But anyway, that is not entirely true either because those counters
are separate information for the different workers, right?

~~

4. Commit message - wording

Also, it changes the view name to pg_stat_subscription_activity
since the word "worker" is an implementation detail that we use one
worker for one tablesync.

SUGGESTION
"Also, it changes" --> "The patch also changes" ...

~~~

5. doc/src/sgml/monitoring.sgml - wording

   
-   The pg_stat_subscription_workers view will contain
-   one row per subscription worker on which errors have occurred, for workers
-   applying logical replication changes and workers handling the initial data
-   copy of the subscribed tables.  The statistics entry is removed when the
+   The pg_stat_subscription_activity view will contain
+   one row per subscription.  The statistics entry is removed when the
corresponding subscription is dropped.
   

SUGGESTION
"The statistics entry is removed" --> "This row is removed"

~~~

6.  doc/src/sgml/monitoring.sgml - why two counters?

Please forgive this noob question...

I see there are 2 error_count columns (one for each kind of worker)
but I was wondering why it is useful for users to be able to
distinguish if the error came from the tablesync workers or from the
apply workers? Do you have any example?

Also, IIRC sometimes the tablesync might actually do a few "apply"
changes itself... so the distinction may become a bit fuzzy...

~~~

7. src/backend/postmaster/pgstat.c - comment

@@ -1313,13 +1312,13 @@ pgstat_vacuum_stat(void)
  }

  /*
- * Repeat for subscription workers.  Similarly, we needn't bother in the
- * common case where no subscription workers' stats are being collected.
+ * Repeat for subscription.  Similarly, we needn't bother in the common
+ * case where no subscription stats are being collected.
  */

typo?

"Repeat for subscription." --> "Repeat for subscriptions."

~~~

8. src/backend/postmaster/pgstat.c

@@ -3000,32 +2968,29 @@ pgstat_fetch_stat_funcentry(Oid func_id)

 /*
  * -
- * pgstat_fetch_stat_subworker_entry() -
+ * pgstat_fetch_stat_subentry() -
  *
  * Support function for the SQL-callable pgstat* functions. Returns
- * the collected statistics for subscription worker or NULL.
+ * the collected statistics for subscription or NULL.
  * -
  */
-PgStat_StatSubWorkerEntry *
-pgstat_fetch_stat_subworker_entry(Oid subid, Oid subrelid)
+PgStat_StatSubEntry *
+pgstat_fetch_stat_subentry(Oid subid)

There seems some kind of inconsistency because the member name is
called "subscriptions" but sometimes it seems singular.

Some places (e.g. pgstat_vacuum_stat) will iterate multiple results,
but then other places (like this function) just return to a single
"subscription" (or "entry").

I suspect all the code may be fine; probably it is just some
inconsistent (singular/plural) comments that have confused things a
bit.

~~~

9. src/backend/replication/logical/worker.c - subscription id

+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MyLogicalRepWorker->subid, false);

and

+ /* Report the worker failed during the application of the change */
+ pgstat_report_subscription_error(MyLogicalRepWorker->subid, true);


Why don't these use MySubscription->oid instead of MyLogicalRepWorker->subid?

~~~

10. src/include/pgstat.h - enum order

@@ -84,8 +84,8 @@ typedef enum StatMsgType
  PGSTAT_MTYPE_REPLSLOT,
  PGSTAT_MTYPE_CONNECT,
  PGSTAT_MTYPE_DISCONNECT,
+ PGSTAT_MTYPE_SUBSCRIPTIONERROR,
  PGSTAT_MTYPE_SUBSCRIPTIONPURGE,
- PGSTAT_MTYPE_SUBWORKERERROR,
 } StatMsgType;

This change rearranges the enum order. Maybe it is safer not to do this?

~~~

11. src/include/pgstat.h

@@ -767,8 +747,8 @@ typedef union PgStat_Msg
  PgStat_MsgReplSlot msg_replslot;
  PgStat_MsgConnect msg_connect;
  PgStat_MsgDisconnect msg_disconnect;
+ PgStat_MsgSubscriptionError msg_subscriptionerror;

RE: Design of pg_stat_subscription_workers vs pgstats

2022-02-22 Thread tanghy.f...@fujitsu.com

On Tue, Feb 22, 2022 1:45 PM Masahiko Sawada  wrote:
> 
> I've attached a patch that changes pg_stat_subscription_workers view.
> It removes non-cumulative values such as error details such as
> error-XID and the error message from the view, and consequently the
> view now has only cumulative statistics counters: apply_error_count
> and sync_error_count. Since the new view name is under discussion I
> temporarily chose pg_stat_subscription_activity.
> 

Thanks for your patch.

Few comments:

1.
+   apply_error_count uint8
...
+   sync_error_count uint8

It seems that Postgres has no data type named uint8, should we change it to
bigint?

2.
+# Wait for the table sync error to be reported.
+$node_subscriber->poll_query_until(
+   'postgres',
+   qq[
+SELECT apply_error_count = 0 AND sync_error_count > 0
+FROM pg_stat_subscription_activity
+WHERE subname = 'tap_sub'
+]) or die "Timed out while waiting for table sync error";

We want to check table sync error here, but do we need to check
"apply_error_count = 0"? I am not sure if it is possible that the apply worker 
has
an unexpected error, which would cause this test to fail.

Regards,
Tang

Re: logical replication empty transactions

2022-02-22 Thread Ajin Cherian

On Thu, Feb 17, 2022 at 9:42 PM Amit Kapila  wrote:
>
> On Mon, Jan 31, 2022 at 6:18 PM Ajin Cherian  wrote:
> >
>
> Few comments:
> =
> 1. Is there any particular why the patch is not skipping empty xacts
> for streaming (in-progress) transactions as noted in the commit
> message as well?
>

I have added support for skipping streaming transaction.

> 2.
> +static void
> +pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
> +{
>   bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
> + PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
> +
> + Assert(txndata);
>
> I think here you can add an assert for sent_begin_txn to be always false?
>

Added.

> 3.
> +/*
> + * Send BEGIN.
> + * This is where the BEGIN is actually sent. This is called
> + * while processing the first change of the transaction.
> + */
>
> Have an empty line between the first two lines to ensure consistency
> with nearby comments. Also, the formatting of these lines appears
> awkward, either run pgindent or make sure lines are not too short.
>

Changed.

> 4. Do we really need to make any changes in PREPARE
> transaction-related functions if can't skip in that case? I think you
> can have a check if the output plugin private variable is not set then
> ignore special optimization for sending begin.
>

I have modified this as well.

I have also rebased the patch after it did not apply due to a new commit.

I will next work on testing and improving the keepalive logic while
skipping transactions.

regards,
Ajin Cherian
Fujitsu Australia


v19-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-02-22 Thread Imseih (AWS), Sami

>Ooh, nice find and diagnosys.  I can confirm that the test fails as you
>described without the code fix, and doesn't fail with it.

>I attach the same patch, with the test file put in its final place
>rather than as a patch.  Due to recent xlog.c changes this need a bit of
>work to apply to back branches; I'll see about getting it in all
>branches soon.

Thank you for reviewing!

Sami

Re: catalog access with reset GUCs during parallel worker startup

2022-02-22 Thread Andres Freund

Hi,

I think we need to do something about this, because afaics this has data
corruption risks, albeit with a narrow window.

E.g. consider what happens if the value of wal_sync_method is reset to the
boot value and one of the catalog accesses in GUC check hooks causes WAL
writes (e.g. due to pruning, buffer replacement). One backend doing
fdatasync() and the rest O_DSYNC or vice versa won't work lead to reliable
durability.

On 2022-02-09 21:47:09 -0500, Robert Haas wrote:
> I remember trying that, and if I remember correctly, it broke with
> core GUCs, without any need for extensions in the picture. I don't
> remember the details too well, unfortunately, but I think it had to do
> with the behavior of some of the check and/or assign hooks.
>
> It's probably time to try it again, because (a) maybe things are
> different now, or (b) maybe I did it wrong, and in any event (c) I
> can't really remember why it didn't work and we probably want to know
> that.

The tests fail:
+ERROR:  invalid value for parameter "session_authorization": "andres"
+CONTEXT:  while setting parameter "session_authorization" to "andres"
+parallel worker
+while scanning relation "public.pvactst"

but that's easily worked around. In fact, there's already code to do so, it
just is lacking awareness of session_authorization:

static bool
can_skip_gucvar(struct config_generic *gconf)
...
 * Role must be handled specially because its current value can be an
 * invalid value (for instance, if someone dropped the role since we set
 * it).  So if we tried to serialize it normally, we might get a 
failure.
 * We skip it here, and use another mechanism to ensure the worker has 
the
 * right value.

Don't those reasons apply just as well to session_authorization as they do to
role?

If I skip session_authorization in can_skip_gucvar() and move
RestoreGUCState() out of the transaction, all tests pass.

Unsurprisingly we get bogus results for parallel accesses to
session_authorization:

postgres[3312622][1]=> SELECT current_setting('session_authorization');
┌─┐
│ current_setting │
├─┤
│ frak│
└─┘
(1 row)

postgres[3312622][1]=> SET force_parallel_mode = on;
SET
postgres[3312622][1]=> SELECT current_setting('session_authorization');
┌─┐
│ current_setting │
├─┤
│ andres  │
└─┘
(1 row)

but that could be remedied just already done for role.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-22 Thread Masahiko Sawada

On Wed, Feb 23, 2022 at 11:14 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-02-22 14:45:19 +0900, Masahiko Sawada wrote:
> > I've attached a patch that changes pg_stat_subscription_workers view.
>
> Thanks for working on this!
>
> Why are the stats stored in the per-database stats file / as a second level
> below the database? While they're also associated with a database, it's a
> global catalog, so it seems to make more sense to have them "live" globally as
> well?

Good point. The reason why we used to use per-database stats file is
that we were storing some relation information there. But now that we
don't need to have such information, it makes more sense to have them
live globally. I'll change the patch accordingly.

>
> Not just from an aesthetical perspective, but there might also be cases where
> it's useful to send stats from the stats launcher. E.g. the number of times
> the launcher couldn't start a worker because the max numbers of workers was
> already active or such.

Good idea.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: shared_preload_libraries = 'pg_stat_statements' failing with installcheck (compute_query_id = auto)

2022-02-22 Thread Justin Pryzby

On Tue, Feb 22, 2022 at 11:04:16AM +0900, Michael Paquier wrote:
> On Fri, Feb 18, 2022 at 05:38:56PM +0800, Julien Rouhaud wrote:
> > On Fri, Feb 18, 2022 at 05:22:36PM +0900, Michael Paquier wrote:
> >> So, I have been looking at this problem, and I don't see a problem in
> >> doing something like the attached, where we add a "regress" mode to
> >> compute_query_id that is a synonym of "auto". Or, in short, we have
> >> the default of letting a module decide if a query ID can be computed
> >> or not, at the exception that we hide its result in EXPLAIN outputs.

> Okay, thanks.  I have worked on that today and applied the patch down

While rebasing, I noticed that this patch does part of what I added in another
thread.

https://commitfest.postgresql.org/37/3409/
| explain_regress, explain(MACHINE), and default to explain(BUFFERS)

-   if (es->verbose && plannedstmt->queryId != UINT64CONST(0))
+   if (es->verbose && plannedstmt->queryId != UINT64CONST(0) && 
es->machine)
{
/*
 * Output the queryid as an int64 rather than a uint64 so we 
match

Apparently, I first wrote this two years ago today.

-- 
Justin

Re: Use generation context to speed up tuplesorts

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-18 12:10:51 +1300, David Rowley wrote:
> The other way I thought to fix it was by changing the logic for when
> generation blocks are freed.  In the problem case mentioned above, the
> block being freed is the current block (which was just allocated).  I
> made some changes to adjust this behaviour so that we no longer free
> the block when the final chunk is pfree()'d. Instead, that now lingers
> and can be reused by future allocations, providing they fit inside it.

That makes sense to me, as long as we keep just one such block.


> The problem I see with this method is that there still could be some
> pathological case that causes us to end up storing just a single tuple per
> generation block.

Crazy idea: Detect the situation, and recompact. Create a new context, copy
all the tuples over, delete the old context. That could be a win even in less
adversarial situations than "a single tuple per generation block".


Greetings,

Andres Freund

Re: small development tip: Consider using the gold linker

2022-02-22 Thread Peter Geoghegan

On Tue, Feb 22, 2022 at 4:26 PM Andres Freund  wrote:
> I've switched back and forth between gold and lld a couple times. Eventually I
> always get frustrated by lld causing slowdowns and other issues for gdb. Did
> you ever hit one of those?

Now that you mention it...gdb was very slow the last few times I used
it, which wasn't for long enough for it to really matter. This must
have been why. I might have to rescind my recommendation of lld.

-- 
Peter Geoghegan

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-22 14:45:19 +0900, Masahiko Sawada wrote:
> I've attached a patch that changes pg_stat_subscription_workers view.

Thanks for working on this!

Why are the stats stored in the per-database stats file / as a second level
below the database? While they're also associated with a database, it's a
global catalog, so it seems to make more sense to have them "live" globally as
well?

Not just from an aesthetical perspective, but there might also be cases where
it's useful to send stats from the stats launcher. E.g. the number of times
the launcher couldn't start a worker because the max numbers of workers was
already active or such.

Greetings,

Andres Freund

Re: row filtering for logical replication

2022-02-22 Thread Amit Kapila

On Tue, Feb 22, 2022 at 4:47 AM Euler Taveira  wrote:
>
> On Thu, Feb 17, 2022, at 3:36 AM, Amit Kapila wrote:
>
> As there is a new version, I would like to wait for a few more days
> before committing. I am planning to commit this early next week (by
> Tuesday) unless others or I see any more things that can be improved.
>
> Amit, I don't have additional comments or suggestions. Let's move on. Next
> topic. :-)
>

Pushed!

-- 
With Regards,
Amit Kapila.

Re: Race conditions in 019_replslot_limit.pl

2022-02-22 Thread Andres Freund

Hi,


I think I did find a bug related to the test, but afaics not the cause of the
test failures we're seeing. See
https://www.postgresql.org/message-id/20220223014855.4lsddr464i7mymk2%40alap3.anarazel.de

I don't think it's related to the problem of this thread, because the logs of
primary3 don't have a single mention of

ereport(LOG,
(errmsg("terminating process %d to release replication 
slot \"%s\"",
active_pid, NameStr(slotname;

On 2022-02-18 15:14:15 -0800, Andres Freund wrote:
> I'm running out of ideas for how to try to reproduce this. I think we might
> need some additional debugging information to get more information from the
> buildfarm.

> I'm thinking of adding log_min_messages=DEBUG2 to primary3, passing --verbose
> to pg_basebackup in $node_primary3->backup(...).
>
> It might also be worth adding DEBUG2 messages to ReplicationSlotShmemExit(),
> ReplicationSlotCleanup(), InvalidateObsoleteReplicationSlots().

Planning to commit something like the attached.

Greetings,

Andres Freund
>From afdeff10526e29e3fc63b18c08100458780489d9 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 22 Feb 2022 18:02:34 -0800
Subject: [PATCH v1] Add temporary debug info to help debug
 019_replslot_limit.pl failures.

I have not been able to reproduce the occasional failures of
019_replslot_limit.pl we are seeing in the buildfarm and not for lack of
trying. The additional logging and increased log level will hopefully help.

Will be reverted once the cause is identified.

Discussion: https://postgr.es/m/20220218231415.c4plkp4i3reqc...@alap3.anarazel.de
---
 src/backend/replication/slot.c| 21 +
 src/bin/pg_basebackup/pg_basebackup.c | 10 +-
 src/test/recovery/t/019_replslot_limit.pl |  5 -
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 5da5fa825a2..3d39fddaaef 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -177,6 +177,10 @@ ReplicationSlotInitialize(void)
 static void
 ReplicationSlotShmemExit(int code, Datum arg)
 {
+	/* temp debugging aid to analyze 019_replslot_limit failures */
+	elog(DEBUG3, "replication slot exit hook, %s active slot",
+		 MyReplicationSlot != NULL ? "with" : "without");
+
 	/* Make sure active replication slots are released */
 	if (MyReplicationSlot != NULL)
 		ReplicationSlotRelease();
@@ -554,6 +558,9 @@ ReplicationSlotCleanup(void)
 	Assert(MyReplicationSlot == NULL);
 
 restart:
+	/* temp debugging aid to analyze 019_replslot_limit failures */
+	elog(DEBUG3, "temporary replication slot cleanup: begin");
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -579,6 +586,8 @@ restart:
 	}
 
 	LWLockRelease(ReplicationSlotControlLock);
+
+	elog(DEBUG3, "temporary replication slot cleanup: done");
 }
 
 /*
@@ -1284,6 +1293,12 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 (void) kill(active_pid, SIGTERM);
 last_signaled_pid = active_pid;
 			}
+			else
+			{
+/* temp debugging aid to analyze 019_replslot_limit failures */
+elog(DEBUG3, "not signalling process %d during invalidation of slot \"%s\"",
+	 active_pid, NameStr(slotname));
+			}
 
 			/* Wait until the slot is released. */
 			ConditionVariableSleep(>active_cv,
@@ -1347,6 +1362,10 @@ InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
 	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
+	/* temp debugging aid to analyze 019_replslot_limit failures */
+	elog(DEBUG3, "begin invalidating obsolete replication slots older than %X/%X",
+		 LSN_FORMAT_ARGS(oldestLSN));
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 	for (int i = 0; i < max_replication_slots; i++)
 	{
@@ -1372,6 +1391,8 @@ restart:
 		ReplicationSlotsComputeRequiredLSN();
 	}
 
+	elog(DEBUG3, "done invalidating obsolete replication slots");
+
 	return invalidated;
 }
 
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 08b07d5a06e..8c77c533e64 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -700,8 +700,16 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 	bgchild = fork();
 	if (bgchild == 0)
 	{
+		int			ret;
+
 		/* in child process */
-		exit(LogStreamerMain(param));
+		ret = LogStreamerMain(param);
+
+		/* temp debugging aid to analyze 019_replslot_limit failures */
+		if (verbose)
+			pg_log_info("log streamer with pid %d exiting", getpid());
+
+		exit(ret);
 	}
 	else if (bgchild < 0)
 	{
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 4257bd4d35a..0c9da9bf272 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -316,13 +316,16 @@

Re: Race condition in InvalidateObsoleteReplicationSlots()

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-22 17:48:55 -0800, Andres Freund wrote:
> I think the minor changes might unfortunately have introduced a race? Before
> the patch just used ConditionVariableSleep(), but now it also has a
> ConditionVariablePrepareToSleep().  Without re-checking the sleep condition
> until
> /* Wait until the slot is released. */
> ConditionVariableSleep(>active_cv,
>WAIT_EVENT_REPLICATION_SLOT_DROP);
> 
> which directly violates what ConditionVariablePrepareToSleep() documents:
> 
>  * This can optionally be called before entering a test/sleep loop.
>  * Doing so is more efficient if we'll need to sleep at least once.
>  * However, if the first test of the exit condition is likely to succeed,
>  * it's more efficient to omit the ConditionVariablePrepareToSleep call.
>  * See comments in ConditionVariableSleep for more detail.
>  *
>  * Caution: "before entering the loop" means you *must* test the exit
>  * condition between calling ConditionVariablePrepareToSleep and calling
>  * ConditionVariableSleep.  If that is inconvenient, omit calling
>  * ConditionVariablePrepareToSleep.
> 
> 
> Afaics this means we can potentially sleep forever if the prior owner of the
> slot releases it before the ConditionVariablePrepareToSleep().
> 
> There's a comment that's mentioning this danger:
> 
> /*
>  * Prepare the sleep on the slot's condition variable before
>  * releasing the lock, to close a possible race condition if the
>  * slot is released before the sleep below.
>  */
>   ConditionVariablePrepareToSleep(>active_cv);
> 
>   LWLockRelease(ReplicationSlotControlLock);
> 
> but afaics that is bogus, because releasing a slot doesn't take
> ReplicationSlotControlLock. That just protects against the slot being dropped,
> not against it being released.
> 
> We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier,
> before the checks at the start of the while loop.

Not at the start of the while loop, outside of the while loop. Doing it in the
loop body doesn't make sense, even if it's at the top. Each
ConditionVariablePrepareToSleep() will unregister itself:

/*
 * If some other sleep is already prepared, cancel it; this is necessary
 * because we have just one static variable tracking the prepared sleep,
 * and also only one cvWaitLink in our PGPROC.  It's okay to do this
 * because whenever control does return to the other test-and-sleep loop,
 * its ConditionVariableSleep call will just re-establish that sleep as
 * the prepared one.
 */
if (cv_sleep_target != NULL)
ConditionVariableCancelSleep();

The intended use is documented in this comment:

 * This should be called in a predicate loop that tests for a specific exit
 * condition and otherwise sleeps, like so:
 *
 *   ConditionVariablePrepareToSleep(cv);  // optional
 *   while (condition for which we are waiting is not true)
 *   ConditionVariableSleep(cv, wait_event_info);
 *   ConditionVariableCancelSleep();

Greetings,

Andres Freund

Re: Time to drop plpython2?

2022-02-22 Thread Tom Lane

Mark Wong  writes:
> Take 3. :)

> I've upgraded everyone to the v14 buildfarm scripts and made sure the
> --test passed on HEAD on each one.  So I hopefully didn't miss any
> (other than the one EOL OpenSUSE version that I will plan on upgrading.)

Thanks!

However ... it seems like most of your animals haven't run at all
in the last couple of days.  Did you maybe forget to re-enable
their cron jobs after messing with them, or something like that?

regards, tom lane

Re: Race condition in InvalidateObsoleteReplicationSlots()

2022-02-22 Thread Andres Freund

Hi,

On 2021-06-11 12:27:57 -0400, Álvaro Herrera wrote:
> On 2021-Jun-10, Álvaro Herrera wrote:
> 
> > Here's a version that I feel is committable (0001).  There was an issue
> > when returning from the inner loop, if in a previous iteration we had
> > released the lock.  In that case we need to return with the lock not
> > held, so that the caller can acquire it again, but weren't.  This seems
> > pretty hard to hit in practice (I suppose somebody needs to destroy the
> > slot just as checkpointer killed the walsender, but before checkpointer
> > marks it as its own) ... but if it did happen, I think checkpointer
> > would block with no recourse.  Also added some comments and slightly
> > restructured the code.
> > 
> > There are plenty of conflicts in pg13, but it's all easy to handle.
> 
> Pushed, with additional minor changes.

I stared at this code, due to [1], and I think I found a bug. I think it's not
the cause of the failures in that thread, but we probably should still do
something about it.

I think the minor changes might unfortunately have introduced a race? Before
the patch just used ConditionVariableSleep(), but now it also has a
ConditionVariablePrepareToSleep().  Without re-checking the sleep condition
until
/* Wait until the slot is released. */
ConditionVariableSleep(>active_cv,
   WAIT_EVENT_REPLICATION_SLOT_DROP);

which directly violates what ConditionVariablePrepareToSleep() documents:

 * This can optionally be called before entering a test/sleep loop.
 * Doing so is more efficient if we'll need to sleep at least once.
 * However, if the first test of the exit condition is likely to succeed,
 * it's more efficient to omit the ConditionVariablePrepareToSleep call.
 * See comments in ConditionVariableSleep for more detail.
 *
 * Caution: "before entering the loop" means you *must* test the exit
 * condition between calling ConditionVariablePrepareToSleep and calling
 * ConditionVariableSleep.  If that is inconvenient, omit calling
 * ConditionVariablePrepareToSleep.

Afaics this means we can potentially sleep forever if the prior owner of the
slot releases it before the ConditionVariablePrepareToSleep().

There's a comment that's mentioning this danger:

/*
 * Prepare the sleep on the slot's condition variable before
 * releasing the lock, to close a possible race condition if the
 * slot is released before the sleep below.
 */
ConditionVariablePrepareToSleep(>active_cv);

LWLockRelease(ReplicationSlotControlLock);

but afaics that is bogus, because releasing a slot doesn't take
ReplicationSlotControlLock. That just protects against the slot being dropped,
not against it being released.

We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier,
before the checks at the start of the while loop.

Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/20220218231415.c4plkp4i3reqcwip%40alap3.anarazel.de

RE: Design of pg_stat_subscription_workers vs pgstats

2022-02-22 Thread osumi.takami...@fujitsu.com

On Tuesday, February 22, 2022 11:47 PM Masahiko Sawada  
wrote:
> On Tue, Feb 22, 2022 at 9:22 PM osumi.takami...@fujitsu.com
>  wrote:
> > (4) doc/src/sgml/monitoring.sgml
> >
> >   
> > role="column_definition">
> > -   last_error_time timestamp with
> time zone
> > +   sync_error_count uint8
> >
> >
> > -   Last time at which this error occurred
> > +   Number of times the error occurred during the initial data
> > + copy
> >
> >
> > I supposed it might be better to use "initial data sync"
> > or "initial data synchronization", rather than "initial data copy".
> 
> "Initial data synchronization" sounds like the whole table synchronization
> process including COPY and applying changes to catch up. But
> sync_error_count is incremented only during COPY so I used "initial data 
> copy".
> What do you think?
Okay. Please keep it as is.


Best Regards,
Takamichi Osumi

Re: pgcrypto: Remove internal padding implementation

2022-02-22 Thread Jacob Champion

On Mon, 2022-02-21 at 11:42 +0100, Peter Eisentraut wrote:
> On 15.02.22 00:07, Jacob Champion wrote:
> > After this patch, bad padding is no longer ignored during decryption,
> > and encryption without padding now requires the input size to be a
> > multiple of the block size. To see the difference you can try the
> > following queries with and without the patch:
> > 
> >  select encrypt_iv('foo', '0123456', 'abcd', 'aes/pad:none');
> >  select 
> > encode(decrypt_iv('\xa21a9c15231465964e3396d32095e67eb52bab05f556a581621dee1b85385789',
> >  '0123456', 'abcd', 'aes'), 'escape');
> 
> Interesting.  I have added test cases about this.  Could you describe 
> how you arrived at the second test case?

Sure -- that second ciphertext is the result of running

SELECT encrypt_iv('abcdefghijklmnopqrstuvwxyz', '0123456', 'abcd', 'aes');

and then incrementing the last byte of the first block (i.e. the 16th
byte) to corrupt the padding in the last block.

> > Both changes seem correct to me. I can imagine some system out there
> > being somehow dependent on the prior decryption behavior to avoid a
> > padding oracle -- but if that's a concern, hopefully you're not using
> > unauthenticated encryption in the first place? It might be worth a note
> > in the documentation.
> 
> Hmm.  I started reading up on "padding oracle attack".  I don't 
> understand it well enough yet to know whether this is a concern here.

I *think* the only reasonable way to prevent that attack is by
authenticating with a MAC or an authenticated cipher, to avoid running
decryption on corrupted ciphertext in the first place. But I'm out of
my expertise here.

> Is there any reasonable meaning of the previous behaviors?

I definitely don't think the previous behavior was reasonable. It's
possible that someone built a strange system on top of that
unreasonable behavior, but I hope not.

> Does bad padding even give correct answers on decryption?

No -- or rather, I'm not really sure how to define "correct answer" for
garbage input. I especially don't like that two different ciphertexts
can silently decrypt to the same plaintext.

> What does encryption without padding even do on incorrect input sizes?

Looks like it's being silently zero-padded by the previous code, which
doesn't seem very helpful to me. The documentation says "data must be
multiple of cipher block size" for the pad:none case, though I suppose
it doesn't say what happens if you ignore the "must".

--Jacob

Re: small development tip: Consider using the gold linker

2022-02-22 Thread Andres Freund

Hi,

On 2022-01-13 19:24:09 -0800, Peter Geoghegan wrote:
> On Fri, Jul 20, 2018 at 8:16 PM Peter Geoghegan  wrote:
> > On Mon, Jul 9, 2018 at 4:40 PM, Andres Freund  wrote:
> > > FWIW, I've lately noticed that I spend a fair time waiting for the
> > > linker during edit-compile-test cycles.  Due to an independent issue I
> > > just used the gold linker, and the speedup is quite noticable.
> > > For me just adding '-fuse-ld=gold' to CFLAGS works.
> >
> > I tried this out today. It makes quite a noticeable difference for me.
> > Thanks for the tip.
> 
> The 2022 version of the same tip: switching over to LLVM lld seems to
> offer a further significant speedup over gold. Not surprising, since
> lld is specifically promoted as a linker that is faster than gold [1].

I've switched back and forth between gold and lld a couple times. Eventually I
always get frustrated by lld causing slowdowns and other issues for gdb. Did
you ever hit one of those?


> Again, this appears to just be a case of installing lld, and adding
> '-fuse-ld=lld' to CFLAGS -- a very easy switch to make. Link time is
> still a noticeable contributor to overall build time for me, even with
> gold (I use ccache, of course).

Yea. gold also isn't really maintained anymore, so it'd be nice to move on...


> There is another linker called mold [2]. It claims to be faster than
> lld (which was faster than gold, which was faster than ld). I didn't
> bother trying it out.

Yea, didn't yet quite seem there yet last time I looked, and lld seems plenty
fast for our purposes.

Greetings,

Andres Freund

Re: postgres_fdw: using TABLESAMPLE to collect remote sample

2022-02-22 Thread Tomas Vondra

On 2/22/22 21:12, Tomas Vondra wrote:
> On 2/22/22 01:36, Fujii Masao wrote:
>>
>>
>> On 2022/02/18 22:28, Tomas Vondra wrote:
>>> Hi,
>>>
>>> here's a slightly updated version of the patch series.
>>
>> Thanks for updating the patches!
>>
>>
>>> The 0001 part
>>> adds tracking of server_version_num, so that it's possible to enable
>>> other features depending on it.
>>
>> Like configure_remote_session() does, can't we use PQserverVersion()
>> instead of implementing new function GetServerVersion()?
>>
> 
> Ah! My knowledge of libpq is limited, so I wasn't sure this function
> exists. It'll simplify the patch.
> 

And here's the slightly simplified patch, without the part adding the
unnecessary GetServerVersion() function.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 089ecf8c9ee44b89becfbcfb9c7a0ddc6c8f197a Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 18 Feb 2022 00:33:19 +0100
Subject: [PATCH] postgres_fdw: sample data on remote node for ANALYZE

When performing ANALYZE on a foreign tables, we need to collect sample
of rows. Until now, we simply read all data from the remote node and
built the sample locally. That is very expensive, especially in terms of
network traffic etc. But it's possible to move the sampling to the
remote node, and use either TABLESAMPLE or simply random() to transfer
just much smaller amount of data.

So we run either

   SELECT * FROM table TABLESAMPLE SYSTEM(fraction)

or

  SELECT * FROM table WHERE random() < fraction

depending on the server version (TABLESAMPLE is supported since 9.5).

To do that, we need to determine what fraction of the table to sample.
We rely on reltuples (fetched from the remote node) to be sufficiently
accurate and up to date, and calculate the fraction based on that. We
increase the sample size a bit (in case the table shrunk), and we still
do the reservoir sampling (in case it grew).

Using tsm_system_rows would allow specifying sample size in rows,
without determining sampling rate. But the sampling method may not be
installed, and we'd still have to determine the relation size.

This adds 'sample' option for remote servers / tables. By default, it's
set to 'true' which enables remote sampling. Setting it to 'false' uses
the old approach of fetching everything and sampling locally.
---
 contrib/postgres_fdw/deparse.c  | 152 
 contrib/postgres_fdw/option.c   |   7 +-
 contrib/postgres_fdw/postgres_fdw.c | 142 +-
 contrib/postgres_fdw/postgres_fdw.h |   7 ++
 4 files changed, 304 insertions(+), 4 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index bf12eac0288..32f2c0d5fb3 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -2267,6 +2267,26 @@ deparseAnalyzeSizeSql(StringInfo buf, Relation rel)
 	appendStringInfo(buf, "::pg_catalog.regclass) / %d", BLCKSZ);
 }
 
+/*
+ * Construct SELECT statement to acquire numbe of rows of given relation.
+ *
+ * Note: Maybe this should compare relpages and current relation size
+ * and adjust reltuples accordingly?
+ */
+void
+deparseAnalyzeTuplesSql(StringInfo buf, Relation rel)
+{
+	StringInfoData relname;
+
+	/* We'll need the remote relation name as a literal. */
+	initStringInfo();
+	deparseRelation(, rel);
+
+	appendStringInfoString(buf, "SELECT reltuples FROM pg_catalog.pg_class WHERE oid = ");
+	deparseStringLiteral(buf, relname.data);
+	appendStringInfoString(buf, "::pg_catalog.regclass");
+}
+
 /*
  * Construct SELECT statement to acquire sample rows of given relation.
  *
@@ -2328,6 +2348,138 @@ deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
 	deparseRelation(buf, rel);
 }
 
+/*
+ * Construct SELECT statement to acquire sample rows of given relation,
+ * by sampling a fraction of the table using TABLESAMPLE SYSTEM.
+ *
+ * SELECT command is appended to buf, and list of columns retrieved
+ * is returned to *retrieved_attrs.
+ *
+ * Note: We could allow selecting system/bernoulli, and maybe even the
+ * optional TSM modules (especially tsm_system_rows would help).
+ */
+void
+deparseAnalyzeTableSampleSql(StringInfo buf, Relation rel, List **retrieved_attrs, double sample_frac)
+{
+	Oid			relid = RelationGetRelid(rel);
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	int			i;
+	char	   *colname;
+	List	   *options;
+	ListCell   *lc;
+	bool		first = true;
+
+	*retrieved_attrs = NIL;
+
+	appendStringInfoString(buf, "SELECT ");
+	for (i = 0; i < tupdesc->natts; i++)
+	{
+		/* Ignore dropped columns. */
+		if (TupleDescAttr(tupdesc, i)->attisdropped)
+			continue;
+
+		if (!first)
+			appendStringInfoString(buf, ", ");
+		first = false;
+
+		/* Use attribute name or column_name option. */
+		colname = NameStr(TupleDescAttr(tupdesc, i)->attname);
+		options = GetForeignColumnOptions(relid, i + 1);
+
+		foreach(lc, options)
+		{
+			DefElem*def = (DefElem *) lfirst(lc);
+
+			if

Re: logical decoding and replication of sequences

2022-02-22 Thread Tomas Vondra

On 2/19/22 06:33, Amit Kapila wrote:
> On Sat, Feb 19, 2022 at 7:48 AM Tomas Vondra
>  wrote:
>>
>> Do we need to handle both FOR ALL TABLES and FOR ALL SEQUENCES at the
>> same time? That seems like a reasonable thing people might want.
>>
> 
> +1. That seems reasonable to me as well. I think the same will apply
> to  FOR ALL TABLES IN SCHEMA and FOR ALL SEQUENCES IN SCHEMA.
> 

It already works for "IN SCHEMA" because that's handled as a publication
object, but FOR ALL TABLES and FOR ALL SEQUENCES are defined directly in
CreatePublicationStmt.

Which also means you can't do ALTER PUBLICATION and change it to FOR ALL
TABLES. Which is a bit annoying, but OK. It's a bit weird FOR ALL TABLES
is mentioned in docs for ALTER PUBLICATION as if it was supported.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

wal_compression=zstd

2022-02-22 Thread Justin Pryzby

Since 4035cd5d4, wal_compression=lz4 is supported.  I think pg15 should also
support wal_compression=zstd.  There are free bits in the WAL header.

The last message on that thread includes a patch doing just that, which I've
rebased.
https://www.postgresql.org/message-id/ynqwd2gsmrnqw...@paquier.xyz

It might be nice if to also support wal_compression=zstd-level, but that seems
optional and could wait for pg16...

In [0], I showed a case where lz4 is just as fast as uncompressed data, and
writes ~60% as much.  zstd writes half as much as LZ4 and 12% slower.
[0] https://www.postgresql.org/message-id/20210614012412.GA31772%40telsasoft.com

I am not claiming that zstd is generally better for WAL.  Rather, if we're
going to support alternate compression methods, it's nice to give a couple
options (in addition to pglz).  Some use cases would certainly suffer from
slower WAL.  But providing the option will benefit some others.  Note that a
superuser can set wal_compression for a given session - this would probably
benefit bulk-loading in an otherwise OLTP environment.

As writen, this patch uses zstd level=1 (whereas the ZSTD's default compress
level is 6).

0001 adds support for zstd
0002 makes more efficient our use of bits in the WAL header
0003 makes it the default compression, to exercise on CI (windows will fail).
0004 adds support for zstd-level
0005-6 are needed to allow recovery tests to pass when wal compression is 
enabled;
0007 (not included) also adds support for zlib.  I'm of the impression nobody
 cares about this, otherwise it would've been included 5-10 years ago.

Only 0001 should be reviewed for pg15 - the others are optional/future work.
>From 9253013c789ffb121272bfeeaa9dcdebbef79ced Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 18 Feb 2022 22:54:18 -0600
Subject: [PATCH 1/6] add wal_compression=zstd

---
 doc/src/sgml/config.sgml  |  9 +++---
 doc/src/sgml/installation.sgml| 10 +++
 src/backend/access/transam/xloginsert.c   | 30 ++-
 src/backend/access/transam/xlogreader.c   | 20 +
 src/backend/utils/misc/guc.c  |  3 ++
 src/backend/utils/misc/postgresql.conf.sample |  2 +-
 src/bin/pg_waldump/pg_waldump.c   |  2 ++
 src/include/access/xlog.h |  3 +-
 src/include/access/xlogrecord.h   |  5 +++-
 9 files changed, 76 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7ed8c82a9d..97740c7b66 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3154,10 +3154,11 @@ include_dir 'conf.d'
 server compresses full page images written to WAL when
  is on or during a base backup.
 A compressed page image will be decompressed during WAL replay.
-The supported methods are pglz and
-lz4 (if PostgreSQL was
-compiled with --with-lz4). The default value is
-off. Only superusers can change this setting.
+The supported methods are pglz, and
+(if configured when PostgreSQL was built) 
+lz4 and zstd.
+The default value is off.
+Only superusers can change this setting.

 

diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 0f74252590..d32767b229 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -271,6 +271,14 @@ su - postgres
  
 
 
+
+ 
+  The ZSTD library can be used to enable
+  compression using that method; see
+  .
+ 
+
+
 
  
   To build the PostgreSQL documentation,
@@ -988,6 +996,8 @@ build-postgresql:

 
  Build with ZSTD compression support.
+ This enables use of ZSTD for
+ compression of WAL data.
 

   
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c260310c4c..847ce0c3fc 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -44,9 +44,17 @@
 #define LZ4_MAX_BLCKSZ		0
 #endif
 
+#ifdef USE_ZSTD
+#include 
+#define ZSTD_MAX_BLCKSZ		ZSTD_COMPRESSBOUND(BLCKSZ)
+#else
+#define ZSTD_MAX_BLCKSZ		0
+#endif
+
+/* Buffer size required to store a compressed version of backup block image */
 #define PGLZ_MAX_BLCKSZ		PGLZ_MAX_OUTPUT(BLCKSZ)
 
-#define COMPRESS_BUFSIZE	Max(PGLZ_MAX_BLCKSZ, LZ4_MAX_BLCKSZ)
+#define COMPRESS_BUFSIZE	Max(Max(PGLZ_MAX_BLCKSZ, LZ4_MAX_BLCKSZ), ZSTD_MAX_BLCKSZ)
 
 /*
  * For each block reference registered with XLogRegisterBuffer, we fill in
@@ -695,6 +703,14 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 #endif
 		break;
 
+	case WAL_COMPRESSION_ZSTD:
+#ifdef USE_ZSTD
+		bimg.bimg_info |= BKPIMAGE_COMPRESS_ZSTD;
+#else
+		elog(ERROR, "ZSTD is not supported by this build");
+#endif
+		break;
+
 	case WAL_COMPRESSION_NONE:
 		Assert(false);	/* cannot happen */

Re: making pg_regress less noisy by removing boilerplate

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-22 08:55:23 -0500, Andrew Dunstan wrote:
> I'm pretty sure all my Windows machines with buildfarm animals are
> sufficiently modern except the XP machine, which only builds release 10
> nowadays.

Cool.

> What's involved in moving to require Unix socket support?

It works today (the CI scripts use it on windows for example).

But it's awkward because make_temp_sockdir() defaults to /tmp/ if TMPDIR isn't
set. Which it is not by default on windows. There's PG_REGRESS_SOCK_DIR, which
kind of works for the msvc build, because pg_regress tests aren't run
concurrently (whereas tap tests can be run concurrently with
PROVE_FLAGS-j).

I think we just make make_temp_sockdir() a tad smarter. Perhaps by lifting the
code in src/bin/psql/command.c:do_edit() to src/port/path.c or such?

Greetings,

Andres Freund

Using operators to do query hints

2022-02-22 Thread Greg Stark

I've been playing with an idea I had a while back. Basically that it
would be useful to have some "noop" operators that are used purely to
influence the planner.

For context I've suggested in the past that there are two categories of hints:

1 Hints that override the planner's decisions with explicit planning
decisions. These we don't like for a variety of reasons, notably
because users don't have very good knowledge of all the plan types
available and new plan types come out in the future.

2 Hints that just influence the estimates that the planner uses to
make its decisions. These seem more acceptable as it amounts to
admitting the users know the distribution of their data and the
behaviour of their functions better than the statistics. Also, there
are plenty of cases where the statistics are really just wild guesses.
They still allow the planner to make decisions about what join orders
or types and so on given the updated data.

So I thought adding some noop operators that took a boolean on one
side and a float on the other and simply returned the boolean at
run-time but used the float (which would have to be a constant) from
the lhs as the selectivity would be useful.

In practice it's usable but not nearly as widely useful as I expected.

1) The boolean being passed through has to be a real clause with real
fields. Otherwise the planner is too clever and turns it into a
one-time filter which doesn't affect the plan :/

2) If it's a clauase that is potentially going to be an index
condition then you need to repeat the clause outside the selectivity
noop operator. Otherwise the selectivity operator hides it from the
planner.

3) It doesn't work on joins at all. Joins are done using the join
selectivity function on the join clause's operator. There's no way to
pass construct a simple noop wrapper that would still work with that
that I can see.

Nonetheless what I have does seem to be somewhat handy for simple cases.

I added some simple SQL wrapper functions such as pg_unlikely():

postgres=# explain select * from test where i<100;
 QUERY PLAN
-
 Index Only Scan using i on test  (cost=0.28..10.01 rows=99 width=4)
   Index Cond: (i < 100)
(2 rows)

postgres=# explain select * from test where pg_unlikely(i<100);
 QUERY PLAN

 Index Only Scan using i on test  (cost=0.28..10.50 rows=1 width=4)
   Index Cond: (i < 100)
   Filter: ('1e-06'::double precision %%% (i < 100))
(3 rows)

However this doesn't really do what you might really be hoping.
specifically, it doesn't actually affect the planner's choice of the
index scan in that query. It affects the estimate of the result of the
scan and that might be useful for subsequent nodes of the plan. But
not for the index scan itself:

postgres=# explain select * from test where pg_unlikely(i<500);
 QUERY PLAN
-
 Seq Scan on test  (cost=0.00..22.50 rows=1 width=4)
   Filter: ((i < 500) AND ('1e-06'::double precision %%% (i < 500)))
(2 rows)


So. I dunno. This feels like it's something that could be quite
handy but it's not as good as I had hoped and I think I'm out of ideas
for making it more powerful.


-- 
greg

Re: Time to drop plpython2?

2022-02-22 Thread Mark Wong

On Mon, Feb 21, 2022 at 03:28:35PM -0500, Tom Lane wrote:
> Mark Wong  writes:
> > On Sat, Feb 19, 2022 at 08:22:29AM -0800, Andres Freund wrote:
> >> Unfortunately it looks like it wasn't quite enough. All, or nearly all, 
> >> your
> >> animals that ran since still seem to be failing in the same spot...
> 
> > Oops, made another pass for python3 dev libraries.
> 
> You might need to do one more thing, which is manually blow away the cache
> files under $BUILDFARM/accache.  For example, on demoiselle everything
> looks fine in HEAD, but the back branches are failing like this:
> 
> checking for python... (cached) /usr/bin/python
> ./configure: line 10334: /usr/bin/python: No such file or directory
> configure: using python 
> ./configure: line 10342: test: : integer expression expected
> checking for Python sysconfig module... no
> configure: error: sysconfig module not found
> 
> Very recent versions of the buildfarm script will discard accache
> automatically after a configure or make failure, but I think the
> REL_11 you're running here doesn't have that defense.  It'll only
> flush accache after a change in the configure script in git.

Take 3. :)

I've upgraded everyone to the v14 buildfarm scripts and made sure the
--test passed on HEAD on each one.  So I hopefully didn't miss any
(other than the one EOL OpenSUSE version that I will plan on upgrading.)

Regards,
Mark

Re: C++ Trigger Framework

2022-02-22 Thread Nathan Bossart

On Tue, Feb 22, 2022 at 04:02:39PM +0200, Shmuel Kamensky wrote:
> I'm interested in creating a Postgres extension that would enable
> developers to write triggers in (modern) C++. Does anyone know if there is
> already some sort of translation wrapper between the native Postgres C
> API's and C++? Or if there's already a project that allows for writing
> triggers in C++ with ease?
> I see that https://github.com/clkao/plv8js-clkao/blob/master/plv8_type.cc
> does implement an abstraction of sorts, but it's specific to v8 types and
> is not genericized as a way of interacting with Postgres C API's from C++
> from *an*y C++ code.
> 
> Can you imagine any potential technical challenges I may encounter (e.g.
> massaging postgres' custom allocator to work with C++'s new and delete
> operators, or unresolvable compiler incompatibilities)?

This might answer your questions:

https://www.postgresql.org/docs/devel/xfunc-c.html#EXTEND-CPP

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: TAP output format in pg_regress

2022-02-22 Thread Daniel Gustafsson

> On 22 Feb 2022, at 18:13, Andres Freund  wrote:
> On 2022-02-22 15:10:11 +0100, Daniel Gustafsson wrote:

>> The errorpaths that exit(2) the testrun should be converted to "bail out" 
>> lines
>> when running with TAP output, but apart from that I think it's fairly spec
>> compliant.
> 
> I'd much rather not use BAIL - I haven't gotten around to doing anything about
> it, but I really want to get rid of nearly all our uses of bail:

Point.  We already error out on stderr in pg_regress so we could probably make
die() equivalent output to keep the TAP parsing consistent.  At any rate,
awaiting the conclusions on the bail thread and simply (for some value of)
replicating that in this patch is probably the best option?

--
Daniel Gustafsson   https://vmware.com/

Re: bailing out in tap tests nearly always a bad idea

2022-02-22 Thread Andres Freund

On 2022-02-22 15:10:30 -0500, Andrew Dunstan wrote:
> I'll be surprised if we can't come up with something cleaner than that.

Suggestions?

C++ Trigger Framework

2022-02-22 Thread Shmuel Kamensky

Hello all, this is my first time posting here (I first posted on the IRC
but didn't get any response), so let me know if there's a different
procedure for asking questions.

I'm interested in creating a Postgres extension that would enable
developers to write triggers in (modern) C++. Does anyone know if there is
already some sort of translation wrapper between the native Postgres C
API's and C++? Or if there's already a project that allows for writing
triggers in C++ with ease?
I see that https://github.com/clkao/plv8js-clkao/blob/master/plv8_type.cc
does implement an abstraction of sorts, but it's specific to v8 types and
is not genericized as a way of interacting with Postgres C API's from C++
from *an*y C++ code.

Can you imagine any potential technical challenges I may encounter (e.g.
massaging postgres' custom allocator to work with C++'s new and delete
operators, or unresolvable compiler incompatibilities)?

Thanks for any input :-)

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-02-22 Thread Alvaro Herrera

On 2022-Feb-22, Imseih (AWS), Sami wrote:

> On 13.5 a wal flush PANIC is encountered after a standby is promoted.
> 
> With debugging, it was found that when a standby skips a missing
> continuation record on recovery, the missingContrecPtr is not
> invalidated after the record is skipped. Therefore, when the standby
> is promoted to a primary it writes an overwrite_contrecord with an LSN
> of the missingContrecPtr, which is now in the past. On flush time,
> this causes a PANIC. From what I can see, this failure scenario can
> only occur after a standby is promoted.

Ooh, nice find and diagnosys.  I can confirm that the test fails as you
described without the code fix, and doesn't fail with it.

I attach the same patch, with the test file put in its final place
rather than as a patch.  Due to recent xlog.c changes this need a bit of
work to apply to back branches; I'll see about getting it in all
branches soon.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I'm impressed how quickly you are fixing this obscure issue. I came from 
MS SQL and it would be hard for me to put into words how much of a better job
you all are doing on [PostgreSQL]."
 Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg0.php
>From 7948d1acdcd25b5b425c7804fad0d46cfb4b14b0 Mon Sep 17 00:00:00 2001
From: "Sami Imseih (AWS)" 
Date: Tue, 22 Feb 2022 19:09:36 +
Subject: [PATCH v2] Fix "missing continuation record" after standby promotion

Fix a condition where a recently promoted standby attempts to write an
OVERWRITE_RECORD with an LSN of the previously read aborted record.

Author: Sami Imseih 
Discussion: https://postgr.es/m/44d259de-7542-49c4-8a52-2ab01534d...@amazon.com
---
 src/backend/access/transam/xlog.c |  16 ++-
 .../t/029_overwrite_contrecord_promotion.pl   | 111 ++
 2 files changed, 126 insertions(+), 1 deletion(-)
 create mode 100644 src/test/recovery/t/029_overwrite_contrecord_promotion.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..62e942f41a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5423,11 +5423,25 @@ StartupXLOG(void)
 	 * made it through and start writing after the portion that persisted.
 	 * (It's critical to first write an OVERWRITE_CONTRECORD message, which
 	 * we'll do as soon as we're open for writing new WAL.)
+	 *
+	 * If the last wal record is ahead of the missing contrecord, this is a
+	 * recently promoted primary and we should not write an overwrite
+	 * contrecord.
 	 */
 	if (!XLogRecPtrIsInvalid(missingContrecPtr))
 	{
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
-		EndOfLog = missingContrecPtr;
+		if (endOfRecoveryInfo->lastRec < missingContrecPtr)
+		{
+			elog(DEBUG2, "setting end of wal to missing continuation record %X/%X",
+ LSN_FORMAT_ARGS(missingContrecPtr));
+			EndOfLog = missingContrecPtr;
+		}
+		else
+		{
+			elog(DEBUG2, "resetting aborted record");
+			abortedRecPtr = InvalidXLogRecPtr;
+		}
 	}
 
 	/*
diff --git a/src/test/recovery/t/029_overwrite_contrecord_promotion.pl b/src/test/recovery/t/029_overwrite_contrecord_promotion.pl
new file mode 100644
index 00..ea4ebb32c0
--- /dev/null
+++ b/src/test/recovery/t/029_overwrite_contrecord_promotion.pl
@@ -0,0 +1,111 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Tests for resetting the "aborted record" after a promotion.
+
+use strict;
+use warnings;
+
+use FindBin;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test: Create a physical replica that's missing the last WAL file,
+# then restart the primary to create a divergent WAL file and observe
+# that the replica resets the "aborted record" after a promotion.
+
+my $node = PostgreSQL::Test::Cluster->new('primary');
+$node->init(allows_streaming => 1);
+# We need these settings for stability of WAL behavior.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+wal_keep_size = 1GB
+log_min_messages = DEBUG2
+));
+$node->start;
+
+$node->safe_psql('postgres', 'create table filler (a int, b text)');
+
+# Now consume all remaining room in the current WAL segment, leaving
+# space enough only for the start of a largish record.
+$node->safe_psql(
+	'postgres', q{
+DO $$
+DECLARE
+wal_segsize int := setting::int FROM pg_settings WHERE name = 'wal_segment_size';
+remain int;
+iters  int := 0;
+BEGIN
+LOOP
+INSERT into filler
+select g, repeat(md5(g::text), (random() * 60 + 1)::int)
+from generate_series(1, 10) g;
+
+remain := wal_segsize - (pg_current_wal_insert_lsn() - '0/0') % wal_segsize;
+IF remain < 2 * setting::int from pg_settings where name = 'block_size' THEN
+RAISE log 'exiting after % iterations, % bytes to end of WAL segment', iters, remain;
+EXIT;
+END IF;
+iters := iters + 1;
+END LOOP;
+END
+$$;
+});
+

Re: postgres_fdw: using TABLESAMPLE to collect remote sample

2022-02-22 Thread Tomas Vondra

On 2/22/22 01:36, Fujii Masao wrote:
> 
> 
> On 2022/02/18 22:28, Tomas Vondra wrote:
>> Hi,
>>
>> here's a slightly updated version of the patch series.
> 
> Thanks for updating the patches!
> 
> 
>> The 0001 part
>> adds tracking of server_version_num, so that it's possible to enable
>> other features depending on it.
> 
> Like configure_remote_session() does, can't we use PQserverVersion()
> instead of implementing new function GetServerVersion()?
> 

Ah! My knowledge of libpq is limited, so I wasn't sure this function
exists. It'll simplify the patch.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bailing out in tap tests nearly always a bad idea

2022-02-22 Thread Andrew Dunstan



On 2/22/22 13:19, Andres Freund wrote:
> Hi,
>
> On 2022-02-22 09:28:37 -0800, Andres Freund wrote:
>> On 2022-02-14 09:46:20 -0800, Andres Freund wrote:
 t/die-immediately.t  aieee at t/die-immediately.t line 1.
 t/die-immediately.t  Dubious, test returned 255 (wstat 65280, 0xff00)
 No subtests run
>>> Hm. Looks different when a test is including our test helpers.
>>>
>>> t/000_die_pg_test_utils.pl .. Dubious, test returned 25 (wstat 6400, 0x1900)
>>> No subtests run
>>> t/000_die_test_more.pl .. error at t/000_die_test_more.pl line 2.
>>> t/000_die_test_more.pl .. Dubious, test returned 255 (wstat 65280, 
>>> 0xff00)
>>> No subtests run
>>>
>>> So I think we broke something... I think it's the log file redirection stuff
>>> in INIT.
>> Any chance somebody with more perl knowledge could look into this? Clearly 
>> our
>> effort to copy all the output the original file descriptors isn't successful
>> here.
>>
>> $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use 
>> PostgreSQL::Test::Utils; die 'blorb';"
>> $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use Test::More; die 
>> 'blorb';"
>> blorb at -e line 1.
> So the problem is that die just outputs to stderr, but we've decided to have
> stderr connected to the output going to tap for some reason. I guess that
> prevents us from messing up the tap output, but it's still weird. Because of
> that redirection, die printing to stderr isn't visible to tap.
>
> Seems we can use black magic to "fix" that... Putting the following into
> INIT{} seems to do the trick:
>
>   # Because of the above redirection the tap output wouldn't contain
>   # information about tests failing due to die etc. Fix that by also
>   # printing the failure to the original stderr.
>   $SIG{__DIE__} = sub
>   {
>   # Don't redirect if it's a syntax error ($^S is undefined) or 
> in an
>   # eval block ($^S == 1).
>   if (defined $^S and $^S == 0)
>   {
>   print $orig_stderr "$_[0]\n";
>   #fail("died: $_[0]");
>   #done_testing();
>   }
>   };
>
>
> $ cat /tmp/die_pg_utils.pl
> use PostgreSQL::Test::Utils;
> use Test::More;
>
> ok(1, "foo");
> die 'blorb';
> done_testing();
>
> With the print we get something like:
>
> $ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl
> ok 1 - foo
> blorb at /tmp/die_pg_utils.pl line 5.
>
> # Tests were run but no plan was declared and done_testing() was not seen.
> # Looks like your test exited with 25 just after 1.
>
> With the fail() and done_testing() we get
>
> $ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl
> ok 1 - foo
> not ok 2 - died: blorb at /tmp/die_pg_utils.pl line 5.
> #
> #   Failed test 'died: blorb at /tmp/die_pg_utils.pl line 5.
> # '
> #   at /home/andres/src/postgresql/src/test/perl/PostgreSQL/Test/Utils.pm 
> line 235.
> 1..2
> # Looks like your test exited with 25 just after 2.
>
>
> The latter output doesn't confuse with stuff about plans and exit codes. But
> contains redundant messages (whatever) and it doesn't seem particularly clean
> to hijack a "test number".
>



I'll be surprised if we can't come up with something cleaner than that.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: remove more archiving overhead

2022-02-22 Thread Nathan Bossart

On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote:
> In my testing, I found that when I killed the server just before unlink()
> during WAL recyling, I ended up with links to the same file in pg_wal after
> restarting.  My latest test produced links to the same file for the current
> WAL file and the next one.  Maybe WAL recyling should use durable_rename()
> instead of durable_rename_excl().

Here is an updated patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e080d493985cf0a203122c1e07952b0f765019e4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 22 Feb 2022 11:39:28 -0800
Subject: [PATCH v2 1/1] reduce archiving overhead

---
 src/backend/access/transam/xlog.c | 10 ++
 src/backend/postmaster/pgarch.c   | 14 +-
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..2ad047052f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3334,13 +3334,15 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 
 	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
+	 * Perform the rename.  Ideally, we'd use link() and unlink() to avoid
+	 * overwriting an existing file (there shouldn't be one).  However, that
+	 * approach opens up the possibility that pg_wal will contain multiple hard
+	 * links to the same WAL file after a crash.
 	 */
-	if (durable_rename_excl(tmppath, path, LOG) != 0)
+	if (durable_rename(tmppath, path, LOG) != 0)
 	{
 		LWLockRelease(ControlFileLock);
-		/* durable_rename_excl already emitted log message */
+		/* durable_rename already emitted log message */
 		return false;
 	}
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index d916ed39a8..641297e9f5 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -746,7 +746,19 @@ pgarch_archiveDone(char *xlog)
 
 	StatusFilePath(rlogready, xlog, ".ready");
 	StatusFilePath(rlogdone, xlog, ".done");
-	(void) durable_rename(rlogready, rlogdone, WARNING);
+
+	/*
+	 * To avoid extra overhead, we don't durably rename the .ready file to
+	 * .done.  Archive commands and libraries must already gracefully handle
+	 * attempts to re-archive files (e.g., if the server crashes just before
+	 * this function is called), so it should be okay if the .ready file
+	 * reappears after a crash.
+	 */
+	if (rename(rlogready, rlogdone) < 0)
+		ereport(WARNING,
+(errcode_for_file_access(),
+ errmsg("could not rename file \"%s\" to \"%s\": %m",
+		rlogready, rlogdone)));
 }
 
 
-- 
2.25.1

[BUG] Panic due to incorrect missingContrecPtr after promotion

2022-02-22 Thread Imseih (AWS), Sami

On 13.5 a wal flush PANIC is encountered after a standby is promoted.

With debugging, it was found that when a standby skips a missing continuation 
record on recovery, the missingContrecPtr is not invalidated after the record 
is skipped. Therefore, when the standby is promoted to a primary it writes an 
overwrite_contrecord with an LSN of the missingContrecPtr, which is now in the 
past. On flush time, this causes a PANIC. From what I can see, this failure 
scenario can only occur after a standby is promoted.

The overwrite_contrecord was introduced in 13.5 with 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff9f111bce24.

Attached is a patch and a TAP test to handle this condition. The patch ensures 
that an overwrite_contrecord is only created if the missingContrecPtr is ahead 
of the last wal record.

To reproduce:
Run the new tap test recovery/t/029_overwrite_contrecord_promotion.pl without 
the attached patch

2022-02-22 18:38:15.526 UTC [31138] LOG:  started streaming WAL from primary at 
0/200 on timeline 1
2022-02-22 18:38:15.535 UTC [31105] LOG:  successfully skipped missing 
contrecord at 0/1FFC620, overwritten at 2022-02-22 18:38:15.136482+00
2022-02-22 18:38:15.535 UTC [31105] CONTEXT:  WAL redo at 0/228 for 
XLOG/OVERWRITE_CONTRECORD: lsn 0/1FFC620; time 2022-02-22 18:38:15.136482+00
…
…..
2022-02-22 18:38:15.575 UTC [31103] PANIC:  xlog flush request 0/201EC70 is not 
satisfied --- flushed only to 0/288
2022-02-22 18:38:15.575 UTC [31101] LOG:  checkpointer process (PID 31103) was 
terminated by signal 6: Aborted
….
…..

With the patch, running the same tap test succeeds and a PANIC is not observed.

Thanks

Sami Imseih
Amazon Web Services





0001-Fix-missing-continuation-record-after-standby-promot.patch
Description: 0001-Fix-missing-continuation-record-after-standby-promot.patch

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-22 Thread Matthias van de Meent

On Tue, 22 Feb 2022 at 07:39, Nitin Jadhav
 wrote:
>
> > > Thank you for sharing the information.  'triggering backend PID' (int)
> > > - can be stored without any problem. 'checkpoint or restartpoint?'
> > > (boolean) - can be stored as a integer value like
> > > PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
> > > PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
> > > start time in stat_progress, timestamp fits in 64 bits) - As
> > > Timestamptz is of type int64 internally, so we can store the timestamp
> > > value in the progres parameter and then expose a function like
> > > 'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
> > > Timestamptz) as argument and then returns string representing the
> > > elapsed time.
> >
> > No need to use a string there; I think exposing the checkpoint start
> > time is good enough. The conversion of int64 to timestamp[tz] can be
> > done in SQL (although I'm not sure that exposing the internal bitwise
> > representation of Interval should be exposed to that extent) [0].
> > Users can then extract the duration interval using now() - start_time,
> > which also allows the user to use their own preferred formatting.
>
> The reason for showing the elapsed time rather than exposing the
> timestamp directly is in case of checkpoint during shutdown and
> end-of-recovery, I am planning to log a message in server logs using
> 'log_startup_progress_interval' infrastructure which displays elapsed
> time. So just to match both of the behaviour I am displaying elapsed
> time here. I feel that elapsed time gives a quicker feel of the
> progress. Kindly let me know if you still feel just exposing the
> timestamp is better than showing the elapsed time.

At least for pg_stat_progress_checkpoint, storing only a timestamp in
the pg_stat storage (instead of repeatedly updating the field as a
duration) seems to provide much more precise measures of 'time
elapsed' for other sessions if one step of the checkpoint is taking a
long time.

I understand the want to integrate the log-based reporting in the same
API, but I don't think that is necessarily the right approach:
pg_stat_progress_* has low-overhead infrastructure specifically to
ensure that most tasks will not run much slower while reporting, never
waiting for locks. Logging, however, needs to take locks (if only to
prevent concurrent writes to the output file at a kernel level) and
thus has a not insignificant overhead and thus is not very useful for
precise and very frequent statistics updates.

So, although similar in nature, I don't think it is smart to use the
exact same infrastructure between pgstat_progress*-based reporting and
log-based progress reporting, especially if your logging-based
progress reporting is not intended to be a debugging-only
configuration option similar to log_min_messages=DEBUG[1..5].

- Matthias

Re: bailing out in tap tests nearly always a bad idea

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-22 09:28:37 -0800, Andres Freund wrote:
> On 2022-02-14 09:46:20 -0800, Andres Freund wrote:
> > > t/die-immediately.t  aieee at t/die-immediately.t line 1.
> > > t/die-immediately.t  Dubious, test returned 255 (wstat 65280, 0xff00)
> > > No subtests run
> >
> > Hm. Looks different when a test is including our test helpers.
> >
> > t/000_die_pg_test_utils.pl .. Dubious, test returned 25 (wstat 6400, 0x1900)
> > No subtests run
> > t/000_die_test_more.pl .. error at t/000_die_test_more.pl line 2.
> > t/000_die_test_more.pl .. Dubious, test returned 255 (wstat 65280, 
> > 0xff00)
> > No subtests run
> >
> > So I think we broke something... I think it's the log file redirection stuff
> > in INIT.
>
> Any chance somebody with more perl knowledge could look into this? Clearly our
> effort to copy all the output the original file descriptors isn't successful
> here.
>
> $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use 
> PostgreSQL::Test::Utils; die 'blorb';"
> $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use Test::More; die 
> 'blorb';"
> blorb at -e line 1.

So the problem is that die just outputs to stderr, but we've decided to have
stderr connected to the output going to tap for some reason. I guess that
prevents us from messing up the tap output, but it's still weird. Because of
that redirection, die printing to stderr isn't visible to tap.

Seems we can use black magic to "fix" that... Putting the following into
INIT{} seems to do the trick:

# Because of the above redirection the tap output wouldn't contain
# information about tests failing due to die etc. Fix that by also
# printing the failure to the original stderr.
$SIG{__DIE__} = sub
{
# Don't redirect if it's a syntax error ($^S is undefined) or 
in an
# eval block ($^S == 1).
if (defined $^S and $^S == 0)
{
print $orig_stderr "$_[0]\n";
#fail("died: $_[0]");
#done_testing();
}
};

$ cat /tmp/die_pg_utils.pl
use PostgreSQL::Test::Utils;
use Test::More;

ok(1, "foo");
die 'blorb';
done_testing();

With the print we get something like:

$ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl
ok 1 - foo
blorb at /tmp/die_pg_utils.pl line 5.

# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 25 just after 1.

With the fail() and done_testing() we get

$ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl
ok 1 - foo
not ok 2 - died: blorb at /tmp/die_pg_utils.pl line 5.
#
#   Failed test 'died: blorb at /tmp/die_pg_utils.pl line 5.
# '
#   at /home/andres/src/postgresql/src/test/perl/PostgreSQL/Test/Utils.pm line 
235.
1..2
# Looks like your test exited with 25 just after 2.

The latter output doesn't confuse with stuff about plans and exit codes. But
contains redundant messages (whatever) and it doesn't seem particularly clean
to hijack a "test number".

Greetings,

Andres Freund

Re: Documentation about PL transforms

2022-02-22 Thread Chapman Flack

On 02/21/22 16:09, Chapman Flack wrote:
> On 02/07/22 15:14, Chapman Flack wrote:
>> I'll work on some doc patches.
> 
> It seems a bit of an impedance mismatch that there is a get_func_trftypes
> producing a C Oid[] (with its length returned separately), while
> get_transform_fromsql and get_transform_tosql both expect a List.
> ...
> Would it be reasonable to deprecate get_func_trftypes and instead
> provide a get_call_trftypes

It would have been painful to write documentation of get_func_trftypes
saying its result isn't what get_transform_{from.to}sql expect, so
patch 1 does add a get_call_trftypes that returns a List *.

Patch 2 then updates the docs as discussed in this thread. It turned out
plhandler.sgml was kind of a long monolith of text even before adding
transform information, so I broke it into sections first. This patch adds
the section markup without reindenting, so the changes aren't obscured.

The chapter had also fallen behind the introduction of procedures, so
I have changed many instances of 'function' to the umbrella term
'routine'.

Patch 3 simply reindents for the new section markup and rewraps.
Whitespace only.

Patch 4 updates src/test/modules/plsample to demonstrate handling of
transforms (and to add some more comments generally).

I'll add this to the CF.

Regards,
-Chap
>From 7e6027cdcf1dbf7a10c0bad5059816cef762b1b9 Mon Sep 17 00:00:00 2001
From: Chapman Flack 
Date: Mon, 21 Feb 2022 20:56:28 -0500
Subject: [PATCH 1/4] Warmup: add a get_call_trftypes function

The existing get_func_trftypes function produces an Oid[], where
both existing get_transform_{from,to}sql functions that depend
on the result expect a List*.

Rather than writing documentation awkwardly describing functions
that won't play together, add a get_call_trftypes function that
returns List*. (The name get_call_... to distinguish from
get_func_... follows the naming used in funcapi.h for a function
returning information about either a function or a procedure.)
---
 src/backend/utils/cache/lsyscache.c | 18 ++
 src/include/funcapi.h   |  5 +
 src/include/utils/lsyscache.h   |  1 +
 3 files changed, 24 insertions(+)

diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index feef999..a49ccad 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -2107,6 +2107,24 @@ get_transform_tosql(Oid typid, Oid langid, List *trftypes)
 		return InvalidOid;
 }

+/*
+ * get_call_trftypes
+ *
+ *		A helper function that does not itself query the transform cache, but
+ *		constructs the transform-type List expected by the functions above.
+ */
+List *
+get_call_trftypes(HeapTuple procTup)
+{
+	Datum		protrftypes;
+	bool		isNull;
+
+	protrftypes = SysCacheGetAttr(PROCOID, procTup,
+  Anum_pg_proc_protrftypes,
+  );
+	return isNull ?  NIL : oid_array_to_list(protrftypes);
+}
+

 /*-- TYPE CACHE --		 */

diff --git a/src/include/funcapi.h b/src/include/funcapi.h
index ba927c2..7c61560 100644
--- a/src/include/funcapi.h
+++ b/src/include/funcapi.h
@@ -175,7 +175,12 @@ extern int	get_func_arg_info(HeapTuple procTup,
 extern int	get_func_input_arg_names(Datum proargnames, Datum proargmodes,
 	 char ***arg_names);

+/*
+ * A deprecated earlier version of get_call_trftypes (in lsyscache.h).
+ * That version produces a List, which is the form downstream functions expect.
+ */
 extern int	get_func_trftypes(HeapTuple procTup, Oid **p_trftypes);
+
 extern char *get_func_result_name(Oid functionId);

 extern TupleDesc build_function_result_tupdesc_d(char prokind,
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index b8dd27d..93b19e7 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern List *get_call_trftypes(HeapTuple procTup);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
-- 
2.7.3

>From 447a4aaecdf3a23a47c11ad741b49f81475e34e3 Mon Sep 17 00:00:00 2001
From: Chapman Flack 
Date: Mon, 21 Feb 2022 20:59:32 -0500
Subject: [PATCH 2/4] Update PL handler implementation docs

The original purpose was to add information on support for
CREATE TRANSFORM (which must be explicitly coded in any PL
implementation intending to support it). But the plhandler
section was about as long as a monolith of text ought to be,
even before adding transform information, so reorganized
first into sections.

Front-loaded with short descriptions of the three possible
functions (call handler, validator, inline handler) registered
with CREATE LANGUAGE. The latter two were afterthoughts in historical
sequence, but the docs

Re: remove more archiving overhead

2022-02-22 Thread Nathan Bossart

On Mon, Feb 21, 2022 at 05:19:48PM -0800, Nathan Bossart wrote:
> I also spent some time investigating whether durably renaming the archive
> status files was even necessary.  In theory, it shouldn't be.  If a crash
> happens before the rename is persisted, we might try to re-archive files,
> but your archive_command/archive_library should handle that.  If the file
> was already recycled or removed, the archiver will skip it (thanks to
> 6d8727f).  But when digging further, I found that WAL file recyling uses
> durable_rename_excl(), which has the following note:
> 
>* Note that a crash in an unfortunate moment can leave you with two 
> links to
>* the target file.
> 
> IIUC this means that in theory, a crash at an unfortunate moment could
> leave you with a .ready file, the file to archive, and another link to that
> file with a "future" WAL filename.  If you re-archive the file after it has
> been reused, you could end up with corrupt WAL archives.  I think this
> might already be possible today.  Syncing the directory after every rename
> probably reduces the likelihood quite substantially, but if
> RemoveOldXlogFiles() quickly picks up the .done file and attempts
> recycling before durable_rename() calls fsync() on the directory,
> presumably the same problem could occur.

In my testing, I found that when I killed the server just before unlink()
during WAL recyling, I ended up with links to the same file in pg_wal after
restarting.  My latest test produced links to the same file for the current
WAL file and the next one.  Maybe WAL recyling should use durable_rename()
instead of durable_rename_excl().

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: bailing out in tap tests nearly always a bad idea

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-14 09:46:20 -0800, Andres Freund wrote:
> > t/die-immediately.t  aieee at t/die-immediately.t line 1.
> > t/die-immediately.t  Dubious, test returned 255 (wstat 65280, 0xff00)
> > No subtests run
> 
> Hm. Looks different when a test is including our test helpers.
> 
> t/000_die_pg_test_utils.pl .. Dubious, test returned 25 (wstat 6400, 0x1900)
> No subtests run
> t/000_die_test_more.pl .. error at t/000_die_test_more.pl line 2.
> t/000_die_test_more.pl .. Dubious, test returned 255 (wstat 65280, 0xff00)
> No subtests run
> 
> So I think we broke something... I think it's the log file redirection stuff
> in INIT.

Any chance somebody with more perl knowledge could look into this? Clearly our
effort to copy all the output the original file descriptors isn't successful
here.

$ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use 
PostgreSQL::Test::Utils; die 'blorb';"
$ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use Test::More; die 
'blorb';"
blorb at -e line 1.

(note that the former will create a tmp_check in your current directory)

Greetings,

Andres Freund

Re: TAP output format in pg_regress

2022-02-22 Thread Andres Freund

Hi,

Thanks for the updated version!

On 2022-02-22 15:10:11 +0100, Daniel Gustafsson wrote:
> The errorpaths that exit(2) the testrun should be converted to "bail out" 
> lines
> when running with TAP output, but apart from that I think it's fairly spec
> compliant.

I'd much rather not use BAIL - I haven't gotten around to doing anything about
it, but I really want to get rid of nearly all our uses of bail:

https://www.postgresql.org/message-id/20220213232249.7sevhlioapydla37%40alap3.anarazel.de

Greetings,

Andres Freund

Re: CREATEROLE and role ownership hierarchies

2022-02-22 Thread Joshua Brindle

On Thu, Feb 17, 2022 at 12:40 PM Robert Haas  wrote:
>
> On Mon, Jan 31, 2022 at 1:57 PM Joshua Brindle
>  wrote:
> > This is precisely the use case I am trying to accomplish with this
> > patchset, roughly:
> >
> > - An automated bot that creates users and adds them to the employees role
> > - Bot cannot access any employee (or other roles) table data
> > - Bot cannot become any employee
> > - Bot can disable the login of any employee
> >
> > Yes there are attack surfaces around the fringes of login, etc but
> > those can be mitigated with certificate authentication. My pg_hba
> > would require any role in the employees role to use cert auth.
> >
> > This would adequately mitigate many threats while greatly enhancing
> > user management.
>
> So, where do we go from here?
>
> I've been thinking about this comment a bit. On the one hand, I have
> some reservations about the phrase "the use case I am trying to
> accomplish with this patchset," because in the end, this is not your
> patch set. It's not reasonable to complain that a patch someone else
> wrote doesn't solve your problem; of course everyone writes patches to
> solve their own problems, or those of their employer, not other
> people's problems. And that's as it should be, else we will have few
> contributors. On the other hand, to the extent that this patch set
> makes things worse for a reasonable use case which you have in mind,
> that's an entirely legitimate complaint.

Yes, absolutely. It is my understanding that generally a community
consensus is attempted, I was throwing my (and Crunchy's) use case out
there as a possible goal, and I have spent time reviewing and testing
the patch, so I think that is fair. Obviously I am not in the position
to stipulate hard requirements.

> After a bit of testing, it seems to me that as things stand today,
> things are nearly perfect for the use case that you have in mind. I
> would be interested to know whether you agree. If I set up an account
> and give it CREATEROLE, it can create users, and it can put them into
> the employees role, but it can't SET ROLE to any of those accounts. It
> can also ALTER ROLE ... NOLOGIN on any of those accounts. The only gap
> I see is that there are certain role-based flags which the CREATEROLE
> account cannot set: SUPERUSER, BYPASSRLS, REPLICATION. You might
> prefer a system where your bot account had the option to grant those
> privileges also, and I think that's a reasonable thing to want.

I believe the only issue in the existing patchset was that membership
was required in employees was required for the Bot, but I can apply
the current patchset and test it out more in a bit.

> However, I *also* think it's reasonable to want an account that can
> create roles but can't give to those roles membership in roles that it
> does not itself possess. Likewise, I think it's reasonable to want an
> account that can only drop roles which that account itself created.
> These kinds of requirements stem from a different use case than what
> you are talking about here, but they seem like fine things to want,
> and as far as I know we have pretty broad agreement that they are
> reasonable. It seems extremely difficult to make a convincing argument
> that this is not a thing which anyone should want to block:
>
> rhaas=> create role bob role pg_execute_server_program;
> CREATE ROLE
>
> Honestly, that seems like a big yikes from here. How is it OK to block
> "create role bob superuser" yet allow that command? I'm inclined to
> think that's just broken. Even if the role were pg_write_all_data
> rather than pg_execute_server_program, it's still a heck of a lot of
> power to be handing out, and I don't see how anyone could make a
> serious argument that we shouldn't have an option to restrict that.

Yes, agreed 100%. To be clear, I do not want Bot in the above use case
to be able to add any role other than employees to new roles it
creates. So we are in complete agreement there, the only difference is
that I do not want Bot to be able to become those roles (or use any
access granted via those roles), it's only job is to manage roles, not
look at data.

> Let me separate the two features that I just mentioned and talk about
> them individually:
>
> 1. Don't allow a CREATEROLE user to give out membership in groups
> which that user does not possess. Leaving aside the details of any
> previously-proposed patches and just speaking theoretically, how can
> this be implemented? I can think of a few ideas. We could (1A) just
> change CREATEROLE to work that way, but IIUC that would break the use
> case you outline here, so I guess that's off the table unless I am
> misunderstanding the situation. We could also (1B) add a second role
> attribute with a different name, like, err, CREATEROLEWEAKLY, that
> behaves in that way, leaving the existing one untouched. But we could
> also take it a lot further, because someone might want to let an
> account hand out a set of privileges which

Re: Patch a potential memory leak in describeOneTableDetails()

2022-02-22 Thread Tom Lane

Kyotaro Horiguchi  writes:
> Anyway, don't we use the -ftabstop option flag to silence compiler?
> The warning is contradicting our convention. The attached adds that
> flag.

No, we use pgindent to not have such inconsistent indentation.

regards, tom lane

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-22 Thread Ashutosh Sharma

I'm not sure about the current status, but found it while playing
around with the latest changes a bit, so thought of sharing it here.

+  
+   strategy
+   
+
+ This is used for copying the database directory.  Currently, we have
+ two strategies the WAL_LOG_BLOCK and the

Isn't it wal_log instead of wal_log_block?

I think when users input wrong strategy with createdb command, we
should provide a hint message showing allowed values for strategy
types along with an error message. This will be helpful for the users.

On Tue, Feb 15, 2022 at 5:19 PM Dilip Kumar  wrote:
>
> On Tue, Feb 15, 2022 at 2:01 AM Maciek Sakrejda  wrote:
> >
> Here is the updated version of the patch, the changes are 1) Fixed
> review comments given by Robert and one open comment from Ashutosh.
> 2) Preserved the old create db method. 3) As agreed upthread for now
> we are using the new strategy only for createdb not for movedb so I
> have removed the changes in ForgetDatabaseSyncRequests() and
> DropDatabaseBuffers().  3) Provided a database creation strategy
> option as of now I have kept it as below.
>
> CREATE DATABASE ... WITH (STRATEGY = WAL_LOG);  -- default if
> option is omitted
> CREATE DATABASE ... WITH (STRATEGY = FILE_COPY);
>
> I have updated the document but I was not sure how much internal
> information to be exposed to the user so I will work on that based on
> feedback from others.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-22 Thread Masahiko Sawada

On Tue, Feb 22, 2022 at 9:22 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Tuesday, February 22, 2022 2:45 PM Masahiko Sawada  
> wrote:
> > I've attached a patch that changes pg_stat_subscription_workers view.
> > It removes non-cumulative values such as error details such as error-XID and
> > the error message from the view, and consequently the view now has only
> > cumulative statistics counters: apply_error_count and sync_error_count. 
> > Since
> > the new view name is under discussion I temporarily chose
> > pg_stat_subscription_activity.
> Hi, thank you for sharing the patch.
>
>
> Few minor comments for v1.

Thank you for the comments!

>
> (1) commit message's typo
>
> This commits changes the view so that it stores only statistics
> counters: apply_error_count and sync_error_count.
>
> "This commits" -> "This commit"

Will fix.

>
> (2) minor improvement suggestion for the commit message
>
> I suggest that we touch the commit id 8d74fc9
> that introduces the pg_stat_subscription_workers
> in the commit message, for better traceability. Below is an example.
>
> From:
> As the result of the discussion, we've concluded that the stats
> collector is not an appropriate place to store the error information of
> subscription workers.
>
> To:
> As the result of the discussion about the view introduced by 8d74fc9,...

Okay, will add the commit reference.

>
> (3) doc/src/sgml/logical-replication.sgml
>
> Kindly refer to commit id 85c61ba for the detail.
> You forgot "the" in the below sentence.
>
> @@ -346,8 +346,6 @@
>
> A conflict will produce an error and will stop the replication; it must be
> resolved manually by the user.  Details about the conflict can be found in
> -   
> -   pg_stat_subscription_workers and the
> subscriber's server log.
>
>
> From:
> subscriber's server log.
> to:
> the subscriber's server log.

Will fix.

>
> (4) doc/src/sgml/monitoring.sgml
>
>   
>
> -   last_error_time timestamp with time 
> zone
> +   sync_error_count uint8
>
>
> -   Last time at which this error occurred
> +   Number of times the error occurred during the initial data copy
>
>
> I supposed it might be better to use "initial data sync"
> or "initial data synchronization", rather than "initial data copy".

"Initial data synchronization" sounds like the whole table
synchronization process including COPY and applying changes to catch
up. But sync_error_count is incremented only during COPY so I used
"initial data copy". What do you think?

>
> (5) src/test/subscription/t/026_worker_stats.pl
>
> +# Truncate test_tab1 so that table sync can continue.
> +$node_subscriber->safe_psql('postgres', "TRUNCATE test_tab1;");
>
> The second truncate is for apply, isn't it? Therefore, kindly change
>
> From:
> Truncate test_tab1 so that table sync can continue.
> To:
> Truncate test_tab1 so that apply can continue.

Right, will fix.

>
> (6) src/test/subscription/t/026_worker_stats.pl
>
> +# Insert more data to test_tab1 on the subscriber and then on the publisher, 
> raising an
> +# error on the subscriber due to violation of the unique constraint on 
> test_tab1.
> +$node_subscriber->safe_psql('postgres', "INSERT INTO test_tab1 VALUES (2)");
>
> Did we need this insert ?
> If you want to indicate the apply is working okay after the error of table 
> sync is solved,
> waiting for the max value in the test_tab1 becoming 2 on the subscriber by 
> polling query
> would work. But, I was not sure if this is essentially necessary for the 
> testing purpose.

You're right, it's not necessary. Also, it seems better to change the
TAP test file name from 026_worker_stats.pl to 026_stats.pl. Will
incorporate these changes.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-22 Thread Ashutosh Sharma

+/* Kinds of checkpoint (as advertised via PROGRESS_CHECKPOINT_KIND) */
+#define PROGRESS_CHECKPOINT_KIND_WAL0
+#define PROGRESS_CHECKPOINT_KIND_TIME   1
+#define PROGRESS_CHECKPOINT_KIND_FORCE  2
+#define PROGRESS_CHECKPOINT_KIND_UNKNOWN3

On what basis have you classified the above into the various types of
checkpoints? AFAIK, the first two types are based on what triggered
the checkpoint (whether it was the checkpoint_timeout or maz_wal_size
settings) while the third type indicates the force checkpoint that can
happen when the checkpoint is triggered for various reasons e.g. .
during createb or dropdb etc. This is quite possible that both the
PROGRESS_CHECKPOINT_KIND_TIME and PROGRESS_CHECKPOINT_KIND_FORCE flags
are set for the checkpoint because multiple checkpoint requests are
processed at one go, so what type of checkpoint would that be?

+*/
+   if ((flags & (CHECKPOINT_IS_SHUTDOWN |
CHECKPOINT_END_OF_RECOVERY)) == 0)
+   {
+
pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
InvalidOid);
+   checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_PHASE,
+
  PROGRESS_CHECKPOINT_PHASE_INIT);
+   if (flags & CHECKPOINT_CAUSE_XLOG)
+   checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
  PROGRESS_CHECKPOINT_KIND_WAL);
+   else if (flags & CHECKPOINT_CAUSE_TIME)
+   checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
  PROGRESS_CHECKPOINT_KIND_TIME);
+   else if (flags & CHECKPOINT_FORCE)
+   checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
  PROGRESS_CHECKPOINT_KIND_FORCE);
+   else
+   checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
  PROGRESS_CHECKPOINT_KIND_UNKNOWN);
+   }
+}

--
With Regards,
Ashutosh Sharma.

On Thu, Feb 10, 2022 at 12:23 PM Nitin Jadhav
 wrote:
>
> > > We need at least a trace of the number of buffers to sync (num_to_scan) 
> > > before the checkpoint start, instead of just emitting the stats at the 
> > > end.
> > >
> > > Bharat, it would be good to show the buffers synced counter and the total 
> > > buffers to sync, checkpointer pid, substep it is running, whether it is 
> > > on target for completion, checkpoint_Reason
> > > (manual/times/forced). BufferSync has several variables tracking the sync 
> > > progress locally, and we may need some refactoring here.
> >
> > I agree to provide above mentioned information as part of showing the
> > progress of current checkpoint operation. I am currently looking into
> > the code to know if any other information can be added.
>
> Here is the initial patch to show the progress of checkpoint through
> pg_stat_progress_checkpoint view. Please find the attachment.
>
> The information added to this view are pid - process ID of a
> CHECKPOINTER process, kind - kind of checkpoint indicates the reason
> for checkpoint (values can be wal, time or force), phase - indicates
> the current phase of checkpoint operation, total_buffer_writes - total
> number of buffers to be written, buffers_processed - number of buffers
> processed, buffers_written - number of buffers written,
> total_file_syncs - total number of files to be synced, files_synced -
> number of files synced.
>
> There are many operations happen as part of checkpoint. For each of
> the operation I am updating the phase field of
> pg_stat_progress_checkpoint view. The values supported for this field
> are initializing, checkpointing replication slots, checkpointing
> snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
> pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
> checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
> buffers, performing sync requests, performing two phase checkpoint,
> recycling old XLOG files and Finalizing. In case of checkpointing
> buffers phase, the fields total_buffer_writes, buffers_processed and
> buffers_written shows the detailed progress of writing buffers. In
> case of performing sync requests phase, the fields total_file_syncs
> and files_synced shows the detailed progress of syncing files. In
> other phases, only the phase field is getting updated and it is
> difficult to show the progress because we do not get the total number
> of files count without traversing the directory. It is not worth to
> calculate that as it affects the performance of the checkpoint. I also
> gave a thought to just mention the number of files processed, but this
> wont give a meaningful progress information (It can be treated as
> statistics). Hence just updating the phase field in those scenarios.
>
> Apart from above fields, I am planning to add few more fields to the
> view in the next patch. That is, process ID of the backend process
> which triggered

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-22 Thread Masahiko Sawada

On Tue, Feb 22, 2022 at 6:53 PM Amit Kapila  wrote:
>
> On Tue, Feb 22, 2022 at 11:15 AM Masahiko Sawada  
> wrote:
> >
> > I've attached a patch that changes pg_stat_subscription_workers view.
> > It removes non-cumulative values such as error details such as
> > error-XID and the error message from the view, and consequently the
> > view now has only cumulative statistics counters: apply_error_count
> > and sync_error_count. Since the new view name is under discussion I
> > temporarily chose pg_stat_subscription_activity.
> >
>
> Few comments:
> =

Thank you for the comments!

> 1.
> --- a/src/backend/catalog/system_functions.sql
> +++ b/src/backend/catalog/system_functions.sql
> @@ -637,11 +637,9 @@ REVOKE EXECUTE ON FUNCTION
> pg_stat_reset_single_table_counters(oid) FROM public;
>
>  REVOKE EXECUTE ON FUNCTION
> pg_stat_reset_single_function_counters(oid) FROM public;
>
> -REVOKE EXECUTE ON FUNCTION pg_stat_reset_replication_slot(text) FROM public;
> -
> -REVOKE EXECUTE ON FUNCTION pg_stat_reset_subscription_worker(oid) FROM 
> public;
> +REVOKE EXECUTE ON FUNCTION
> pg_stat_reset_single_subscription_counters(oid) FROM public;
>
> -REVOKE EXECUTE ON FUNCTION pg_stat_reset_subscription_worker(oid,
> oid) FROM public;
> +REVOKE EXECUTE ON FUNCTION pg_stat_reset_replication_slot(text) FROM public;
>
> Is there a need to change anything about
> pg_stat_reset_replication_slot() in this patch?

It doesn't change pg_stat_reset_replication_slot() but just changes
the order in order to put the modified function
pg_stat_reset_single_subscription_counters() closer to other similar
functions such as pg_stat_reset_single_function_counters().

>
> 2. Do we still need to use LATERAL in the view's query?

There are some functions that use LATERAL in a similar way but it
seems no need to put LATERAL before the function call. Will remove.

> 3. Can we send error stats pgstat_report_stat() as that will be called
> via proc_exit() path. We can set the phase (apply/sync) in
> apply_error_callback_arg and then use that to send the appropriate
> message. I think this will obviate the need for try..catch.

If we use pgstat_report_stat() to send subscription stats messages,
all processes end up going through that path. It might not bring
overhead in practice but I'd like to avoid it. And, since the apply
worker also calls pgstat_report_stat() at the end of the transaction,
we might need to change pgstat_report_stat() so that it doesn't send
the subscription messages when it gets called at the end of the
transaction. I think it's likely that PG_TRY() and PG_CATCH() wil be
added for example, when the disable_on_error feature or the storing
error details feature is introduced, so obviating the need for them at
this point would not benefit much.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: psql - add SHOW_ALL_RESULTS option

2022-02-22 Thread Peter Eisentraut

I wrote a few more small tests for psql to address uncovered territory 
in SendQuery() especially:


- \timing
- client encoding handling
- notifications

What's still missing is:

- \watch
- pagers

For \watch, I think one would need something like the current cancel 
test (since you need to get the psql pid to send a signal to stop the 
watch).  It would work in principle, but it will require more work to 
refactor the cancel test.


For pagers, I don't know.  I would be pretty easy to write a simple 
script that acts as a pass-through pager and check that it is called. 
There were some discussions earlier in the thread that some version of 
some patch had broken some use of pagers.  Does anyone remember details? 
 Anything worth testing specifically?From 6e75c1bec73f2128b94131305e6a37b97257f7c3 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 22 Feb 2022 13:42:38 +0100
Subject: [PATCH 1/2] Improve some psql test code

Split psql_like() into two functions psql_like() and psql_fails_like()
and make them mirror the existing command_like() and
command_fails_like() more closely.  In particular, follow the
universal convention that the test name is the last argument.
---
 src/bin/psql/t/001_basic.pl | 59 ++---
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/src/bin/psql/t/001_basic.pl b/src/bin/psql/t/001_basic.pl
index ba3dd846ba..f416e0ab5e 100644
--- a/src/bin/psql/t/001_basic.pl
+++ b/src/bin/psql/t/001_basic.pl
@@ -12,40 +12,36 @@
 program_version_ok('psql');
 program_options_handling_ok('psql');
 
-my ($stdout, $stderr);
-my $result;
-
-# Execute a psql command and check its result patterns.
+# Execute a psql command and check its output.
 sub psql_like
 {
local $Test::Builder::Level = $Test::Builder::Level + 1;
 
-   my $node= shift;
-   my $test_name   = shift;
-   my $query   = shift;
-   my $expected_stdout = shift;
-   my $expected_stderr = shift;
+   my ($node, $sql, $expected_stdout, $test_name) = @_;
+
+   my ($ret, $stdout, $stderr) = $node->psql('postgres', $sql);
+
+   is($ret,0,  "$test_name: exit code 0");
+   is($stderr, '', "$test_name: no stderr");
+   like($stdout, $expected_stdout, "$test_name: matches");
+
+   return;
+}
+
+# Execute a psql command and check that it fails and check the stderr.
+sub psql_fails_like
+{
+   local $Test::Builder::Level = $Test::Builder::Level + 1;
 
-   die "cannot specify both expected stdout and stderr here"
- if (defined($expected_stdout) && defined($expected_stderr));
+   my ($node, $sql, $expected_stderr, $test_name) = @_;
 
# Use the context of a WAL sender, some of the tests rely on that.
my ($ret, $stdout, $stderr) = $node->psql(
-   'postgres', $query,
-   on_error_die => 0,
+   'postgres', $sql,
replication  => 'database');
 
-   if (defined($expected_stdout))
-   {
-   is($ret,0,  "$test_name: expected result code");
-   is($stderr, '', "$test_name: no stderr");
-   like($stdout, $expected_stdout, "$test_name: stdout matches");
-   }
-   if (defined($expected_stderr))
-   {
-   isnt($ret, 0, "$test_name: expected result code");
-   like($stderr, $expected_stderr, "$test_name: stderr matches");
-   }
+   isnt($ret, 0, "$test_name: exit code not 0");
+   like($stderr, $expected_stderr, "$test_name: matches");
 
return;
 }
@@ -53,6 +49,9 @@ sub psql_like
 # test --help=foo, analogous to program_help_ok()
 foreach my $arg (qw(commands variables))
 {
+   my ($stdout, $stderr);
+   my $result;
+
$result = IPC::Run::run [ 'psql', "--help=$arg" ], '>', \$stdout, '2>',
  \$stderr;
ok($result, "psql --help=$arg exit code 0");
@@ -70,15 +69,15 @@ sub psql_like
 });
 $node->start;
 
-psql_like($node, '\copyright', '\copyright', qr/Copyright/, undef);
-psql_like($node, '\help without arguments', '\help', qr/ALTER/, undef);
-psql_like($node, '\help with argument', '\help SELECT', qr/SELECT/, undef);
+psql_like($node, '\copyright', qr/Copyright/, '\copyright');
+psql_like($node, '\help', qr/ALTER/, '\help without arguments');
+psql_like($node, '\help SELECT', qr/SELECT/, '\help with argument');
 
 # Test clean handling of unsupported replication command responses
-psql_like(
+psql_fails_like(
$node,
-   'handling of unexpected PQresultStatus',
'START_REPLICATION 0/0',
-   undef, qr/unexpected PQresultStatus: 8$/);
+   qr/unexpected PQresultStatus: 8$/,
+   'handling of unexpected PQresultStatus');
 
 done_testing();
-- 
2.35.1

From 189e977e47c505230195551833e0a61ec71dced3 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 22 Feb 2022 14:20:05 +0100
Subject: [PATCH 2/2] psql: Additional tests

Add a few TAP tests for things that happen while a user query is

Re: TAP output format in pg_regress

2022-02-22 Thread Daniel Gustafsson

> On 22 Feb 2022, at 00:08, Daniel Gustafsson  wrote:

> I'll fix that.

The attached v3 fixes that thinko, and cleans up a lot of the output which
isn't diagnostic per the TAP spec to make it less noisy.  It also fixes tag
support used in the ECPG tests and a few small cleanups.  There is a blank line
printed which needs to be fixed, but I'm running out of time and wanted to get
a non-broken version on the list before putting it aside for today.

The errorpaths that exit(2) the testrun should be converted to "bail out" lines
when running with TAP output, but apart from that I think it's fairly spec
compliant.

--
Daniel Gustafsson   https://vmware.com/

v3-0001-pg_regress-TAP-output-format.patch
Description: Binary data

Re: making pg_regress less noisy by removing boilerplate

2022-02-22 Thread Andrew Dunstan

On 2/21/22 12:31, Andres Freund wrote:
>
> We still have a few issues with ports conflicts on windows. We should really
> consider just desupporting all windows versions without unix socket
> support. But until then it seems useful to be able to figure out random
> failures.
>

I'm pretty sure all my Windows machines with buildfarm animals are
sufficiently modern except the XP machine, which only builds release 10
nowadays.

What's involved in moving to require Unix socket support?

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: External data files possible?

2022-02-22 Thread Aleksander Alekseev

Hi Chris,

> Breaking up the data into pages with page headers means a lot of extra work 
> [...]. It would be much better to store the index in a set of external data 
> files. This seems possible so long as I put the files under the database's 
> directory and name things properly.


Just curious, what is your index for, and how you are going to handle
crash recovery?

-- 
Best regards,
Aleksander Alekseev

RE: Design of pg_stat_subscription_workers vs pgstats

2022-02-22 Thread osumi.takami...@fujitsu.com

On Tuesday, February 22, 2022 2:45 PM Masahiko Sawada  
wrote:
> I've attached a patch that changes pg_stat_subscription_workers view.
> It removes non-cumulative values such as error details such as error-XID and
> the error message from the view, and consequently the view now has only
> cumulative statistics counters: apply_error_count and sync_error_count. Since
> the new view name is under discussion I temporarily chose
> pg_stat_subscription_activity.
Hi, thank you for sharing the patch.

Few minor comments for v1.

(1) commit message's typo

This commits changes the view so that it stores only statistics
counters: apply_error_count and sync_error_count.

"This commits" -> "This commit"

(2) minor improvement suggestion for the commit message

I suggest that we touch the commit id 8d74fc9
that introduces the pg_stat_subscription_workers
in the commit message, for better traceability. Below is an example.

From:
As the result of the discussion, we've concluded that the stats
collector is not an appropriate place to store the error information of
subscription workers.

To:
As the result of the discussion about the view introduced by 8d74fc9,...

(3) doc/src/sgml/logical-replication.sgml

Kindly refer to commit id 85c61ba for the detail.
You forgot "the" in the below sentence.

@@ -346,8 +346,6 @@

A conflict will produce an error and will stop the replication; it must be
resolved manually by the user.  Details about the conflict can be found in
-   
-   pg_stat_subscription_workers and the
subscriber's server log.

From:
subscriber's server log.
to:
the subscriber's server log.

(4) doc/src/sgml/monitoring.sgml

-   last_error_time timestamp with time 
zone
+   sync_error_count uint8

-   Last time at which this error occurred
+   Number of times the error occurred during the initial data copy

I supposed it might be better to use "initial data sync"
or "initial data synchronization", rather than "initial data copy".

(5) src/test/subscription/t/026_worker_stats.pl

+# Truncate test_tab1 so that table sync can continue.
+$node_subscriber->safe_psql('postgres', "TRUNCATE test_tab1;");

The second truncate is for apply, isn't it? Therefore, kindly change

From:
Truncate test_tab1 so that table sync can continue.
To:
Truncate test_tab1 so that apply can continue.

(6) src/test/subscription/t/026_worker_stats.pl

+# Insert more data to test_tab1 on the subscriber and then on the publisher, 
raising an
+# error on the subscriber due to violation of the unique constraint on 
test_tab1.
+$node_subscriber->safe_psql('postgres', "INSERT INTO test_tab1 VALUES (2)");

Did we need this insert ?
If you want to indicate the apply is working okay after the error of table sync 
is solved,
waiting for the max value in the test_tab1 becoming 2 on the subscriber by 
polling query
would work. But, I was not sure if this is essentially necessary for the 
testing purpose.

Best Regards,
Takamichi Osumi

Re: Support logical replication of DDLs

2022-02-22 Thread Marcos Pegoraro

Em seg., 21 de fev. de 2022 às 13:13, Zheng Li 
escreveu:

>
> 2. Table level
> Allows DDLs on the published tables to be replicated except for
> certain edge cases.
>
> Think how to handle triggers and functions with same name but different
purpose.

Publisher
create function public.audit() returns trigger language plpgsql as $$
begin
  new.Audit_User = current_user();
  new.Audit_Date_Time = now();
  return new;
end;$$
create trigger audit before insert or update on foo for each row execute
procedure public.audit();

Subscriber
create function public.audit() returns trigger language plpgsql as $$
begin
  insert into Audit(Audit_Date_Time, Audit_User, Schema_Name, Table_Name,
Audit_Action, Field_Values) values(new.Audit_ts, new.Audit_User,
tg_table_schema, tg_table_name, tg_op, row_to_json(case when tg_op =
'DELETE' then old.* else new.* end));
  return null;
end;$$
create trigger audit after insert or update or delete on foo for each row
execute procedure public.audit();

regards,
Marcos

Re: Support logical replication of DDLs

2022-02-22 Thread Aleksander Alekseev

Hi Zheng,

> I’m working on a patch to support logical replication of data
> definition language statements (DDLs).

That's great!

> However, there are still many edge cases to sort out because not every
> DDL statement can/should be replicated.

Maybe the first implementation shouldn't be perfect as long as known
limitations are documented and the future improvements are unlikely to
break anything for the users. Committing an MVP and iterating on this is
much simpler in terms of development and code review. Also, it will deliver
value to the users and give us feedback sooner.

> 1. DDL involving multiple tables where only some tables are replicated,
e.g.
>
> DROP TABLE replicated_foo, notreplicated_bar;
>

I would add DROP TABLE ... CASCADE to the list. Also, let's not forget that
PostgreSQL supports table inheritance and table partitioning.

> 2. Any DDL that calls a volatile function, such as NOW() or RAND(), is
> likely to generate a different value on each replica. It is possible
> to work around these issues—for example, the publisher can replace any
> volatile function calls with a fixed return value when the statement
> is logged so that the subscribers all get the same value. We will have
> to consider some other cases.

That would make sense.

> [...]
> Whether a DDL should be replicated also depends on what granularity do
> we define DDL replication. For example, we can define DDL replication
> on these levels:
>
> 1. Database level
> Allows all DDLs for a database to be replicated except for certain
> edge cases (refer to the edge cases mentioned above).
> This is likely a major use case, such as in online major version upgrade.

To my knowledge, this is not a primary use case for logical replication.
Also, I suspect that implementing it may be a bit challenging. What if we
focus on table-level replication for now?

-- 
Best regards,
Aleksander Alekseev

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-22 Thread Amit Kapila

On Tue, Feb 22, 2022 at 11:15 AM Masahiko Sawada  wrote:
>
> I've attached a patch that changes pg_stat_subscription_workers view.
> It removes non-cumulative values such as error details such as
> error-XID and the error message from the view, and consequently the
> view now has only cumulative statistics counters: apply_error_count
> and sync_error_count. Since the new view name is under discussion I
> temporarily chose pg_stat_subscription_activity.
>

Few comments:
=
1.
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -637,11 +637,9 @@ REVOKE EXECUTE ON FUNCTION
pg_stat_reset_single_table_counters(oid) FROM public;

 REVOKE EXECUTE ON FUNCTION
pg_stat_reset_single_function_counters(oid) FROM public;

-REVOKE EXECUTE ON FUNCTION pg_stat_reset_replication_slot(text) FROM public;
-
-REVOKE EXECUTE ON FUNCTION pg_stat_reset_subscription_worker(oid) FROM public;
+REVOKE EXECUTE ON FUNCTION
pg_stat_reset_single_subscription_counters(oid) FROM public;

-REVOKE EXECUTE ON FUNCTION pg_stat_reset_subscription_worker(oid,
oid) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_stat_reset_replication_slot(text) FROM public;

Is there a need to change anything about
pg_stat_reset_replication_slot() in this patch?

2. Do we still need to use LATERAL in the view's query?

3. Can we send error stats pgstat_report_stat() as that will be called
via proc_exit() path. We can set the phase (apply/sync) in
apply_error_callback_arg and then use that to send the appropriate
message. I think this will obviate the need for try..catch.

-- 
With Regards,
Amit Kapila.

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-22 01:11:21 -0800, Andres Freund wrote:
> I've started to work on a few debugging aids to find problem like
> these. Attached are two WIP patches:

Forgot to attach. Also importantly includes a tap test for several of these
issues

Greetings,

Andres Freund
>From 0bc64874f8e5faae9a38731a83aa7b001095cc35 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 21 Feb 2022 15:44:02 -0800
Subject: [PATCH v1 1/4] WIP: test for file reuse dangers around database and
 tablespace commands.

---
 src/test/recovery/t/029_relfilenode_reuse.pl | 233 +++
 1 file changed, 233 insertions(+)
 create mode 100644 src/test/recovery/t/029_relfilenode_reuse.pl

diff --git a/src/test/recovery/t/029_relfilenode_reuse.pl b/src/test/recovery/t/029_relfilenode_reuse.pl
new file mode 100644
index 000..22d8e85614c
--- /dev/null
+++ b/src/test/recovery/t/029_relfilenode_reuse.pl
@@ -0,0 +1,233 @@
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::Basename;
+
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', q[
+allow_in_place_tablespaces = true
+log_connections=on
+# to avoid "repairing" corruption
+full_page_writes=off
+log_min_messages=debug2
+autovacuum_naptime=1s
+shared_buffers=1MB
+]);
+$node_primary->start;
+
+
+# Create streaming standby linking to primary
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(300);
+
+my %psql_primary = (stdin => '', stdout => '', stderr => '');
+$psql_primary{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_primary->connstr('postgres') ],
+	'<',
+	\$psql_primary{stdin},
+	'>',
+	\$psql_primary{stdout},
+	'2>',
+	\$psql_primary{stderr},
+	$psql_timeout);
+
+my %psql_standby = ('stdin' => '', 'stdout' => '', 'stderr' => '');
+$psql_standby{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_standby->connstr('postgres') ],
+	'<',
+	\$psql_standby{stdin},
+	'>',
+	\$psql_standby{stdout},
+	'2>',
+	\$psql_standby{stderr},
+	$psql_timeout);
+
+
+# Create template database with a table that we'll update, to trigger dirty
+# rows. Using a template database + preexisting rows makes it a bit easier to
+# reproduce, because there's no cache invalidations generated.
+
+$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db_template OID = 5;");
+$node_primary->safe_psql('conflict_db_template', q[
+CREATE TABLE large(id serial primary key, dataa text, datab text);
+INSERT INTO large(dataa, datab) SELECT g.i::text, 1 FROM generate_series(1, 4000) g(i);]);
+$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db TEMPLATE conflict_db_template OID = 50001;");
+
+$node_primary->safe_psql('postgres', q[
+CREATE EXTENSION pg_prewarm;
+CREATE TABLE replace_sb(data text);
+INSERT INTO replace_sb(data) SELECT random()::text FROM generate_series(1, 15000);]);
+
+# Use longrunning transactions, so that AtEOXact_SMgr doesn't close files
+send_query_and_wait(
+	\%psql_primary,
+	q[BEGIN;],
+	qr/BEGIN/m);
+send_query_and_wait(
+	\%psql_standby,
+	q[BEGIN;],
+	qr/BEGIN/m);
+
+# Cause lots of dirty rows in shared_buffers
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 1;");
+
+# Now do a bunch of work in another database. That will end up needing to
+# write back dirty data from the previous step, opening the relevant file
+# descriptors
+cause_eviction(\%psql_primary, \%psql_standby);
+
+# drop and recreate database
+$node_primary->safe_psql('postgres', "DROP DATABASE conflict_db;");
+$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db TEMPLATE conflict_db_template OID = 50001;");
+
+verify($node_primary, $node_standby, 1,
+	   "initial contents as expected");
+
+# Again cause lots of dirty rows in shared_buffers, but use a different update
+# value so we can check everything is OK
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 2;");
+
+# Again cause a lot of IO. That'll again write back dirty data, but uses (XXX
+# adjust after bugfix) the already opened file descriptor.
+# FIXME
+cause_eviction(\%psql_primary, \%psql_standby);
+
+verify($node_primary, $node_standby, 2,
+	   "update to reused relfilenode (due to DB oid conflict) is not lost");
+
+
+$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
+
+verify($node_primary, $node_standby, 3,
+	   "restored

Re: pg_receivewal.exe unhandled exception in zlib1.dll

2022-02-22 Thread Juan José Santamaría Flecha

Please excuse my late reply.

On Wed, Feb 16, 2022 at 12:20 PM Juan José Santamaría Flecha <
juanjo.santama...@gmail.com> wrote:

> On Tue, Feb 15, 2022 at 5:25 PM Andres Freund  wrote:
>
>>
>> FWIW, I've looked at using either  vcpkg   or conan to more easily
>> install /
>>
> build dependencies on windows, in the right debug / release mode.
>>
>> The former was easier, but unfortunately the zlib, lz4, ... packages don't
>> include the gzip, lz4 etc binaries right now. That might be easy enough
>> to fix
>> with an option. It does require building the dependencies locally.
>>
>> Conan seemed to be a nicer architecture, but there's no builtin way to
>> update
>> to newer dependency versions, and non-versioned dependencies don't
>> actually
>> work. It has pre-built dependencies for the common cases.
>>
>> I will have to do some testing with Conan and see how both compare.
>

I have found a couple of additional issues when trying to build using
Conan, all libraries are static and there aren't any prebuilt packages for
MSVC 2022 yet. Also for some specific packages:

- zstd doesn't include the binaries either.
- gettext packages only provides binaries, no libraries, nor headers.
- openssl builds the binaries, but is missing some objects in the
libraries (error LNK2019: unresolved external symbol).
- libxml builds the binaries, but is missing some objects in the
libraries (error LNK2019: unresolved external symbol).

,Regards,

Juan José Santamaría Flecha

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-10 14:26:59 -0800, Andres Freund wrote:
> On 2022-02-11 09:10:38 +1300, Thomas Munro wrote:
> > It seems like I should go ahead and do that today, and we can study
> > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?
> 
> Yes.

I wrote a test to show the problem. While looking at the code I unfortunately
found that CREATE DATABASE ... OID isn't the only problem. Two consecutive
ALTER DATABASE ... SET TABLESPACE also can cause corruption.

The test doesn't work on < 15 (due to PostgresNode renaming and use of
allow_in_place_tablespaces). But manually it's unfortunately quite
reproducible :(

Start a server with

shared_buffers = 1MB
autovacuum=off
bgwriter_lru_maxpages=1
bgwriter_delay=1

these are just to give more time / to require smaller tables.

Start two psql sessions

s1: \c postgres
s1: CREATE DATABASE conflict_db;
s1: CREATE TABLESPACE test_tablespace LOCATION '/tmp/blarg';
s1: CREATE EXTENSION pg_prewarm;

s2: \c conflict_db
s2: CREATE TABLE large(id serial primary key, dataa text, datab text);
s2: INSERT INTO large(dataa, datab) SELECT g.i::text, 0 FROM generate_series(1, 
4000) g(i);
s2: UPDATE large SET datab = 1;

-- prevent AtEOXact_SMgr
s1: BEGIN;

-- Fill shared_buffers with other contents. This needs to write out the prior 
dirty contents
-- leading to opening smgr rels / file descriptors
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE 
pg_relation_filenode(oid) != 0;

s2: \c postgres
-- unlinks the files
s2: ALTER DATABASE conflict_db SET TABLESPACE test_tablespace;
-- now new files with the same relfilenode exist
s2: ALTER DATABASE conflict_db SET TABLESPACE pg_default;
s2: \c conflict_db
-- dirty buffers again
s2: UPDATE large SET datab = 2;

-- this again writes out the dirty buffers. But because nothing forced the smgr 
handles to be closed,
-- fds pointing to the *old* file contents are used. I.e. we'll write to the 
wrong file
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE 
pg_relation_filenode(oid) != 0;

-- verify that everything is [not] ok
s2: SELECT datab, count(*) FROM large GROUP BY 1 ORDER BY 1 LIMIT 10;

┌───┬───┐
│ datab │ count │
├───┼───┤
│ 1 │  2117 │
│ 2 │67 │
└───┴───┘
(2 rows)

oops.

I don't yet have a good idea how to tackle this in the backbranches.

I've started to work on a few debugging aids to find problem like
these. Attached are two WIP patches:

1) use fstat(...).st_nlink == 0 to detect dangerous actions on fds with
   deleted files. That seems to work (almost) everywhere. Optionally, on
   linux, use readlink(/proc/$pid/fd/$fd) to get the filename.
2) On linux, iterate over PGPROC to get a list of pids. Then iterate over
   /proc/$pid/fd/ to check for deleted files. This only can be done after
   we've just made sure there's no fds open to deleted files.

Greetings,

Andres Freund

Re: Patch a potential memory leak in describeOneTableDetails()

2022-02-22 Thread Daniel Gustafsson

> On 22 Feb 2022, at 07:14, Kyotaro Horiguchi  wrote:

> Anyway, don't we use the -ftabstop option flag to silence compiler?
> The warning is contradicting our convention. The attached adds that
> flag.

Isn't this flag contradicting our convention?  From the docs section you
reference further down:

"Source code formatting uses 4 column tab spacing, with tabs preserved
(i.e., tabs are not expanded to spaces)."

The -ftabstop option is (to a large extent, not entirely) to warn when tabs and
spaces are mixed creating misleading indentation that the author didn't even
notice due to tabwidth settings?  ISTM we are better of getting these warnings
than suppressing them.

> By the way, the doc says that:
> 
> https://www.postgresql.org/docs/14/source-format.html
> 
>> The text browsing tools more and less can be invoked as:
>>   more -x4
>>   less -x4
>> to make them show tabs appropriately.
> 
> But AFAICS the "more" command on CentOS doesn't have -x options nor
> any option to change tab width and I don't find a document that
> suggests its existence on other platforms.

macOS, FreeBSD and NetBSD both show the less(1) manpage for more(1) which
suggests that more is implemented using less there, and thus supports -x (it
does on macOS).  OpenBSD and Solaris more(1) does not show a -x option, but AIX
does have it.  Linux in its various flavors doesn't seem to have it.

The section in question was added to our docs 22 years ago, to make it reflect
the current operating systems in use maybe just not mentioning more(1) is the
easiest?:

"The text browsing tool less can be invoked as less -x4 to make it show
tabs appropriately."

Or perhaps remove that section altogether?

--
Daniel Gustafsson   https://vmware.com/

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-22 Thread Kyotaro Horiguchi

At Tue, 22 Feb 2022 01:59:45 +0900, Fujii Masao  
wrote in 
> 
> > Of the following, I think we should do (a) and (b) to make future
> > backpatchings easier.
> > a) Use RedoRecPtr and PriorRedoPtr after they are assigned.
> > b) Move assignment to PriorRedoPtr into the ControlFileLock section.
> 
> I failed to understand how (a) and (b) can make the backpatching
> easier. How easy to backpatch seems the same whether we apply (a) and
> (b) or not...

That premises that the patch applied to master contains (a) and (b).
So if it doesn't, those are not need by older branches.


> > c) Skip udpate of minRecoveryPoint only when the checkpoint gets old.
> 
> Yes.
> 
> 
> > d) Skip call to UpdateCheckPointDistanceEstimate() when RedoRecPtr <=
> >PriorRedoPtr.
> 
> But "RedoRecPtr <= PriorRedoPtr" will never happen, will it? Because a

I didn't believe that it happens. (So, it came from my
convervativeness, or laziness, or both:p) The code dates from 2009 and
StartupXLOG makes a concurrent checkpoint with bgwriter. But as of at
least 9.5, StartupXLOG doesn't directly call CreateCheckPoint. So I
think that won't happen.

So, in short, I agree to remove it or turn it into Assert().

> restartpoint is skipped at the beginning of CreateRestartPoint() in
> that case. If this understanding is right, the check of "RedoRecPtr <=
> PriorRedoPtr" is not necessary before calling
> UpdateCheckPointDistanceEstimate().
> 
> 
> + ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
> + ControlFile->minRecoveryPointTLI = 0;
> 
> Don't we need to update LocalMinRecoveryPoint and
> LocalMinRecoveryPointTLI after this? Maybe it's not necessary, but
> ISTM that it's safer and better to always update them whether the
> state is DB_IN_ARCHIVE_RECOVERY or not.

Agree that it's safer and tidy.

>   if (flags & CHECKPOINT_IS_SHUTDOWN)
>   ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
> 
> Same as above. IMO it's safer and better to always update the state
> (whether the state is DB_IN_ARCHIVE_RECOVERY or not) if
> CHECKPOINT_IS_SHUTDOWN flag is passed.

That means we may exit recovery mode after ShutdownXLOG called
CreateRestartPoint. I don't think that may happen.  So I'd rather add
Assert ((flags_IS_SHUTDOWN) == 0) there instaed.

I'll post the new version later.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: List of all* PostgreSQL EXTENSIONs in the world

2022-02-22 Thread Aleksander Alekseev

Hi Joel,

> I've compiled a list of all* PostgreSQL EXTENSIONs in the world:

The list is great. Thanks for doing this!

Just curious, is it a one-time experiment or you are going to keep the
list in an actual state similarly to awesome-* lists on GitHub, or ...
?

-- 
Best regards,
Aleksander Alekseev

72 matches

Mail list logo