Re: Handle infinite recursion in logical replication setup

2022-03-15 Thread vignesh C
On Tue, Mar 15, 2022 at 9:55 AM kuroda.hay...@fujitsu.com
 wrote:
>
> Dear Vignesh,
>
> > Thanks for kind explanation.
> > I read above and your doc in 0002, and I put some comments.
>
> I forgot a comment about 0002 doc.
>
> 5. create_subscription.sgml - about your example
>
> Three possibilities were listed in the doc,
> but I was not sure about b) case.
> In the situation Node1 and Node2 have already become multi-master,
> and data has already synced at that time.
> If so, how do we realize that "there is data present only in one Node"?
> Case a) and c) seem reasonable.

Your point is valid, modified.

The changes for the same are available int the v5 patch available at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3wCf0YcvVo%2BgHMGpupk9K6WKJxCyLUvhPC2GkPKRZUWA%40mail.gmail.com

Regards,
Vignesh




Re: Handle infinite recursion in logical replication setup

2022-03-15 Thread vignesh C
On Tue, Mar 15, 2022 at 7:09 AM kuroda.hay...@fujitsu.com
 wrote:
>
> Dear Vignesh,
>
> Thank you for updating your patch!
>
> > Let's consider an existing Multi master logical replication setup
> > between Node1 and Node2 that is created using the following steps:
> > a) Node1 - Publication publishing employee table - pub1
> > b) Node2 - Subscription subscribing from publication pub1 with
> > publish_local_only - sub1_pub1_node1
> > c) Node2 - Publication publishing employee table - pub2
> > d) Node1 - Subscription subscribing from publication pub2 with
> > publish_local_only - sub2_pub2_node2
> >
> > To create a subscription in node3, we will be using the following steps:
> > a) Node2 - Publication publishing employee table. - pub3
> > b) Node3 - Subscription subscribing from publication in Node2 with
> > publish_local_only - sub3_pub3_node2
> >
> > When we create a subscription in Node3, Node3 will connect to
> > Node2(this will not be done in Node3) and check if the employee table
> > is present in pg_subscription_rel, in our case Node2 will have
> > employee table present in pg_subscription_rel (sub1_pub1_node1
> > subscribing to employee table from pub1 in Node1). As employee table
> > is being subscribed in node2 from node1, we will throw an error like
> > below:
> > postgres=# create subscription sub2 CONNECTION 'dbname =postgres port
> > = ' publication pub2 with (publish_local_only=on);
> > ERROR:  CREATE/ALTER SUBSCRIPTION with publish_local_only and
> > copy_data as true is not allowed when the publisher might have
> > replicated data, table:public.t1 might have replicated data in the
> > publisher
> > HINT:  Use CREATE/ALTER SUBSCRIPTION with copy_data = off or force
>
> Thanks for kind explanation.
> I read above and your doc in 0002, and I put some comments.
>
> 1. alter_subscription.sgml
>
> ```
> -copy_data (boolean)
> +copy_data (boolean | 
> force)
> ```
>
> I thought that it should be written as enum. For example, huge_pages GUC 
> parameter
> can accept {on, off, try}, and it has been written as enum.

Modified

> 2. create_subscription.sgml
>
> ```
> -copy_data (boolean)
> +copy_data (boolean | 
> force)
> ```
>
> Same as above.

Modified

> 3. create_subscription.sgml
>
> ```
> +
> + 
> +  If the publication tables were also subscribing data in the 
> publisher
> +  from other publishers, it will affect the
> +  CREATE SUBSCRIPTION based on the value specified
> +  for publish_local_only option. Refer to the
> +   for details.
> + 
> ```
>
> I seeked docs, but the words " publication tables " have not seen.
> How about "tables in the publication"?

Modified

> 4. create_subscription.sgml - about your example
>
> In the first section, we should describe about 2-nodes case more detail
> like Amit mentioned in [1]. I thought that Option-3 can be resolved by 
> defining
> subscriptions in both nodes with publish_local_only = true and copy_data = 
> force.

I thought existing information is enough because we have mentioned
that node1 and node2 have bidirectional replication setup done and
both the table data will be replicated and synchronized as and when
the DML operations are happening. In option-3 we need to create a
subscription with copy_data as force to one node and copy_data as
false to another node because both nodes will be having the same data,
copying the data just from one of the nodes should be enough.

Thanks for the comments, the attached v5 patch has the changes for the same.

Regards,
Vignesh
From 2e9b9192a1f485d7b1f3c53170626cb49a31ee1e Mon Sep 17 00:00:00 2001
From: Vigneshwaran C 
Date: Tue, 15 Mar 2022 16:47:30 +0530
Subject: [PATCH v5 1/2] Skip replication of non local data.

Add an option publish_local_only which will subscribe only to the locally
generated data in the publisher node. If subscriber is created with this
option, publisher will skip publishing the data that was subscribed
from other nodes. It can be created using following syntax:
ex: CREATE SUBSCRIPTION sub1 CONNECTION 'dbname =postgres port=' PUBLICATION pub1 with (publish_local_only = on);
---
 contrib/test_decoding/test_decoding.c |  20 +++
 doc/src/sgml/ref/alter_subscription.sgml  |   5 +-
 doc/src/sgml/ref/create_subscription.sgml |  12 ++
 src/backend/catalog/pg_subscription.c |   1 +
 src/backend/catalog/system_views.sql  |   4 +-
 src/backend/commands/subscriptioncmds.c   |  26 +++-
 .../libpqwalreceiver/libpqwalreceiver.c   |   4 +
 src/backend/replication/logical/decode.c  |  15 +-
 src/backend/replication/logical/logical.c |  33 +
 src/backend/replication/logical/worker.c  |   2 +
 src/backend/replication/pgoutput/pgoutput.c   |  45 ++
 src/bin/pg_dump/pg_dump.c |  13 +-
 src/bin/pg_dump/pg_dump.h |   1 +
 src/bin/psql/describe.c   |   8 +-
 src/bin/psql/tab-complete.c 

Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-03-15 Thread Thomas Munro
On Sat, Feb 19, 2022 at 7:01 AM Nathan Bossart  wrote:
> Here is a new patch.  This is essentially the same as v1, but I've spruced
> up the comments and the commit message.

Hi Nathan,

Pushed and back-patched (it's slightly different before 12).  Thanks!




Re: Corruption during WAL replay

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 12:44:49 -0400, Robert Haas  wrote 
in 
> On Wed, Jan 26, 2022 at 3:25 AM Kyotaro Horiguchi
>  wrote:
> > The attached is the fixed version and it surely works with the repro.
> 
> Hi,
> 
> I spent the morning working on this patch and came up with the
> attached version. I wrote substantial comments in RelationTruncate(),
> where I tried to make it more clear exactly what the bug is here, and
> also in storage/proc.h, where I tried to clarify both the use of the
> DELAY_CHKPT_* flags in general terms. If nobody is too sad about this
> version, I plan to commit it.

Thanks for taking this and for the time.  The additional comments
seems describing the flags more clearly.

storage.c:
+* Make sure that a concurrent checkpoint can't complete while 
truncation
+* is in progress.
+*
+* The truncation operation might drop buffers that the checkpoint
+* otherwise would have flushed. If it does, then it's essential that
+* the files actually get truncated on disk before the checkpoint record
+* is written. Otherwise, if reply begins from that checkpoint, the
+* to-be-truncated buffers might still exist on disk but have older
+* contents than expected, which can cause replay to fail. It's OK for
+* the buffers to not exist on disk at all, but not for them to have the
+* wrong contents.

FWIW, this seems like slightly confusing between buffer and its
content.  I can read it correctly so I don't mind if it is natural
enough.

Otherwise all the added/revised comments looks fine. Thanks for the
labor.

> I think it should be back-patched, too, but that looks like a bit of a
> pain. I think every back-branch will require different adjustments.

I'll try that, if you are already working on it, please inform me. (It
may more than likely be too late..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Tablesync early exit

2022-03-15 Thread Amit Kapila
On Wed, Mar 16, 2022 at 1:16 AM Greg Stark  wrote:
>
> This patch has been through five CFs without any review. It's a very
> short patch and I'm assuming the only issue is that it's not clear
> whether it's a good idea or not and there are few developers familiar
> with this area of code?
>
> Amit, it looks like you were the one who asked for it to be split off
> from the logical decoding of 2PC patch in [1]. Can you summarize what
> questions remain here? Should we just commit this or is there any
> issue that needs to be debated?
>

Looking closely at this, I am not sure whether this is a good idea or
not. Responded accordingly.

-- 
With Regards,
Amit Kapila.




Re: Tablesync early exit

2022-03-15 Thread Amit Kapila
On Mon, Aug 30, 2021 at 8:50 AM Peter Smith  wrote:
>
> Patch v2 is the same; it only needed re-basing to the latest HEAD.
>

Why do you think it is correct to exit before trying to receive any
message? How will we ensure whether the apply worker has processed any
message? At the beginning of function LogicalRepApplyLoop(),
last_received is the LSN where the copy has finished in the case of
tablesync worker. I think we need to receive the message before trying
to ensure whether we have synced with the apply worker or not.

-- 
With Regards,
Amit Kapila.




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-15 Thread Dilip Kumar
On Tue, Mar 15, 2022 at 11:09 PM Robert Haas  wrote:
>
> On Tue, Mar 15, 2022 at 1:26 PM Ashutosh Sharma  wrote:
> > > On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma  
> > > wrote:
> > > > Few comments on the latest patch:
> > > >
> > > > -   /* We need to construct the pathname for this database 
> > > > */
> > > > -   dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> > > > +   if (xlrec->dbid != InvalidOid)
> > > > +   dbpath = GetDatabasePath(xlrec->dbid, 
> > > > xlrec->tsid);
> > > > +   else
> > > > +   dbpath = pstrdup("global");
> > > >
> > > > Do we really need this change? Is GetDatabasePath() alone not capable
> > > > of handling it?
> > >
> > > Well, I mean, that function has a special case for
> > > GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid
> > > is 0.
> > >
> >
> > Wouldn't this be true only in case of a shared map file (when dbOid is
> > Invalid and tblspcOid is globaltablespace_oid) or am I missing
> > something?
>
> *facepalm*
>
> Good catch, sorry that I'm slow on the uptake today.
>
> v3 attached.

Thanks Ashutosh and Robert.  Other pacthes cleanly applied on this
patch still generated a new version so that we can find all patches
together.  There are no other changes.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From cd0fe403cd54e5dbeb7e17b321bdf0434b509162 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Tue, 15 Mar 2022 09:18:52 +0530
Subject: [PATCH v16 2/6] Extend relmap interfaces

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path.
2) And another interface for getting filenode from oid.  We already
have RelationMapOidToFilenode for the same purpose but that assumes
we are connected to the database for which we want to get the mapping.
So this new interface will do the same but instead, it will get the
mapping for the input database.

These interfaces are required for next patch, for supporting the
wal logged created database.
---
 src/backend/utils/cache/relmapper.c | 60 +
 src/include/utils/relmapper.h   |  4 ++-
 2 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4d0718f..5b22dbb 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -252,6 +252,60 @@ RelationMapFilenodeToOid(Oid filenode, bool shared)
 }
 
 /*
+ * RelationMapOidToFilenodeForDatabase
+ *
+ * Same as RelationMapOidToFilenode, but instead of reading the mapping from
+ * the database we are connected to it will read the mapping from the input
+ * database.
+ */
+Oid
+RelationMapOidToFilenodeForDatabase(char *dbpath, Oid relationId)
+{
+	RelMapFile	map;
+	int			i;
+
+	/* Read the relmap file from the source database. */
+	read_relmap_file(, dbpath, false, ERROR);
+
+	/* Iterate over the relmap entries to find the input relation oid. */
+	for (i = 0; i < map.num_mappings; i++)
+	{
+		if (relationId == map.mappings[i].mapoid)
+			return map.mappings[i].mapfilenode;
+	}
+
+	return InvalidOid;
+}
+
+/*
+ * RelationMapCopy
+ *
+ * Copy relmapfile from source db path to the destination db path and WAL log
+ * the operation.
+ */
+void
+RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
+{
+	RelMapFile map;
+
+	/*
+	 * Read the relmap file from the source database.  This function is only
+	 * called during the create database, so elevel can be ERROR.
+	 */
+	read_relmap_file(, srcdbpath, false, ERROR);
+
+	/*
+	 * Write map contents into the destination database's relmap file. No
+	 * sinval needed because we are creating new file while creating a new
+	 * database so no one else must be accessing this file and for the same
+	 * reason we don't need to acquire the RelationMappingLock as well.  And,
+	 * we also don't need to preserve files because we are creating a new
+	 * database so in case of anerror relation files will be deleted anyway.
+	 */
+	write_relmap_file(, true, false, false, dbid, tsid, dstdbpath);
+}
+
+/*
  * RelationMapUpdateMap
  *
  * Install a new relfilenode mapping for the specified relation.
@@ -1031,6 +1085,12 @@ relmap_redo(XLogReaderState *record)
 		 *
 		 * There shouldn't be anyone else updating relmaps during WAL replay,
 		 * but grab the lock to interlock against load_relmap_file().
+		 *
+		 * Note - this WAL is also written for copying the relmap file while
+		 * creating a database.  Therefore, it makes no sense to acquire a
+		 * relmap lock or send sinval.  But if we want to avoid that, then we
+		 * must set an extra flag in WAL.  So let it grab the lock and send
+		 * sinval because there is no harm in that.
 		 */
 		LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
 		write_relmap_file(, false, true, false,
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h

Tab completion for ALTER MATERIALIZED VIEW ... SET ACCESS METHOD

2022-03-15 Thread Yugo NAGATA
Hello hackers,

SET ACCESS METHOD is supported in ALTER TABLE since the commit
b0483263dd. Since that time,  this also has be allowed SET ACCESS
METHOD in ALTER MATERIALIZED VIEW. Although it is not documented,
this works.

I cannot found any reasons to prohibit SET ACCESS METHOD in ALTER
MATERIALIZED VIEW, so I think it is better to support this in psql
tab-completion and be documented.

I attached a patch to fix the tab-completion and the documentation
about this syntax. Also, I added description about SET TABLESPACE
syntax that would have been overlooked.

Regards,
Yugo Nagata


-- 
Yugo NAGATA 
diff --git a/doc/src/sgml/ref/alter_materialized_view.sgml b/doc/src/sgml/ref/alter_materialized_view.sgml
index 7011a0e7da..cae135c27a 100644
--- a/doc/src/sgml/ref/alter_materialized_view.sgml
+++ b/doc/src/sgml/ref/alter_materialized_view.sgml
@@ -43,6 +43,8 @@ ALTER MATERIALIZED VIEW ALL IN TABLESPACE namecolumn_name SET COMPRESSION compression_method
 CLUSTER ON index_name
 SET WITHOUT CLUSTER
+SET ACCESS METHOD new_access_method
+SET TABLESPACE new_tablespace
 SET ( storage_parameter [= value] [, ... ] )
 RESET ( storage_parameter [, ... ] )
 OWNER TO { new_owner | CURRENT_ROLE | CURRENT_USER | SESSION_USER }
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6957567264..daf4206caa 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2124,7 +2124,7 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("TO");
 	/* ALTER MATERIALIZED VIEW xxx SET */
 	else if (Matches("ALTER", "MATERIALIZED", "VIEW", MatchAny, "SET"))
-		COMPLETE_WITH("(", "SCHEMA", "TABLESPACE", "WITHOUT CLUSTER");
+		COMPLETE_WITH("(", "ACCESS METHOD", "SCHEMA", "TABLESPACE", "WITHOUT CLUSTER");
 	/* ALTER POLICY  */
 	else if (Matches("ALTER", "POLICY"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_policies);
@@ -2485,6 +2485,13 @@ psql_completion(const char *text, int start, int end)
 	else if (Matches("ALTER", "TYPE", MatchAny, "RENAME", "VALUE"))
 		COMPLETE_WITH_ENUM_VALUE(prev3_wd);
 
+	/*
+	 * If we have ALTER MATERIALIZED VIEW  SET ACCESS METHOD provide a list of table
+	 * AMs.
+	 */
+	else if (Matches("ALTER", "MATERIALIZED", "VIEW", MatchAny, "SET", "ACCESS", "METHOD"))
+		COMPLETE_WITH_QUERY(Query_for_list_of_table_access_methods);
+
 /*
  * ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ]
  * ANALYZE [ VERBOSE ] [ table_and_columns [, ...] ]


Re: Skipping logical replication transactions on subscriber side

2022-03-15 Thread Amit Kapila
On Wed, Mar 16, 2022 at 7:58 AM Amit Kapila  wrote:
>
> On Wed, Mar 16, 2022 at 6:03 AM Masahiko Sawada  wrote:
> >
> > On Tue, Mar 15, 2022 at 7:18 PM Amit Kapila  wrote:
> > >
> > > On Tue, Mar 15, 2022 at 11:43 AM Masahiko Sawada  
> > > wrote:
> > > >
> > >
> > > 6.
> > > @@ -1583,7 +1649,8 @@ apply_handle_insert(StringInfo s)
> > >   TupleTableSlot *remoteslot;
> > >   MemoryContext oldctx;
> > >
> > > - if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
> > > + if (is_skipping_changes() ||
> > >
> > > Is there a reason to keep the skip_changes check here and in other DML
> > > operations instead of at one central place in apply_dispatch?
> >
> > Since we already have the check of applying the change on the spot at
> > the beginning of the handlers I feel it's better to add
> > is_skipping_changes() to the check than add a new if statement to
> > apply_dispatch, but do you prefer to check it in one central place in
> > apply_dispatch?
> >
>
> I think either way is fine. I just wanted to know the reason, your
> current change looks okay to me.
>
> Some questions/comments
> ==
>

Some cosmetic suggestions:
==
1.
+# Create subscriptions. Both subscription sets disable_on_error to on
+# so that they get disabled when a conflict occurs.
+$node_subscriber->safe_psql(
+ 'postgres',
+ qq[
+CREATE SUBSCRIPTION $subname CONNECTION '$publisher_connstr'
PUBLICATION tap_pub WITH (streaming = on, two_phase = on,
disable_on_error = on);
+]);

I don't understand what you mean by 'Both subscription ...' in the
above comments.

2.
+ # Check the log indicating that successfully skipped the transaction,

How about slightly rephrasing this to: "Check the log to ensure that
the transaction is skipped"?

-- 
With Regards,
Amit Kapila.




Re: Allow file inclusion in pg_hba and pg_ident files

2022-03-15 Thread Julien Rouhaud
Hi,

The cfbot says that the patch doesn't apply anymore, so here's a v3 with the
changes mentioned below.

On Tue, Mar 01, 2022 at 05:19:50PM +0800, Julien Rouhaud wrote:
>
> If you prefer to interleave static and non static function I can change it.

Change the split to not reorder functions.

> > +#include "utils/guc.h"
> > +//#include "utils/tuplestore.h"
>
> Yes I noticed this one this morning.  I didn't want to send a new patch 
> version
> just for that, but I already fixed it locally.

Included.

> Yes I'm aware of that thread.  I will be happy to change the patch to use
> MakeFuncResultTuplestore() as soon as it lands.  Thanks for the notice though.

Done, with the new SetSingleFuncCall().

> > It could be possible to do installcheck on an instance that has user
> > mappings meaning that this had better be ">= 0", no?
>
> I thought about it, and supposed it would bring a bit more value with the test
> like that.  I can change it if you prefer.

Change this way.
>From 865b181ea989bb5c52fad2c3c9e20a38aa173cf3 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Tue, 1 Mar 2022 21:45:42 +0800
Subject: [PATCH v3 1/4] Extract view processing code from hba.c

This file is already quite big and a following commit will add yet an
additional view, so let's move all the view related code in hba.c into a new
adt/hbafuncs.c.

Author: Julien Rouhaud
Reviewed-by: FIXME
Discussion: https://postgr.es/m/20220223045959.35ipdsvbxcstrhya%40jrouhaud
---
 src/backend/libpq/hba.c  | 462 ++-
 src/backend/utils/adt/Makefile   |   1 +
 src/backend/utils/adt/hbafuncs.c | 423 
 src/include/libpq/hba.h  |  29 ++
 src/tools/pgindent/typedefs.list |   2 +-
 5 files changed, 475 insertions(+), 442 deletions(-)
 create mode 100644 src/backend/utils/adt/hbafuncs.c

diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 90953c38f3..f9843a0b30 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -68,32 +68,6 @@ typedef struct check_network_data
 #define token_is_keyword(t, k) (!t->quoted && strcmp(t->string, k) == 0)
 #define token_matches(t, k)  (strcmp(t->string, k) == 0)
 
-/*
- * A single string token lexed from a config file, together with whether
- * the token had been quoted.
- */
-typedef struct HbaToken
-{
-   char   *string;
-   boolquoted;
-} HbaToken;
-
-/*
- * TokenizedLine represents one line lexed from a config file.
- * Each item in the "fields" list is a sub-list of HbaTokens.
- * We don't emit a TokenizedLine for empty or all-comment lines,
- * so "fields" is never NIL (nor are any of its sub-lists).
- * Exception: if an error occurs during tokenization, we might
- * have fields == NIL, in which case err_msg != NULL.
- */
-typedef struct TokenizedLine
-{
-   List   *fields; /* List of lists of HbaTokens */
-   int line_num;   /* Line number */
-   char   *raw_line;   /* Raw line text */
-   char   *err_msg;/* Error message if any */
-} TokenizedLine;
-
 /*
  * pre-parsed content of HBA config file: list of HbaLine structs.
  * parsed_hba_context is the memory context where it lives.
@@ -138,16 +112,10 @@ static const char *const UserAuthName[] =
 };
 
 
-static MemoryContext tokenize_file(const char *filename, FILE *file,
-  List 
**tok_lines, int elevel);
 static List *tokenize_inc_file(List *tokens, const char *outer_filename,
   const char 
*inc_filename, int elevel, char **err_msg);
 static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline,
   int elevel, char 
**err_msg);
-static ArrayType *gethba_options(HbaLine *hba);
-static void fill_hba_line(Tuplestorestate *tuple_store, TupleDesc tupdesc,
- int lineno, HbaLine *hba, 
const char *err_msg);
-static void fill_hba_view(Tuplestorestate *tuple_store, TupleDesc tupdesc);
 
 
 /*
@@ -419,7 +387,7 @@ tokenize_inc_file(List *tokens,
}
 
/* There is possible recursion here if the file contains @ */
-   linecxt = tokenize_file(inc_fullname, inc_file, _lines, elevel);
+   linecxt = tokenize_auth_file(inc_fullname, inc_file, _lines, 
elevel);
 
FreeFile(inc_file);
pfree(inc_fullname);
@@ -427,7 +395,7 @@ tokenize_inc_file(List *tokens,
/* Copy all tokens found in the file and append to the tokens list */
foreach(inc_line, inc_lines)
{
-   TokenizedLine *tok_line = (TokenizedLine *) lfirst(inc_line);
+   TokenizedAuthLine *tok_line = (TokenizedAuthLine *) 
lfirst(inc_line);
ListCell   *inc_field;
 
/* If any line has an error, propagate that up to caller */
@@ -458,7 +426,8 @@ 

Re: Assert in pageinspect with NULL pages

2022-03-15 Thread Michael Paquier
On Tue, Mar 15, 2022 at 06:56:46AM -0500, Justin Pryzby wrote:
> On Tue, Mar 15, 2022 at 06:32:44PM +0900, Michael Paquier wrote:
>> +-- Suppress the DETAIL message, to allow the tests to work across various
>> +-- default page sizes.
> 
> I think you mean "various non-default page sizes" or "various page sizes".

Thanks.  Indeed, this sounded a bit weird.  I have been able to look
at all that this morning and backpatched this first part.
--
Michael


signature.asc
Description: PGP signature


Re: Move the "DR_intorel" struct to a more suitable position

2022-03-15 Thread Julien Rouhaud
Hi,

On Wed, Mar 16, 2022 at 11:16:58AM +0800, zk.wang wrote:
> Generally, we should define struct in the header file(.h). But I found struct
> "DR_intorel" in createas.c and it doesn't seem to be properly defined. May be
> it should define in createas.h.

We put struct declarations in header files when multiple other files needs to
know about it.  For DR_intorel it's a private struct that isn't needed outside
createas.c, so it's defined in the file.  You can find a lot of similar usage
in the source.

> Besides, this is my first
> contribution.If there are any submitted questions, please let me know.
> Thank you~ :)

Welcome!  For the record, it's usually better to provide a patch.  You can
refer to https://wiki.postgresql.org/wiki/Submitting_a_Patch for more
information.




Re: Move the "DR_intorel" struct to a more suitable position

2022-03-15 Thread Andres Freund
Hi, 

On March 15, 2022 8:16:58 PM PDT, "zk.wang"  wrote:
>Generally, we should define struct in the header file(.h).

Why? It's perfectly sensible to define types in .c files when they're not used 
elsewhere.

Greetings,

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Move the "DR_intorel" struct to a more suitable position

2022-03-15 Thread zk.wang
Generally, we should define struct in the header file(.h). But I found struct 
"DR_intorel" in createas.c and it doesn't seem to be properly defined. May be 
it should define in createas.h.
Besides, this is my first contribution.If there are any submitted 
questions, please let me know. Thank you~ :)

modify_createas.h
Description: Binary data


Re: Add 64-bit XIDs into PostgreSQL 15

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 18:48:34 +0300, Maxim Orlov  wrote in 
> Hi Kyotaro!
> 
> 0001:
> >
> >  The XID_FMT has quite bad impact on the translatability of error
> >  messages.  3286065651 has removed INT64_FORMAT from translatable
> >  texts for the reason.  This re-introduces that in several places.
> >  0001 itself does not harm but 0005 replaces XID_FMT with
> >  INT64_FORMAT.  Other patches have the same issue, too.
> >
>  I do understand your concern and I wonder how I can do this better? My
> first intention was to replace XID_FMT with %llu and INT64_FORMAT with
> %lld. This should solve the translatability issue, but I'm not sure about
> portability of this. Should this work on Windows, etc? Can you advise me on
> the best solution?

Doesn't doing "errmsg("blah blah  %lld ..", (long long) xid)" work?

> We've fixed all the other things mentioned. Thanks!
> 
> Also added two fixes:
> - CF bot was unhappy with pg_upgrade test in v17 because I forgot to add a
> fix for computation of relminmxid during vacuum on a fresh database.
> - Replace frozen or invalid x_min with FrozenTransactionId or
> InvalidTransactionId respectively during tuple conversion to 64xid.
> 
> Reviews are welcome as always! Thanks!

My pleasure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufferAlloc: don't take two simultaneous locks

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov  
wrote in 
> В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:
> > Hmm. v8 returns stashed element with original patition index when the
> > element is *not* reused.  But what I saw in the previous test runs is
> > the REUSE->ENTER(reuse)(->REMOVE) case.  So the new version looks like
> > behaving the same way (or somehow even worse) with the previous
> > version.
> 
> v8 doesn't differ in REMOVE case neither from master nor from
> previous version. It differs in RETURNED case only.
> Or I didn't understand what you mean :(

In v7, HASH_ENTER returns the element stored in DynaHashReuse using
the freelist_idx of the new key.  v8 uses that of the old key (at the
time of HASH_REUSE).  So in the case "REUSE->ENTER(elem exists and
returns the stashed)" case the stashed element is returned to its
original partition.  But it is not what I mentioned.

On the other hand, once the stahsed element is reused by HASH_ENTER,
it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
from old partition) case.  I suspect that ththat the frequent freelist
starvation comes from the latter case.

> > get_hash_entry continuously suffer lack of freelist
> > entry. (FWIW, attached are the test-output diff for both master and
> > patched)
> > 
> > master finally allocated 31 fresh elements for a 100s run.
> > 
> > > ALLOCED: 31;; freshly allocated
> > 
> > v8 finally borrowed 33620 times from another freelist and 0 freshly
> > allocated (ah, this version changes that..)
> > Finally v8 results in:
> > 
> > > RETURNED: 50806;; returned stashed elements
> > > BORROWED: 33620;; borrowed from another freelist
> > > REUSED: 1812664;; stashed
> > > ASSIGNED: 1762377  ;; reused
> > >(ALLOCED: 0);; freshly allocated

(I misunderstand that v8 modified get_hash_entry's preference between
allocation and borrowing.)

I re-ran the same check for v7 and it showed different result.

RETURNED: 1
ALLOCED: 15
BORROWED: 0
REUSED: 505435
ASSIGNED: 505462 (-27)  ## the counters are not locked.

> Is there any measurable performance hit cause of borrowing?
> Looks like "borrowed" happened in 1.5% of time. And it is on 128kb
> shared buffers that is extremely small. (Or it was 128MB?)

It is intentional set small to get extremely frequent buffer
replacements.  The point here was the patch actually can induce
frequent freelist starvation.  And as you do, I also doubt the
significance of the performance hit by that.  Just I was not usre.

I re-ran the same for v8 and got a result largely different from the
previous trial on the same v8.

RETURNED: 2
ALLOCED: 0
BORROWED: 435
REUSED: 495444
ASSIGNED: 495467 (-23)

Now "BORROWED" happens 0.8% of REUSED.

> Well, I think some spare entries could reduce borrowing if there is
> a need. I'll test on 128MB with spare entries. If there is profit,
> I'll return some, but will keep SharedBufHash fixed.

I don't doubt the benefit of this patch.  And now convinced by myself
that the downside is negligible than the benefit.

> Master branch does less freelist manipulations since it  tries to
> insert first and if there is collision it doesn't delete victim
> buffer.
> 
> > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > ...
> > > Strange thing: both master and patched version has higher
> > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > than in first october version [1]. But lower tps at higher
> > > connections number (>= 191 clients).
> > > I'll try to bisect on master this unfortunate change.
> > 
> > The reversing of the preference order between freshly-allocation and
> > borrow-from-another-freelist might affect.
> 
> `master` changed its behaviour as well.
> It is not problem of the patch at all.

Agreed.  So I think we should go on this direction.

There are some last comments on v8.

+ 
HASH_FIXED_SIZE);

Ah, now I understand that this prevented allocation of new elements.
I think this good to do for SharedBufHash.



+   longnfree;  /* number of free entries in 
the list */
HASHELEMENT *freeList;  /* chain of free elements */
 } FreeListData;
 
+#if SIZEOF_LONG == 4
+typedef pg_atomic_uint32 nalloced_store_t;
+typedef uint32 nalloced_value_t;
+#define nalloced_read(a)   (long)pg_atomic_read_u32(a)
+#define nalloced_add(a, v) pg_atomic_fetch_add_u32((a), (uint32)(v))


I don't think nalloced needs to be the same width to long.  For the
platforms with 32-bit long, anyway the possible degradation if any by
64-bit atomic there doesn't matter.  So don't we always define the
atomic as 64bit and use the pg_atomic_* functions directly?


+   case HASH_REUSE:
+   if (currBucket != NULL)

Don't we need an assertion on (DunaHashReuse.element == NULL) here?


-   size = add_size(size, BufTableShmemSize(NBuffers + 

Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

2022-03-15 Thread Thomas Munro
On Wed, Mar 2, 2022 at 10:58 AM Andres Freund  wrote:
> On 2022-03-02 06:46:23 +1300, Thomas Munro wrote:
> > From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001
> > From: Thomas Munro 
> > Date: Mon, 28 Feb 2022 11:27:05 +1300
> > Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay().
>
> LGTM. Would be nice to have this fixed soon, even if it's just to reduce test
> times :)

Thanks for the review.  Pushed to master and 14, with the wait event
moved to the end of the enum for the back-patch.

> > From 1eb0266fed7ccb63a2430e4fbbaef2300f2aa0d0 Mon Sep 17 00:00:00 2001
> > From: Thomas Munro 
> > Date: Tue, 1 Mar 2022 11:38:27 +1300
> > Subject: [PATCH v2 2/3] Fix waiting in RegisterSyncRequest().
>
> LGTM.

Pushed as far back as 12.  It could be done for 10 & 11, but indeed
the code starts getting quite different back there, and since there
are no field reports, I think that's OK for now.

A simple repro, for the record: run installcheck with
shared_buffers=256kB, and then partway through, kill -STOP
$checkpointer to simulate being stalled on IO for a while.  Backends
will soon start waiting for the checkpointer to drain the queue while
dropping relations.  This state was invisible to pg_stat_activity, and
hangs forever if you kill the postmaster and CONT the checkpointer.

> > From 50060e5a0ed66762680ddee9e30acbad905c6e98 Mon Sep 17 00:00:00 2001
> > From: Thomas Munro 
> > Date: Tue, 1 Mar 2022 17:34:43 +1300
> > Subject: [PATCH v2 3/3] Use condition variable to wait when sync request 
> > queue
> >  is full.

> [review]

I'll come back to 0003 (condition variable-based improvement) a bit later.




RE: Logical replication timeout problem

2022-03-15 Thread wangw.f...@fujitsu.com
On Wed, Mar 9, 2022 at 2:45 PM Masahiko Sawada  wrote:
>
Thanks for your comments.

> On Wed, Mar 9, 2022 at 10:26 AM I wrote:
> > On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada 
> wrote:
> > > I've looked at the patch and have a question:
> > Thanks for your review and comments.
> >
> > > +void
> > > +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) {
> > > +static int skipped_changes_count = 0;
> > > +
> > > +/*
> > > + * skipped_changes_count is reset when processing changes that do
> not
> > > + * need to be skipped.
> > > + */
> > > +if (!skipped)
> > > +{
> > > +skipped_changes_count = 0;
> > > +return;
> > > +}
> > > +
> > > +/*
> > > + * After continuously skipping SKIPPED_CHANGES_THRESHOLD
> > > changes, try to send a
> > > + * keepalive message.
> > > + */
> > > +#define SKIPPED_CHANGES_THRESHOLD 1
> > > +
> > > +if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD)
> > > +{
> > > +/* Try to send a keepalive message. */
> > > +OutputPluginUpdateProgress(ctx, true);
> > > +
> > > +/* After trying to send a keepalive message, reset the 
> > > flag. */
> > > +skipped_changes_count = 0;
> > > +}
> > > +}
> > >
> > > Since we send a keepalive after continuously skipping 1 changes, the
> > > originally reported issue can still occur if skipping 1 changes took 
> > > more
> than
> > > the timeout and the walsender didn't send any change while that, is that
> right?
> > Yes, theoretically so.
> > But after testing, I think this value should be conservative enough not to
> reproduce
> > this bug.
> 
> But it really depends on the workload, the server condition, and the
> timeout value, right? The logical decoding might involve disk I/O much
> to spill/load intermediate data and the system might be under the
> high-load condition. Why don't we check both the count and the time?
> That is, I think we can send a keep-alive either if we skipped 1
> changes or if we didn't sent anything for wal_sender_timeout / 2.
Yes, you are right.
Do you mean that when skipping every change, check if it has been more than
(wal_sender_timeout / 2) without sending anything?
IIUC, I tried to send keep-alive messages based on time before[1], but after
testing, I found that it will brings slight overhead. So I am not sure, in a
function(pgoutput_change) that is invoked frequently, should this kind of
overhead be introduced?

> Also, the patch changes the current behavior of wal senders; with the
> patch, we send keep-alive messages even when wal_sender_timeout = 0.
> But I'm not sure it's a good idea. The subscriber's
> wal_receiver_timeout might be lower than wal_sender_timeout. Instead,
> I think it's better to periodically check replies and send a reply to
> the keep-alive message sent from the subscriber if necessary, for
> example, every 1 skipped changes.
Sorry, I could not follow what you said. I am not sure, do you mean the
following?
1. When we didn't sent anything for (wal_sender_timeout / 2) or we skipped
1 changes continuously, we will invoke the function WalSndKeepalive in the
function WalSndUpdateProgress, and send a keepalive message to the subscriber
with requesting an immediate reply.
2. If after sending a keepalive message, and then 1 changes are skipped
continuously again. In this case, we need to handle the reply from the
subscriber-side when processing the 1th change. The handling approach is to
reply to the confirmation message from the subscriber.

[1] - 
https://www.postgresql.org/message-id/OS3PR01MB6275DFFDAC7A59FA148931529E209%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Please let me know if I understand wrong.

Regards,
Wang wei


Re: Skipping logical replication transactions on subscriber side

2022-03-15 Thread Amit Kapila
On Wed, Mar 16, 2022 at 7:58 AM Amit Kapila  wrote:
>
> On Wed, Mar 16, 2022 at 6:03 AM Masahiko Sawada  wrote:
> >
> > On Tue, Mar 15, 2022 at 7:18 PM Amit Kapila  wrote:
> > >
> > > On Tue, Mar 15, 2022 at 11:43 AM Masahiko Sawada  
> > > wrote:
> > > >
> > >
> > > 6.
> > > @@ -1583,7 +1649,8 @@ apply_handle_insert(StringInfo s)
> > >   TupleTableSlot *remoteslot;
> > >   MemoryContext oldctx;
> > >
> > > - if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
> > > + if (is_skipping_changes() ||
> > >
> > > Is there a reason to keep the skip_changes check here and in other DML
> > > operations instead of at one central place in apply_dispatch?
> >
> > Since we already have the check of applying the change on the spot at
> > the beginning of the handlers I feel it's better to add
> > is_skipping_changes() to the check than add a new if statement to
> > apply_dispatch, but do you prefer to check it in one central place in
> > apply_dispatch?
> >
>
> I think either way is fine. I just wanted to know the reason, your
> current change looks okay to me.
>

I feel it is better to at least add a comment suggesting that we skip
only data modification changes because the other part of message
handle_stream_* is there in other message handlers as well. It will
make it easier to add a similar check in future message handlers.

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2022-03-15 Thread Amit Kapila
On Tue, Mar 15, 2022 at 7:30 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Tuesday, March 15, 2022 3:13 PM Masahiko Sawada  
> wrote:
> > I've attached an updated version patch.
>
> A couple of minor comments on v14.
>
> (1) apply_handle_commit_internal
>
>
> +   if (is_skipping_changes())
> +   {
> +   stop_skipping_changes();
> +
> +   /*
> +* Start a new transaction to clear the subskipxid, if not 
> started
> +* yet. The transaction is committed below.
> +*/
> +   if (!IsTransactionState())
> +   StartTransactionCommand();
> +   }
> +
>
> I suppose we can move this condition check and stop_skipping_changes() call
> to the inside of the block we enter when IsTransactionState() returns true.
>
> As the comment of apply_handle_commit_internal() mentions,
> it's the helper function for apply_handle_commit() and
> apply_handle_stream_commit().
>
> Then, I couldn't think that both callers don't open
> a transaction before the call of apply_handle_commit_internal().
> For applying spooled messages, we call begin_replication_step as well.
>
> I can miss something, but timing when we receive COMMIT message
> without opening a transaction, would be the case of empty transactions
> where the subscription (and its subscription worker) is not interested.
>

I think when we skip non-streamed transactions we don't start a
transaction. So, if we do what you are suggesting, we will miss to
clear the skip_lsn after skipping the transaction.

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2022-03-15 Thread Amit Kapila
On Wed, Mar 16, 2022 at 6:03 AM Masahiko Sawada  wrote:
>
> On Tue, Mar 15, 2022 at 7:18 PM Amit Kapila  wrote:
> >
> > On Tue, Mar 15, 2022 at 11:43 AM Masahiko Sawada  
> > wrote:
> > >
> >
> > 6.
> > @@ -1583,7 +1649,8 @@ apply_handle_insert(StringInfo s)
> >   TupleTableSlot *remoteslot;
> >   MemoryContext oldctx;
> >
> > - if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
> > + if (is_skipping_changes() ||
> >
> > Is there a reason to keep the skip_changes check here and in other DML
> > operations instead of at one central place in apply_dispatch?
>
> Since we already have the check of applying the change on the spot at
> the beginning of the handlers I feel it's better to add
> is_skipping_changes() to the check than add a new if statement to
> apply_dispatch, but do you prefer to check it in one central place in
> apply_dispatch?
>

I think either way is fine. I just wanted to know the reason, your
current change looks okay to me.

Some questions/comments
==
1. IIRC, earlier, we thought of allowing to use of this option (SKIP)
only for superusers (as this can lead to inconsistent data if not used
carefully) but I don't see that check in the latest patch. What is the
reason for the same?

2.
+ /*
+ * Update the subskiplsn of the tuple to InvalidXLogRecPtr.

I think we can change the above part of the comment to "Clear subskiplsn."

3.
+ * Since we already have

Isn't it better to say here: Since we have already ...?

-- 
With Regards,
Amit Kapila.




Re: Unhyphenation of crash-recovery

2022-03-15 Thread Andres Freund
Hi

On March 15, 2022 6:25:09 PM PDT, Kyotaro Horiguchi  
wrote:
>Hello, this is a derived topic from [1], summarized as $SUBJECT.
>
>This just removes useless hyphens from the words
>"(crash|emergency)-recovery". We don't have such wordings for "archive
>recovery" This patch fixes non-user-facing texts as well as
>user-facing ones.

I don't see the point of this kind of change. 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 23:16:52 +1300, Thomas Munro  wrote 
in 
> On Tue, Mar 15, 2022 at 10:30 PM Michael Paquier  wrote:
> > So, which one of a relative path or an absolute path do you think
> > would be better for the user?  My preference tends toward the relative
> > path, as we know that all those tablespaces stay in pg_tblspc/ so one
> > can make the difference with normal tablespaces more easily.  The
> > barrier is thin, though :p
> 
> Sounds good to me.

+1. Desn't the doc need to mention that?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Kyotaro Horiguchi
At Wed, 16 Mar 2022 09:19:13 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> In short, I split out the two topics other than checkpoint log to
> other threads.

So, this is about the main topic of this thread, adding LSNs to
checkpint log.  Other topics have moved to other treads [1], [2] ,
[3].

I think this is no longer controversial alone.  So this patch is now
really Read-for-Commiter and is waiting to be picked up.

regards.


[1] 
https://www.postgresql.org/message-id/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com
[2] 
https://www.postgresql.org/message-id/20220316.102900.2003692961119672246.horikyota.ntt%40gmail.com
[3] 
https://www.postgresql.org/message-id/20220316.102509.785466054344164656.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 4c3d40e7aaf29cb905f3561a291d35981d762456 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 4 Mar 2022 13:22:41 +0900
Subject: [PATCH v13] Add checkpoint and redo LSN to LogCheckpointEnd log
 message

It is useful (for debugging purposes) if the checkpoint end message
has the checkpoint LSN(end) and REDO LSN(start). It gives more
context while analyzing checkpoint-related issues. The pg_controldata
gives the last checkpoint LSN and REDO LSN, but having this info
alongside the log message helps analyze issues that happened
previously, connect the dots and identify the root cause.
---
 src/backend/access/transam/xlog.c | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ed16f279b1..b85c76d8f8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6121,7 +6121,8 @@ LogCheckpointEnd(bool restartpoint)
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
-		"distance=%d kB, estimate=%d kB",
+		"distance=%d kB, estimate=%d kB; "
+		"lsn=%X/%X, redo lsn=%X/%X",
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -6134,14 +6135,21 @@ LogCheckpointEnd(bool restartpoint)
 		longest_msecs / 1000, (int) (longest_msecs % 1000),
 		average_msecs / 1000, (int) (average_msecs % 1000),
 		(int) (PrevCheckPointDistance / 1024.0),
-		(int) (CheckPointDistanceEstimate / 1024.0;
+		(int) (CheckPointDistanceEstimate / 1024.0),
+		/*
+		 * ControlFileLock is not required as we are the only
+		 * updator of these variables.
+		 */
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;
 	else
 		ereport(LOG,
 (errmsg("checkpoint complete: wrote %d buffers (%.1f%%); "
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
-		"distance=%d kB, estimate=%d kB",
+		"distance=%d kB, estimate=%d kB; "
+		"lsn=%X/%X, redo lsn=%X/%X",
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -6154,7 +6162,13 @@ LogCheckpointEnd(bool restartpoint)
 		longest_msecs / 1000, (int) (longest_msecs % 1000),
 		average_msecs / 1000, (int) (average_msecs % 1000),
 		(int) (PrevCheckPointDistance / 1024.0),
-		(int) (CheckPointDistanceEstimate / 1024.0;
+		(int) (CheckPointDistanceEstimate / 1024.0),
+		/*
+		 * ControlFileLock is not required as we are the only
+		 * updator of these variables.
+		 */
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;
 }
 
 /*
-- 
2.27.0



Reword "WAL location" as "WAL LSN"

2022-03-15 Thread Kyotaro Horiguchi
Uh.. Sorry, I forgot to change the subject. Resent with the correct
subject.

=
Hello, this is a follow-up topic of [1] (add LSNs to checkpint logs).

Many user-facing texts contains wording like "WAL location" or such
like. The attached is WIP patches that fixes such wordings to "WAL
LSN" or alikes.

The attached patches is v13 but it is not changed at all from v12.
My lastest comments on this version are as follows.

https://www.postgresql.org/message-id/20220209.115204.1794224638476710282.horikyota@gmail.com

> The old 0003 (attached 0004):
> 
> 
> 
> +++ b/src/backend/access/rmgrdesc/xlogdesc.c
> - appendStringInfo(buf, "redo %X/%X; "
> + appendStringInfo(buf, "redo lsn %X/%X; "
> 
> 
> 
> It is shown in the context of a checkpoint record, so I think it is
> not needed or rather lengthning the dump line uselessly. 
> 
> 
> 
> +++ b/src/backend/access/transam/xlog.c
> - (errmsg("request to flush past end of generated 
> WAL; request %X/%X, current position %X/%X",
> + (errmsg("request to flush past end of generated 
> WAL; request lsn %X/%X, current lsn %X/%X",
> 
> 
> 
> +++ b/src/backend/replication/walsender.c
> - (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush position of this server %X/%X",
> + (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush LSN of this server %X/%X",
> 
> 
> 
> "WAL" is upper-cased. So it seems rather strange that the "lsn" is
> lower-cased.  In the first place the message doesn't look like a
> user-facing error message and I feel we don't need position or lsn
> there..
> 
> 
> 
> +++ b/src/bin/pg_rewind/pg_rewind.c
> - pg_log_info("servers diverged at WAL location %X/%X on timeline 
> %u",
> + pg_log_info("servers diverged at WAL LSN %X/%X on timeline %u",
> 
> 
> 
> I feel that we don't need "WAL" there.
> 
> 
> 
> +++ b/src/bin/pg_waldump/pg_waldump.c
> - printf(_("  -e, --end=RECPTR   stop reading at WAL location 
> RECPTR\n"));
> + printf(_("  -e, --end=RECPTR   stop reading at WAL LSN RECPTR\n"));
> 
> 
> 
> Mmm.. "WAL LSN RECPTR" looks strange to me.  In the first place I
> don't think "RECPTR" is a user-facing term. Doesn't something like the
> follows work?
> 
> 
> 
> + printf(_("  -e, --end=WAL-LSN   stop reading at WAL-LSN\n"));
> 
> 
> 
> In some changes in this patch shorten the main message text of
> fprintf-ish functions.  That makes the succeeding parameters can be
> inlined.
regards.

[1] 
https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From afed063f09b788bc07a0e2665499c3e3c3edc214 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 1 Feb 2022 03:57:04 +
Subject: [PATCH v13 1/2] Change "location" to "lsn" in pg_controldata

---
 src/bin/pg_controldata/pg_controldata.c| 10 +-
 src/test/recovery/t/016_min_consistency.pl |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d94..59f39267df 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -235,9 +235,9 @@ main(int argc, char *argv[])
 		   dbState(ControlFile->state));
 	printf(_("pg_control last modified: %s\n"),
 		   pgctime_str);
-	printf(_("Latest checkpoint location:   %X/%X\n"),
+	printf(_("Latest checkpoint LSN:%X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->checkPoint));
-	printf(_("Latest checkpoint's REDO location:%X/%X\n"),
+	printf(_("Latest checkpoint's REDO LSN: %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo));
 	printf(_("Latest checkpoint's REDO WAL file:%s\n"),
 		   xlogfilename);
@@ -274,13 +274,13 @@ main(int argc, char *argv[])
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->unloggedLSN));
-	printf(_("Minimum recovery ending location: %X/%X\n"),
+	printf(_("Minimum recovery ending LSN:  %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint));
 	printf(_("Min recovery ending loc's timeline:   %u\n"),
 		   ControlFile->minRecoveryPointTLI);
-	printf(_("Backup start location:%X/%X\n"),
+	printf(_("Backup start LSN: %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->backupStartPoint));
-	printf(_("Backup end location:  %X/%X\n"),
+	printf(_("Backup end LSN:   %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:%s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
diff --git a/src/test/recovery/t/016_min_consistency.pl b/src/test/recovery/t/016_min_consistency.pl

Unhyphenation of crash-recovery

2022-03-15 Thread Kyotaro Horiguchi
Hello, this is a derived topic from [1], summarized as $SUBJECT.

This just removes useless hyphens from the words
"(crash|emergency)-recovery". We don't have such wordings for "archive
recovery" This patch fixes non-user-facing texts as well as
user-facing ones.

regards.

[1] 
https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 751acd2510e036a5c484e5768dd9e4e85cd8b2cf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 4 Mar 2022 13:53:38 +0900
Subject: [PATCH v13] Get rid of uses of some hyphenated words

There are use of both "crash-recovery" and "crash recovery" in the
tree.  All occurances of archive recovery is spelled
unhyphenated. Unify the similar uses to unhyphenated.
---
 doc/src/sgml/ref/pg_ctl-ref.sgml  | 2 +-
 doc/src/sgml/ref/postgres-ref.sgml| 2 +-
 src/backend/commands/trigger.c| 2 +-
 src/backend/utils/cache/relcache.c| 2 +-
 src/test/recovery/t/023_pitr_prepared_xact.pl | 2 +-
 src/test/regress/parallel_schedule| 2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/ref/pg_ctl-ref.sgml b/doc/src/sgml/ref/pg_ctl-ref.sgml
index 3946fa52ea..7cbe5c3048 100644
--- a/doc/src/sgml/ref/pg_ctl-ref.sgml
+++ b/doc/src/sgml/ref/pg_ctl-ref.sgml
@@ -194,7 +194,7 @@ PostgreSQL documentation
rolled back and clients are forcibly disconnected, then the
server is shut down.  Immediate mode will abort
all server processes immediately, without a clean shutdown.  This choice
-   will lead to a crash-recovery cycle during the next server start.
+   will lead to a crash recovery cycle during the next server start.
   
 
   
diff --git a/doc/src/sgml/ref/postgres-ref.sgml b/doc/src/sgml/ref/postgres-ref.sgml
index 55a3f6c69d..fc22fbc4fe 100644
--- a/doc/src/sgml/ref/postgres-ref.sgml
+++ b/doc/src/sgml/ref/postgres-ref.sgml
@@ -719,7 +719,7 @@ PostgreSQL documentation
is also unwise to send SIGKILL to a server
process  the main postgres process will
interpret this as a crash and will force all the sibling processes
-   to quit as part of its standard crash-recovery procedure.
+   to quit as part of its standard crash recovery procedure.
   
  
 
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index e08bd9a370..ce170b4c6e 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -1887,7 +1887,7 @@ RelationBuildTriggers(Relation relation)
 	/*
 	 * Note: since we scan the triggers using TriggerRelidNameIndexId, we will
 	 * be reading the triggers in name order, except possibly during
-	 * emergency-recovery operations (ie, IgnoreSystemIndexes). This in turn
+	 * emergency recovery operations (ie, IgnoreSystemIndexes). This in turn
 	 * ensures that triggers will be fired in name order.
 	 */
 	ScanKeyInit(,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index fccffce572..2419cf5285 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -6598,7 +6598,7 @@ RelationCacheInitFilePostInvalidate(void)
  * Remove the init files during postmaster startup.
  *
  * We used to keep the init files across restarts, but that is unsafe in PITR
- * scenarios, and even in simple crash-recovery cases there are windows for
+ * scenarios, and even in simple crash recovery cases there are windows for
  * the init files to become out-of-sync with the database.  So now we just
  * remove them during startup and expect the first backend launch to rebuild
  * them.  Of course, this has to happen in each database of the cluster.
diff --git a/src/test/recovery/t/023_pitr_prepared_xact.pl b/src/test/recovery/t/023_pitr_prepared_xact.pl
index 39e8a8fa17..0f48c20ed1 100644
--- a/src/test/recovery/t/023_pitr_prepared_xact.pl
+++ b/src/test/recovery/t/023_pitr_prepared_xact.pl
@@ -1,7 +1,7 @@
 
 # Copyright (c) 2021-2022, PostgreSQL Global Development Group
 
-# Test for point-in-time-recovery (PITR) with prepared transactions
+# Test for point-in-time recovery (PITR) with prepared transactions
 use strict;
 use warnings;
 use PostgreSQL::Test::Cluster;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 6d8f524ae9..8251744bbf 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -13,7 +13,7 @@ test: test_setup
 
 # run tablespace by itself, and early, because it forces a checkpoint;
 # we'd prefer not to have checkpoints later in the tests because that
-# interferes with crash-recovery testing.
+# interferes with crash recovery testing.
 test: tablespace
 
 # --
-- 
2.27.0



Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Kyotaro Horiguchi
Hello, this is a follow-up topic of [1] (add LSNs to checkpint logs).

Many user-facing texts contains wording like "WAL location" or such
like. The attached is WIP patches that fixes such wordings to "WAL
LSN" or alikes.

The attached patches is v13 but it is not changed at all from v12.
My lastest comments on this version are as follows.

https://www.postgresql.org/message-id/20220209.115204.1794224638476710282.horikyota@gmail.com

> The old 0003 (attached 0004):
> 
> 
> 
> +++ b/src/backend/access/rmgrdesc/xlogdesc.c
> - appendStringInfo(buf, "redo %X/%X; "
> + appendStringInfo(buf, "redo lsn %X/%X; "
> 
> 
> 
> It is shown in the context of a checkpoint record, so I think it is
> not needed or rather lengthning the dump line uselessly. 
> 
> 
> 
> +++ b/src/backend/access/transam/xlog.c
> - (errmsg("request to flush past end of generated 
> WAL; request %X/%X, current position %X/%X",
> + (errmsg("request to flush past end of generated 
> WAL; request lsn %X/%X, current lsn %X/%X",
> 
> 
> 
> +++ b/src/backend/replication/walsender.c
> - (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush position of this server %X/%X",
> + (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush LSN of this server %X/%X",
> 
> 
> 
> "WAL" is upper-cased. So it seems rather strange that the "lsn" is
> lower-cased.  In the first place the message doesn't look like a
> user-facing error message and I feel we don't need position or lsn
> there..
> 
> 
> 
> +++ b/src/bin/pg_rewind/pg_rewind.c
> - pg_log_info("servers diverged at WAL location %X/%X on timeline 
> %u",
> + pg_log_info("servers diverged at WAL LSN %X/%X on timeline %u",
> 
> 
> 
> I feel that we don't need "WAL" there.
> 
> 
> 
> +++ b/src/bin/pg_waldump/pg_waldump.c
> - printf(_("  -e, --end=RECPTR   stop reading at WAL location 
> RECPTR\n"));
> + printf(_("  -e, --end=RECPTR   stop reading at WAL LSN RECPTR\n"));
> 
> 
> 
> Mmm.. "WAL LSN RECPTR" looks strange to me.  In the first place I
> don't think "RECPTR" is a user-facing term. Doesn't something like the
> follows work?
> 
> 
> 
> + printf(_("  -e, --end=WAL-LSN   stop reading at WAL-LSN\n"));
> 
> 
> 
> In some changes in this patch shorten the main message text of
> fprintf-ish functions.  That makes the succeeding parameters can be
> inlined.
regards.

[1] 
https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From afed063f09b788bc07a0e2665499c3e3c3edc214 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 1 Feb 2022 03:57:04 +
Subject: [PATCH v13 1/2] Change "location" to "lsn" in pg_controldata

---
 src/bin/pg_controldata/pg_controldata.c| 10 +-
 src/test/recovery/t/016_min_consistency.pl |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d94..59f39267df 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -235,9 +235,9 @@ main(int argc, char *argv[])
 		   dbState(ControlFile->state));
 	printf(_("pg_control last modified: %s\n"),
 		   pgctime_str);
-	printf(_("Latest checkpoint location:   %X/%X\n"),
+	printf(_("Latest checkpoint LSN:%X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->checkPoint));
-	printf(_("Latest checkpoint's REDO location:%X/%X\n"),
+	printf(_("Latest checkpoint's REDO LSN: %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo));
 	printf(_("Latest checkpoint's REDO WAL file:%s\n"),
 		   xlogfilename);
@@ -274,13 +274,13 @@ main(int argc, char *argv[])
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->unloggedLSN));
-	printf(_("Minimum recovery ending location: %X/%X\n"),
+	printf(_("Minimum recovery ending LSN:  %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint));
 	printf(_("Min recovery ending loc's timeline:   %u\n"),
 		   ControlFile->minRecoveryPointTLI);
-	printf(_("Backup start location:%X/%X\n"),
+	printf(_("Backup start LSN: %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->backupStartPoint));
-	printf(_("Backup end location:  %X/%X\n"),
+	printf(_("Backup end LSN:   %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:%s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
diff --git a/src/test/recovery/t/016_min_consistency.pl b/src/test/recovery/t/016_min_consistency.pl
index 5e0655c2a9..4ee20309cd 100644
--- a/src/test/recovery/t/016_min_consistency.pl

Possible corruption by CreateRestartPoint at promotion

2022-03-15 Thread Kyotaro Horiguchi
Hello,  (Cc:ed Fujii-san)

This is a diverged topic from [1], which is summarized as $SUBJECT.

To recap:

While discussing on additional LSNs in checkpoint log message,
Fujii-san pointed out [2] that there is a case where
CreateRestartPoint leaves unrecoverable database when concurrent
promotion happens. That corruption is "fixed" by the next checkpoint
so it is not a severe corruption.

AFAICS since 9.5, no check(/restart)pionts won't run concurrently with
restartpoint [3].  So I propose to remove the code path as attached.

regards.


[1] 
https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com

[2] 
https://www.postgresql.org/message-id/7bfad665-db9c-0c2a-2604-9f54763c5f9e%40oss.nttdata.com

[3] 
https://www.postgresql.org/message-id/20220222.174401.765586897814316743.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 57823c5690469cf404de9fbdf37f55d09abaedd5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 4 Mar 2022 13:18:30 +0900
Subject: [PATCH v4] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 69 ---
 1 file changed, 44 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ed16f279b1..dd9c564988 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6900,6 +6900,9 @@ CreateRestartPoint(int flags)
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
 
+	/* we don't assume concurrent checkpoint/restartpoint to run */
+	Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
+
 	/* Get a local copy of the last safe checkpoint record. */
 	SpinLockAcquire(>info_lck);
 	lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr;
@@ -6965,7 +6968,7 @@ CreateRestartPoint(int flags)
 
 	/* Also update the info_lck-protected copy */
 	SpinLockAcquire(>info_lck);
-	XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+	XLogCtl->RedoRecPtr = RedoRecPtr;
 	SpinLockRelease(>info_lck);
 
 	/*
@@ -6984,7 +6987,10 @@ CreateRestartPoint(int flags)
 	/* Update the process title */
 	update_checkpoint_display(flags, true, false);
 
-	CheckPointGuts(lastCheckPoint.redo, flags);
+	CheckPointGuts(RedoRecPtr, flags);
+
+	/* Update pg_control */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
 	/*
 	 * Remember the prior checkpoint's redo ptr for
@@ -6992,30 +6998,26 @@ CreateRestartPoint(int flags)
 	 */
 	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+	Assert (PriorRedoPtr < RedoRecPtr);
+
+	ControlFile->checkPoint = lastCheckPointRecPtr;
+	ControlFile->checkPointCopy = lastCheckPoint;
+
 	/*
-	 * Update pg_control, using current time.  Check that it still shows
-	 * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-	 * this is a quick hack to make sure nothing really bad happens if somehow
-	 * we get here after the end-of-recovery checkpoint.
+	 * Ensure minRecoveryPoint is past the checkpoint record while archive
+	 * recovery is still ongoing.  Normally, this will have happened already
+	 * while writing out dirty buffers, but not necessarily - e.g. because no
+	 * buffers were dirtied.  We do this because a non-exclusive base backup
+	 * uses minRecoveryPoint to determine which WAL files must be included in
+	 * the backup, and the file (or files) containing the checkpoint record
+	 * must be included, at a minimum. Note that for an ordinary restart of
+	 * recovery there's no value in having the minimum recovery point any
+	 * earlier than this anyway, because redo will begin just after the
+	 * checkpoint record.  This is a quick hack to make sure nothing really bad
+	 * happens if somehow we get here after the end-of-recovery checkpoint.
 	 */
-	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-	if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-		ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+	if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
 	{
-		ControlFile->checkPoint = lastCheckPointRecPtr;
-		ControlFile->checkPointCopy = lastCheckPoint;
-
-		/*
-		 * Ensure minRecoveryPoint is past the checkpoint record.  

Re: Skipping logical replication transactions on subscriber side

2022-03-15 Thread Masahiko Sawada
On Tue, Mar 15, 2022 at 7:18 PM Amit Kapila  wrote:
>
> On Tue, Mar 15, 2022 at 11:43 AM Masahiko Sawada  
> wrote:
> >
>
> 6.
> @@ -1583,7 +1649,8 @@ apply_handle_insert(StringInfo s)
>   TupleTableSlot *remoteslot;
>   MemoryContext oldctx;
>
> - if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
> + if (is_skipping_changes() ||
>
> Is there a reason to keep the skip_changes check here and in other DML
> operations instead of at one central place in apply_dispatch?

Since we already have the check of applying the change on the spot at
the beginning of the handlers I feel it's better to add
is_skipping_changes() to the check than add a new if statement to
apply_dispatch, but do you prefer to check it in one central place in
apply_dispatch?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 18:26:26 +0900, Michael Paquier  wrote 
in 
> On Tue, Mar 15, 2022 at 05:23:40PM +0900, Kyotaro Horiguchi wrote:
> > At Tue, 15 Mar 2022 12:19:47 +0530, Bharath Rupireddy 
> >  wrote in 
> >> On Fri, Mar 4, 2022 at 10:40 AM Kyotaro Horiguchi
> >>  wrote:
> >> 0001 - I don't think you need to do this as UpdateControlFile
> >> (update_controlfile) will anyway update it, no?
> >> + /* Update control file using current time */
> >> + ControlFile->time = (pg_time_t) time(NULL);
> > 
> > Ugh.. Yes. It is a copy-pasto from older versions.  They may have the
> > same copy-pasto..
> 
> This thread has shifted to an entirely different discussion,
> presenting patches that touch code paths unrelated to what was first
> stated.  Shouldn't you create a new thread with a proper $subject to
> attract a more correct audience?  

I felt the same since some messages ago.  I thought Fujii-san thought
that he wants to fix the CreateRestartPoint issue before the
checkpoint log patch but he looks like busy these days.

Since the CreateRestartPoint issue is orthogonal to the main patch,
I'll separate that part into anther thread.

This thread is discussing one other topic, wordings in user-facing
texts. This is also orthogonal (3-dimentionally?) to the two topics.

In short, I split out the two topics other than checkpoint log to
other threads.

Thanks for cueing me to do that!

regareds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH] Accept IP addresses in server certificate SANs

2022-03-15 Thread Jacob Champion
On Tue, 2022-03-15 at 21:41 +, Jacob Champion wrote:
> Sounds good. I'll work on adding tests for the current behavior, and if
> the committers don't like it, we can change it.

Done in v7, attached.

--Jacob
commit 51052e322f4d4e4f75d3cf3d000a363244696013
Author: Jacob Champion 
Date:   Tue Mar 15 16:02:15 2022 -0700

squash! libpq: allow IP address SANs in server certs

Per discussion on list, add tests for the (counter-intuitive) standing
behavior that matches IP addresses embedded in dNSName SANs.

diff --git a/src/test/ssl/conf/server-ip-in-dnsname.config 
b/src/test/ssl/conf/server-ip-in-dnsname.config
new file mode 100644
index 00..b15649aef7
--- /dev/null
+++ b/src/test/ssl/conf/server-ip-in-dnsname.config
@@ -0,0 +1,18 @@
+# An OpenSSL format CSR config file for creating a server certificate.
+#
+
+[ req ]
+distinguished_name = req_distinguished_name
+req_extensions = v3_req
+prompt = no
+
+[ req_distinguished_name ]
+OU = PostgreSQL test suite
+
+# For Subject Alternative Names
+[ v3_req ]
+subjectAltName = @alt_names
+
+# Normally IP addresses should not go into a dNSName.
+[ alt_names ]
+DNS.1 = 192.0.2.1
diff --git a/src/test/ssl/ssl/server-ip-in-dnsname.crt 
b/src/test/ssl/ssl/server-ip-in-dnsname.crt
new file mode 100644
index 00..78ad8d99c8
--- /dev/null
+++ b/src/test/ssl/ssl/server-ip-in-dnsname.crt
@@ -0,0 +1,70 @@
+Certificate:
+Data:
+Version: 3 (0x2)
+Serial Number: 2315416547509162496 (0x2022031515585200)
+Signature Algorithm: sha256WithRSAEncryption
+Issuer: CN = Test CA for PostgreSQL SSL regression test server certs
+Validity
+Not Before: Mar 15 22:58:52 2022 GMT
+Not After : Jul 31 22:58:52 2049 GMT
+Subject: OU = PostgreSQL test suite
+Subject Public Key Info:
+Public Key Algorithm: rsaEncryption
+RSA Public-Key: (2048 bit)
+Modulus:
+00:ca:67:e5:b3:f5:fc:e7:c1:41:1f:f2:bc:e9:0e:
+07:3c:40:ac:4d:63:d5:84:a1:55:ad:a9:72:3f:3d:
+eb:e0:95:0c:48:15:37:91:e5:bb:3e:ca:1d:4f:c9:
+63:33:59:08:01:db:e5:60:07:ba:f5:f1:29:ea:23:
+e1:38:b9:6d:ed:4a:b2:a3:b4:98:dd:69:f9:13:0d:
+ed:59:42:a4:2a:5b:d0:05:93:cf:8c:fe:23:c4:d4:
+85:26:67:9a:08:21:1e:f7:d6:e1:17:dc:64:c0:9c:
+1e:ad:6f:7a:f5:53:0f:14:7f:70:06:c3:3d:8a:60:
+04:20:f9:17:f4:99:31:1c:8c:0f:0e:41:ec:82:cb:
+99:fb:74:7a:a0:35:9c:2a:9a:bf:2a:81:08:66:6f:
+c7:5a:25:f0:de:41:7d:57:6b:0d:7a:7f:ac:de:7d:
+f7:b9:21:05:64:11:67:c8:3c:e0:56:72:15:95:df:
+a7:0a:79:ed:eb:b4:38:64:03:cd:91:57:a0:42:f6:
+b7:83:95:95:de:bb:2b:98:dc:72:55:95:c4:76:3a:
+14:67:07:8c:2b:f2:aa:ce:3c:3c:23:ce:47:ce:1a:
+9d:9c:23:1a:ca:95:39:2e:b6:e7:4f:c3:58:a0:50:
+3b:82:b2:86:44:d5:7f:40:16:fc:80:86:48:c3:8a:
+90:09
+Exponent: 65537 (0x10001)
+X509v3 extensions:
+X509v3 Subject Alternative Name: 
+DNS:192.0.2.1
+Signature Algorithm: sha256WithRSAEncryption
+ c8:00:01:f4:58:92:84:a5:b3:cd:d1:08:a1:37:a4:85:50:0e:
+ 56:2f:88:b1:b0:10:02:a3:d3:88:00:69:12:9c:a7:8e:79:51:
+ 06:e6:fb:e0:3a:76:b4:86:34:1c:39:c8:2d:23:5a:02:0c:b8:
+ 54:2d:c8:c2:bb:0c:62:21:6e:b2:42:7b:4e:6b:3c:a8:d2:ca:
+ 28:26:bd:21:12:6d:96:7e:2a:6f:22:94:2c:e9:6f:44:d4:e4:
+ dc:b3:ad:b8:04:05:5f:90:0c:2b:e9:64:f6:57:0a:08:00:f7:
+ 10:0f:69:d1:b4:84:21:49:42:11:f4:9e:d0:e5:66:f3:b9:0c:
+ 3b:a6:d8:50:84:17:d6:fd:43:d6:c3:bb:da:8d:e9:74:af:57:
+ 5f:95:b7:b4:0a:0a:79:58:63:5f:a7:59:18:cb:e4:1f:5d:6f:
+ d8:8e:4b:2d:64:43:f5:28:59:be:4c:5e:7c:bb:1a:c8:0e:66:
+ 1b:a1:2f:04:7a:b5:11:10:64:22:c6:28:db:23:40:1f:5e:34:
+ a1:c6:d8:b2:4b:ad:eb:6d:da:68:dc:2d:01:3e:3f:58:c7:65:
+ 6d:5a:2e:14:d2:aa:1e:7f:f8:8b:27:12:41:08:bb:c4:ff:de:
+ fe:76:e3:dd:2b:1d:00:92:43:96:03:d5:6e:7a:79:16:2e:9e:
+ fd:ed:43:3c
+-BEGIN CERTIFICATE-
+MIIC/DCCAeSgAwIBAgIIICIDFRVYUgAwDQYJKoZIhvcNAQELBQAwQjFAMD4GA1UE
+Aww3VGVzdCBDQSBmb3IgUG9zdGdyZVNRTCBTU0wgcmVncmVzc2lvbiB0ZXN0IHNl
+cnZlciBjZXJ0czAeFw0yMjAzMTUyMjU4NTJaFw00OTA3MzEyMjU4NTJaMCAxHjAc
+BgNVBAsMFVBvc3RncmVTUUwgdGVzdCBzdWl0ZTCCASIwDQYJKoZIhvcNAQEBBQAD
+ggEPADCCAQoCggEBAMpn5bP1/OfBQR/yvOkOBzxArE1j1YShVa2pcj896+CVDEgV
+N5Hluz7KHU/JYzNZCAHb5WAHuvXxKeoj4Ti5be1KsqO0mN1p+RMN7VlCpCpb0AWT
+z4z+I8TUhSZnmgghHvfW4RfcZMCcHq1vevVTDxR/cAbDPYpgBCD5F/SZMRyMDw5B
+7ILLmft0eqA1nCqavyqBCGZvx1ol8N5BfVdrDXp/rN5997khBWQRZ8g84FZyFZXf
+pwp57eu0OGQDzZFXoEL2t4OVld67K5jcclWVxHY6FGcHjCvyqs48PCPOR84anZwj
+GsqVOS6250/DWKBQO4KyhkTVf0AW/ICGSMOKkAkCAwEAAaMYMBYwFAYDVR0RBA0w

Re: document the need to analyze partitioned tables

2022-03-15 Thread Justin Pryzby
On Mon, Mar 14, 2022 at 05:23:54PM -0400, Robert Haas wrote:
> On Fri, Jan 21, 2022 at 1:31 PM Tomas Vondra  
> wrote:
> > [ new patch ]
> 
> This patch is originally by Justin. The latest version is by Tomas. I
> think the next step is for Justin to say whether he's OK with the
> latest version that Tomas posted. If he is, then I suggest that he
> also mark it Ready for Committer, and that Tomas commit it. If he's
> not, he should say what he wants changed and either post a new version
> himself or wait for Tomas to do that.

Yes, I think it can be Ready.  Done.

I amended some of Tomas' changes (see 0003, attached as txt).

@cfbot: the *.patch file is for your consumption, and the others are only there
to show my changes.

> I think the fact that is classified as a "Bug Fix" in the CommitFest
> application is not particularly good. I would prefer to see it
> classified under "Documentation". I'm prepared to concede that
> documentation can have bugs as a general matter, but nobody's data is
> getting eaten because the documentation wasn't updated. In fact, this
> is the fourth patch from the "bug fix" section I've studied this
> afternoon, and, well, none of them have been back-patchable code
> defects.

In fact, I consider this to be back-patchable back to v10.  IMO it's an
omission that this isn't documented.  Not all bugs cause data to be eaten.  If
someone reads the existing documentation, they might conclude that their
partitioned tables don't need to be analyzed, and they would've been better
served by not reading the docs.

-- 
Justin
>From c237d91ec258ebbf24ebd3a38e139777582817f6 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 22 Jul 2021 16:06:18 -0500
Subject: [PATCH] documentation deficiencies for ANALYZE of partitioned tables

This is partially extracted from 1b5617eb844cd2470a334c1d2eec66cf9b39c41a,
which was reverted.
---
 doc/src/sgml/maintenance.sgml | 29 +
 doc/src/sgml/ref/analyze.sgml | 32 +---
 2 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 36f975b1e5b..34d72dba784 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -290,6 +290,15 @@
 to meaningful statistical changes.

 
+   
+Tuples changed in partitions and inheritance children do not trigger
+analyze on the parent table.  If the parent table is empty or rarely
+changed, it may never be processed by autovacuum, and the statistics for
+the inheritance tree as a whole won't be collected. It is necessary to
+run ANALYZE on the parent table manually in order to
+keep the statistics up to date.
+   
+

 As with vacuuming for space recovery, frequent updates of statistics
 are more useful for heavily-updated tables than for seldom-updated
@@ -347,6 +356,19 @@
  ANALYZE commands on those tables on a suitable schedule.
 

+
+   
+
+ The autovacuum daemon does not issue ANALYZE commands
+ for partitioned tables.  Inheritance parents will only be analyzed if the
+ parent itself is changed - changes to child tables do not trigger
+ autoanalyze on the parent table.  If your queries require statistics on
+ parent tables for proper planning, it is necessary to periodically run
+ a manual ANALYZE on those tables to keep the statistics
+ up to date.
+
+   
+
   
 
   
@@ -819,6 +841,13 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu
 since the last ANALYZE.

 
+   
+Partitioned tables are not processed by autovacuum.  Statistics
+should be collected by running a manual ANALYZE when it is
+first populated, and again whenever the distribution of data in its
+partitions changes significantly.
+   
+

 Temporary tables cannot be accessed by autovacuum.  Therefore,
 appropriate vacuum and analyze operations should be performed via
diff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml
index c423aeeea5e..8268ba87f63 100644
--- a/doc/src/sgml/ref/analyze.sgml
+++ b/doc/src/sgml/ref/analyze.sgml
@@ -263,9 +263,35 @@ ANALYZE [ VERBOSE ] [ table_and_columns
 
   
-If any of the child tables are foreign tables whose foreign data wrappers
-do not support ANALYZE, those child tables are ignored while
-gathering inheritance statistics.
+For partitioned tables, ANALYZE gathers statistics by
+sampling rows from all partitions; in addition, it will recurse into each
+partition and update its statistics.  Each leaf partition is analyzed only
+once, even with multi-level partitioning.  No statistics are collected for
+only the parent table (without data from its partitions), because with
+partitioning it's guaranteed to be empty.
+  
+
+  
+By constrast, if the table being analyzed has inheritance children,
+ANALYZE gathers two sets of statistics: one on the rows
+ 

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-15 Thread Matthias van de Meent
Apart from registering this in CF 2022-07 last Friday, I've also just
added this issue to the Open Items list for PG15 under "Older bugs
affecting stable branches"; as a precaution to not lose track of this
issue in the buzz of the upcoming feature freeze.

-Matthias




Re: Estimating HugePages Requirements?

2022-03-15 Thread Nathan Bossart
On Tue, Mar 15, 2022 at 11:02:37PM +0100, Magnus Hagander wrote:
> I think we're talking about two different things here.
> 
> I think the "avoid extra logging" would be worth seeing if we can
> address for 15.

A simple approach could be to just set log_min_messages to PANIC before
exiting.  I've attached a patch for this.  With this patch, we'll still see
a FATAL if we try to use 'postgres -C' for a runtime-computed GUC on a
running server, and there will be no extra output as long as the user sets
log_min_messages to INFO or higher (i.e., not a DEBUG* value).  For
comparison, 'postgres -C' for a non-runtime-computed GUC does not emit
extra output as long as the user sets log_min_messages to DEBUG2 or higher.

> The "able to run on a live cluster" seems a lot bigger and more scary
> and not 15 material.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From cdfd1ad00ca1d8afbfcbafc1f684b5ba9cc43eb6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 15 Mar 2022 15:36:41 -0700
Subject: [PATCH v2 1/1] don't emit shutdown messages for 'postgres -C' with
 runtime-computed GUCs

---
 src/backend/postmaster/postmaster.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 80bb269599..bf48bc6326 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1066,6 +1066,10 @@ PostmasterMain(int argc, char *argv[])
  false, false);
 
 		puts(config_val ? config_val : "");
+
+		/* don't emit shutdown messages */
+		SetConfigOption("log_min_messages", "PANIC", PGC_INTERNAL, PGC_S_OVERRIDE);
+
 		ExitPostmaster(0);
 	}
 
-- 
2.25.1



Re: BufferAlloc: don't take two simultaneous locks

2022-03-15 Thread Yura Sokolov
В Вт, 15/03/2022 в 13:47 +0300, Yura Sokolov пишет:
> В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:
> > Thanks for the new version.
> > 
> > At Tue, 15 Mar 2022 08:07:39 +0300, Yura Sokolov  
> > wrote in 
> > > В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:
> > > > В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:
> > > > > At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov 
> > > > >  wrote in 
> > > > > > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > > > > I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
> > > > > 128kB shared buffers and I saw that get_hash_entry never takes the
> > > > > !element_alloc() path and always allocate a fresh entry, then
> > > > > saturates at 30 new elements allocated at the medium of a 100 seconds
> > > > > run.
> > > > > 
> > > > > Then, I tried the same with the patch, and I am surprized to see that
> > > > > the rise of the number of newly allocated elements didn't stop and
> > > > > went up to 511 elements after the 100 seconds run.  So I found that my
> > > > > concern was valid.  The change in dynahash actually
> > > > > continuously/repeatedly causes lack of free list entries.  I'm not
> > > > > sure how much the impact given on performance if we change
> > > > > get_hash_entry to prefer other freelists, though.
> > > > 
> > > > Well, it is quite strange SharedBufHash is not allocated as
> > > > HASH_FIXED_SIZE. Could you check what happens with this flag set?
> > > > I'll try as well.
> > > > 
> > > > Other way to reduce observed case is to remember freelist_idx for
> > > > reused entry. I didn't believe it matters much since entries migrated
> > > > netherless, but probably due to some hot buffers there are tention to
> > > > crowd particular freelist.
> > > 
> > > Well, I did both. Everything looks ok.
> > 
> > Hmm. v8 returns stashed element with original patition index when the
> > element is *not* reused.  But what I saw in the previous test runs is
> > the REUSE->ENTER(reuse)(->REMOVE) case.  So the new version looks like
> > behaving the same way (or somehow even worse) with the previous
> > version.
> 
> v8 doesn't differ in REMOVE case neither from master nor from
> previous version. It differs in RETURNED case only.
> Or I didn't understand what you mean :(
> 
> > get_hash_entry continuously suffer lack of freelist
> > entry. (FWIW, attached are the test-output diff for both master and
> > patched)
> > 
> > master finally allocated 31 fresh elements for a 100s run.
> > 
> > > ALLOCED: 31;; freshly allocated
> > 
> > v8 finally borrowed 33620 times from another freelist and 0 freshly
> > allocated (ah, this version changes that..)
> > Finally v8 results in:
> > 
> > > RETURNED: 50806;; returned stashed elements
> > > BORROWED: 33620;; borrowed from another freelist
> > > REUSED: 1812664;; stashed
> > > ASSIGNED: 1762377  ;; reused
> > > (ALLOCED: 0);; freshly allocated
> > 
> > It contains a huge degradation by frequent elog's so they cannot be
> > naively relied on, but it should show what is happening sufficiently.
> 
> Is there any measurable performance hit cause of borrowing?
> Looks like "borrowed" happened in 1.5% of time. And it is on 128kb
> shared buffers that is extremely small. (Or it was 128MB?)
> 
> Well, I think some spare entries could reduce borrowing if there is
> a need. I'll test on 128MB with spare entries. If there is profit,
> I'll return some, but will keep SharedBufHash fixed.

Well, I added GetMaxBackends spare items, but I don't see certain
profit. It is probably a bit better at 128MB shared buffers and
probably a bit worse at 1GB shared buffers (select_only on scale 100).

But it is on old Xeon X5675. Probably things will change on more
capable hardware. I just don't have access at the moment.

> 
> Master branch does less freelist manipulations since it  tries to
> insert first and if there is collision it doesn't delete victim
> buffer.
> 

-

regards
Yura





Re: Estimating HugePages Requirements?

2022-03-15 Thread Magnus Hagander
On Tue, Mar 15, 2022 at 3:41 AM Michael Paquier  wrote:
>
> On Mon, Mar 14, 2022 at 10:34:17AM -0700, Nathan Bossart wrote:
> > On Mon, Mar 14, 2022 at 04:26:43PM +0100, Magnus Hagander wrote:
> >> And in fact, the command documented on
> >> https://www.postgresql.org/docs/devel/kernel-resources.html doesn't
> >> actually produce the output that the docs show, it also shows the log
> >> line, in the default config? If we can't fix the extra logging we
> >> should at least have our docs represent reality -- maybe by adding a
> >> "2>/dev/null" entry? But it'd be better to have it not output that log
> >> in the first place...
> >
> > I attached a patch to adjust the documentation for now.  This apparently
> > crossed my mind earlier [0], but I didn't follow through with it for some
> > reason.
>
> Another thing that we can add is -c log_min_messages=fatal, but my
> method is more complicated than a simple redirection, of course :)

Either does work, but yours has more characters :)


> >> (Of course what I'd really want is to be able to run it on a cluster
> >> that's running, but that was discussed downthread so I'm not bringing
> >> that one up for changes now)
> >
> > I think it is worth revisiting the extra logging and the ability to view
> > runtime-computed GUCs on a running server.  Should this be an open item for
> > v15, or do you think it's alright to leave it for the v16 development
> > cycle?
>
> Well, this is a completely new problem as it opens the door of
> potential concurrent access issues with the data directory lock file
> while reading values from the control file.  And that's not mandatory
> to be able to get those estimations without having to allocate a large
> chunk of memory, which was the primary goal discussed upthread as far
> as I recall.  So I would leave that as an item to potentially tackle
> in future versions.

I think we're talking about two different things here.

I think the "avoid extra logging" would be worth seeing if we can
address for 15.

The "able to run on a live cluster" seems a lot bigger and more scary
and not 15 material.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Skip partition tuple routing with constant partition key

2022-03-15 Thread Greg Stark
There are a whole lot of different patches in this thread.

However this last one https://commitfest.postgresql.org/37/3270/
created by Amit seems like a fairly straightforward optimization that
can be evaluated on its own separately from the others and seems quite
mature. I'm actually inclined to set it to "Ready for Committer".

Incidentally a quick read-through of the patch myself and the only
question I have is how the parameters of the adaptive algorithm were
chosen. They seem ludicrously conservative to me and a bit of simple
arguments about how expensive an extra check is versus the time saved
in the boolean search should be easy enough to come up with to justify
whatever values make sense.




Re: do only critical work during single-user vacuum?

2022-03-15 Thread Peter Geoghegan
On Wed, Feb 16, 2022 at 12:43 AM John Naylor
 wrote:
> I'll put some effort in finding any way that it might not be robust.
> After that, changing the message and docs is trivial.

It would be great to be able to totally drop the idea of using
single-user mode before Postgres 15 feature freeze. How's that going?

I suggest that we apply the following patch as part of that work. It
adds one last final failsafe check at the point that VACUUM makes a
final decision on rel truncation.

It seems unlikely that the patch will ever make the crucial difference
in a wraparound scenario -- in practice it's very likely that we'd
have triggered the wraparound at that point if we run into trouble
with the target rel's relfrozenxid age. And even if it does get to
that point, it would still be possible for the autovacuum launcher to
launch another autovacuum -- this time around we will avoid rel
truncation, restoring the system to normal operation (i.e. no more
xidStopLimit state).

On the other hand it's possible that lazy_cleanup_all_indexes() will
take a very long time to run, and it runs after the current final
failsafe check. An index AM's amvacuumcleanup() routine can take a
long time to run sometimes, especially with GIN indexes. And so it's
just about possible that we won't have triggered the failsafe by the
time lazy_cleanup_all_indexes() is called, which then spends a long
time doing index cleanup -- long enough for the system to reach
xidStopLimit due to the target rel's relfrozenxid age crossing the
crucial xidStopLimit crossover point.

This patch makes this problem scenario virtually impossible. Right now
I'm only prepared to say it's very unlikely. I don't see a reason to
take any chances, though.

-- 
Peter Geoghegan


v1-0001-Perform-final-failsafe-check-before-truncation.patch
Description: Binary data


Re: Optimize external TOAST storage

2022-03-15 Thread Nathan Bossart
On Wed, Mar 16, 2022 at 01:03:38AM +0530, davinder singh wrote:
> Regarding the 2nd part of configuring the threshold, Based on our
> experiments, we have fixed it for the attributes with size > 2 * chunk_size.
> The default chunk_size is 2KB and the page size is 8KB. While toasting each
> attribute is divided into chunks, and each page can hold a max of 4 such
> chunks.
> We only need to think about the space used by the last chunk of the
> attribute.
> This means with each value optimization, it might use extra space in the
> range
> (0B,2KB]. I think this extra space is independent of attribute size. So we
> don't
> need to worry about configuring this threshold. Let me know if I missed
> something
> here.

Looking closer, ISTM there are two thresholds here.  The first threshold is
the minimum size of the uncompressed data required to trigger this
optimization.  The purpose behind this threshold is to avoid increasing
disk space usage in return for very little gain in the read path (i.e.,
skipping decompression).  Specifically, this threshold helps avoid
increasing disk space usage when there are many small TOAST attributes.
You've chosen 2x the TOAST chunk size as the threshold.

The second threshold is the compression ratio required to trigger this
optimization.  The purpose behind this threshold is to avoid storing poorly
compressed data so that we can improve performance in the read path (i.e.,
skipping decompression).  You've chosen >=1 TOAST chunk as the threshold.
This means that when I have an attribute with size 3x a TOAST chunk, I'll
only store it compressed when the compressed attribute is 2x a TOAST chunk
or smaller (a 33% improvement).  If I have an attribute with size 100x a
TOAST chunk, I'll store it compressed when the compressed attribute is 99x
a TOAST chunk or smaller (a 1% improvement).

IIUC this means that it is much more likely that you will store a
compressed attribute when the uncompressed data is larger.  The benefit of
this approach is that you know you'll never give up more than one TOAST
chunk of disk space in the name of faster reads.  The downside is that a
huge number of smaller attributes can require much more disk space than if
they were universally compressed.

Perhaps this is missing one additional threshold.  Perhaps there should be
a compression ratio for which the optimization is ignored.  If my
uncompressed data is 3x a TOAST chunk and compressing it would save half of
a TOAST chunk of disk space (a ~17% improvement), maybe we should still
store it compressed.  I think this threshold would be most important for
servers with many small TOAST attributes.  I don't know whether it is
necessary to make all of this configurable or whether we should just use
some heuristics to find a good balance.  In any case, I imagine the
settings should look different for the different TOAST compression methods.

Or maybe we should just make the first threshold the configuration option
for this feature.  If you set it to -1, the optimization is never used.  If
you set it to 0, you always try to save on read time at the expense of disk
space.  If you set it to 2x the TOAST chunk size, you might use up to 50%
more disk space for read path optimizations.  If you set it to 100x the
TOAST chunk size, you might use up to 1% more disk space for read
optimizations.  By default, this configuration option could be set
conservatively (e.g., 10x a TOAST chunk, which could lead to a maximum of
10% more disk space).  I suspect the increase in disk space would be much
less than these numbers in most cases, but it might be helpful to think of
these things in terms of the worst case scenario.

I apologize for thinking out loud a bit here, but I hope this gives you
some insight into my perspective on this.  In general, I am skeptical that
we can choose one threshold that will work for all PostgreSQL installations
in the known universe.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] Accept IP addresses in server certificate SANs

2022-03-15 Thread Jacob Champion
On Mon, 2022-03-14 at 15:30 +0900, Kyotaro Horiguchi wrote:
> t/003_sslinfo.pl ... 1/? # Tests were run but no plan was declared and 
> done_testing() was not seen.
> # Looks like your test exited with 29 just after 6.
> t/003_sslinfo.pl ... Dubious, test returned 29 (wstat 7424, 0x1d00)
> All 6 subtests passed 
> ...
> Result: FAIL
> 
> The script complains like this:
> 
> ok 6 - ssl_client_cert_present() for connection with cert
> connection error: 'psql: error: connection to server at "127.0.0.1", port 
> 62656 failed: SSL error: tlsv1 alert unknown ca'
> while running 'psql -XAtq -d sslrootcert=ssl/root+server_ca.crt 
> sslmode=require dbname=trustdb hostaddr=127.0.0.1 user=ssltestuser 
> host=localhost -f - -v ON_ERROR_STOP=1' at 
> /home/horiguti/work/worktrees/ipsan/src/test/ssl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm
>  line 1873.
> 
> So, psql looks like disliking the ca certificate.  I also will dig
> into that.

Hmm, the sslinfo tests are failing? I wouldn't have expected that based
on the patch changes. Just to confirm -- they pass for you without the
patch?

> > > Mmm. after the end of the loop, rc is non-zero only when the loop has
> > > exited by the break and otherwise rc is zero. Why isn't it equivalent
> > > to setting check_cn to false at the break?
> > 
> > check_cn can be false if rc is zero, too; it means that we found a SAN
> > of the correct type but it didn't match.
> 
> Don't we count unmatched certs as "existed"?  In that case I think we
> don't go to CN.

Unmatched names, you mean? I'm not sure I understand.

If it helps, the two tests that will fail if check_cn is unset only at
the break are

- certificate with both a CN and SANs ignores CN
- certificate with both an IP CN and IP SANs ignores CN

because none of the SANs would match in that case.

> > > Actually the patch searches for a match of IP address connhost from
> > > dNSName SANs even if iPAddress SANs exist.  I think we've not
> > > explicitly defined thebehavior in that case.
> > 
> > That's a good point; I didn't change the prior behavior. I feel more
> > comfortable leaving that check, since it is technically possible to
> > push something that looks like an IP address into a dNSName SAN. We
> > should probably make an explicit decision on that, as you say.
> > 
> > But I don't think that contradicts the code comment, does it? The
> > comment is just talking about CN fallback scenarios. If you find a
> > match in a dNSName, there's no reason to fall back to the CN.
> 
> The comment explains the spec correctly. From a practical view, the
> behavior above doesn't seem to make things insecure.  So I don't have
> a strong opinion on the choice of the behaviors.
> 
> The only thing I'm concerned here is the possibility that the decision
> corners us to some uncomfortable state between the RFC and our spec in
> future.  On the other hand, changing the behavior can immediately make
> someone uncomfortable.
> 
> So, I'd like to leave it to committers:p

Sounds good. I'll work on adding tests for the current behavior, and if
the committers don't like it, we can change it.

> > > I supposed that we only
> > > be deviant in the case "IP address connhost and no SANs of any type
> > > exists". What do you think about it?
> > 
> > We fall back in the case of "IP address connhost and dNSName SANs
> > exist", which is prohibited by that part of RFC 6125. I don't think we
> > deviate in the case you described; can you explain further?
> 
> In that case, i.e., connhost is IP address and no SANs exist at all,
> we go to CN.  On the other hand in RFC6125:
> 
> rfc6125> In some cases, the URI is specified as an IP address rather
> rfc6125> than a hostname. In this case, the iPAddress subjectAltName
> rfc6125> must be present in the certificate and must exactly match the
> rfc6125> IP in the URI.
> 
> It (seems to me) denies that behavior.  Regardless of the existence of
> other types of SANs, iPAddress is required if connname is an IP
> address.  (That is, it doesn't seem to me that there's any context
> like "if any SANs exists", but I'm not so sure I read it perfectly.)

I see what you mean now. Yes, we deviate there as well (and have done
so for a while now). I think breaking compatibility there would
probably not go over well.

> Thanks.  All behaviors and theier reasons is now clear. So
> 
> Let's leave them for committers for now.

Thank you for the review!

--Jacob


Re: Window Function "Run Conditions"

2022-03-15 Thread Greg Stark
This looks like an awesome addition.

I have one technical questions...

Is it possible to actually transform the row_number case into a LIMIT
clause or make the planner support for this case equivalent to it (in
which case we can replace the LIMIT clause planning to transform into
a window function)?

The reason I ask is because the Limit plan node is actually quite a
bit more optimized than the general window function plan node. It
calculates cost estimates based on the limit and can support Top-N
sort nodes.

But the bigger question is whether this patch is ready for a committer
to look at? Were you able to resolve Andy Fan's bug report? Did you
resolve the two questions in the original email?




Re: On login trigger: take three

2022-03-15 Thread Daniel Gustafsson
> On 14 Mar 2022, at 17:10, Andres Freund  wrote:
> 
> Hi,
> 
> On 2022-03-14 14:47:09 +0100, Daniel Gustafsson wrote:
>>> On 14 Mar 2022, at 01:08, Andres Freund  wrote:
>> 
>>> I was thinking that the way to use it would be to specify it as a client
>>> option. Like PGOPTIONS='-c ignore_event_trigger=login' psql.
>> 
>> Attached is a quick PoC/sketch of such a patch, where 0001 adds a guc, 0002 
>> is
>> the previously posted v25 login event trigger patch, and 0003 adapts is to
>> allow ignoring it (0002/0003 split only for illustrative purposes of course).
>> Is this along the lines of what you were thinking?
> 
> Yes.
> 
> Of course there's still the bogus cache invalidation stuff etc that I
> commented on upthread.

Yeah, that was the previously posted v25 from the author (who adopted it from
the original author).  I took the liberty to quickly poke at the review
comments you had left as well as the ones that I had found to try and progress
the patch.  0001 should really go in it's own thread though to not hide it from
anyone interested who isn't looking at this thread.

--
Daniel Gustafsson   https://vmware.com/



v27-0002-Add-a-new-login-event-and-login-event-trigger-su.patch
Description: Binary data


v27-0001-Provide-a-GUC-for-temporarily-ignoring-event-tri.patch
Description: Binary data


Re: [PATCH] pgbench: add multiconnect option

2022-03-15 Thread Greg Stark
Hi guys,

It looks like David sent a patch and Fabien sent a followup patch. But
there hasn't been a whole lot of discussion or further patches.

It sounds like there are some basic questions about what the right
interface should be. Are there specific questions that would be
helpful for moving forward?




Re: ExecTypeSetColNames is fundamentally broken

2022-03-15 Thread Robert Haas
On Tue, Dec 7, 2021 at 1:19 PM Tom Lane  wrote:
> So the alternatives I see are to revert what bf7ca1587 tried to do
> here, or to try to make it work that way across-the-board, which
> implies (a) a very much larger amount of work, and (b) breaking
> important behaviors that are decades older than that commit.
> It's not even entirely clear that we could get to complete
> consistency if we went down that path.

Continuing my tour through the "bug fixes" section of the CommitFest,
I came upon this thread. Unfortunately there's not that much I can do
to progress it, because I've already expressed all the opinions that I
have on this thread. If we back-patch Tom's originally proposed fix, i
expect we might get a complaint or too, but the current behavior of
being able to create unreadable tables is indisputably poor, and I'm
not in a position to tell Tom that he has to go write a different fix
instead, or even that such a fix is possible. Unless somebody else
wants to comment, which IMHO would be good, I think it's up to Tom to
make a decision here on how he'd like to proceed and then, probably,
just do it.

Anyone else have thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: ICU for global collation

2022-03-15 Thread Daniel Verite
Finnerty, Jim wrote:

> In ICU, the "locale" is just the first part of what we can pass to the
> "locale" parameter in CREATE COLLATION - the part before the optional '@'
> delimiter.  The ICU locale does not include the secondary or tertiary
> properties,

Why not? Please see
https://unicode-org.github.io/icu/userguide/locale

It says that "a locale consists of one or more pieces of ordered
information", the pieces being a language code, a script code, a
country code, a variant code, and keywords.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-15 Thread Nathan Bossart
On Tue, Mar 15, 2022 at 11:04:26AM +0900, Michael Paquier wrote:
> On Mon, Mar 14, 2022 at 03:54:19PM +0530, Bharath Rupireddy wrote:
>> At times, the snapshot or mapping files can be large in number and one
>> some platforms it takes time for checkpoint to process all of them.
>> Having the stats about them in server logs can help us better analyze
>> why checkpoint took a long time and provide a better RCA.
> 
> Do you have any numbers to share regarding that?  Seeing information
> about 1k WAL segments being recycled and/or removed by a checkpoint
> where the operation takes dozens of seconds to complete because we can
> talk about hundred of gigs worth of files moved around.  If we are
> talking about 100~200 files up to 10~20kB each for snapshot and
> mapping files, the information has less value, worth only a portion of
> one WAL segment.

I don't have specific numbers to share, but as noted elsewhere [0], I
routinely see lengthy checkpoints that spend a lot of time in these cleanup
tasks.

[0] https://postgr.es/m/18ED8B1F-7F5B-4ABF-848D-45916C938BC7%40amazon.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg14 psql broke \d datname.nspname.relname

2022-03-15 Thread David G. Johnston
On Tue, Mar 15, 2022 at 12:31 PM Mark Dilger 
wrote:

>
> > On Mar 15, 2022, at 12:27 PM, Robert Haas  wrote:
> >
> > - Justin Pryzby, who originally discovered the problem, prefers the
> > same behavior that I prefer long-term, but thinks Tom's behavior is
> > better than doing nothing.
> > - Mark Dilger, Isaac Moreland, Garick Hamlin, Alvaro Herrera, and
> > Julien Rouhaud have commented on the thread but have not endorsed
> > either of these dueling proposals.
>
> I vote in favor of committing the patch, though I'd also say it's not
> super important to me.
>
>
I'm on board with leaving the v14 change in place - fixing the bug so that
a matching database name is accepted (the whole copy-from-logs argument is
quite compelling).  I'm not too concerned about psql, since \d is mainly
used interactively, and since the change will result in errors in
pg_dump/pg_restore the usual due diligence for upgrading should handle the
necessary tweaks should the case arise where bogus/ignore stuff is present.

David J.


Re: pg14 psql broke \d datname.nspname.relname

2022-03-15 Thread Justin Pryzby
On Wed, Oct 13, 2021 at 11:54:26AM -0500, Justin Pryzby wrote:
> It seems unfortunate if names from log messages qualified with datname were 
> now
> rejected.  Like this one:
> 
> | automatic analyze of table "ts.child.cdrs_2021_10_12"...

Mark mentioned this "log message" use case in his proposed commit message, but
I wanted to mention what seems like a more important parallel:

postgres=# SELECT 'postgres.public.postgres_log'::regclass;
regclass | postgres_log

postgres=# SELECT 'not.postgres.public.postgres_log'::regclass;
ERROR:  improper relation name (too many dotted names): 
not.postgres.public.postgres_log
^
postgres=# SELECT 'not.public.postgres_log'::regclass;
ERROR:  cross-database references are not implemented: "not.public.postgres_log"

I think Mark used this as the model behavior for \d for this patch, which
sounds right.  Since the "two dot" case wasn't fixed in 14.1 nor 2, it seems
better to implement the ultimate, intended behavior now, rather than trying to
exactly match what old versions did.  I'm of the understanding that's what
Mark's patch does, so +1 from me.

I don't know how someone upgrading from an old version would know about the
change, though (rejecting junk prefixes rather than ignoring them).  *If* it
were important, it seems like it'd need to be added to the 14.0 release notes.




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-15 Thread Matthias van de Meent
On Mon, 14 Mar 2022 at 18:14, Andres Freund  wrote:
>
> Hi
>
> A random thought I had while thinking about the size limits: We could use the
> low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
> XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
> allow us to e.g. get away from needing Heap2. Which would aestethically be
> pleasing.

That would be interesting; though out of scope for this bug I'm trying to fix.

> On 2022-03-14 17:57:23 +0100, Matthias van de Meent wrote:
> > I'm not sure whether or not to include this in the test suite, though,
> > as this would require a machine with at least 1GB of memory available
> > for this test alone, and I don't know the current requirements for
> > running the test suite.
>
> We definitely shouldn't require this much RAM for the tests.
>
> It might be worth adding tests exercising edge cases around segment boundaries
> (and perhaps page boundaries) though. E.g. record headers split across pages
> and segments.
>
>
>
> > --- a/src/backend/access/transam/xloginsert.c
> > +++ b/src/backend/access/transam/xloginsert.c
> > @@ -338,10 +338,16 @@ XLogRegisterData(char *data, int len)
> >  {
> >   XLogRecData *rdata;
> >
> > - Assert(begininsert_called);
> > + Assert(begininsert_called && len >= 0 && AllocSizeIsValid(len));
>
> Shouldn't we just make the length argument unsigned?

I've applied that in the attached revision; but I'd like to note that
this makes the fix less straightforward to backpatch; as the changes
to the public function signatures shouldn't be applied in older
versions.

> > - if (num_rdatas >= max_rdatas)
> > + /*
> > +  * Check against max_rdatas; and ensure we don't fill a record with
> > +  * more data than can be replayed
> > +  */
> > + if (unlikely(num_rdatas >= max_rdatas ||
> > +  !AllocSizeIsValid((uint64) mainrdata_len + 
> > (uint64) len)))
> >   elog(ERROR, "too much WAL data");
> > +
> >   rdata = [num_rdatas++];
>
> Personally I'd write it as unlikely(num_rdatas >= max_rdatas) || unlikely(...)
> but I doubt if it makes an actual difference to the compiler.

Agreed, updated.

> >   rdata->data = data;
> > @@ -377,7 +383,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
> >   registered_buffer *regbuf;
> >   XLogRecData *rdata;
> >
> > - Assert(begininsert_called);
> > + Assert(begininsert_called && len >= 0 && len <= UINT16_MAX);
> >
> >   /* find the registered buffer struct */
> >   regbuf = _buffers[block_id];
> > @@ -385,8 +391,14 @@ XLogRegisterBufData(uint8 block_id, char *data, int 
> > len)
> >   elog(ERROR, "no block with id %d registered with WAL 
> > insertion",
> >block_id);
> >
> > - if (num_rdatas >= max_rdatas)
> > + /*
> > +  * Check against max_rdatas; and ensure we don't register more data 
> > per
> > +  * buffer than can be handled by the physical record format.
> > +  */
> > + if (unlikely(num_rdatas >= max_rdatas ||
> > +  regbuf->rdata_len + len > UINT16_MAX))
> >   elog(ERROR, "too much WAL data");
> > +
> >   rdata = [num_rdatas++];
>
> Given the repeated check it might be worth to just put it in a static inline
> used from the relevant places (which'd generate less code because the same
> line number would be used for all the checks).

The check itself is slightly different in those 3 places; but the
error message is shared. Do you mean to extract the elog() into a
static inline function (as attached), or did I misunderstand?


-Matthias
From 9e4b70ca096469616fd1320a02cbde129a4943b6 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Fri, 11 Mar 2022 16:16:22 +0100
Subject: [PATCH v3] Add protections in xlog record APIs against large numbers
 and overflows.

Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops.
---
 src/backend/access/transam/xloginsert.c | 56 -
 src/include/access/xloginsert.h |  4 +-
 src/include/access/xlogrecord.h |  4 ++
 3 files changed, 52 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c260310c4c..e87642ccae 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -127,6 +127,17 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
 	   bool *topxid_included);
 static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
 	uint16 hole_length, char *dest, uint16 *dlen);
+static inline void XLogErrorDataLimitExceeded();
+
+/*
+ * Error due to exceeding the maximum size of a WAL record, or registering
+ * more datas than are being accounted for by the XLog 

Re: ICU for global collation

2022-03-15 Thread Daniel Verite
Julien Rouhaud wrote:

> > > While on that topic, the doc should probably mention that default ICU
> > > collations can only be deterministic.
> >
> > Well, there is no option to do otherwise, so I'm not sure where/how to
> > mention that.  We usually don't document options that don't exist. ;-)
> 
> Sure, but I'm afraid that users may still be tempted to use ICU locales like
> und-u-ks-level2 from the case_insensitive example in the doc and hope that
> it will work accordingly.

+1.

The CREATE DATABASE doc says this currently:

icu_locale

Specifies the ICU locale ID if the ICU locale provider is used.


ISTM that we need to say explicitly that this locale will be used by
default to compare all collatable strings, except that it's overruled
by a bytewise comparison to break ties in case of equality.

The idea is to describe what the backend will do with the setting
rather than saying that we don't have a nondeterministic option.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: ICU for global collation

2022-03-15 Thread Finnerty, Jim
Can we get some more consistent terminology around the term "locale"?

In ICU, the "locale" is just the first part of what we can pass to the "locale" 
parameter in CREATE COLLATION - the part before the optional '@' delimiter.  
The ICU locale does not include the secondary or tertiary properties, so it is 
usually just the country and the language, e.g. en_US (or en-US), but it can 
also be something like es_TRADITIONAL for traditional Spanish.  

I think it would be an improvement in clarity if we consistently use the term 
'locale' to mean the same thing that ICU means by that term, and not to have 
the thing that we call the "locale" also include collation modifiers, or to 
imply that a locale is the same thing as a collation.

/Jim






Re: Tablesync early exit

2022-03-15 Thread Greg Stark
This patch has been through five CFs without any review. It's a very
short patch and I'm assuming the only issue is that it's not clear
whether it's a good idea or not and there are few developers familiar
with this area of code?

Amit, it looks like you were the one who asked for it to be split off
from the logical decoding of 2PC patch in [1]. Can you summarize what
questions remain here? Should we just commit this or is there any
issue that needs to be debated?

[1] 
https://www.postgresql.org/message-id/CAA4eK1Jxu-3qxtkfA_dKoquQgGZVcB+k9_-yT5=9gdew84t...@mail.gmail.com


On Sat, 6 Mar 2021 at 20:56, Peter Smith  wrote:
>
> Hi hackers.
>
> I propose a small optimization can be added to the tablesync replication code.
>
> This proposal (and simple patch) was first discussed here [1].
>
> Basic idea is the tablesync could/should detect if there is anything
> to do *before* it enters the apply main loop. Calling
> process_sync_tables() before the apply main loop offers a quick way
> out, so the message handling will not be unnecessarily between
> workers. This will be a small optimization.
>
> But also, IMO this is a more natural separation of work. E.g tablesync
> worker will finish when the table is synced - not go one extra step...
>
> ~~
>
> This patch was already successfully used for several versions
> (v43-v50) of another 2PC patch [2], but it was eventually removed from
> there because, although it has its own independent value, it was not
> required for that patch series [3].
>
> 
> [1] 
> https://www.postgresql.org/message-id/CAHut%2BPtjk-Qgd3R1a1_tr62CmiswcYphuv0pLmVA-%2B2s8r0Bkw%40mail.gmail.com
> [2] 
> https://www.postgresql.org/message-id/flat/CAHut%2BPsd5nyg-HG6rGO2_5jzXuSA1Eq5%2BB5J2VJo0Q2QWi-1HQ%40mail.gmail.com#1c2683756b32e267d96b7177ba95
> [3] 
> https://www.postgresql.org/message-id/CAA4eK1Jxu-3qxtkfA_dKoquQgGZVcB%2Bk9_-yT5%3D9GDEW84TF%2BA%40mail.gmail.com
>
> Kind Regards,
> Peter Smith.
> Fujitsu Australia
>
>


--
greg




Re: Optimize external TOAST storage

2022-03-15 Thread davinder singh
Thanks, Nathan, for the review comments. Please find the updated patch.
On Sun, Mar 13, 2022 at 3:43 AM Nathan Bossart 
wrote:

> Do you think it is worth making this configurable?  I don't think it is
> outside the realm of possibility for some users to care more about disk
> space than read performance.  And maybe the threshold for this optimization
> could differ based on the workload.
>

I think here we can break it into two parts.
The first part if the user cares about the disk more than reading
performance, disable this?
That is a good idea, we can try this, lets see what others say.

Regarding the 2nd part of configuring the threshold, Based on our
experiments, we have fixed it for the attributes with size > 2 * chunk_size.
The default chunk_size is 2KB and the page size is 8KB. While toasting each
attribute is divided into chunks, and each page can hold a max of 4 such
chunks.
We only need to think about the space used by the last chunk of the
attribute.
This means with each value optimization, it might use extra space in the
range
(0B,2KB]. I think this extra space is independent of attribute size. So we
don't
need to worry about configuring this threshold. Let me know if I missed
something
here.


>  extern void toast_tuple_externalize(ToastTupleContext *ttc, int attribute,
>  int options);
> +extern void toast_tuple_opt_externalize(ToastTupleContext *ttc, int
> attribute,
> +int options, Datum
> old_toast_value,
> +ToastAttrInfo *old_toast_attr);
>
> Could we bake this into toast_tuple_externalize() so that all existing
> callers benefit from this optimization?  Is there a reason to export both
> functions?  Perhaps toast_tuple_externalize() should be renamed and made
> static, and then this new function could be called
> toast_tuple_externalize() (which would be a wrapper around the internal
> function).
>
This function is used only in heap_toast_insert_or_update(), all existing
callers are
using new function only. As you suggested, I have renamed the new function
as
wrapper function and also only exporting the new function.


>
> +/* Sanity check: if data is not compressed then we can proceed as
> usual. */
> +if (!VARATT_IS_COMPRESSED(DatumGetPointer(*value)))
> +toast_tuple_externalize(ttc, attribute, options);
>
> With a --enable-cassert build, this line causes assertion failures in the
> call to GetMemoryChunkContext() from pfree().  'make check' is enough to
> reproduce it.  Specifically, it fails the following assertion:
>
> AssertArg(MemoryContextIsValid(context));
>

Thanks for pointing it out, this failure started because I was not handling
the
case when the data is already compressed even before toasting. Following
check verifies if the data is compressed or not but that is not enough
because
we can't optimize the toasting if we didn't get the data in the
uncompressed form in the first place from the source.
+/* If data is not compressed then we can proceed as usual. */
+if (!VARATT_IS_COMPRESSED(DatumGetPointer(*value)))

v1 patch didn't have this problem because it was verifying whether we have
compressed data in this toasting round or not. I have changed back to the
earlier version.
+/* If data is not compressed then we can proceed as usual. */
+if (*value == orig_toast_value)



-- 
Regards,
Davinder
EnterpriseDB: http://www.enterprisedb.com


v3_0001_optimize_external_toast_storage.patch
Description: Binary data


Re: pg14 psql broke \d datname.nspname.relname

2022-03-15 Thread Mark Dilger



> On Mar 15, 2022, at 12:27 PM, Robert Haas  wrote:
> 
> - Justin Pryzby, who originally discovered the problem, prefers the
> same behavior that I prefer long-term, but thinks Tom's behavior is
> better than doing nothing.
> - Mark Dilger, Isaac Moreland, Garick Hamlin, Alvaro Herrera, and
> Julien Rouhaud have commented on the thread but have not endorsed
> either of these dueling proposals.

I vote in favor of committing the patch, though I'd also say it's not super 
important to me.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Proposal: Support custom authentication methods using hooks

2022-03-15 Thread samay sharma
Hi,

On Fri, Mar 4, 2022 at 11:15 AM Jacob Champion  wrote:

> On Thu, 2022-03-03 at 11:12 +0100, Peter Eisentraut wrote:
> > At the moment, it is not possible to judge whether the hook interface
> > you have chosen is appropriate.
> >
> > I suggest you actually implement the Azure provider, then make the hook
> > interface, and then show us both and we can see what to do with it.
>
> To add a data point here, I've rebased my OAUTHBEARER experiment [1] on
> top of this patchset. (That should work with Azure's OIDC provider, and
> if it doesn't, I'd like to know why.)
>

Firstly, thanks for doing this. It helps to have another data point and the
feedback you provided is very valuable. I've looked to address it with the
patchset attached to this email.

This patch-set adds the following:

* Allow multiple custom auth providers to be registered (Addressing
feedback from Aleksander and Andrew)
* Modify the test extension to use SCRAM to exchange secrets (Based on
Andres's suggestion)
* Add support for custom auth options to configure provider's behavior (by
exposing a new hook) (Required by OAUTHBEARER)
* Allow custom auth methods to use usermaps. (Required by OAUTHBEARER)


> After the port, here are the changes I still needed to carry in the
> backend to get the tests passing:
>
> - I needed to add custom HBA options to configure the provider.
>

Could you try to rebase your patch to use the options hook and let me know
if it satisfies your requirements?

Please let me know if there's any other feedback.

Regards,
Samay


> - I needed to declare usermap support so that my provider could
> actually use check_usermap().

- I had to modify the SASL mechanism registration to allow a custom
> maximum message length, but I think that's not the job of Samay's
> proposal to fix; it's just a needed improvement to CheckSASLAuth().
>
> Obviously, the libpq frontend still needs to understand how to speak
> the new SASL mechanism. There are third-party SASL implementations that
> are plugin-based, which could potentially ease the pain here, at the
> expense of a major dependency and a very new distribution model.
>
> --Jacob
>
> [1]
> https://www.postgresql.org/message-id/flat/d1b467a78e0e36ed85a09adf979d04cf124a9d4b.camel%40vmware.com
>


v3-0001-Add-support-for-custom-authentication-methods.patch
Description: Binary data


v3-0004-Add-support-for-map-and-custom-auth-options.patch
Description: Binary data


v3-0002-Add-sample-extension-to-test-custom-auth-provider.patch
Description: Binary data


v3-0003-Add-tests-for-test_auth_provider-extension.patch
Description: Binary data


Re: pg14 psql broke \d datname.nspname.relname

2022-03-15 Thread Robert Haas
Continuing my pass through the "bug fixes" section of the CommitFest,
I came upon this patch, which is contested. Here is my attempt to
summarize where things stand. As I understand it:

- Tom wants to revert to the previous behavior of accepting arbitrary
garbage, so that \d slkgjskld.jgdsjhgjklsdhg.saklasgh.foo.bar means \d
foo.bar.
- I want \d mydb.foo.bar to mean \d foo.bar if the dbname is mydb and
report an error otherwise; anything with dots>2 is also an error in my
view.
- Peter Geoghegan agrees with Tom.
- Stephen Frost agrees with me.
- Vik Fearing also agrees with me.
- Justin Pryzby, who originally discovered the problem, prefers the
same behavior that I prefer long-term, but thinks Tom's behavior is
better than doing nothing.
- Mark Dilger, Isaac Moreland, Garick Hamlin, Alvaro Herrera, and
Julien Rouhaud have commented on the thread but have not endorsed
either of these dueling proposals.

By my count, that's probably a vote of 4-2 in view of the preferred
solution, but it depends on whether you could Justin's vote as +1 for
my preferred solution or maybe +0.75 or +0.50 or something. At any
rate, it's close.

If anyone else would like to take a position, please do so in the next
few days. If there are no more votes, I'm going to proceed with trying
to fix up Mark's patch implementing my preferred solution and getting
it committed.

Thanks,

...Robert




Re: Kerberos delegation support in libpq and postgres_fdw

2022-03-15 Thread Jacob Champion
On Fri, 2022-03-11 at 19:39 -0500, Stephen Frost wrote:
> 
> On Fri, Mar 11, 2022 at 18:55 Jacob Champion  wrote:
> > 
> > [5] says we have to free the proxy credential with GSS_Release_cred();
> > I don't see that happening anywhere, but I may have missed it.
> 
> I’m not sure that it’s really necessary or worthwhile to do that at
> process end since … the process is about to end. I suppose we could
> provide a function that a user could call to ask for it to be
> released sooner if we really wanted..?

Do we have to keep the credential handle around once we've stored it
into the MEMORY: cache, though? Just seems like a leak that someone
will have to plug eventually, even if it doesn't really impact things
now.

> > It seems like there should be significant security implications to
> > allowing delegation across the board. Especially since one postgres_fdw
> > might delegate to another server, and then another... Should this be
> > opt-in, maybe via a connection parameter?
> 
> This is already opt-in- at kinit time a user can decide if they’d
> like a proxy-able ticket or not.  I don’t know that we really need to
> have our own option for it … tho I’m not really against adding such
> an option either.

I don't really have experience with the use case. Is it normal for
kinit users to have to decide once, globally, whether they want
everything they interact with to be able to proxy their credentials? It
just seems like you'd want more fine-grained control over who gets to
masquerade as you.

> > Similarly, it feels a little strange that the server would allow the
> > client to unilaterally force the use of a delegated credential. I think
> > that should be opt-in on the server side too, unless there's some
> > context I'm missing around why that's safe.
> 
> Perhaps you could explain what isn’t safe about accepting a delegated
> credential from a client..?  I am not away of a risk to accepting
> such a delegated credential.

My initial impression is that this is effectively modifying the USER
MAPPING that the admin has set up. I'd be worried about an open
credential proxy being used to bypass firewall or HBA restrictions, for
instance -- you might not be able to connect as an admin from your
machine, but you might be able to connect by bouncing through a proxy.
(What damage you can do is going to be limited by what the server
extensions can do, of course.)

Another danger might be disclosure/compromise of middlebox secrets? Is
it possible for someone who has one half of the credentials to snoop on
a gssenc connection between the proxy Postgres and the backend
Postgres?

> Even so, I’m not against adding an option… but exactly how would that
> option be configured?  Server level?  On the HBA line?  role level..?

In the OPTIONS for CREATE SERVER, maybe? At least for the FDW case.

> > If I'm reading it right, we're resetting the default credential in the
> > MEMORY cache, so if you're a libpq client doing your own GSSAPI work,
> > I'm guessing you might not be happy with this behavior.
> 
> This is just done on the server side and not the client side..?

Yeah, I misread the patch, sorry.

> > Also, we're
> > globally ignoring whatever ccache was set by an administrator. Can't
> > two postgres_fdw connections from the same backend process require
> > different settings?
> 
> Settings..? Perhaps, but delegated credentials aren’t really 
> settings, so not really sure what you’re suggesting here.

I mean that one backend server might require delegated credentials, and
another might require whatever the admin has already set up in the
ccache, and the user might want to use tables from both servers in the
same session.

> > I notice that gss_store_cred_into() has a companion,
> > gss_acquire_cred_from(). Is it possible to use that to pull out our
> > delegated credential explicitly by name, instead of stomping on the
> > global setup?
> 
> Not really sure what is meant here by global setup..?  Feeling like
> this is a follow on confusion from maybe mixing server vs client
> libpq?

By my reading, the gss_store_cred_into() call followed by
the setenv("KRB5CCNAME", ...) is effectively performing global
configuration for the process. Any KRB5CCNAME already set up by the
server admin is going to be ignored from that point onward. Is that
accurate?

Thanks,
--Jacob


Re: refactoring basebackup.c

2022-03-15 Thread Robert Haas
On Tue, Mar 15, 2022 at 6:33 AM Jeevan Ladhe  wrote:
> I get following error at my end:
>
> $ pg_basebackup -D /tmp/zstd_bk -Ft -Xfetch --compress=server-zstd:7@4
> pg_basebackup: error: could not initiate base backup: ERROR:  could not 
> compress data: Unsupported parameter
> pg_basebackup: removing data directory "/tmp/zstd_bk"
>
> This is mostly because I have the zstd library version v1.4.4, which
> does not have default support for parallel workers. Maybe we should
> have a better error, something that is hinting that the parallelism is
> not supported by the particular build.

I'm not averse to trying to improve that error message, but honestly
I'd consider that to be good enough already to be acceptable. We could
think about trying to add an errhint() telling you that the problem
may be with your libzstd build.

> The regression for pg_verifybackup test 008_untar.pl also fails with a
> similar error. Here, I think we should have some logic in regression to
> skip the test if the parameter is not supported?

Or at least to have the test not fail.

> Also, just a thought, for the versions where parallelism is not
> supported, should we instead just throw a warning and fall back to
> non-parallel behavior?

I don't think so. I think it's better for the user to get an error and
then change their mind and request something we can do.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-15 Thread Robert Haas
On Tue, Mar 15, 2022 at 1:26 PM Ashutosh Sharma  wrote:
> > On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma  
> > wrote:
> > > Few comments on the latest patch:
> > >
> > > -   /* We need to construct the pathname for this database */
> > > -   dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> > > +   if (xlrec->dbid != InvalidOid)
> > > +   dbpath = GetDatabasePath(xlrec->dbid, 
> > > xlrec->tsid);
> > > +   else
> > > +   dbpath = pstrdup("global");
> > >
> > > Do we really need this change? Is GetDatabasePath() alone not capable
> > > of handling it?
> >
> > Well, I mean, that function has a special case for
> > GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid
> > is 0.
> >
>
> Wouldn't this be true only in case of a shared map file (when dbOid is
> Invalid and tblspcOid is globaltablespace_oid) or am I missing
> something?

*facepalm*

Good catch, sorry that I'm slow on the uptake today.

v3 attached.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


relmap-refactor-rmh-v3.patch
Description: Binary data


Re: ICU for global collation

2022-03-15 Thread Robert Haas
On Tue, Mar 15, 2022 at 12:58 PM Peter Eisentraut
 wrote:
> On 14.03.22 19:57, Robert Haas wrote:
> > 1. What will happen if I set the ICU collation to something that
> > doesn't match the libc collation? How bad are the consequences?
>
> These are unrelated, so there are no consequences.

Can you please elaborate on this?

> > 2. If I want to avoid a mismatch between the two, then I will need a
> > way to figure out which libc collation corresponds to a given ICU
> > collation. How do I do that?
>
> You can specify the same name for both.

Hmm. If every name were valid in both systems, I don't think you'd be
proposing two fields.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-15 Thread Ashutosh Sharma
On Tue, Mar 15, 2022 at 10:17 PM Robert Haas  wrote:
>
> On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma  
> wrote:
> >
> > Few comments on the latest patch:
> >
> > -   /* We need to construct the pathname for this database */
> > -   dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> > +   if (xlrec->dbid != InvalidOid)
> > +   dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> > +   else
> > +   dbpath = pstrdup("global");
> >
> > Do we really need this change? Is GetDatabasePath() alone not capable
> > of handling it?
>
> Well, I mean, that function has a special case for
> GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid
> is 0.
>

Wouldn't this be true only in case of a shared map file (when dbOid is
Invalid and tblspcOid is globaltablespace_oid) or am I missing
something?

--
With Regards,
Ashutosh Sharma.




Re: ICU for global collation

2022-03-15 Thread Peter Eisentraut



On 14.03.22 19:57, Robert Haas wrote:

1. What will happen if I set the ICU collation to something that
doesn't match the libc collation? How bad are the consequences?


These are unrelated, so there are no consequences.


2. If I want to avoid a mismatch between the two, then I will need a
way to figure out which libc collation corresponds to a given ICU
collation. How do I do that?


You can specify the same name for both.




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-15 Thread Robert Haas
On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma  wrote:
>
> Few comments on the latest patch:
>
> -   /* We need to construct the pathname for this database */
> -   dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> +   if (xlrec->dbid != InvalidOid)
> +   dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> +   else
> +   dbpath = pstrdup("global");
>
> Do we really need this change? Is GetDatabasePath() alone not capable
> of handling it?

Well, I mean, that function has a special case for
GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid
is 0.

> I think we can shorten these function names to probably
> ScanSourceDBPgClassRel(), ScanSourceDBPgClassTuple() and likewise?

We could, but I don't think it's an improvement.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] Fix various spelling errors

2022-03-15 Thread Tom Lane
"Kekalainen, Otto"  writes:
> Please next time use `git am` to import the patch so that author and other
> commit metadata is kept, or if you apply patches manually then commit with 
> `git
> --author` so that original author will be correct in the commit and your name
> will be only in the committer field.

This is not our practice.  We credit authors in the body of the commit
message, but we don't worry about the git metadata.  git is a tool we
happen to be using at the moment, but it doesn't run the project,
and in any case it's far short of being adequate for such a purpose.
What would you do with multiple-author patches, for a start?

regards, tom lane




Re: Corruption during WAL replay

2022-03-15 Thread Robert Haas
On Wed, Jan 26, 2022 at 3:25 AM Kyotaro Horiguchi
 wrote:
> The attached is the fixed version and it surely works with the repro.

Hi,

I spent the morning working on this patch and came up with the
attached version. I wrote substantial comments in RelationTruncate(),
where I tried to make it more clear exactly what the bug is here, and
also in storage/proc.h, where I tried to clarify both the use of the
DELAY_CHKPT_* flags in general terms. If nobody is too sad about this
version, I plan to commit it.

I think it should be back-patched, too, but that looks like a bit of a
pain. I think every back-branch will require different adjustments.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v5-0001-Fix-possible-recovery-trouble-if-TRUNCATE-overlap.patch
Description: Binary data


Re: [PATCH] Fix various spelling errors

2022-03-15 Thread Kekalainen, Otto

On 2022-03-14, 19:47, "Michael Paquier"  wrote:
>It is useful to group that together.  I have gathered everything that
>   looked like a typo or a grammar mistake, and applied the fixes.

That was quick, thanks!

Please next time use `git am` to import the patch so that author and other
commit metadata is kept, or if you apply patches manually then commit with `git
--author` so that original author will be correct in the commit and your name
will be only in the committer field.

This was just a spelling fix so I don't care, but for other people doing more
significant contributions, getting the git authorship and having a PostgreSQL
contribution show up on their Gihub (or Gitlab profile, or in other tools that
read git commits) can be important and maybe the only reward/credit for those
doing open source on their free time.

Again, personally I don't care and I don't need Github credits for this, just
stating as general advice for a better contribution process in the future.

- Otto



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-15 Thread Ashutosh Sharma
Few comments on the latest patch:

-   /* We need to construct the pathname for this database */
-   dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+   if (xlrec->dbid != InvalidOid)
+   dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+   else
+   dbpath = pstrdup("global");

Do we really need this change? Is GetDatabasePath() alone not capable
of handling it?

==

+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+
Oid tbid, Oid dbid,
+
char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+
Oid dbid, char *srcpath,
+
List *rnodelist, Snapshot snapshot);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char
*srcpath);

I think we can shorten these function names to probably
ScanSourceDBPgClassRel(), ScanSourceDBPgClassTuple() and likewise?

--
With Regards,
Ashutosh Sharma.

On Tue, Mar 15, 2022 at 3:24 PM Dilip Kumar  wrote:
>
> On Mon, Mar 14, 2022 at 10:04 PM Robert Haas  wrote:
>
> > I think it would make sense to have two different WAL records e.g.
> > XLOG_DBASE_CREATE_WAL_LOG and XLOG_DBASE_CREATE_FILE_COPY. Then it's
> > easy to see how this could be generalized to other strategies in the
> > future.
>
> Done that way.  In dbase_desc(), for XLOG_DBASE_CREATE_FILE_COPY I
> have kept the older description i.e. "copy dir" and for
> XLOG_DBASE_CREATE_WAL_LOG it is "create dir", because logically the
> first one is actually copying and the second one is just creating the
> directory.  Do you think we should be using "copy dir file_copy" and
> "copy dir wal_log" in the description as well?
>
> > On Mon, Mar 14, 2022 at 12:04 PM Robert Haas  wrote:
> > > I was looking at 0001 and 0002 again and realized that I swapped the
> > > names load_relmap_file() and read_relmap_file() from what I should
> > > have done. Here's a revised version. With this, read_relmap_file() and
> > > write_relmap_file() become functions that just read and write the file
> > > without touching any global variables, and load_relmap_file() is the
> > > function that reads data from the file and puts it into a global
> > > variable, which seems more sensible than the way I had it before.
>
> Okay, I have included this patch and rebased other patches on top of that.
>
> > Regarding 0003 and 0005, I'm not a fan of 'bool isunlogged'. I think
> > 'bool permanent' would be better (note BM_PERMANENT). This would
> > involve reversing true and false.
>
> Okay changed.
>
> > Regarding 0005:
> >
> > +   CREATEDB_WAL_LOG = 0,
> > +   CREATEDB_FILE_COPY = 1
> >
> > I still think you don't need = 0 and = 1 here.
>
> Done
>
> > I'll probably go through and do a pass over the comments once you post
> > the next version of this. There seems to be work needed in a bunch of
> > places, but it probably makes more sense for me to go through and
> > adjust the things that seem to need it rather than listing a bunch of
> > changes for you to make.
>
> Sure, thanks.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com




Re: Add 64-bit XIDs into PostgreSQL 15

2022-03-15 Thread Michael Banck
Hi,

On Mon, Mar 14, 2022 at 01:32:04PM +0300, Aleksander Alekseev wrote:
> IMO v16-0001 and v16-0002 are in pretty good shape and are as much as
> we are going to deliver in PG15. I'm going to change the status of the
> CF entry to "Ready for Committer" somewhere this week unless someone
> believes v16-0001 and/or v16-0002 shouldn't be merged.

Not sure, but if you want more people to look at them, probably best
would be to start a new thread with just the v15-target patches. Right
now, one has to download your tarball, extract it and look at the
patches in there.

I hope v16-0001 and v16-0002 are small enough (I didn't do the above)
that they can just be attached normally?


Michael

-- 
Michael Banck
Team Lead PostgreSQL
Project Manager
Tel.: +49 2166 9901-171
Mail: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Management: Dr. Michael Meskes, Geoff Richardson, Peter Lilley

Our handling of personal data is subject to:
https://www.credativ.de/en/contact/privacy/




Re: BufferAlloc: don't take two simultaneous locks

2022-03-15 Thread Yura Sokolov
В Вт, 15/03/2022 в 13:47 +0300, Yura Sokolov пишет:
> В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:
> > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > ...
> > > Strange thing: both master and patched version has higher
> > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > than in first october version [1]. But lower tps at higher
> > > connections number (>= 191 clients).
> > > I'll try to bisect on master this unfortunate change.
> > 
> > The reversing of the preference order between freshly-allocation and
> > borrow-from-another-freelist might affect.
> 
> `master` changed its behaviour as well.
> It is not problem of the patch at all.

Looks like there is no issue: old commmit 2d44dee0281a1abf
behaves similar to new one at the moment.

I think, something changed in environment.
I remember there were maintanance downtime in the autumn.
Perhaps kernel were updated or some sysctl tuning changed.



regards
Yura.





Re: Change the csv log to 'key:value' to facilitate the user to understanding and processing of logs

2022-03-15 Thread Jan Wieck

On 3/15/22 10:12, Andrew Dunstan wrote:


On 3/15/22 09:30, hubert depesz lubaczewski wrote:

On Tue, Mar 15, 2022 at 09:31:19AM +0800, lupeng wrote:

Dear Hackers
When I audit the Postgresql database recently, I found that after
configuring the log type as csv, the output log content is as follows:
"database ""lp_db1"" does not exist","DROP DATABASE
lp_db1;",,"dropdb, dbcommands.c:841","","client backend",,0 It is very
inconvenient to understand the real meaning of each field. And in the
log content," is escaped as "", which is not friendly to regular
expression matching. Therefore, I want to modify the csv log function,
change its format to key:value, assign the content of the non-existing
field to NULL, and at the same time, " will be escaped as  \" in the
log content. After the modification, the above log format is as
follows: Log_time:"2022-03-15 09:17:55.289
CST",User_name:"postgres",Database_name:"lp_db",Process_id:"17995",Remote_host:"192.168.88.130",Remote_port:"38402",Line_number:
"622fe941.464b",PS_display:"DROP
DATABASE",Session_start_timestamp:"2022-03-15 09:17:53
CST",Virtual_transaction_id:"3/2",Transaction_id:"NULL",Error_severity:"ERROR",SQL_state_code
:"3D000",Errmessage:"database \"lp_db1\" does not
exist",Errdetail:"NULL",Errhint:"NULL",Internal_query:"NULL",Internal_pos:"0",Errcontext:"NULL",User_query
:"DROP DATABASE lp_db1;",Cursorpos:"NULL",File_location:"dropdb,
dbcommands.c:841",Application_name:"NULL",Backend_type:"client
backend",Leader_PID:"0",Query_id:"0"

CSV format is well documented
(https://www.postgresql.org/docs/current/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-CSVLOG).

If you want named fields you can wait for pg15 and its jsonlog
(https://www.depesz.com/2022/01/17/waiting-for-postgresql-15-introduce-log_destinationjsonlog/).

I, for one, wouldn't want to have to deal with field names repeated in
every single record.



Indeed. And even if this were a good idea, which it's not, it would be
15 years too late.


Also, the CSV format, while human readable to a degree, wasn't meant for 
direct, human consumption. It was meant to be read by programs and at 
the time, CSV made the most sense.



Regards, Jan




Re: Change the csv log to 'key:value' to facilitate the user to understanding and processing of logs

2022-03-15 Thread Andrew Dunstan


On 3/15/22 09:30, hubert depesz lubaczewski wrote:
> On Tue, Mar 15, 2022 at 09:31:19AM +0800, lupeng wrote:
>> Dear Hackers
>> When I audit the Postgresql database recently, I found that after
>> configuring the log type as csv, the output log content is as follows:
>> "database ""lp_db1"" does not exist","DROP DATABASE
>> lp_db1;",,"dropdb, dbcommands.c:841","","client backend",,0 It is very
>> inconvenient to understand the real meaning of each field. And in the
>> log content," is escaped as "", which is not friendly to regular
>> expression matching. Therefore, I want to modify the csv log function,
>> change its format to key:value, assign the content of the non-existing
>> field to NULL, and at the same time, " will be escaped as  \" in the
>> log content. After the modification, the above log format is as
>> follows: Log_time:"2022-03-15 09:17:55.289
>> CST",User_name:"postgres",Database_name:"lp_db",Process_id:"17995",Remote_host:"192.168.88.130",Remote_port:"38402",Line_number:
>> "622fe941.464b",PS_display:"DROP
>> DATABASE",Session_start_timestamp:"2022-03-15 09:17:53
>> CST",Virtual_transaction_id:"3/2",Transaction_id:"NULL",Error_severity:"ERROR",SQL_state_code
>> :"3D000",Errmessage:"database \"lp_db1\" does not
>> exist",Errdetail:"NULL",Errhint:"NULL",Internal_query:"NULL",Internal_pos:"0",Errcontext:"NULL",User_query
>> :"DROP DATABASE lp_db1;",Cursorpos:"NULL",File_location:"dropdb,
>> dbcommands.c:841",Application_name:"NULL",Backend_type:"client
>> backend",Leader_PID:"0",Query_id:"0"
> CSV format is well documented
> (https://www.postgresql.org/docs/current/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-CSVLOG).
>
> If you want named fields you can wait for pg15 and its jsonlog
> (https://www.depesz.com/2022/01/17/waiting-for-postgresql-15-introduce-log_destinationjsonlog/).
>
> I, for one, wouldn't want to have to deal with field names repeated in
> every single record.
>

Indeed. And even if this were a good idea, which it's not, it would be
15 years too late.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





RE: Skipping logical replication transactions on subscriber side

2022-03-15 Thread osumi.takami...@fujitsu.com
On Tuesday, March 15, 2022 3:13 PM Masahiko Sawada  
wrote:
> I've attached an updated version patch.

A couple of minor comments on v14.

(1) apply_handle_commit_internal


+   if (is_skipping_changes())
+   {
+   stop_skipping_changes();
+
+   /*
+* Start a new transaction to clear the subskipxid, if not 
started
+* yet. The transaction is committed below.
+*/
+   if (!IsTransactionState())
+   StartTransactionCommand();
+   }
+

I suppose we can move this condition check and stop_skipping_changes() call
to the inside of the block we enter when IsTransactionState() returns true.

As the comment of apply_handle_commit_internal() mentions,
it's the helper function for apply_handle_commit() and
apply_handle_stream_commit().

Then, I couldn't think that both callers don't open
a transaction before the call of apply_handle_commit_internal().
For applying spooled messages, we call begin_replication_step as well.

I can miss something, but timing when we receive COMMIT message
without opening a transaction, would be the case of empty transactions
where the subscription (and its subscription worker) is not interested.
If this is true, currently the patch's code includes
such cases within the range of is_skipping_changes() check.

(2) clear_subscription_skip_lsn's comments.

The comments for this function shouldn't touch
update of origin states, now that we don't update those.

+/*
+ * Clear subskiplsn of pg_subscription catalog with origin state updated.
+ *


This applies to other comments.

+   /*
+* Update the subskiplsn of the tuple to InvalidXLogRecPtr.  If user has
+* already changed subskiplsn before clearing it we don't update the
+* catalog and don't advance the replication origin state.  
...
+*We can reduce the possibility by
+* logging a replication origin WAL record to advance the origin LSN
+* instead but there is no way to advance the origin timestamp and it
+* doesn't seem to be worth doing anything about it since it's a very 
rare
+* case.
+*/



Best Regards,
Takamichi Osumi



Re: speed up a logical replica setup

2022-03-15 Thread Peter Eisentraut

On 21.02.22 13:09, Euler Taveira wrote:
A new tool called pg_subscriber does this conversion and is tightly 
integrated

with Postgres.


Are we comfortable with the name pg_subscriber?  It seems too general.
Are we planning other subscriber-related operations in the future?  If
so, we should at least make this one use a --create option or
something like that.

 doc/src/sgml/ref/pg_subscriber.sgml

Attached is a patch that reorganizes the man page a bit.  I moved the
description of the steps to the Notes section and formatted it
differently.  I think the steps are interesting but not essential for
the using of the program, so I wanted to get them out of the main
description.

 src/bin/pg_subscriber/pg_subscriber.c

+   if (made_temp_replslot)
+   {
+   conn = connect_database(dbinfo[0].pubconninfo, true);
+   drop_replication_slot(conn, [0], temp_replslot);
+   disconnect_database(conn);
+   }

Temp slots don't need to be cleaned up.

+/*
+ * Obtain the system identifier from control file. It will be used to 
compare
+ * if a data directory is a clone of another one. This routine is used 
locally

+ * and avoids a replication connection.
+ */
+static char *
+get_control_from_datadir(const char *datadir)

This could return uint64 directly, without string conversion.
get_sysid_from_conn() could then convert to uint64 internally.

+   {"verbose", no_argument, NULL, 'v'},

I'm not sure if the --verbose option is all that useful.

+   {"stop-subscriber", no_argument, NULL, 1},

This option doesn't seem to be otherwise supported or documented.

+   pub_sysid = pg_malloc(32);
+   pub_sysid = get_sysid_from_conn(dbinfo[0].pubconninfo);
+   sub_sysid = pg_malloc(32);
+   sub_sysid = get_control_from_datadir(subscriber_dir);

These mallocs don't appears to be of any use.

+   dbname_conninfo = pg_malloc(NAMEDATALEN);

This seems wrong.

Overall, this code could use a little bit more structuring.  There are
a lot of helper functions that don't seem to do a lot and are mostly
duplicate runs-this-SQL-command calls.  But the main() function is
still huge.  There is room for refinement.

 src/bin/pg_subscriber/t/001_basic.pl

Good start, but obviously, we'll need some real test cases here also.

 src/bin/initdb/initdb.c
 src/bin/pg_ctl/pg_ctl.c
 src/common/file_utils.c
 src/include/common/file_utils.h

I recommend skipping this refactoring.  The readfile() function from
pg_ctl is not general enough to warrant the pride of place of a
globally available function.  Note that it is specifically geared
toward some of pg_ctl's requirements, for example that the underlying
file can change while it is being read.

The requirements of pg_subscriber can be satisfied more easily: Just
call pg_ctl to start the server.  You are already using that in
pg_subscriber.  Is there a reason it can't be used here as well?From a0ad5fdaddc17ef74594bfd3c65c777649d1544b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 15 Mar 2022 14:39:19 +0100
Subject: [PATCH] fixup! Create a new logical replica from a base backup or
 standby server.

---
 doc/src/sgml/ref/pg_subscriber.sgml | 172 
 1 file changed, 99 insertions(+), 73 deletions(-)

diff --git a/doc/src/sgml/ref/pg_subscriber.sgml 
b/doc/src/sgml/ref/pg_subscriber.sgml
index e68a19092e..df63c6a993 100644
--- a/doc/src/sgml/ref/pg_subscriber.sgml
+++ b/doc/src/sgml/ref/pg_subscriber.sgml
@@ -43,79 +43,6 @@ Description
replication connections from the target server (known as subscriber server).
The target server should accept local logical replication connection.
   
-
-  
-   The transformation proceeds in eight steps. First,
-   pg_subscriber checks if the given target data
-   directory has the same system identifier than the source data directory.
-   Since it uses the recovery process as one of the steps, it starts the target
-   server as a replica from the source server. If the system identifier is not
-   the same, pg_subscriber will terminate with an
-   error.
-  
-
-  
-   Second, pg_subscriber checks if the target data
-   directory is used by a standby server. Stop the standby server if it is
-   running. One of the next steps is to add some recovery parameters that
-   requires a server start. This step avoids an error.
-  
-
-  
-   Next, pg_subscriber creates one replication slot
-   for each specified database on the source server. The replication slot name
-   contains a pg_subscriber prefix. These replication slots
-   will be used by the subscriptions in a future step.  Another replication
-   slot is used to get a consistent start location. This consistent LSN will be
-   used (a) as a stopping point in the  parameter and (b) by the subscriptions
-   as a replication starting point. It guarantees that no transaction will be
-   lost.
-  
-
-  
-   Next, write recovery parameters into the target data directory and start the
-   target server. It specifies a LSN (consistent LSN that was obtained in the
-   

Re: Add pg_freespacemap extension sql test

2022-03-15 Thread Dong Wook Lee
2022년 3월 11일 (금) 오후 2:51, Michael Paquier 님이 작성:
>
> On Wed, Mar 09, 2022 at 08:13:15PM +0900, Dong Wook Lee wrote:
> > I agree with you, but I have no good idea how to deal with it.
>
> Well, my guess is that you basically just care about being able to
> detect if there is free space in the map or not, which goes down to
> detecting if pg_freespace() returns 0 or a number strictly higher than
> 0, so wouldn't it be enough to stick some > 0 in your test queries?
> Btw, if you want to test 32-bit builds, gcc allows that by passing
> down -m32.
>
> > Can the Perl TAP test be a good way?
>
> That does not seem necessary here.
> --
> Michael

so, you mean it's not necessary to add cases for negative numbers or
beyond the range?
I just wrote down testable cases, and if it doesn't have a big
advantage, I don't mind not adding that case.




Re: Change the csv log to 'key:value' to facilitate the user to understanding and processing of logs

2022-03-15 Thread hubert depesz lubaczewski
On Tue, Mar 15, 2022 at 09:31:19AM +0800, lupeng wrote:
> Dear Hackers
> When I audit the Postgresql database recently, I found that after
> configuring the log type as csv, the output log content is as follows:
> "database ""lp_db1"" does not exist","DROP DATABASE
> lp_db1;",,"dropdb, dbcommands.c:841","","client backend",,0 It is very
> inconvenient to understand the real meaning of each field. And in the
> log content," is escaped as "", which is not friendly to regular
> expression matching. Therefore, I want to modify the csv log function,
> change its format to key:value, assign the content of the non-existing
> field to NULL, and at the same time, " will be escaped as  \" in the
> log content. After the modification, the above log format is as
> follows: Log_time:"2022-03-15 09:17:55.289
> CST",User_name:"postgres",Database_name:"lp_db",Process_id:"17995",Remote_host:"192.168.88.130",Remote_port:"38402",Line_number:
> "622fe941.464b",PS_display:"DROP
> DATABASE",Session_start_timestamp:"2022-03-15 09:17:53
> CST",Virtual_transaction_id:"3/2",Transaction_id:"NULL",Error_severity:"ERROR",SQL_state_code
> :"3D000",Errmessage:"database \"lp_db1\" does not
> exist",Errdetail:"NULL",Errhint:"NULL",Internal_query:"NULL",Internal_pos:"0",Errcontext:"NULL",User_query
> :"DROP DATABASE lp_db1;",Cursorpos:"NULL",File_location:"dropdb,
> dbcommands.c:841",Application_name:"NULL",Backend_type:"client
> backend",Leader_PID:"0",Query_id:"0"

CSV format is well documented
(https://www.postgresql.org/docs/current/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-CSVLOG).

If you want named fields you can wait for pg15 and its jsonlog
(https://www.depesz.com/2022/01/17/waiting-for-postgresql-15-introduce-log_destinationjsonlog/).

I, for one, wouldn't want to have to deal with field names repeated in
every single record.

depesz




Re: Change the csv log to 'key:value' to facilitate the user to understanding and processing of logs

2022-03-15 Thread Julien Rouhaud
Hi,

On Tue, Mar 15, 2022 at 09:31:19AM +0800, lupeng wrote:
>
> When I audit the Postgresql database recently, I found that after configuring
> the log type as csv, the output log content is as follows: "database
> ""lp_db1"" does not exist","DROP DATABASE lp_db1;",,"dropdb,
> dbcommands.c:841","","client backend",,0 It is very inconvenient to
> understand the real meaning of each field. And in the log content," is
> escaped as "", which is not friendly to regular expression matching.
> Therefore, I want to modify the csv log function, change its format to
> key:value, assign the content of the non-existing field to NULL, and at the
> same time, " will be escaped as  \" in the log content. After the
> modification, the above log format is as follows: Log_time:"2022-03-15
> 09:17:55.289
> CST",User_name:"postgres",Database_name:"lp_db",Process_id:"17995", [...]

This would make the logs a lot more verbose, and a lot less easy to process if
you process them with tools intended for csv files.

You should consider using the newly introduced jsonlog format (as soon as pg15
is released), which seems closer to what you want.




Change the csv log to 'key:value' to facilitate the user to understanding and processing of logs

2022-03-15 Thread lupeng
Dear Hackers






















When I audit the Postgresql database recently, I found that after configuring 
the log type as csv, the output log content is as follows: "database ""lp_db1"" 
does not exist","DROP DATABASE lp_db1;",,"dropdb, 
dbcommands.c:841","","client backend",,0 It is very inconvenient to understand 
the real meaning of each field. And in the log content??" is escaped as "", 
which is not friendly to regular expression matching. Therefore, I want to 
modify the csv log function, change its format to key:value, assign the content 
of the non-existing field to NULL, and at the same time, " will be escaped as  
\" in the log content. After the modification, the above log format is as 
follows: Log_time:"2022-03-15 09:17:55.289 
CST",User_name:"postgres",Database_name:"lp_db",Process_id:"17995",Remote_host:"192.168.88.130",Remote_port:"38402",Line_number:
 "622fe941.464b",PS_display:"DROP DATABASE",Session_start_timestamp:"2022-03-15 
09:17:53 
CST",Virtual_transaction_id:"3/2",Transaction_id:"NULL",Error_severity:"ERROR",SQL_state_code
 :"3D000",Errmessage:"database \"lp_db1\" does not 
exist",Errdetail:"NULL",Errhint:"NULL",Internal_query:"NULL",Internal_pos:"0",Errcontext:"NULL",User_query
 :"DROP DATABASE lp_db1;",Cursorpos:"NULL",File_location:"dropdb, 
dbcommands.c:841",Application_name:"NULL",Backend_type:"client 
backend",Leader_PID:"0",Query_id:"0"














































Regards,










-- 
-lupeng

Re: Can we consider "24 Hours" for "next day" in INTERVAL datatype ?

2022-03-15 Thread Joseph Koshakow
On Tue, Mar 15, 2022 at 3:46 AM Julien Rouhaud  wrote:
> On Tue, Mar 15, 2022 at 12:54:58PM +0530, Prabhat Sahu wrote:
> >
> > Is there any specific purpose we are holding the hours as an increasing
> > number beyond 24 hours also?
>
> Yes, you can't blindly assume that adding 24 hours will always be the same as
> adding a day.  You can just justify_days if you want to force that behavior.

The specific purpose by the way, at least according to the docs [1],
is daylights savings time:
> Internally interval values are stored as months, days, and microseconds. This 
> is done because
> the number of days in a month varies, and a day can have 23 or 25 hours if a 
> daylight savings
> time adjustment is involved.
Though I suppose leap seconds may also follow similar logic.

[1] 
https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT

- Joe Koshakow




Re: Assert in pageinspect with NULL pages

2022-03-15 Thread Justin Pryzby
On Tue, Mar 15, 2022 at 06:32:44PM +0900, Michael Paquier wrote:

> + if (!IS_BRIN(indexRel))
> + ereport(ERROR,
> + (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +  errmsg("\"%s\" is not a %s index",
> + 
> RelationGetRelationName(indexRel), "BRIN")));

If it were me, I'd write this without the extra parens around (errcode()).

> +-- Suppress the DETAIL message, to allow the tests to work across various
> +-- default page sizes.

I think you mean "various non-default page sizes" or "various page sizes".

-- 
Justin




Re: BufferAlloc: don't take two simultaneous locks

2022-03-15 Thread Yura Sokolov
В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:
> Thanks for the new version.
> 
> At Tue, 15 Mar 2022 08:07:39 +0300, Yura Sokolov  
> wrote in 
> > В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:
> > > В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:
> > > > At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov 
> > > >  wrote in 
> > > > > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > > > I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
> > > > 128kB shared buffers and I saw that get_hash_entry never takes the
> > > > !element_alloc() path and always allocate a fresh entry, then
> > > > saturates at 30 new elements allocated at the medium of a 100 seconds
> > > > run.
> > > > 
> > > > Then, I tried the same with the patch, and I am surprized to see that
> > > > the rise of the number of newly allocated elements didn't stop and
> > > > went up to 511 elements after the 100 seconds run.  So I found that my
> > > > concern was valid.  The change in dynahash actually
> > > > continuously/repeatedly causes lack of free list entries.  I'm not
> > > > sure how much the impact given on performance if we change
> > > > get_hash_entry to prefer other freelists, though.
> > > 
> > > Well, it is quite strange SharedBufHash is not allocated as
> > > HASH_FIXED_SIZE. Could you check what happens with this flag set?
> > > I'll try as well.
> > > 
> > > Other way to reduce observed case is to remember freelist_idx for
> > > reused entry. I didn't believe it matters much since entries migrated
> > > netherless, but probably due to some hot buffers there are tention to
> > > crowd particular freelist.
> > 
> > Well, I did both. Everything looks ok.
> 
> Hmm. v8 returns stashed element with original patition index when the
> element is *not* reused.  But what I saw in the previous test runs is
> the REUSE->ENTER(reuse)(->REMOVE) case.  So the new version looks like
> behaving the same way (or somehow even worse) with the previous
> version.

v8 doesn't differ in REMOVE case neither from master nor from
previous version. It differs in RETURNED case only.
Or I didn't understand what you mean :(

> get_hash_entry continuously suffer lack of freelist
> entry. (FWIW, attached are the test-output diff for both master and
> patched)
> 
> master finally allocated 31 fresh elements for a 100s run.
> 
> > ALLOCED: 31;; freshly allocated
> 
> v8 finally borrowed 33620 times from another freelist and 0 freshly
> allocated (ah, this version changes that..)
> Finally v8 results in:
> 
> > RETURNED: 50806;; returned stashed elements
> > BORROWED: 33620;; borrowed from another freelist
> > REUSED: 1812664;; stashed
> > ASSIGNED: 1762377  ;; reused
> >(ALLOCED: 0);; freshly allocated
> 
> It contains a huge degradation by frequent elog's so they cannot be
> naively relied on, but it should show what is happening sufficiently.

Is there any measurable performance hit cause of borrowing?
Looks like "borrowed" happened in 1.5% of time. And it is on 128kb
shared buffers that is extremely small. (Or it was 128MB?)

Well, I think some spare entries could reduce borrowing if there is
a need. I'll test on 128MB with spare entries. If there is profit,
I'll return some, but will keep SharedBufHash fixed.

Master branch does less freelist manipulations since it  tries to
insert first and if there is collision it doesn't delete victim
buffer.

> > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> ...
> > Strange thing: both master and patched version has higher
> > peak tps at X5676 at medium connections (17 or 27 clients)
> > than in first october version [1]. But lower tps at higher
> > connections number (>= 191 clients).
> > I'll try to bisect on master this unfortunate change.
> 
> The reversing of the preference order between freshly-allocation and
> borrow-from-another-freelist might affect.

`master` changed its behaviour as well.
It is not problem of the patch at all.

--

regards
Yura.





Re: refactoring basebackup.c

2022-03-15 Thread Jeevan Ladhe
Thanks for the patch, Dipesh.
I had a look at the patch and also tried to take the backup. I have
following suggestions and observations:

I get following error at my end:

$ pg_basebackup -D /tmp/zstd_bk -Ft -Xfetch --compress=server-zstd:7@4
pg_basebackup: error: could not initiate base backup: ERROR:  could not
compress data: Unsupported parameter
pg_basebackup: removing data directory "/tmp/zstd_bk"

This is mostly because I have the zstd library version v1.4.4, which
does not have default support for parallel workers. Maybe we should
have a better error, something that is hinting that the parallelism is
not supported by the particular build.

The regression for pg_verifybackup test 008_untar.pl also fails with a
similar error. Here, I think we should have some logic in regression to
skip the test if the parameter is not supported?

+   if (ZSTD_isError(ret))

+   elog(ERROR,

+"could not compress data: %s",

+ZSTD_getErrorName(ret));

I think all of this can go on one line, but anyhow we have to improve
the error message here.

Also, just a thought, for the versions where parallelism is not
supported, should we instead just throw a warning and fall back to
non-parallel behavior?

Regards,
Jeevan Ladhe

On Mon, 14 Mar 2022 at 21:41, Dipesh Pandit  wrote:

> Hi,
>
> I tried to implement support for parallel ZSTD compression. The
> library provides an option (ZSTD_c_nbWorkers) to specify the
> number of compression workers. The number of parallel
> workers can be set as part of compression parameter and if this
> option is specified then the library performs parallel compression
> based on the specified number of workers.
>
> User can specify the number of parallel worker as part of
> --compress option by appending an integer value after at sign (@).
> (-Z, --compress=[{client|server}-]{gzip|lz4|zstd}[:LEVEL][@WORKERS])
>
> Please find the attached patch v1 with the above changes.
>
> Note: ZSTD library version 1.5.x supports parallel compression
> by default and if the library version is lower than 1.5.x then
> parallel compression is enabled only the source is compiled with build
> macro ZSTD_MULTITHREAD. If the linked library version doesn't
> support parallel compression then setting the value of parameter
> ZSTD_c_nbWorkers to a value other than 0 will be no-op and
> returns an error.
>
> Thanks,
> Dipesh
>


Re: Skipping logical replication transactions on subscriber side

2022-03-15 Thread Amit Kapila
On Tue, Mar 15, 2022 at 11:43 AM Masahiko Sawada  wrote:
>
> I've attached an updated version patch.
>

Review:
===
1.
+++ b/doc/src/sgml/logical-replication.sgml
@@ -366,15 +366,19 @@ CONTEXT:  processing remote data for replication
origin "pg_16395" during "INSER
transaction, the subscription needs to be disabled temporarily by
ALTER SUBSCRIPTION ... DISABLE first or
alternatively, the
subscription can be used with the
disable_on_error option.
-   Then, the transaction can be skipped by calling the
+   Then, the transaction can be skipped by using
+   ALTER SUBSCRITPION ... SKIP with the finish LSN
+   (i.e., LSN 0/14C0378). After that the replication
+   can be resumed by ALTER SUBSCRIPTION ... ENABLE.
+   Alternatively, the transaction can also be skipped by calling the

Do we really need to disable the subscription for the skip feature? I
think that is required for origin_advance. Also, probably, we can say
Finish LSN could be Prepare LSN, Commit LSN, etc.

2.
+ /*
+ * Quick return if it's not requested to skip this transaction. This
+ * function is called every start of applying changes and we assume that
+ * skipping the transaction is not used in many cases.
+ */
+ if (likely(XLogRecPtrIsInvalid(MySubscription->skiplsn) ||

The second part of this comment (especially ".. every start of
applying changes ..") sounds slightly odd to me. How about changing it
to: "This function is called for every remote transaction and we
assume that skipping the transaction is not used in many cases."

3.
+
+ ereport(LOG,
+ errmsg("start skipping logical replication transaction which
finished at %X/%X",
...
+ ereport(LOG,
+ (errmsg("done skipping logical replication transaction which
finished at %X/%X",

No need of 'which' in above LOG messages. I think the message will be
clear without the use of which in above message.

4.
+ ereport(LOG,
+ (errmsg("done skipping logical replication transaction which
finished at %X/%X",
+ LSN_FORMAT_ARGS(skip_xact_finish_lsn;
+
+ /* Stop skipping changes */
+ skip_xact_finish_lsn = InvalidXLogRecPtr;

Let's reverse the order of these statements to make them consistent
with the corresponding maybe_start_* function.

5.
+
+ if (myskiplsn != finish_lsn)
+ ereport(WARNING,
+ errmsg("skip-LSN of logical replication subscription \"%s\"
cleared", MySubscription->name),

Shouldn't this be a LOG instead of a WARNING as this will be displayed
only in server logs and by background apply worker?

6.
@@ -1583,7 +1649,8 @@ apply_handle_insert(StringInfo s)
  TupleTableSlot *remoteslot;
  MemoryContext oldctx;

- if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
+ if (is_skipping_changes() ||

Is there a reason to keep the skip_changes check here and in other DML
operations instead of at one central place in apply_dispatch?

7.
+ /*
+ * Start a new transaction to clear the subskipxid, if not started
+ * yet. The transaction is committed below.
+ */
+ if (!IsTransactionState())

I think the second part of the comment: "The transaction is committed
below." is not required.

8.
+ XLogRecPtr subskiplsn; /* All changes which finished at this LSN are
+ * skipped */
+
 #ifdef CATALOG_VARLEN /* variable-length fields start here */
  /* Connection string to the publisher */
  text subconninfo BKI_FORCE_NOT_NULL;
@@ -109,6 +112,8 @@ typedef struct Subscription
  bool disableonerr; /* Indicates if the subscription should be
  * automatically disabled if a worker error
  * occurs */
+ XLogRecPtr skiplsn; /* All changes which finished at this LSN are
+ * skipped */

No need for 'which' in the above comments.

9.
Can we merge 029_disable_on_error in 030_skip_xact and name it as
029_on_error (or 029_on_error_skip_disable or some variant of it)?
Both seem to be related features. I am slightly worried at the pace at
which the number of test files are growing in subscription test.

-- 
With Regards,
Amit Kapila.




Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-15 Thread Thomas Munro
On Tue, Mar 15, 2022 at 10:30 PM Michael Paquier  wrote:
> So, which one of a relative path or an absolute path do you think
> would be better for the user?  My preference tends toward the relative
> path, as we know that all those tablespaces stay in pg_tblspc/ so one
> can make the difference with normal tablespaces more easily.  The
> barrier is thin, though :p

Sounds good to me.




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-15 Thread Dilip Kumar
On Mon, Mar 14, 2022 at 10:04 PM Robert Haas  wrote:

> I think it would make sense to have two different WAL records e.g.
> XLOG_DBASE_CREATE_WAL_LOG and XLOG_DBASE_CREATE_FILE_COPY. Then it's
> easy to see how this could be generalized to other strategies in the
> future.

Done that way.  In dbase_desc(), for XLOG_DBASE_CREATE_FILE_COPY I
have kept the older description i.e. "copy dir" and for
XLOG_DBASE_CREATE_WAL_LOG it is "create dir", because logically the
first one is actually copying and the second one is just creating the
directory.  Do you think we should be using "copy dir file_copy" and
"copy dir wal_log" in the description as well?

> On Mon, Mar 14, 2022 at 12:04 PM Robert Haas  wrote:
> > I was looking at 0001 and 0002 again and realized that I swapped the
> > names load_relmap_file() and read_relmap_file() from what I should
> > have done. Here's a revised version. With this, read_relmap_file() and
> > write_relmap_file() become functions that just read and write the file
> > without touching any global variables, and load_relmap_file() is the
> > function that reads data from the file and puts it into a global
> > variable, which seems more sensible than the way I had it before.

Okay, I have included this patch and rebased other patches on top of that.

> Regarding 0003 and 0005, I'm not a fan of 'bool isunlogged'. I think
> 'bool permanent' would be better (note BM_PERMANENT). This would
> involve reversing true and false.

Okay changed.

> Regarding 0005:
>
> +   CREATEDB_WAL_LOG = 0,
> +   CREATEDB_FILE_COPY = 1
>
> I still think you don't need = 0 and = 1 here.

Done

> I'll probably go through and do a pass over the comments once you post
> the next version of this. There seems to be work needed in a bunch of
> places, but it probably makes more sense for me to go through and
> adjust the things that seem to need it rather than listing a bunch of
> changes for you to make.

Sure, thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 9a4138bd3590d4df887dc09989a8c72715789b65 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Tue, 15 Mar 2022 09:13:21 +0530
Subject: [PATCH v15 1/6] Refactor relmap load and relmap write functions

Currently, relmap reading and writing interfaces are tightly
coupled with shared_map and local_map of the database
it is connected to.  But as higher level patch set we need
interfaces where we can read relmap into any input memory
and while writing also we should be able to pass the map.
And, also support reading relmap file from input database
path instead of assuming we are connected to the database.

So as part of this patch, we are doing refactoring of the
existing code such that we can expose the read and write
interfaces that are independent of the shared_map and the
local_map, without changing any logic.

Author: Robert Haas
---
 src/backend/utils/cache/relmapper.c | 127 ++--
 1 file changed, 64 insertions(+), 63 deletions(-)

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 4f6811f..c3fef70 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -137,8 +137,10 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode,
 static void merge_map_updates(RelMapFile *map, const RelMapFile *updates,
 			  bool add_okay);
 static void load_relmap_file(bool shared, bool lock_held);
-static void write_relmap_file(bool shared, RelMapFile *newmap,
-			  bool write_wal, bool send_sinval, bool preserve_files,
+static void read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held,
+			 int elevel);
+static void write_relmap_file(RelMapFile *newmap, bool write_wal,
+			  bool send_sinval, bool preserve_files,
 			  Oid dbid, Oid tsid, const char *dbpath);
 static void perform_relmap_update(bool shared, const RelMapFile *updates);
 
@@ -568,9 +570,9 @@ RelationMapFinishBootstrap(void)
 	Assert(pending_local_updates.num_mappings == 0);
 
 	/* Write the files; no WAL or sinval needed */
-	write_relmap_file(true, _map, false, false, false,
-	  InvalidOid, GLOBALTABLESPACE_OID, NULL);
-	write_relmap_file(false, _map, false, false, false,
+	write_relmap_file(_map, false, false, false,
+	  InvalidOid, GLOBALTABLESPACE_OID, "global");
+	write_relmap_file(_map, false, false, false,
 	  MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
 }
 
@@ -687,39 +689,48 @@ RestoreRelationMap(char *startAddress)
 }
 
 /*
- * load_relmap_file -- load data from the shared or local map file
+ * load_relmap_file -- load the shared or local map file
  *
- * Because the map file is essential for access to core system catalogs,
- * failure to read it is a fatal error.
+ * Because these files are essential for access to core system catalogs,
+ * failure to load either of them is a fatal error.
  *
  * Note that the local case requires DatabasePath to be set up.
  */
 static void
 

Re: support for MERGE

2022-03-15 Thread Amit Langote
On Mon, Mar 14, 2022 at 11:36 PM Amit Langote  wrote:
> On Sun, Mar 13, 2022 at 12:36 AM Alvaro Herrera  
> wrote:
> > FYI I intend to get the ModifyTable split patch (0001+0003) pushed
> > hopefully on Tuesday or Wednesday next week, unless something really
> > ugly is found on it.
>
> I looked at 0001 and found a few things that could perhaps be improved.
>
> +   /* Slot containing tuple obtained from subplan, for RETURNING */
> +   TupleTableSlot *planSlot;
>
> I think planSlot is also used by an FDW to look up any FDW-specific
> junk attributes that the FDW's scan method would have injected into
> the slot, so maybe no need to specifically mention only RETURNING here
> as what cares about it.
>
> +   /*
> +* The tuple produced by EvalPlanQual to retry from, if a cross-partition
> +* UPDATE requires it
> +*/
> +   TupleTableSlot *retrySlot;
>
> Maybe a bit long name, though how about calling this updateRetrySlot
> to make it apparent that what is to be retried with the slot is the
> whole UPDATE operation, not some sub-operation (like say the
> cross-partition portion)?

I think I meant to suggest cpUpdateRetrySlot, as in, the slot
populated by ExecCrossPartitionUpdate() with a tuple to retry the
UPDATE operation with, at least the portion of it that ExecUpdateAct()
is charge of.

> +   /*
> +* The tuple projected by the INSERT's RETURNING clause, when doing a
> +* cross-partition UPDATE
> +*/
> +   TupleTableSlot *projectedFromInsert;
>
> I'd have gone with cpUpdateReturningSlot here, because that is what it
> looks to me to be.  The fact that it is ExecInsert() that computes the
> RETURNING targetlist projection seems like an implementation detail.
>
> I know you said you don't like having "Slot" in the name, but then we
> also have retrySlot. :)
>
> +/*
> + * Context struct containing output data specific to UPDATE operations.
> + */
> +typedef struct UpdateContext
> +{
> +   boolupdateIndexes;  /* index update required? */
> +   boolcrossPartUpdate;/* was it a cross-partition update? */
> +} UpdateContext;
>
> I wonder if it isn't more readable to just have these two be the
> output arguments of ExecUpdateAct()?
>
> +redo_act:
> +   /* XXX reinit ->crossPartUpdate here? */
>
> Why not just make ExecUpdateAct() always set the flag, not only when a
> cross-partition update does occur?
>
> Will look some more in the morning tomorrow.

Here are some more improvement suggestions:

+/*
+ * Context struct for a ModifyTable operation, containing basic execution state
+ * and some output data.
+ */

Does it make sense to expand "some output data", maybe as:

...and some output variables populated by the ExecUpdateAct() and
ExecDeleteAct() to report the result of their actions to ExecUpdate()
and ExecDelete() respectively.

+   /* output */
+   TM_FailureData tmfd;/* stock failure data */

Maybe we should be more specific here, say:

Information about the changes that were found to be made concurrently
to a tuple being updated or deleted

+   LockTupleMode lockmode; /* tuple lock mode */

Also here, say:

Lock mode to acquire on the latest tuple before performing EvalPlanQual() on it

+/*
+ * ExecDeleteAct -- subroutine for ExecDelete
+ *
+ * Actually delete the tuple, when operating on a plain or partitioned table.

+/*
+ * ExecUpdateAct -- subroutine for ExecUpdate
+ *
+ * Actually update the tuple, when operating on a plain or partitioned table.

Hmm, ExecDelete() and ExecUpdate() never see a partitioned table,
because both DELETE and UPDATE are always performed directly on leaf
partitions.   The (root) partitioned table only gets involved if the
UPDATE of a leaf partition tuple causes it to move to another
partition, that too only for applying tuple routing algorithm to find
the destination leaf partition.

So the following suffices, for ExecDeleteAct():

Actually delete the tuple from a plain table.

and for ExecUpdateAct(), maybe it makes sense to mention the
possibility of a cross-partition partition.

Actually update the tuple of a plain table.  If the table happens to
be a leaf partition and the update causes its partition constraint to
be violated, ExecCrossPartitionUpdate() is invoked to move the updated
tuple into the appropriate partition.

+ * Closing steps of tuple deletion; this invokes AFTER FOR EACH ROW triggers,
+ * including the UPDATE triggers if there was a cross-partition tuple move.

How about:

...including the UPDATE triggers if the ExecDelete() is being done as
part of a cross-partition tuple move.

+   /* and project to create a current version of the new tuple */

How about:

and project the new tuple to retry the UPDATE with

--
Amit Langote
EDB: http://www.enterprisedb.com




Re: Query about time zone patterns in to_char

2022-03-15 Thread Nitin Jadhav
> This patch is still in the current CommitFest, so I decided to review
> it. I see that DCH_keywords[] includes upper and lower-case entries
> for everything except the three cases corrected by this patch, where
> it includes upper-case entries but not the corresponding lower-case
> entries. It seems to make sense to make these three cases consistent
> with everything else.
>
> It took me a while to understand how DCH_keywords[] and DCH_index[]
> actually work, and I think it's a pretty confusing design, but what
> the patch does seems to be consistent with that, so it appears correct
> to me.
>
> Therefore, I have committed it.

Thank you so much.

Thanks & Regards,
Nitin Jadhav

On Tue, Mar 15, 2022 at 2:22 AM Robert Haas  wrote:
>
> On Fri, Jul 9, 2021 at 10:44 AM Tomas Vondra
>  wrote:
> > Yeah, does not seem to be worth it, as there seem to be no actual
> > reports of issues in the field.
> >
> > FWIW there seem to be quite a bit of other to_char differences compared
> > to Oracle (judging by docs and playing with sqlfiddle). But the patch
> > seems fine / simple enough and non-problematic, so perhaps let's just
> > get it committed?
>
> This patch is still in the current CommitFest, so I decided to review
> it. I see that DCH_keywords[] includes upper and lower-case entries
> for everything except the three cases corrected by this patch, where
> it includes upper-case entries but not the corresponding lower-case
> entries. It seems to make sense to make these three cases consistent
> with everything else.
>
> It took me a while to understand how DCH_keywords[] and DCH_index[]
> actually work, and I think it's a pretty confusing design, but what
> the patch does seems to be consistent with that, so it appears correct
> to me.
>
> Therefore, I have committed it.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com




Re: Assert in pageinspect with NULL pages

2022-03-15 Thread Michael Paquier
On Thu, Feb 17, 2022 at 09:00:20PM -0600, Justin Pryzby wrote:
> BRIN can also crash if passed a non-brin index.
> 
> I've been sitting on this one for awhile.  Feel free to include it in your
> patchset.

So, I have begun a close lookup of this thread, and the problem you
are reporting here is worth fixing on its own, before we do something
for new pages as reported originally.  There is more that caught my
attention than the brin AM not being check properly, because a couple
of code paths are fuzzy with the checks on the page sizes.  My
impression of the whole thing is that we'd better generalize the use
of get_page_from_raw(), improving the code on many sides when the user
can provide a raw page in input (also I'd like to think that it is a
better practice anyway as any of those functions may finish to look
8-byte fields, but the current coding would silently break in
alignment-picky environments):
- Some code paths (hash, btree) would trigger elog(ERROR) but we
cannot use that for errors that can be triggered by the user. 
- bt_page_items_bytea(), page_header() and fsm_page_contents() were
fuzzy on the page size check, so it was rather easy to generate
garbage with random data.
- page_checksum_internal() used a PageHeader, where I would have
expected a Page.
- More consistency in the error strings, meaning less contents to
translate in the long-term.

This first batch leads me to the attached, with tests to stress all
that for all the functions taking raw pages in input.
--
Michael
From 588ffddf2bd2c0d1e6168a2e7093c2488caec94b Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 15 Mar 2022 17:59:04 +0900
Subject: [PATCH] Fixes for pageinspect with page sizes

---
 contrib/pageinspect/brinfuncs.c| 36 ++
 contrib/pageinspect/btreefuncs.c   | 28 ++--
 contrib/pageinspect/expected/brin.out  |  4 +++
 contrib/pageinspect/expected/btree.out | 15 +++
 contrib/pageinspect/expected/gin.out   | 11 
 contrib/pageinspect/expected/gist.out  | 15 +++
 contrib/pageinspect/expected/hash.out  | 17 
 contrib/pageinspect/expected/page.out  | 11 
 contrib/pageinspect/fsmfuncs.c |  4 ++-
 contrib/pageinspect/gistfuncs.c|  9 +++
 contrib/pageinspect/hashfuncs.c|  6 +++--
 contrib/pageinspect/rawpage.c  | 29 +++--
 contrib/pageinspect/sql/brin.sql   |  4 +++
 contrib/pageinspect/sql/btree.sql  | 13 ++
 contrib/pageinspect/sql/gin.sql|  9 +++
 contrib/pageinspect/sql/gist.sql   | 13 ++
 contrib/pageinspect/sql/hash.sql   | 13 ++
 contrib/pageinspect/sql/page.sql   |  9 +++
 18 files changed, 179 insertions(+), 67 deletions(-)

diff --git a/contrib/pageinspect/brinfuncs.c b/contrib/pageinspect/brinfuncs.c
index b7c8365218..bd0ea8b18c 100644
--- a/contrib/pageinspect/brinfuncs.c
+++ b/contrib/pageinspect/brinfuncs.c
@@ -16,6 +16,7 @@
 #include "access/brin_tuple.h"
 #include "access/htup_details.h"
 #include "catalog/index.h"
+#include "catalog/pg_am_d.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "lib/stringinfo.h"
@@ -31,6 +32,8 @@ PG_FUNCTION_INFO_V1(brin_page_items);
 PG_FUNCTION_INFO_V1(brin_metapage_info);
 PG_FUNCTION_INFO_V1(brin_revmap_data);
 
+#define IS_BRIN(r) ((r)->rd_rel->relam == BRIN_AM_OID)
+
 typedef struct brin_column_state
 {
 	int			nstored;
@@ -45,8 +48,7 @@ Datum
 brin_page_type(PG_FUNCTION_ARGS)
 {
 	bytea	   *raw_page = PG_GETARG_BYTEA_P(0);
-	Page		page = VARDATA(raw_page);
-	int			raw_page_size;
+	Page		page;
 	char	   *type;
 
 	if (!superuser())
@@ -54,14 +56,7 @@ brin_page_type(PG_FUNCTION_ARGS)
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  errmsg("must be superuser to use raw page functions")));
 
-	raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
-
-	if (raw_page_size != BLCKSZ)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("input page too small"),
- errdetail("Expected size %d, got %d",
-		   BLCKSZ, raw_page_size)));
+	page = get_page_from_raw(raw_page);
 
 	switch (BrinPageType(page))
 	{
@@ -89,19 +84,7 @@ brin_page_type(PG_FUNCTION_ARGS)
 static Page
 verify_brin_page(bytea *raw_page, uint16 type, const char *strtype)
 {
-	Page		page;
-	int			raw_page_size;
-
-	raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
-
-	if (raw_page_size != BLCKSZ)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("input page too small"),
- errdetail("Expected size %d, got %d",
-		   BLCKSZ, raw_page_size)));
-
-	page = VARDATA(raw_page);
+	Page		page = get_page_from_raw(raw_page);
 
 	/* verify the special space says this page is what we want */
 	if (BrinPageType(page) != type)
@@ -143,6 +126,13 @@ brin_page_items(PG_FUNCTION_ARGS)
 	SetSingleFuncCall(fcinfo, 0);
 
 	indexRel = index_open(indexRelid, AccessShareLock);
+
+	if (!IS_BRIN(indexRel))
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),

Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-15 Thread Michael Paquier
On Tue, Mar 15, 2022 at 03:55:56PM +1300, Thomas Munro wrote:
> On Tue, Mar 15, 2022 at 2:50 PM Michael Paquier  wrote:
>> On Tue, Mar 15, 2022 at 02:33:17PM +1300, Thomas Munro wrote:
>> > As for the complaint about pg_tablespace_location() failing, would it
>> > be better to return an empty string?  That's what was passed in as
>> > LOCATION.  Something like the attached.
>>
>> Hmm, I don't think so.  The point of the function is to be able to
>> know the location of a tablespace at SQL level so as we don't have any
>> need to hardcode its location within any external tests (be it a
>> pg_regress test or a TAP test) based on how in-place tablespace paths
>> are built in the backend, so I think that we'd better report either a
>> relative path from data_directory or an absolute path, but not an
>> empty string.
>>
>> In any case, I'd suggest to add a regression test.  What I have sent
>> upthread would be portable enough.
> 
> Fair enough.  No objections here.

So, which one of a relative path or an absolute path do you think
would be better for the user?  My preference tends toward the relative
path, as we know that all those tablespaces stay in pg_tblspc/ so one
can make the difference with normal tablespaces more easily.  The
barrier is thin, though :p
--
Michael


signature.asc
Description: PGP signature


Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Michael Paquier
On Tue, Mar 15, 2022 at 05:23:40PM +0900, Kyotaro Horiguchi wrote:
> At Tue, 15 Mar 2022 12:19:47 +0530, Bharath Rupireddy 
>  wrote in 
>> On Fri, Mar 4, 2022 at 10:40 AM Kyotaro Horiguchi
>>  wrote:
>> 0001 - I don't think you need to do this as UpdateControlFile
>> (update_controlfile) will anyway update it, no?
>> + /* Update control file using current time */
>> + ControlFile->time = (pg_time_t) time(NULL);
> 
> Ugh.. Yes. It is a copy-pasto from older versions.  They may have the
> same copy-pasto..

This thread has shifted to an entirely different discussion,
presenting patches that touch code paths unrelated to what was first
stated.  Shouldn't you create a new thread with a proper $subject to
attract a more correct audience?  
--
Michael


signature.asc
Description: PGP signature


Re: Column Filtering in Logical Replication

2022-03-15 Thread Tomas Vondra



On 3/15/22 05:43, Amit Kapila wrote:
> On Mon, Mar 14, 2022 at 4:42 PM houzj.f...@fujitsu.com
>  wrote:
>>
>> On Monday, March 14, 2022 5:08 AM Tomas Vondra 
>>  wrote:
>>>
>>> On 3/12/22 05:30, Amit Kapila wrote:
> ...

 Okay, please find attached. I have done basic testing of this, if we
 agree with this approach then this will require some more testing.

>>>
>>> Thanks, the proposed changes seem like a clear improvement, so I've
>>> added them, with some minor tweaks (mostly to comments).
>>
>> Hi,
>>
>> Thanks for updating the patches !
>> And sorry for the row filter bug caused by my mistake.
>>
>> I looked at the two fixup patches. I am thinking would it be better if we
>> add one testcase for these two bugs? Maybe like the attachment.
>>
> 
> Your tests look good to me. We might want to add some comments for
> each test but I guess that can be done before committing. Tomas, it
> seems you are planning to push these bug fixes, do let me know if you
> want me to take care of these while you focus on the main patch? I
> think the first patch needs to be backpatched till 13 and the second
> one is for just HEAD.
> 

Yeah, I plan to push the fixes later today. I'll polish them a bit
first, and merge the tests (shared by Hou zj) into the patches etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 12:19:47 +0530, Bharath Rupireddy 
 wrote in 
> On Fri, Mar 4, 2022 at 10:40 AM Kyotaro Horiguchi
>  wrote:
> 0001 - I don't think you need to do this as UpdateControlFile
> (update_controlfile) will anyway update it, no?
> + /* Update control file using current time */
> + ControlFile->time = (pg_time_t) time(NULL);

Ugh.. Yes. It is a copy-pasto from older versions.  They may have the
same copy-pasto..



> > 0002: Add REDO/Checkpiont LSNs to checkpoinkt-end log message.
> >   (The main patch in this thread)
> 
> 0002 - If at all the intention is to say that no ControlFileLock is
> required while reading ControlFile->checkPoint and
> ControlFile->checkPointCopy.redo, let's say it, no? How about
> something like "No ControlFileLock is required while reading
> ControlFile->checkPoint and ControlFile->checkPointCopy.redo as there
> can't be any other process updating them concurrently."?
> 
> + /* we are the only updator of these variables */
> + LSN_FORMAT_ARGS(ControlFile->checkPoint),
> + LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;

I thought the comment explains that. But it would be better to be more
specific.  It is changed as the follows.

> * ControlFileLock is not required as we are the only
> * updator of these variables.


> > 0003: Replace (WAL-)location to LSN in pg_controldata.
> >
> > 0004: Replace (WAL-)location to LSN in user-facing texts.
> >   (This doesn't reflect my recent comments.)
> 
> If you don't mind, can you please put the comments here?

Okay. It's the following message.

https://www.postgresql.org/message-id/20220209.115204.1794224638476710282.horikyota@gmail.com

> The old 0003 (attached 0004):
> 
> 
> 
> +++ b/src/backend/access/rmgrdesc/xlogdesc.c
> - appendStringInfo(buf, "redo %X/%X; "
> + appendStringInfo(buf, "redo lsn %X/%X; "
> 
> 
> 
> It is shown in the context of a checkpoint record, so I think it is
> not needed or rather lengthning the dump line uselessly. 
> 
> 
> 
> +++ b/src/backend/access/transam/xlog.c
> - (errmsg("request to flush past end of generated 
> WAL; request %X/%X, current position %X/%X",
> + (errmsg("request to flush past end of generated 
> WAL; request lsn %X/%X, current lsn %X/%X",
> 
> 
> 
> +++ b/src/backend/replication/walsender.c
> - (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush position of this server %X/%X",
> + (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush LSN of this server %X/%X",
> 
> 
> 
> "WAL" is upper-cased. So it seems rather strange that the "lsn" is
> lower-cased.  In the first place the message doesn't look like a
> user-facing error message and I feel we don't need position or lsn
> there..
> 
> 
> 
> +++ b/src/bin/pg_rewind/pg_rewind.c
> - pg_log_info("servers diverged at WAL location %X/%X on timeline 
> %u",
> + pg_log_info("servers diverged at WAL LSN %X/%X on timeline %u",
> 
> 
> 
> I feel that we don't need "WAL" there.
> 
> 
> 
> +++ b/src/bin/pg_waldump/pg_waldump.c
> - printf(_("  -e, --end=RECPTR   stop reading at WAL location 
> RECPTR\n"));
> + printf(_("  -e, --end=RECPTR   stop reading at WAL LSN RECPTR\n"));
> 
> 
> 
> Mmm.. "WAL LSN RECPTR" looks strange to me.  In the first place I
> don't think "RECPTR" is a user-facing term. Doesn't something like the
> follows work?
> 
> 
> 
> + printf(_("  -e, --end=WAL-LSN   stop reading at WAL-LSN\n"));
> 
> 
> 
> In some changes in this patch shorten the main message text of
> fprintf-ish functions.  That makes the succeeding parameters can be
> inlined.


> > 0005: Unhyphenate the word archive-recovery and similars.
> 
> 0005 - How about replacing "crash-recovery" to "crash recovery" in
> postgres-ref.sgml too?

Oh, that's a left-over. Fixed. Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 37f7433dce1ea9e41aa54f2c18c0bf3d2aa3c814 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 4 Mar 2022 13:18:30 +0900
Subject: [PATCH v12 1/5] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed 

Re: Can we consider "24 Hours" for "next day" in INTERVAL datatype ?

2022-03-15 Thread Julien Rouhaud
Hi,

On Tue, Mar 15, 2022 at 12:54:58PM +0530, Prabhat Sahu wrote:
>
> Kindly check the below scenario with INTERVAL datatype.
> 
> postgres=# select interval '01 20:59:59' + interval '00 05:00:01' as
> interval;
> interval
> 
>  1 day 26:00:00
> (1 row)
> 
> Any operation with INTERVAL data, We are changing the interval values as
> "60 sec" as "next minute"
> "60 min" as "next hour"
> *Similarly can't we consider "24 Hours" for "next day" ?*
> Is there any specific purpose we are holding the hours as an increasing
> number beyond 24 hours also?

Yes, you can't blindly assume that adding 24 hours will always be the same as
adding a day.  You can just justify_days if you want to force that behavior.




Re: ICU for global collation

2022-03-15 Thread Julien Rouhaud
On Mon, Mar 14, 2022 at 01:50:50PM +0100, Peter Eisentraut wrote:
> On 05.03.22 09:38, Julien Rouhaud wrote:
> > I say it works because I did manually check, as far as I can see there isn't
> > any test that ensures it.
> >
> > I'm using this naive scenario:
> >
> > DROP DATABASE IF EXISTS dbicu;
> > CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'en_US' ICU_LOCALE 
> > 'en-u-kf-upper' template 'template0';
> > \c dbicu
> > CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
> > CREATE TABLE icu(def text, en text COLLATE "en_US", upfirst text COLLATE 
> > upperfirst);
> > INSERT INTO icu VALUES ('a', 'a', 'a'), ('b','b','b'), ('A','A','A'), 
> > ('B','B','B');
> > SELECT def AS def FROM icu ORDER BY def;
> > SELECT def AS en FROM icu ORDER BY en;
> > SELECT def AS upfirst FROM icu ORDER BY upfirst;
> > SELECT def AS upfirst_explicit FROM icu ORDER BY en COLLATE upperfirst;
> > SELECT def AS en_x_explicit FROM icu ORDER BY def COLLATE "en-x-icu";
> >
> > Maybe there should be some test along those lines included in the patch?
>
> I added something like this to a new test suite under src/test/icu/.
> (src/test/locale/ was already used for something else.)

Great, thanks!

> > > The locale object in ICU is an identifier that specifies a particular 
> > > locale
> > > and has fields for language, country, and an optional code to specify 
> > > further
> > > variants or subdivisions. These fields also can be represented as a string
> > > with the fields separated by an underscore.
>
> I think the Localization chapter needs to be reorganized a bit, but I'll
> leave that for a separate patch.

WFM.

> > I spent some time looking at the ICU api trying to figure out if using a
> > posix locale name (e.g. en_US) was actually compatible with an ICU locale 
> > name.
> > It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the
> > same locale, but I might be wrong.  I also didn't find a way to figure out 
> > how
> > to ask ICU if the locale identifier passed is complete garbage or not.  One
> > sure thing is that the system collation we import are of the form 'en-us', 
> > so
> > it seems weird to have this form in pg_collation and by default another 
> > form in
> > pg_database.
>
> Yeah it seems to be inconsistent about that.  The locale ID documentation
> appears to indicate that "en_US" is the canonical form, but when you ask it
> to list all the locales it knows about it returns "en-US".

Yeah I saw that too when checking is POSIX locale names were valid, and that's
not great.
>
> > In CREATE DATABASE manual:
> >
> > +Specifies the provider to use for the default collation in this
> > +database.  Possible values are:
> > +
> > icu,ICU
> > +libc.  libc is the default.  
> > The
> > +available choices depend on the operating system and build options.
> >
> > That's actually not true as pg_strcasecmp is used in createdb():
>
> I don't understand what the concern is here.

Ah sorry I missed the , I thought the possible values listed were
icu, ICU and libc.  Shouldn't the ICU  be before the icu 
rather than the libc, or at least before the comma?
>
> Yeah, I think it is not the job of the initdb man page to lecture people
> about the merits of their locale choice.  The same performance warning can
> also be found in the localization chapter; we don't need to repeat it
> everywhere a locale choice is mentioned.

Ah, I tried to look for another place where the same warning could be found but
missed it.  I'm fine with it then!

> > While on that topic, the doc should probably mention that default ICU
> > collations can only be deterministic.
>
> Well, there is no option to do otherwise, so I'm not sure where/how to
> mention that.  We usually don't document options that don't exist. ;-)

Sure, but I'm afraid that users may still be tempted to use ICU locales like
und-u-ks-level2 from the case_insensitive example in the doc and hope that it
will work accordingly.  Or maybe it's just me that still sees ICU as dark magic
and want to be extra cautious.

Unrelated, but I just realized that we have PGC_INTERNAL gucs for lc_ctype and
lc_collate.  Should we add one for icu_locale too?

> > Also unrelated, but how about raising a warning about possibly hiding
> > corruption if you use CREATE COLLATION ...  VERSION or CREATE DATABASE ...
> > COLLATION_VERSION if !IsBinaryUpgrade()?
>
> Hmm, there is a point to that.  But then we should just make that an error.
> Otherwise, we raise the warning but then there is no way to "fix" what was
> warned about.  Seems confusing.

I was afraid that an error was a bit too much, but if that's acceptable I agree
that it's better.

> > in AlterCollation(), pg_collation_actual_version(), 
> > AlterDatabaseRefreshColl()
> > and pg_database_collation_actual_version():
> >
> > -   datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, 
> > );
> > -   Assert(!isnull);
> > -   newversion = 

Re: Can we consider "24 Hours" for "next day" in INTERVAL datatype ?

2022-03-15 Thread Laurenz Albe
On Tue, 2022-03-15 at 12:54 +0530, Prabhat Sahu wrote:
> Kindly check the below scenario with INTERVAL datatype.
> 
> postgres=# select interval '01 20:59:59' + interval '00 05:00:01' as interval;
>     interval    
> 
>  1 day 26:00:00
> (1 row)
> 
> Any operation with INTERVAL data, We are changing the interval values as 
> "60 sec" as "next minute"
> "60 min" as "next hour"
> Similarly can't we consider "24 Hours" for "next day" ?
> Is there any specific purpose we are holding the hours as an increasing 
> number beyond 24 hours also?
> 
> But when we are dealing with TIMESTAMP with INTERVAL values it's considered 
> the "24 Hours" for "next day".
> 
> postgres=# select timestamp '01-MAR-22 20:59:59' + interval '00 05:00:01'  as 
> interval;
>       interval       
> -
>  2022-03-02 02:00:00
> (1 row)

The case is different with days:

test=> SELECT TIMESTAMPTZ '2022-03-26 20:00:00 Europe/Vienna' + INTERVAL '12 
hours' + INTERVAL '12 hours';
?column?

 2022-03-27 21:00:00+02
(1 row)

test=> SELECT TIMESTAMPTZ '2022-03-26 20:00:00 Europe/Vienna' + INTERVAL '1 
day';
?column?

 2022-03-27 20:00:00+02
(1 row)

Yours,
Laurenz Albe





Re: Allow async standbys wait for sync replication

2022-03-15 Thread Bharath Rupireddy
On Wed, Mar 9, 2022 at 7:31 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-03-06 12:27:52 +0530, Bharath Rupireddy wrote:
> > On Sun, Mar 6, 2022 at 1:57 AM Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > On 2022-03-05 14:14:54 +0530, Bharath Rupireddy wrote:
> > > > I understand. Even if we use the SyncRepWaitForLSN approach, the async
> > > > walsenders will have to do nothing in WalSndLoop() until the sync
> > > > walsender wakes them up via SyncRepWakeQueue.
> > >
> > > I still think we should flat out reject this approach. The proper way to
> > > implement this feature is to change the protocol so that WAL can be sent 
> > > to
> > > replicas with an additional LSN informing them up to where WAL can be
> > > flushed. That way WAL is already sent when the sync replicas have 
> > > acknowledged
> > > receipt and just an updated "flush/apply up to here" LSN has to be sent.
> >
> > I was having this thought back of my mind. Please help me understand these:
> > 1) How will the async standbys ignore the WAL received but
> > not-yet-flushed by them in case the sync standbys don't acknowledge
> > flush LSN back to the primary for whatever reasons?
>
> What do you mean with "ignore"? When replaying?

Let me illustrate with an example:

1) Say, primary at LSN 100, sync standby at LSN 90 (is about to
receive/receiving the WAL from LSN 91 - 100 from primary), async
standby at LSN 100 - today this is possible if the async standby is
closer to primary than sync standby for whatever reasons
2) With the approach that's originally proposed in this thread - async
standbys can never get ahead of LSN 90 (flush LSN reported back to the
primary by all sync standbys)
3) With the approach that's suggested i.e. "let async standbys receive
WAL at their own pace, but they should only be allowed to
apply/write/flush to the WAL file in pg_wal directory/disk until the
sync standbys latest flush LSN" - async standbys can receive the WAL
from LSN 91 - 100 but they aren't allowed to apply/write/flush. Where
will the async standbys hold the WAL from LSN 91 - 100 until the
latest flush LSN (100) is reported to them? If they "somehow" store
the WAL from LSN 91 - 100 and not apply/write/flush, how will they
ignore that WAL, say if the sync standbys don't report the latest
flush LSN back to the primary (for whatever reasons)? In such cases,
the primary has no idea of the latest sync standbys flush LSN (?) if
at all the sync standbys can't come up and reconnect and resync with
the primary? Should the async standby always assume that the WAL from
LSN 91 -100 is invalid for them as they haven't received the sync
flush LSN from primary? In such a case, aren't there "invalid holes"
in the WAL files on the async standbys?

> I think this'd require adding a new pg_control field saying up to which LSN
> WAL is "valid". If that field is set, replay would only replay up to that LSN
> unless some explicit operation is taken to replay further (e.g. for data
> recovery).

With the approach that's suggested i.e. "let async standbys receive
WAL at their own pace, but they should only be allowed to
apply/write/flush to the WAL file in pg_wal directory/disk until the
sync standbys latest flush LSN'' - there can be 2 parts to the WAL on
async standbys - most of it "valid and makes sense for async standbys"
and some of it "invalid and doesn't make sense for async standbys''?
Can't this require us to rework some parts like "redo/apply/recovery
logic on async standbys'', tools like pg_basebackup, pg_rewind,
pg_receivewal, pg_recvlogical, cascading replication etc. that depend
on WAL records and now should know whether the WAL records are valid
for them? I may be wrong here though.

> > 2) When we say the async standbys will receive the WAL, will they just
> > keep the received WAL in the shared memory but not apply or will they
> > just write but not apply the WAL and flush the WAL to the pg_wal
> > directory on the disk or will they write to some other temp wal
> > directory until they receive go-ahead LSN from the primary?
>
> I was thinking that for now it'd go to disk, but eventually would first go to
> wal_buffers and only to disk if wal_buffers needs to be flushed out (and only
> in that case the pg_control field would need to be set).

IIUC, the WAL buffers (XLogCtl->pages) aren't used on standbys as wal
receivers bypass them and flush the data directly to the disk. Hence,
the WAL buffers that are allocated(?, I haven't checked the code
though) but unused on standbys can be used to hold the WAL until the
new flush LSN is reported from the primary. At any point of time, the
WAL buffers will have the latest WAL that's waiting for a new flush
LSN from the primary. However, this can be a problem for larger
transactions that can eat up the entire WAL buffers and flush LSN is
far behind in which case we need to flush the WAL to the latest WAL
file in pg_wal/disk but let the other folks in the server know upto
which the WAL is valid.

> > 3) Won't the 

Re: BufferAlloc: don't take two simultaneous locks

2022-03-15 Thread Kyotaro Horiguchi
Thanks for the new version.

At Tue, 15 Mar 2022 08:07:39 +0300, Yura Sokolov  
wrote in 
> В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:
> > В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:
> > > At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov 
> > >  wrote in 
> > > > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > > I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
> > > 128kB shared buffers and I saw that get_hash_entry never takes the
> > > !element_alloc() path and always allocate a fresh entry, then
> > > saturates at 30 new elements allocated at the medium of a 100 seconds
> > > run.
> > > 
> > > Then, I tried the same with the patch, and I am surprized to see that
> > > the rise of the number of newly allocated elements didn't stop and
> > > went up to 511 elements after the 100 seconds run.  So I found that my
> > > concern was valid.  The change in dynahash actually
> > > continuously/repeatedly causes lack of free list entries.  I'm not
> > > sure how much the impact given on performance if we change
> > > get_hash_entry to prefer other freelists, though.
> > 
> > Well, it is quite strange SharedBufHash is not allocated as
> > HASH_FIXED_SIZE. Could you check what happens with this flag set?
> > I'll try as well.
> > 
> > Other way to reduce observed case is to remember freelist_idx for
> > reused entry. I didn't believe it matters much since entries migrated
> > netherless, but probably due to some hot buffers there are tention to
> > crowd particular freelist.
> 
> Well, I did both. Everything looks ok.

Hmm. v8 returns stashed element with original patition index when the
element is *not* reused.  But what I saw in the previous test runs is
the REUSE->ENTER(reuse)(->REMOVE) case.  So the new version looks like
behaving the same way (or somehow even worse) with the previous
version.  get_hash_entry continuously suffer lack of freelist
entry. (FWIW, attached are the test-output diff for both master and
patched)

master finally allocated 31 fresh elements for a 100s run.

> ALLOCED: 31;; freshly allocated

v8 finally borrowed 33620 times from another freelist and 0 freshly
allocated (ah, this version changes that..)
Finally v8 results in:

> RETURNED: 50806;; returned stashed elements
> BORROWED: 33620;; borrowed from another freelist
> REUSED: 1812664;; stashed
> ASSIGNED: 1762377  ;; reused
>(ALLOCED: 0);; freshly allocated

It contains a huge degradation by frequent elog's so they cannot be
naively relied on, but it should show what is happening sufficiently.

> I lost access to Xeon 8354H, so returned to old Xeon X5675.
...
> Strange thing: both master and patched version has higher
> peak tps at X5676 at medium connections (17 or 27 clients)
> than in first october version [1]. But lower tps at higher
> connections number (>= 191 clients).
> I'll try to bisect on master this unfortunate change.

The reversing of the preference order between freshly-allocation and
borrow-from-another-freelist might affect.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/storage/buffer/buf_table.c 
b/src/backend/storage/buffer/buf_table.c
index dc439940fa..ac651b98e6 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -31,7 +31,7 @@ typedef struct
int id; /* Associated 
buffer ID */
 } BufferLookupEnt;
 
-static HTAB *SharedBufHash;
+HTAB *SharedBufHash;
 
 
 /*
diff --git a/src/backend/utils/hash/dynahash.c 
b/src/backend/utils/hash/dynahash.c
index 3babde8d70..294516ef01 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -195,6 +195,11 @@ struct HASHHDR
longssize;  /* segment size --- must be 
power of 2 */
int sshift; /* segment shift = 
log2(ssize) */
int nelem_alloc;/* number of entries to 
allocate at once */
+   int alloc;
+   int reuse;
+   int borrow;
+   int assign;
+   int ret;
 
 #ifdef HASH_STATISTICS
 
@@ -963,6 +968,7 @@ hash_search(HTAB *hashp,
   
foundPtr);
 }
 
+extern HTAB *SharedBufHash;
 void *
 hash_search_with_hash_value(HTAB *hashp,
const void *keyPtr,
@@ -1354,6 +1360,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
hctl->freeList[freelist_idx].nentries++;

SpinLockRelease(>freeList[freelist_idx].mutex);
 
+   if (hashp == SharedBufHash)
+   elog(LOG, "BORROWED: %d", 
++hctl->borrow);
return newElement;
}
 
@@ -1363,6 +1371,8 @@ 

Can we consider "24 Hours" for "next day" in INTERVAL datatype ?

2022-03-15 Thread Prabhat Sahu
Hi All,
Kindly check the below scenario with INTERVAL datatype.

postgres=# select interval '01 20:59:59' + interval '00 05:00:01' as
interval;
interval

 1 day 26:00:00
(1 row)

Any operation with INTERVAL data, We are changing the interval values as
"60 sec" as "next minute"
"60 min" as "next hour"
*Similarly can't we consider "24 Hours" for "next day" ?*
Is there any specific purpose we are holding the hours as an increasing
number beyond 24 hours also?

But when we are dealing with TIMESTAMP with INTERVAL values it's considered
the "24 Hours" for "next day".

postgres=# select timestamp '01-MAR-22 20:59:59' + interval '00 05:00:01'
 as interval;
  interval
-
 2022-03-02 02:00:00
(1 row)

-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com


Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Bharath Rupireddy
On Fri, Mar 4, 2022 at 10:40 AM Kyotaro Horiguchi
 wrote:
>
> At Mon, 14 Feb 2022 14:52:15 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > In this version , 0001 gets one fix and two comment updates.
>
> While the disucssion on back-patching of 0001 proceeding on another
> branch, the main patch iself gets looks like as if rotten on
> CF-App. So I rebased v10 on the current master.  0001 is replaced by
> an adjusted patch based on the latest "control file update fix" patch.
>
> 0001: Fix possible incorrect controlfile update that leads to
>   unrecoverable database.

0001 - I don't think you need to do this as UpdateControlFile
(update_controlfile) will anyway update it, no?
+ /* Update control file using current time */
+ ControlFile->time = (pg_time_t) time(NULL);

> 0002: Add REDO/Checkpiont LSNs to checkpoinkt-end log message.
>   (The main patch in this thread)

0002 - If at all the intention is to say that no ControlFileLock is
required while reading ControlFile->checkPoint and
ControlFile->checkPointCopy.redo, let's say it, no? How about
something like "No ControlFileLock is required while reading
ControlFile->checkPoint and ControlFile->checkPointCopy.redo as there
can't be any other process updating them concurrently."?

+ /* we are the only updator of these variables */
+ LSN_FORMAT_ARGS(ControlFile->checkPoint),
+ LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;

> 0003: Replace (WAL-)location to LSN in pg_controldata.
>
> 0004: Replace (WAL-)location to LSN in user-facing texts.
>   (This doesn't reflect my recent comments.)

If you don't mind, can you please put the comments here?

> 0005: Unhyphenate the word archive-recovery and similars.

0005 - How about replacing "crash-recovery" to "crash recovery" in
postgres-ref.sgml too?

Regards,
Bharath Rupireddy.




Re: Support logical replication of DDLs

2022-03-15 Thread rajesh singarapu
+

On Sun, Mar 13, 2022 at 5:05 PM Dilip Kumar  wrote:
>
> On Mon, Feb 21, 2022 at 9:43 PM Zheng Li  wrote:
> >
> > Hello,
> >
> > One of the most frequently requested improvements from our customers
> > is to reduce downtime associated with software updates (both major and
> > minor versions). To do this, we have reviewed potential contributions to
> > improving logical replication.
> >
> > I’m working on a patch to support logical replication of data
> > definition language statements (DDLs). This is a useful feature when a
> > database in logical replication has lots of tables, functions and
> > other objects that change over time, such as in online cross major
> > version upgrade.
>
> +1
+1


> > I put together a prototype that replicates DDLs using the generic
> > messages for logical decoding. The idea is to log the candidate DDL
> > string in ProcessUtilitySlow() using LogLogicalMessge() with a new
> > flag in WAL record type xl_logical_message indicating it’s a DDL
> > message. The xl_logical_message record is decoded and sent to the
> > subscriber via pgoutput. The logical replication worker process is
> > dispatched for this new DDL message type and executes the command
> > accordingly.
>
> If you don't mind, would you like to share the POC or the branch for this 
> work?
I would love to try this patch out, would you like to share branch or POC ?




Re: generic plans and "initial" pruning

2022-03-15 Thread Amit Langote
On Tue, Mar 15, 2022 at 5:06 AM Robert Haas  wrote:
> On Mon, Mar 14, 2022 at 3:38 PM Tom Lane  wrote:
> > What I am skeptical about is that this work actually accomplishes
> > anything under real-world conditions.  That's because if pruning would
> > save enough to make skipping the lock-acquisition phase worth the
> > trouble, the plan cache is almost certainly going to decide it should
> > be using a custom plan not a generic plan.  Now if we had a better
> > cost model (or, indeed, any model at all) for run-time pruning effects
> > then maybe that situation could be improved.  I think we'd be better
> > served to worry about that end of it before we spend more time making
> > the executor even less predictable.
>
> I don't agree with that analysis, because setting plan_cache_mode is
> not uncommon. Even if that GUC didn't exist, I'm pretty sure there are
> cases where the planner naturally falls into a generic plan anyway,
> even though pruning is happening. But as it is, the GUC does exist,
> and people use it. Consequently, while I'd love to see something done
> about the costing side of things, I do not accept that all other
> improvements should wait for that to happen.

I agree that making generic plans execute faster has merit even before
we make the costing changes to allow plancache.c prefer generic plans
over custom ones in these cases.  As the numbers in my previous email
show, simply executing a generic plan with the proposed improvements
applied is significantly cheaper than having the planner do the
pruning on every execution:

nparts  auto/custom generic
==  ==  ==
32  13359   28204
64  15760   26795
128 15825   26387
256 15017   25601
512 13479   19911
102413200   20158
204812884   16180

> > Also, while I've not spent much time at all reading this patch,
> > it seems rather desperately undercommented, and a lot of the
> > new names are unintelligible.  In particular, I suspect that the
> > patch is significantly redesigning when/where run-time pruning
> > happens (unless it's just letting that be run twice); but I don't
> > see any documentation or name changes suggesting where that
> > responsibility is now.
>
> I am sympathetic to that concern. I spent a while staring at a
> baffling comment in 0001 only to discover it had just been moved from
> elsewhere. I really don't feel that things in this are as clear as
> they could be -- although I hasten to add that I respect the people
> who have done work in this area previously and am grateful for what
> they did. It's been a huge benefit to the project in spite of the
> bumps in the road. Moreover, this isn't the only code in PostgreSQL
> that needs improvement, or the worst. That said, I do think there are
> problems. I don't yet have a position on whether this patch is making
> that better or worse.

Okay, I'd like to post a new version with the comments edited to make
them a bit more intelligible.  I understand that the comments around
the new invocation mode(s) of runtime pruning are not as clear as they
should be, especially as the changes that this patch wants to make to
how things work are not very localized.

> That said, I believe that the core idea of the patch is to optionally
> perform pruning before we acquire locks or spin up the main executor
> and then remember the decisions we made. If once the main executor is
> spun up we already made those decisions, then we must stick with what
> we decided. If not, we make those pruning decisions at the same point
> we do currently

Right.  The "initial" pruning, that this patch wants to make occur at
an earlier point (plancache.c), is currently performed in
ExecInit[Merge]Append().

If it does occur early due to the plan being a cached one,
ExecInit[Merge]Append() simply refers to its result that would be made
available via a new data structure that plancache.c has been made to
pass down to the executor alongside the plan tree.

If it does not, ExecInit[Merge]Append() does the pruning in the same
way it does now.  Such cases include initial pruning using only STABLE
expressions that the planner doesn't bother to compute by itself lest
the resulting plan may be cached, but no EXTERN parameters.

--
Amit Langote
EDB: http://www.enterprisedb.com




  1   2   >