date:20230223

Re: New "blob" re-introduced?

2023-02-23 Thread Daniel Gustafsson

> On 24 Feb 2023, at 08:40, Kyotaro Horiguchi  wrote:

> Mmm. The following changes of e9960732a9 seem like reverting the
> previous commit 35ce24c333...

Fixed in 94851e4b90, thanks for the report!

--
Daniel Gustafsson

Re: New "blob" re-introduced?

2023-02-23 Thread Kyotaro Horiguchi

At Fri, 24 Feb 2023 16:31:27 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I noticed that the commit e9960732a9 introduced the following message.
> 
> + if (EndCompressFileHandle(ctx->dataFH) != 0)
> + pg_fatal("could not close blob data file: %m");
> 
> It seems that we have removed the terminology "blob(s)" from
> user-facing messages by the commit 35ce24c333 (discussion is [1]).
> Shouldn't we use "large object" instead of "blob" in the message?
> 
> 
> [1] 
> https://www.postgresql.org/message-id/868a381f-4650-9460-1726-1ffd39a270b4%40enterprisedb.com

Mmm. The following changes of e9960732a9 seem like reverting the
previous commit 35ce24c333...

e9960732a9 @ 2023/2/23:
-   if (cfclose(ctx->dataFH) != 0)
-   pg_fatal("could not close LO data file: %m");
+   /* Close the BLOB data file itself */
+   if (EndCompressFileHandle(ctx->dataFH) != 0)
+   pg_fatal("could not close blob data file: %m");
-   if (cfwrite(buf, len, ctx->LOsTocFH) != len)
-   pg_fatal("could not write to LOs TOC file");
+   if (CFH->write_func(buf, len, CFH) != len)
+   pg_fatal("could not write to blobs TOC file");
..
-   if (cfclose(ctx->LOsTocFH) != 0)
-   pg_fatal("could not close LOs TOC file: %m");
+   if (EndCompressFileHandle(ctx->LOsTocFH) != 0)
+   pg_fatal("could not close blobs TOC file: %m");

35ce24c333 @ 2022/12/5:
-   pg_fatal("could not close blob data file: %m");
+   pg_fatal("could not close LO data file: %m");
...
-   if (cfwrite(buf, len, ctx->blobsTocFH) != len)
-   pg_fatal("could not write to blobs TOC file");
+   if (cfwrite(buf, len, ctx->LOsTocFH) != len)
+   pg_fatal("could not write to LOs TOC file");
...
-   pg_fatal("could not close blob data file: %m");
+   pg_fatal("could not close LO data file: %m");

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: New "blob" re-introduced?

2023-02-23 Thread Daniel Gustafsson

> On 24 Feb 2023, at 08:31, Kyotaro Horiguchi  wrote:

> Shouldn't we use "large object" instead of "blob" in the message?

Nice catch, it should be "large object" as per the linked discussion.  There
are also a few more like:

-   if (cfclose(ctx->LOsTocFH) != 0)
-   pg_fatal("could not close LOs TOC file: %m");
+   if (EndCompressFileHandle(ctx->LOsTocFH) != 0)
+   pg_fatal("could not close blobs TOC file: %m");

I'll go ahead and fix them, thanks for the report!

--
Daniel Gustafsson

New "blob" re-introduced?

2023-02-23 Thread Kyotaro Horiguchi

I noticed that the commit e9960732a9 introduced the following message.

+   if (EndCompressFileHandle(ctx->dataFH) != 0)
+   pg_fatal("could not close blob data file: %m");

It seems that we have removed the terminology "blob(s)" from
user-facing messages by the commit 35ce24c333 (discussion is [1]).
Shouldn't we use "large object" instead of "blob" in the message?


[1] 
https://www.postgresql.org/message-id/868a381f-4650-9460-1726-1ffd39a270b4%40enterprisedb.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Proposal: :SQL_EXEC_TIME (like :ROW_COUNT) Variable (psql)

2023-02-23 Thread Gurjeet Singh

On Thu, Feb 23, 2023 at 8:42 PM Kirk Wolak  wrote:
>   I love that my proposal for %T in the prompt, triggered some great 
> conversations.
>
>   This is not instead of that.  That lets me run a query and come back HOURS 
> later, and know it finished before 7PM like it was supposed to!

Neat! I have this info embedded in my Bash prompt [1], but many a
times this is not sufficient to reconstruct the time it took to run
the shell command.

>   This feature is simple.  We forget to set \timing on...
> We run a query, and we WONDER... how long did  that take.

And so I empathize with this need. I have set my Bash prompt to show
me this info [2].This info is very helpful in situations where you
fire a command, get tired of waiting for it and walk away for a few
minutes. Upon return it's very useful to see exactly how long did it
take for the command to finish.

>   I am not sure the name is right, but I would like to report it in the same 
> (ms) units as \timing, since there is an implicit relationship in what they 
> are doing.
>
>   I think like ROW_COUNT, it should not change because of internal commands.

+1

> So, you guys +1 this thing, give additional comments.  When the feedback 
> settles, I commit to making it happen.

This is definitely a useful feature. I agree with everything in the
proposed UI (reporting in milliseconds, don't track internal commands'
timing).

I think 'duration' or 'elapsed' would be a better words in this
context. So perhaps the name could be one of :sql_exec_duration (sql
prefix feels superfluous), :exec_duration, :command_duration, or
:elapsed_time.

By using \timing, the user is explicitly opting into any overhead
caused by time-keeping. With this feature, the timing info will be
collected all the time. So do consider evaluating the performance
impact this can cause on people's workloads. They may not care for the
impact in interactive mode, but in automated scripts, even a moderate
performance overhead would be a deal-breaker.

[1]: 
https://github.com/gurjeet/home/blob/08f1051fb854f4fc8fbc4f1326f393ed507a55ce/.bashrc#L278
[2]: 
https://github.com/gurjeet/home/blob/08f1051fb854f4fc8fbc4f1326f393ed507a55ce/.bashrc#L262

Best regards,
Gurjeet
http://Gurje.et

Re: Add index scan progress to pg_stat_progress_vacuum

2023-02-23 Thread Masahiko Sawada

On Tue, Feb 21, 2023 at 1:48 AM Imseih (AWS), Sami  wrote:
>
> Thanks!
>
> >  I think PROGRESS_VACUUM_INDEXES_TOTAL and
> >   PROGRESS_VACUUM_INDEXES_PROCESSED are better for consistency. The rest
> >   looks good to me.
>
> Took care of that in v25.

Thanks! It looks good to me so I've marked it as Ready for Committer.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Doc updates for MERGE

2023-02-23 Thread Dean Rasheed

Attached is a patch fixing a few doc omissions for MERGE.

I don't think that it's necessary to update every place that could
possibly apply to MERGE, but there are a few places where we give a
list of commands that may be used in a particular context, and I think
those should mention MERGE, if it applies.

Also, there were a couple of other places where it seemed worth giving
slightly more detail about what happens for MERGE.

Regards,
Dean
diff --git a/doc/src/sgml/arch-dev.sgml b/doc/src/sgml/arch-dev.sgml
new file mode 100644
index 0c7a53c..976db1e
--- a/doc/src/sgml/arch-dev.sgml
+++ b/doc/src/sgml/arch-dev.sgml
@@ -530,13 +530,15 @@

 

-The executor mechanism is used to evaluate all four basic SQL query
+The executor mechanism is used to evaluate all five basic SQL query
 types: SELECT, INSERT,
-UPDATE, and DELETE.
+UPDATE, DELETE, and
+MERGE.
 For SELECT, the top-level executor code
 only needs to send each row returned by the query plan tree
 off to the client.  INSERT ... SELECT,
-UPDATE, and DELETE
+UPDATE, DELETE, and
+MERGE
 are effectively SELECTs under a special
 top-level plan node called ModifyTable.

@@ -552,7 +554,13 @@
 mark the old row deleted.  For DELETE, the only
 column that is actually returned by the plan is the TID, and the
 ModifyTable node simply uses the TID to visit each
-target row and mark it deleted.
+target row and mark it deleted.  For MERGE, the
+planner joins the source and target relations, and includes all
+column values required by any of the WHEN clauses,
+plus the TID of the target row; this data is fed up to the
+ModifyTable node, which uses the information to
+work out which WHEN clause to execute, and then
+inserts, updates or deletes the target row, as required.

 

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
new file mode 100644
index 8dc8d7a..5179125
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1804,7 +1804,8 @@ REVOKE ALL ON accounts FROM PUBLIC;
view, or other table-like object.
Also allows use of COPY TO.
This privilege is also needed to reference existing column values in
-   UPDATE or DELETE.
+   UPDATE, DELETE,
+   or MERGE.
For sequences, this privilege also allows use of the
currval function.
For large objects, this privilege allows the object to be read.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
new file mode 100644
index f180607..9d0deae
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1611,7 +1611,8 @@ synchronous_standby_names = 'ANY 2 (s1,
  
   
Data Manipulation Language (DML): INSERT,
-   UPDATE, DELETE, COPY FROM,
+   UPDATE, DELETE,
+   MERGE, COPY FROM,
TRUNCATE.
Note that there are no allowed actions that result in a trigger
being executed during recovery.  This restriction applies even to
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
new file mode 100644
index 54f329c..90822b3
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -788,9 +788,10 @@ ROLLBACK;
 

 As seen in this example, when the query is an INSERT,
-UPDATE, or DELETE command, the actual work of
+UPDATE, DELETE, or
+MERGE command, the actual work of
 applying the table changes is done by a top-level Insert, Update,
-or Delete plan node.  The plan nodes underneath this node perform
+Delete, or Merge plan node.  The plan nodes underneath this node perform
 the work of locating the old rows and/or computing the new data.
 So above, we see the same sort of bitmap table scan we've seen already,
 and its output is fed to an Update node that stores the updated rows.
@@ -803,7 +804,8 @@ ROLLBACK;

 

-When an UPDATE or DELETE command affects an
+When an UPDATE, DELETE, or
+MERGE command affects an
 inheritance hierarchy, the output might look like this:
 
 
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
new file mode 100644
index 8897a54..7c8a49f
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -1044,7 +1044,8 @@ INSERT INTO mytable VALUES (1,'one'), (2
  PL/pgSQL variable values can be
  automatically inserted into optimizable SQL commands, which
  are SELECT, INSERT,
- UPDATE, DELETE, and certain
+ UPDATE, DELETE,
+ MERGE, and certain
  utility commands that incorporate one of these, such
  as EXPLAIN and CREATE TABLE ... AS
  SELECT.  In these commands,
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
new file mode 100644
index 93fc716..73b7f44
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -4116,6 +4116,13 @@ psql "dbname=postgres replication=databa

 

+For a MERGE command, the tag is
+

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-02-23 Thread Masahiko Sawada

On Wed, Feb 22, 2023 at 6:55 PM John Naylor
 wrote:
>
>
> On Wed, Feb 22, 2023 at 3:29 PM Masahiko Sawada  wrote:
> >
> > On Wed, Feb 22, 2023 at 4:35 PM John Naylor
> >  wrote:
> > >
> > >  I don't think any vacuum calls in regression tests would stress any of 
> > > this code very much, so it's not worth carrying the old way forward. I 
> > > was thinking of only doing this as a short-time sanity check for testing 
> > > a real-world workload.
> >
> > I guess that It would also be helpful at least until the GA release.
> > People will be able to test them easily on their workloads or their
> > custom test scenarios.
>
> That doesn't seem useful to me. If we've done enough testing to reassure us 
> the new way always gives the same answer, the old way is not needed at commit 
> time. If there is any doubt it will always give the same answer, then the 
> whole patchset won't be committed.

True. Even if we're done enough testing we cannot claim there is no
bug. My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose. Instead, it
seems to be better to add more necessary assertions. What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/access/common/tidstore.c 
b/src/backend/access/common/tidstore.c
index 9360520482..fc20e58a95 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -75,6 +75,14 @@ typedef uint64 offsetbm;
 #define LOWER_OFFSET_NBITS 6   /* log(sizeof(offsetbm), 2) */
 #define LOWER_OFFSET_MASK  ((1 << LOWER_OFFSET_NBITS) - 1)
 
+/*
+ * The minimum amount of memory required by TidStore is 2MB, the current 
minimum
+ * valid value for the maintenance_work_mem GUC. This is required to allocate 
the
+ * DSA initial segment, 1MB, and some meta data. This number is applied also to
+ * the local TidStore cases for simplicity.
+ */
+#define TIDSTORE_MIN_MEMORY(2 * 1024 * 1024L)  /* 2MB */
+
 /* A magic value used to identify our TidStore. */
 #define TIDSTORE_MAGIC 0x826f6a10
 
@@ -101,7 +109,7 @@ typedef struct TidStoreControl
 
/* These values are never changed after creation */
size_t  max_bytes;  /* the maximum bytes a TidStore can use 
*/
-   int max_off;/* the maximum offset number */
+   OffsetNumbermax_off;/* the maximum offset number */
int max_off_nbits;  /* the number of bits required for 
offset
 * numbers */
int upper_off_nbits;/* the number of bits of offset 
numbers
@@ -174,10 +182,17 @@ static inline tidkey encode_tid(TidStore *ts, ItemPointer 
tid, offsetbm *off_bit
  * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
  */
 TidStore *
-TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
+TidStoreCreate(size_t max_bytes, OffsetNumber max_off, dsa_area *area)
 {
TidStore*ts;
 
+   Assert(max_off <= MaxOffsetNumber);
+
+   /* Sanity check for the max_bytes */
+   if (max_bytes < TIDSTORE_MIN_MEMORY)
+   elog(ERROR, "memory for TidStore must be at least %ld, but %zu 
provided",
+TIDSTORE_MIN_MEMORY, max_bytes);
+
ts = palloc0(sizeof(TidStore));
 
/*
@@ -192,8 +207,8 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area 
*area)
 * In local TidStore cases, the radix tree uses slab allocators for 
each kind
 * of node class. The most memory consuming case while adding Tids 
associated
 * with one page (i.e. during TidStoreSetBlockOffsets()) is that we 
allocate a new
-* slab block for a new radix tree node, which is approximately 70kB. 
Therefore,
-* we deduct 70kB from the max_bytes.
+* slab block for a new radix tree node, which is approximately 70kB at 
most.
+* Therefore, we deduct 70kB from the max_bytes.
 *
 * In shared cases, DSA allocates the memory segments big enough to 
follow
 * a geometric series that approximately doubles the total DSA size (see
@@ -378,6 +393,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, 
OffsetNumber *offsets,
const int nkeys = UINT64CONST(1) << ts->control->upper_off_nbits;
 
Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+   Assert(BlockNumberIsValid(blkno));
 
bitmaps = palloc(sizeof(offsetbm) * nkeys);
key = prev_key = key_base;
@@ -386,6 +402,8 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, 
OffsetNumber *offsets,
{
offsetbmoff_bit;
 
+   Assert(offsets[i] <= ts->control->max_off);
+
/* encode the tid to a key and partial offset */

Re: MERGE ... RETURNING

2023-02-23 Thread Dean Rasheed

On Tue, 7 Feb 2023 at 10:56, Dean Rasheed  wrote:
>
> Attached is a more complete patch
>

Rebased version attached.

Regards,
Dean
diff --git a/doc/src/sgml/dml.sgml b/doc/src/sgml/dml.sgml
new file mode 100644
index cbbc5e2..ff2a827
--- a/doc/src/sgml/dml.sgml
+++ b/doc/src/sgml/dml.sgml
@@ -283,10 +283,15 @@ DELETE FROM products;
RETURNING
   
 
+  
+   MERGE
+   RETURNING
+  
+
   
Sometimes it is useful to obtain data from modified rows while they are
being manipulated.  The INSERT, UPDATE,
-   and DELETE commands all have an
+   DELETE, and MERGE commands all have an
optional RETURNING clause that supports this.  Use
of RETURNING avoids performing an extra database query to
collect the data, and is especially valuable when it would otherwise be
@@ -339,6 +344,24 @@ DELETE FROM products
 
   
 
+  
+   In a MERGE, the data available to RETURNING is
+   the content of the source row plus the content of the inserted, updated, or
+   deleted target row.  In addition, the merge support functions
+and 
+   may be used to return information about which merge action was executed for
+   each row.  Since it is quite common for the source and target to have many
+   of the same columns, specifying RETURNING * can lead to a
+   lot of duplicated columns, so it is often more useful to just return the
+   target row.  For example:
+
+MERGE INTO products p USING new_products n ON p.product_no = n.product_no
+  WHEN NOT MATCHED THEN INSERT VALUES (n.product_no, n.name, n.price)
+  WHEN MATCHED THEN UPDATE SET name = n.name, price = n.price
+  RETURNING pg_merge_action(), p.*;
+
+  
+
   
If there are triggers () on the target table,
the data available to RETURNING is the row as modified by
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index 0cbdf63..224e9b8
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21216,6 +21216,99 @@ SELECT count(*) FROM sometable;
 
  
 
+ 
+  Merge Support Functions
+
+  
+   MERGE
+   RETURNING
+  
+
+  
+   The merge support functions shown in
+may be used in the
+   RETURNING list of a  command
+   to return additional information about the action taken for each row.
+  
+
+  
+Merge Support Functions
+
+
+ 
+  
+   
+Function
+   
+   
+Description
+   
+  
+ 
+
+ 
+  
+   
+
+ pg_merge_action
+
+pg_merge_action ( )
+text
+   
+   
+Returns the action performed on the current row ('INSERT',
+'UPDATE', or 'DELETE').
+   
+  
+
+  
+   
+
+ pg_merge_when_clause
+
+pg_merge_when_clause ( )
+integer
+   
+   
+Returns the 1-based index of the WHEN clause executed
+for the current row.
+   
+  
+ 
+
+  
+
+  
+   Example:
+
+  
+
+  
+   Note that it is an error to use these functions anywhere other than the
+   RETURNING list of a MERGE command.
+  
+
+ 
+
  
   Subquery Expressions
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
new file mode 100644
index 7c01a54..4317cfc
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1323,9 +1323,9 @@
  to a client upon the
  completion of an SQL command, usually a
  SELECT but it can be an
- INSERT, UPDATE, or
- DELETE command if the RETURNING
- clause is specified.
+ INSERT, UPDATE,
+ DELETE, or MERGE command if the
+ RETURNING clause is specified.
 
 
  The fact that a result set is a relation means that a query can be used
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
new file mode 100644
index 8897a54..3249e9c
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -1020,8 +1020,8 @@ INSERT INTO mytable VALUES (1,'one'), (2
 
 
 
- If the command does return rows (for example SELECT,
- or INSERT/UPDATE/DELETE
+ If the command does return rows (for example SELECT, or
+ INSERT/UPDATE/DELETE/MERGE
  with RETURNING), there are two ways to proceed.
  When the command will return at most one row, or you only care about
  the first row of output, write the command as usual but add
@@ -1044,7 +1044,8 @@ INSERT INTO mytable VALUES (1,'one'), (2
  PL/pgSQL variable values can be
  automatically inserted into optimizable SQL commands, which
  are SELECT, INSERT,
- UPDATE, DELETE, and certain
+ UPDATE, DELETE,
+ MERGE and certain
  utility commands that incorporate one of these, such
  as EXPLAIN and CREATE TABLE ... AS
  SELECT.  In these commands,
@@ -1148,6 +1149,7 @@ SELECT select_expressionsexpressions INTO STRICT target;
 UPDATE ... RETURNING expressions INTO STRICT target;
 DELETE ... RETURNING expressions INTO STRICT target;
+MERGE ... RETURNING expressions INTO STRICT target;
 
 
  where target can be a record variable, a row
@@ -1158,8

Re: Switching XLog source from archive to streaming when primary available

2023-02-23 Thread Bharath Rupireddy

On Wed, Nov 16, 2022 at 11:39 AM Bharath Rupireddy
 wrote:
>
> I'm attaching the v10 patch for further review.

Needed a rebase. I'm attaching the v11 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 6cf6c9bc0a6d6c4325459cddc22fd6fcdc970a03 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 24 Feb 2023 04:38:13 +
Subject: [PATCH v11] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be efficient and cheaper because network
latencies, disk IO cost might differ on the archive as compared
to primary and often the archive may sit far from standby
impacting recovery performance on the standby. Hence reading WAL
from the primary, by setting this parameter, as opposed to the
archive enables the standby to catch up with the primary sooner
thus reducing replication lag and avoiding WAL files accumulation
on the primary.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication. However, exhaust all the WAL present in
pg_wal before switching. If the standby fails to switch to stream
mode, it falls back to archive mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  46 +
 doc/src/sgml/high-availability.sgml   |   7 +
 src/backend/access/transam/xlogrecovery.c | 158 --
 src/backend/utils/misc/guc_tables.c   |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h |   1 +
 src/test/recovery/meson.build |   1 +
 src/test/recovery/t/035_wal_source_switch.pl  | 126 ++
 8 files changed, 338 insertions(+), 17 deletions(-)
 create mode 100644 src/test/recovery/t/035_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e5c41cc6c6..fd400ac662 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4884,6 +4884,52 @@ ANY num_sync ( 
+  streaming_replication_retry_interval (integer)
+  
+   streaming_replication_retry_interval configuration parameter
+  
+  
+  
+   
+Specifies amount of time after which standby attempts to switch WAL
+source from WAL archive to streaming replication (get WAL from
+primary). However, exhaust all the WAL present in pg_wal before
+switching. If the standby fails to switch to stream mode, it falls
+back to archive mode.
+If this value is specified without units, it is taken as milliseconds.
+The default is five minutes (5min).
+With a lower setting of this parameter, the standby makes frequent
+WAL source switch attempts when the primary is lost for quite longer.
+To avoid this, set a reasonable value.
+A setting of 0 disables the feature. When disabled,
+the standby typically switches to stream mode, only when receive from
+WAL archive finishes (no more WAL left there) or fails for any reason.
+This parameter can only be set in
+the postgresql.conf file or on the server
+command line.
+   
+   
+Reading WAL from archive may not always be efficient and cheaper
+because network latencies, disk IO cost might differ on the archive as
+compared to primary and often the archive may sit far from standby
+impacting recovery performance on the standby. Hence reading WAL
+from the primary, by setting this parameter, as opposed to the archive
+enables the standby to catch up with the primary sooner thus reducing
+replication lag and avoiding WAL files accumulation on the primary.
+  
+   
+Note that the standby may not always attempt to switch source from
+WAL archive to streaming replication at exact
+streaming_replication_retry_interval intervals.
+For example, if the parameter is set to 1min and
+fetching from WAL archive takes 5min, then the
+source switch attempt happens for the next WAL after current WAL is
+fetched from WAL archive and applied.
+  
+  
+ 
+
  
   recovery_min_apply_delay (integer)
   
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index f180607528..0323f4cc43 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -636,6 +636,13 @@

Re: Add LZ4 compression in pg_dump

2023-02-23 Thread Michael Paquier

On Thu, Feb 23, 2023 at 07:51:16PM -0600, Justin Pryzby wrote:
> On Thu, Feb 23, 2023 at 09:24:46PM +0100, Tomas Vondra wrote:
>> I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
>> and marked the CF entry as committed. Thanks for the patch!
> 
> A big thanks from me to everyone involved.

Wow, nice!  The APIs are clear to follow.

> I'll send a patch soon.  I first submitted patches for that 2 years ago
> (before PGDG was ready to add zstd).
> https://commitfest.postgresql.org/31/2888/

Thanks.  It should be straight-forward to see that in 16, I guess.
--
Michael


signature.asc
Description: PGP signature

Re: stopgap fix for signal handling during restore_command

2023-02-23 Thread Nathan Bossart

On Fri, Feb 24, 2023 at 01:25:01PM +1300, Thomas Munro wrote:
> I think you should have a trailing \n when writing to stderr.

Oops.  I added that in v7.

> Here's that reproducer I speculated about (sorry I confused SIGQUIT
> and SIGTERM in my earlier email, ENOCOFFEE).  Seems to do the job, and
> I tested on a Linux box for good measure.  If you comment out the
> kill(), "check PROVE_TESTS=t/002_archiving.pl" works fine
> (demonstrating that that definition of system() works fine).  With the
> kill(), it reliably reaches 'TRAP: failed Assert("latch->owner_pid ==
> MyProcPid")' without your patch, and with your patch it avoids it.  (I
> believe glibc's system() could reach it too with the right timing, but
> I didn't try, my point being that the use of the OpenBSD system() here
> is only  because it's easier to grok and to wrangle.)

Thanks for providing the reproducer!  I am seeing the behavior that you
described on my Linux machine.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 6bdafd9980854fc9d81f37d450b86e5f5d4a9d84 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 23 Feb 2023 14:27:48 -0800
Subject: [PATCH v7 1/2] Move extra code out of the Pre/PostRestoreCommand()
 block.

If SIGTERM is received within this block, the startup process will
immediately proc_exit() in the signal handler, so it is inadvisable
to include any more code than is required in this section.  This
change moves the code recently added to this block (see 1b06d7b and
7fed801) to outside of the block.  This ensures that only system()
is called while proc_exit() might be called in the SIGTERM handler,
which is how this code worked from v8.4 to v14.
---
 src/backend/access/transam/xlogarchive.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index fcc87ff44f..41684418b6 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -159,20 +159,27 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 			 xlogRestoreCmd)));
 
+	fflush(NULL);
+	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
+
 	/*
-	 * Check signals before restore command and reset afterwards.
+	 * PreRestoreCommand() informs the SIGTERM handler for the startup process
+	 * that it should proc_exit() right away.  This is done for the duration of
+	 * the system() call because there isn't a good way to break out while it
+	 * is executing.  Since we might call proc_exit() in a signal handler, it
+	 * is best to put any additional logic before or after the
+	 * PreRestoreCommand()/PostRestoreCommand() section.
 	 */
 	PreRestoreCommand();
 
 	/*
 	 * Copy xlog from archival storage to XLOGDIR
 	 */
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
 	rc = system(xlogRestoreCmd);
-	pgstat_report_wait_end();
 
 	PostRestoreCommand();
+
+	pgstat_report_wait_end();
 	pfree(xlogRestoreCmd);
 
 	if (rc == 0)
-- 
2.25.1

>From 52b0ad1a8b8a94db1f540b7cbf916bc9b1a785b6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 14 Feb 2023 09:44:53 -0800
Subject: [PATCH v7 2/2] Don't proc_exit() in startup's SIGTERM handler if
 forked by system().

Instead, emit a message to STDERR and _exit() in this case.  This
change also adds assertions to proc_exit(), ProcKill(), and
AuxiliaryProcKill() to verify that these functions are not called
by a process forked by system(), etc.
---
 src/backend/postmaster/startup.c | 20 +++-
 src/backend/storage/ipc/ipc.c|  3 +++
 src/backend/storage/lmgr/proc.c  |  2 ++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index efc2580536..bace915881 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -19,6 +19,8 @@
  */
 #include "postgres.h"
 
+#include 
+
 #include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -121,7 +123,23 @@ StartupProcShutdownHandler(SIGNAL_ARGS)
 	int			save_errno = errno;
 
 	if (in_restore_command)
-		proc_exit(1);
+	{
+		/*
+		 * If we are in a child process (e.g., forked by system() in
+		 * RestoreArchivedFile()), we don't want to call any exit callbacks.
+		 * The parent will take care of that.
+		 */
+		if (MyProcPid == (int) getpid())
+			proc_exit(1);
+		else
+		{
+			const char	msg[] = "StartupProcShutdownHandler() called in child process\n";
+			int			rc pg_attribute_unused();
+
+			rc = write(STDERR_FILENO, msg, sizeof(msg));
+			_exit(1);
+		}
+	}
 	else
 		shutdown_requested = true;
 	WakeupRecovery();
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 1904d21795..d5097dc008 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -103,6 +103,9 @@ static int	on_proc_exit_index,
 void
 proc_exit(int code)

Re: Allow logical replication to copy tables in binary format

2023-02-23 Thread Peter Smith

Here is my summary of this thread so far, plus some other thoughts.

(I wrote this mostly for my own internal notes/understanding, but
maybe it is a helpful summary for others so I am posting it here too).

~~

1. Initial Goal
--

Currently, the CREATE SUBSCRIPTION ... WITH(binary=true) mode does
data replication in binary mode, but tablesync COPY phase is still
only in text mode. IIUC Melih just wanted to unify those phases so
binary=true would mean also do the COPY phase in binary [1].

Actually, this was my very first expectation too.

2. Objections to unification
---

Bharath posted suggesting tying the COPY/replication parts is not a
good idea [2]. But if these are not to be unified then it requires a
new subscription option to be introduced, and currently, the thread
refers to this new option as copy_format=binary.

3. A problem: binary replication is not always binary!
--

Shi-san reported [3] that under special circumstances (e.g. if the
datatype has no binary output function) the current HEAD binary=true
mode for replication has the ability to fall back to text replication.
Since the binary COPY doesn't do this, it means binary COPY could fail
in some cases where the binary=true replication will be OK.

AFAIK, this means that even if we *wanted* to unify everything with
just 'binary=true' it can't be done like that.

4. New option is needed
-

OK, so let's assume these options must be separated (because of the
problem of #3).

~

4a. New string option called copy_format?

Currently, the patch/thread describes a new 'copy_format' string
option with values 'text' (default) and 'binary'.

Why? If there are only 2 values then IMO it would be better to have a
*boolean* option called something like binary_copy=true/false.

~

4b. Or, we could extend copy_data

Jelte suggested [4] we could extend copy_data = 'text' (aka on/true)
OR 'binary' OR 'none' (aka off/false).

That was interesting, although
- my first impression was to worry about backward compatibility issues
for existing application code. I don't know if those are easily
solved.
- AFAIK such "hybrid" boolean/enum options are kind of frowned upon so
I am not sure if a committer would be in favour of introducing another
one.


5. Combining options
--

Because options are separated, it means combinations become possible...

~~

5a. Combining option: "copy_format = binary" and "copy_data = false"

Kuroda [5] wrote such a combination should not be allowed.

I kind of disagree with that. IMO everything should be flexible as
possible. The patch might end up accidentally stopping something that
has a valid use case. Anyway, such restrictions are easy to add later.

~~

5b. Combining options: binary=true/copy_format=binary, and
binary=false/copy_format=binary become possible.

AFAIK currently the patch disallows some combinations:

+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("%s and %s are mutually exclusive options",
+ "binary = false", "copy_format = binary")));


I kind of disagree with introducing such rules/restrictions. IMO all
this patch needs to do is clearly document all necessary precautions
etc. But if the user still wants to do something, we should just let
them do it. If they manage to shoot themselves in the foot, well it
was their choice after reading the docs, and it's their foot.


6. pub/sub version checking


There were some discussions about doing some server checking to reject
some PG combinations from being allowed to use the copy_format=binary.

Jelte suggested [5] that it is the "responsibility of the operator to
evaluate the risk".

I agree. Yes, the patch certainly needs to *document* all precautions,
but having too many restrictions might end up accidentally stopping
something useful. Anyway, this seems like #5a. I prefer KISS
Principle. More restrictions can be added later if found necessary.


7. More doubts & a thought bubble
-

7a. What is user action for this scenario?

I am unsure about this scenario - imagine that everything is working
properly and the copy_format=binary/copy_data=true is all working
nicely for weeks for a certain pub/sub...

But if the publication was defined as FOR ALL TABLES, or as ALL TABLES
IN SCHEMA, then I think the tablesync can still crash if a user
creates a new table that suffers the same COPY binary trouble Shi-san
described earlier [3].

What is the user supposed to do when that tablesync fails?

They had no way to predict it could happen, And it will be too painful
to have to DROP and re-create the entire SUBSCRIPTION again now that
it has happened.

~~

7a. A thought bubble

I wondered (perhaps this is naive) would it be it possible to enhance
the COPY command to enhance the "binary" mode so it can be made to
fall back to text mode if it

Re: Add LZ4 compression in pg_dump

2023-02-23 Thread Justin Pryzby

On Thu, Feb 23, 2023 at 09:24:46PM +0100, Tomas Vondra wrote:
> I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
> and marked the CF entry as committed. Thanks for the patch!

A big thanks from me to everyone involved.

> I wonder how difficult would it be to add the zstd compression, so that
> we don't have the annoying "unsupported" cases.

I'll send a patch soon.  I first submitted patches for that 2 years ago
(before PGDG was ready to add zstd).
https://commitfest.postgresql.org/31/2888/

-- 
Justin

Re: stopgap fix for signal handling during restore_command

2023-02-23 Thread Thomas Munro

On Fri, Feb 24, 2023 at 1:25 PM Thomas Munro  wrote:
> ENOCOFFEE

Erm, I realised after sending that I'd accidentally sent a version
that uses fork() anyway, and now if I change it back to vfork() it
doesn't fail the way I wanted to demonstrate, at least on Linux.  I
don't have time or desire to dig into how Linux vfork() really works
so I'll leave it at that... but the patch as posted does seem to be a
useful tool for understanding this failure... please just ignore the
confused comments about fork() vs vfork() therein.

Re: pgsql: Refactor to add pg_strcoll(), pg_strxfrm(), and variants.

2023-02-23 Thread Justin Pryzby

On Fri, Feb 24, 2023 at 01:56:05PM +1300, Thomas Munro wrote:
> On Fri, Feb 24, 2023 at 1:20 PM Justin Pryzby  wrote:
> > These patches cause warnings under MSVC.
> >
> > Of course my patch to improve CI by warning about compiler warnings is
> > the only one to notice.
> >
> > https://cirrus-ci.com/task/6199582053367808
> 
> It's a shame that it fails the whole Windows task, whereas for the
> Unixen we don't do -Werror so you can still see if everything else is
> OK, but then we check for errors in a separate task.  I don't have any
> ideas on how to achieve that, though.

My patch isn't very pretty, but you can see that runs all the tests
before grepping for warnings, rather than failing during compilation as
you said.

IMO the compiler warnings task is separate not only "to avoid failing
the whole task during compilation", but because it's compiled with
optimization.  Which is 1) needed to allow some warnings to be warned
about; and, 2) harmful to enable during the "check-world" tests, since
it makes backtraces less accurate.

-- 
Justin

Re: pgsql: Refactor to add pg_strcoll(), pg_strxfrm(), and variants.

2023-02-23 Thread Thomas Munro

On Fri, Feb 24, 2023 at 1:20 PM Justin Pryzby  wrote:
> These patches cause warnings under MSVC.
>
> Of course my patch to improve CI by warning about compiler warnings is
> the only one to notice.
>
> https://cirrus-ci.com/task/6199582053367808

It's a shame that it fails the whole Windows task, whereas for the
Unixen we don't do -Werror so you can still see if everything else is
OK, but then we check for errors in a separate task.  I don't have any
ideas on how to achieve that, though.

FWIW my CI log scanner also noticed this problem
http://cfbot.cputube.org/highlights/compiler.html.  Been wondering how
to bring that to the right people's attention.  Perhaps by adding a
clickable ⚠ to the main page next to the item if any of these
"highlights" were detected; perhaps it should take you to a
per-submission history page with the highlights from each version.

Re: stopgap fix for signal handling during restore_command

2023-02-23 Thread Thomas Munro

On Fri, Feb 24, 2023 at 12:15 PM Nathan Bossart
 wrote:
> Thoughts?

I think you should have a trailing \n when writing to stderr.

Here's that reproducer I speculated about (sorry I confused SIGQUIT
and SIGTERM in my earlier email, ENOCOFFEE).  Seems to do the job, and
I tested on a Linux box for good measure.  If you comment out the
kill(), "check PROVE_TESTS=t/002_archiving.pl" works fine
(demonstrating that that definition of system() works fine).  With the
kill(), it reliably reaches 'TRAP: failed Assert("latch->owner_pid ==
MyProcPid")' without your patch, and with your patch it avoids it.  (I
believe glibc's system() could reach it too with the right timing, but
I didn't try, my point being that the use of the OpenBSD system() here
is only  because it's easier to grok and to wrangle.)


0001-XXX-inject-SIGTERM-into-system-at-an-inconvenient-mo.xatch
Description: Binary data

Re: pgsql: Refactor to add pg_strcoll(), pg_strxfrm(), and variants.

2023-02-23 Thread Justin Pryzby

These patches cause warnings under MSVC.

Of course my patch to improve CI by warning about compiler warnings is
the only one to notice.

https://cirrus-ci.com/task/6199582053367808

-- 
Justin

Re: Rework of collation code, extensibility

2023-02-23 Thread Jeff Davis

On Wed, 2023-02-22 at 20:49 +0100, Peter Eisentraut wrote:
> > On 14.02.23 00:45, Jeff Davis wrote:
> > > > Now the patches are:
> > > > 
> > > >  0001: pg_strcoll/pg_strxfrm
> > > >  0002: pg_locale_deterministic()
> > > >  0003: cleanup a USE_ICU special case
> > > >  0004: GUCs (only for testing, not for commit)
> > 
> > I haven't read the whole thing again, but this arrangement looks
> > good > to 
> > me.  I don't have an opinion on whether 0004 is actually useful.

Committed with a few revisions after I took a fresh look over the
patch.

The most significant was that I found out that we are also hashing the
NUL byte at the end of the string when the collation is non-
deterministic. The refactoring patch doesn't change that of course, but
the API from pg_strnxfrm() is more clear and I added comments.

Also, ICU uses int32_t for string lengths rather than size_t (I'm not
sure that's a great idea, but that's what ICU does). I clarified the
boundary by changing the argument types of the ICU-specific static
functions to int32_t, while leaving the API entry points as size_t.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Weird failure with latches in curculio on v15

2023-02-23 Thread Nathan Bossart

On Wed, Feb 22, 2023 at 09:48:10PM +1300, Thomas Munro wrote:
> On Tue, Feb 21, 2023 at 5:50 PM Nathan Bossart  
> wrote:
>> On Tue, Feb 21, 2023 at 09:03:27AM +0900, Michael Paquier wrote:
>> > Perhaps beginning a new thread with a patch and a summary would be
>> > better at this stage?  Another thing I am wondering is if it could be
>> > possible to test that rather reliably.  I have been playing with a few
>> > scenarios like holding the system() call for a bit with hardcoded
>> > sleep()s, without much success.  I'll try harder on that part..  It's
>> > been mentioned as well that we could just move away from system() in
>> > the long-term.
>>
>> I'm happy to create a new thread if needed, but I can't tell if there is
>> any interest in this stopgap/back-branch fix.  Perhaps we should just jump
>> straight to the long-term fix that Thomas is looking into.
> 
> Unfortunately the latch-friendly subprocess module proposal I was
> talking about would be for 17.  I may post a thread fairly soon with
> design ideas + list of problems and decision points as I see them, and
> hopefully some sketch code, but it won't be a proposal for [/me checks
> calendar] next week's commitfest and probably wouldn't be appropriate
> in a final commitfest anyway, and I also have some other existing
> stuff to clear first.  So please do continue with the stopgap ideas.

I've created a new thread for the stopgap fix [0].

[0] https://postgr.es/m/20230223231503.GA743455%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

stopgap fix for signal handling during restore_command

2023-02-23 Thread Nathan Bossart

On Wed, Feb 22, 2023 at 09:48:10PM +1300, Thomas Munro wrote:
> On Tue, Feb 21, 2023 at 5:50 PM Nathan Bossart  
> wrote:
>> I'm happy to create a new thread if needed, but I can't tell if there is
>> any interest in this stopgap/back-branch fix.  Perhaps we should just jump
>> straight to the long-term fix that Thomas is looking into.
> 
> Unfortunately the latch-friendly subprocess module proposal I was
> talking about would be for 17.  I may post a thread fairly soon with
> design ideas + list of problems and decision points as I see them, and
> hopefully some sketch code, but it won't be a proposal for [/me checks
> calendar] next week's commitfest and probably wouldn't be appropriate
> in a final commitfest anyway, and I also have some other existing
> stuff to clear first.  So please do continue with the stopgap ideas.

Okay, here is a new thread...

Since v8.4, the startup process will proc_exit() immediately within its
SIGTERM handler while the restore_command executes via system().   Some
recent changes added unsafe code to the section where this behavior is
enabled [0].  The long-term fix likely includes moving away from system()
completely, but we may want to have a stopgap/back-branch fix while that is
under development.

I've attached a patch set for a proposed stopgap fix.  0001 simply moves
the extra code outside of the Pre/PostRestoreCommand() block so that only
system() is executed while the SIGTERM handler might proc_exit().  This
restores the behavior that was in place from v8.4 to v14, so I don't expect
it to be too controversial.  0002 adds code to startup's SIGTERM handler to
call _exit() instead of proc_exit() if we are in a forked process from
system(), etc.  It also adds assertions to ensure proc_exit(), ProcKill(),
and AuxiliaryProcKill() are not called within such forked processes.

Thoughts?

[0] https://postgr.es/m/20230201105514.rsjl4bnhb65giyvo%40alap3.anarazel.de

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From aee0df5e44cfd631fb601e18ab50469a662ced19 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 23 Feb 2023 14:27:48 -0800
Subject: [PATCH v6 1/2] Move extra code out of the Pre/PostRestoreCommand()
 block.

If SIGTERM is received within this block, the startup process will
immediately proc_exit() in the signal handler, so it is inadvisable
to include any more code than is required in this section.  This
change moves the code recently added to this block (see 1b06d7b and
7fed801) to outside of the block.  This ensures that only system()
is called while proc_exit() might be called in the SIGTERM handler,
which is how this code worked from v8.4 to v14.
---
 src/backend/access/transam/xlogarchive.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index fcc87ff44f..41684418b6 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -159,20 +159,27 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 			 xlogRestoreCmd)));
 
+	fflush(NULL);
+	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
+
 	/*
-	 * Check signals before restore command and reset afterwards.
+	 * PreRestoreCommand() informs the SIGTERM handler for the startup process
+	 * that it should proc_exit() right away.  This is done for the duration of
+	 * the system() call because there isn't a good way to break out while it
+	 * is executing.  Since we might call proc_exit() in a signal handler, it
+	 * is best to put any additional logic before or after the
+	 * PreRestoreCommand()/PostRestoreCommand() section.
 	 */
 	PreRestoreCommand();
 
 	/*
 	 * Copy xlog from archival storage to XLOGDIR
 	 */
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
 	rc = system(xlogRestoreCmd);
-	pgstat_report_wait_end();
 
 	PostRestoreCommand();
+
+	pgstat_report_wait_end();
 	pfree(xlogRestoreCmd);
 
 	if (rc == 0)
-- 
2.25.1

>From 76e4f00c1c08513893780bc50709b5434dc3470f Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 14 Feb 2023 09:44:53 -0800
Subject: [PATCH v6 2/2] Don't proc_exit() in startup's SIGTERM handler if
 forked by system().

Instead, emit a message to STDERR and _exit() in this case.  This
change also adds assertions to proc_exit(), ProcKill(), and
AuxiliaryProcKill() to verify that these functions are not called
by a process forked by system(), etc.
---
 src/backend/postmaster/startup.c | 20 +++-
 src/backend/storage/ipc/ipc.c|  3 +++
 src/backend/storage/lmgr/proc.c  |  2 ++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index efc2580536..de2b56c2fa 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -19,6 +19,8 @@
  */
 #include "postgres.h"
 
+#include 
+
 #include

Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-02-23 Thread Andrey Chudnovsky

> This really doesn't feel like a great area to try and do hooks or
> similar in, not the least because that approach has been tried and tried
> again (PAM, GSSAPI, SASL would all be examples..) and frankly none of
> them has turned out great (which is why we can't just tell people "well,
> install the pam_oauth2 and watch everything work!") and this strikes me
> as trying to do that yet again but worse as it's not even a dedicated
> project trying to solve the problem but more like a side project.

In this case it's not intended to be an open-ended hook, but rather an
implementation of a specific rfc (rfc-7628) which defines a
client-server communication for the authentication flow.
The rfc itself does leave a lot of flexibility on specific parts of
the implementation. Which do require hooks:
(1.) Server side hook to validate the token, which is specific to the
OAUTH provider.
(2.) Client side hook to request the client to obtain the token.

On (1.), we would need a hook for the OAUTH provider extension to do
validation. We can though do some basic check that the credential is
indeed a JWT token signed by the requested issuer.

Specifically (2.) is where we can provide a layer in libpq to simplify
the integration. i.e. implement some OAUTH flows.
Though we would need some flexibility for the clients to bring their own token:
For example there are cases where the credential to obtain the token
is stored in a separate secure location and the token is returned from
a separate service or pushed from a more secure environment.

> another new "generic" set of hooks/APIs that will just cause DBAs and
> our users headaches trying to make work.
As I mentioned above, it's an rfc implementation, rather than our invention.
When it comes to DBAs and the users.
Builtin libpq implementations which allows psql and pgadmin to
seamlessly connect should suffice those needs.
While extensibility would allow the ecosystem to be open for OAUTH
providers, SAAS developers, PAAS providers and other institutional
players.

Thanks!
Andrey.

On Thu, Feb 23, 2023 at 10:47 AM Stephen Frost  wrote:
>
> Greetings,
>
> * Jacob Champion (jchamp...@timescale.com) wrote:
> > On Mon, Feb 20, 2023 at 2:35 PM Stephen Frost  wrote:
> > > Having skimmed back through this thread again, I still feel that the
> > > direction that was originally being taken (actually support something in
> > > libpq and the backend, be it with libiddawc or something else or even
> > > our own code, and not just throw hooks in various places) makes a lot
> > > more sense and is a lot closer to how Kerberos and client-side certs and
> > > even LDAP auth work today.
> >
> > Cool, that helps focus the effort. Thanks!
>
> Great, glad to hear that.
>
> > > That also seems like a much better answer
> > > for our users when it comes to new authentication methods than having
> > > extensions and making libpq developers have to write their own custom
> > > code, not to mention that we'd still need to implement something in psql
> > > to provide such a hook if we are to have psql actually usefully exercise
> > > this, no?
> >
> > I don't mind letting clients implement their own flows... as long as
> > it's optional. So even if we did use a hook in the end, I agree that
> > we've got to exercise it ourselves.
>
> This really doesn't feel like a great area to try and do hooks or
> similar in, not the least because that approach has been tried and tried
> again (PAM, GSSAPI, SASL would all be examples..) and frankly none of
> them has turned out great (which is why we can't just tell people "well,
> install the pam_oauth2 and watch everything work!") and this strikes me
> as trying to do that yet again but worse as it's not even a dedicated
> project trying to solve the problem but more like a side project.  SCRAM
> was good, we've come a long way thanks to that, this feels like it
> should be more in line with that rather than trying to invent yet
> another new "generic" set of hooks/APIs that will just cause DBAs and
> our users headaches trying to make work.
>
> > > In the Kerberos test suite we have today, we actually bring up a proper
> > > Kerberos server, set things up, and then test end-to-end installing a
> > > keytab for the server, getting a TGT, getting a service ticket, testing
> > > authentication and encryption, etc.  Looking around, it seems like the
> > > equivilant would perhaps be to use Glewlwyd and libiddawc or libcurl and
> > > our own code to really be able to test this and show that it works and
> > > that we're doing it correctly, and to let us know if we break something.
> >
> > The original patchset includes a test server in Python -- a major
> > advantage being that you can test the client and server independently
> > of each other, since the implementation is so asymmetric. Additionally
> > testing against something like Glewlwyd would be a great way to stack
> > coverage. (If we *only* test against a packaged server, though, it'll
> > be harder to

Re: [PATCH] Add pretty-printed XML output option

2023-02-23 Thread Peter Smith

The patch v19 LGTM.

- v19 applies cleanly for me
- Full clean build OK
- HTML docs build and render OK
- The 'make check' tests all pass for me
- Also cfbot reports latest patch has no errors -- http://cfbot.cputube.org/

So, I marked it a "Ready for Committer" in the CF --
https://commitfest.postgresql.org/42/4162/

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: [BUG] Autovacuum not dynamically decreasing cost_limit and cost_delay

2023-02-23 Thread Melanie Plageman

On Mon, Feb 8, 2021 at 9:49 AM Mead, Scott  wrote:
>   Initially, my goal was to determine feasibility for making this dynamic.  I 
> added debug code to vacuum.c:vacuum_delay_point(void) and found that changes 
> to cost_delay and cost_limit are already processed by a running vacuum.  
> There was a bug preventing the cost_delay or cost_limit from being configured 
> to allow higher throughput however.
>
> I believe this is a bug because currently, autovacuum will dynamically detect 
> and increase the cost_limit or cost_delay, but it can never decrease those 
> values beyond their setting when the vacuum began.  The current behavior is 
> for vacuum to limit the maximum throughput of currently running vacuum 
> processes to the cost_limit that was set when the vacuum process began.
>
> I changed this (see attached) to allow the cost_limit to be re-calculated up 
> to the maximum allowable (currently 10,000).  This has the effect of allowing 
> users to reload a configuration change and an in-progress vacuum can be 
> ‘sped-up’ by setting either the cost_limit or cost_delay.
>
> The problematic piece is:
>
> diff --git a/src/backend/postmaster/autovacuum.c 
> b/src/backend/postmaster/autovacuum.c
> index c6ec657a93..d3c6b0d805 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -1834,7 +1834,7 @@ autovac_balance_cost(void)
>  * cost_limit to more than the base value.
>  */
> worker->wi_cost_limit = Max(Min(limit,
> -   worker->wi_cost_limit_base),
> +   MAXVACUUMCOSTLIMIT),
> 1);
> }
>
> We limit the worker to the max cost_limit that was set at the beginning of 
> the vacuum.

So, in do_autovacuum() in the loop through all relations we will be
vacuuming (around line 2308) (comment says "perform operations on
collected tables"), we will reload the config file first before
operating on that table [1]. Any changes you have made to
autovacuum_vacuum_cost_limit or other GUCs will be read and changed
here.

Later in this same loop, table_recheck_autovac() will set
tab->at_vacuum_cost_limit from vac_cost_limit which is set from the
autovacuum_vacuum_cost_limit or vacuum_cost_limit and will pick up your
refreshed value.

Then a bit further down, (before autovac_balance_cost()),
MyWorkerInfo->wi_cost_limit_base is set from tab->at_vacuum_cost_limit.

In autovac_balance_cost(), when we loop through the running workers to
calculate the worker->wi_cost_limit, workers who have reloaded the
config file in the do_autovacuum() loop prior to our taking the
AutovacuumLock will have the new version of autovacuum_vacuum_cost_limit
in their wi_cost_limit_base.

If you saw an old value in the DEBUG log output, that could be
because it was for a worker who has not yet reloaded the config file.
(the launcher also calls autovac_balance_cost(), but I will
assume we are just talking about the workers here).

Note that this will only pick up changes between tables being
autovacuumed. If you want to see updates to the value in the middle of
autovacuum vacuuming a table, then we would need to reload the
configuration file more often than just between tables.

I have started a discussion about doing this in [2]. I made it a
separate thread because my proposed changes would have effects outside
of autovacuum. Processing the config file reload in vacuum_delay_point()
would affect vacuum and analyze (i.e. not just autovacuum). Explicit
vacuum and analyze rely on the per statement config reload in
PostgresMain().

> Interestingly, autovac_balance_cost(void) is only updating the cost_limit, 
> even if the cost_delay is modified.  This is done correctly, it was just a 
> surprise to see the behavior.

If this was during vacuuming of a single table, this is expected for the
same reason described above.

- Melanie

[1] 
https://github.com/postgres/postgres/blob/master/src/backend/postmaster/autovacuum.c#L2324
[2] 
https://www.postgresql.org/message-id/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com

Should vacuum process config file reload more often

2023-02-23 Thread Melanie Plageman

Hi,

Users may wish to speed up long-running vacuum of a large table by
decreasing autovacuum_vacuum_cost_delay/vacuum_cost_delay, however the
config file is only reloaded between tables (for autovacuum) or after
the statement (for explicit vacuum). This has been brought up for
autovacuum in [1].

Andres suggested that it might be possible to check ConfigReloadPending
in vacuum_delay_point(), so I thought I would draft a rough patch and
start a discussion.

Since vacuum_delay_point() is also called by analyze and we do not want
to reload the configuration file if we are in a user transaction, I
widened the scope of the in_outer_xact variable in vacuum() and allowed
analyze in a user transaction to default to the current configuration
file reload cadence in PostgresMain().

I don't think I can set and leave vac_in_outer_xact the way I am doing
it in this patch, since I use vac_in_outer_xact in vacuum_delay_point(),
which I believe is reachable from codepaths that would not have called
vacuum(). It seems that if a backend sets it, the outer transaction
commits, and then the backend ends up calling vacuum_delay_point() in a
different way later, it wouldn't be quite right.

Apart from this, one higher level question I have is if there are other
gucs whose modification would make reloading the configuration file
during vacuum/analyze unsafe.

- Melanie

[1] 
https://www.postgresql.org/message-id/flat/22CA91B4-D341-4075-BD3C-4BAB52AF1E80%40amazon.com#37f05e33d2ce43680f96332fa1c0f3d4
From aea6fbfd93ab12e4e27869b755367ab8454e3eef Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 23 Feb 2023 15:54:55 -0500
Subject: [PATCH v1] reload config file vac

---
 src/backend/commands/vacuum.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index aa79d9de4d..979d19222d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -48,6 +48,7 @@
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
+#include "postmaster/interrupt.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -75,6 +76,7 @@ int			vacuum_multixact_failsafe_age;
 /* A few variables that don't seem worth passing around as parameters */
 static MemoryContext vac_context = NULL;
 static BufferAccessStrategy vac_strategy;
+static bool vac_in_outer_xact = false;
 
 
 /*
@@ -309,8 +311,7 @@ vacuum(List *relations, VacuumParams *params,
 	static bool in_vacuum = false;
 
 	const char *stmttype;
-	volatile bool in_outer_xact,
-use_own_xacts;
+	volatile bool use_own_xacts;
 
 	Assert(params != NULL);
 
@@ -327,10 +328,10 @@ vacuum(List *relations, VacuumParams *params,
 	if (params->options & VACOPT_VACUUM)
 	{
 		PreventInTransactionBlock(isTopLevel, stmttype);
-		in_outer_xact = false;
+		vac_in_outer_xact = false;
 	}
 	else
-		in_outer_xact = IsInTransactionBlock(isTopLevel);
+		vac_in_outer_xact = IsInTransactionBlock(isTopLevel);
 
 	/*
 	 * Due to static variables vac_context, anl_context and vac_strategy,
@@ -451,7 +452,7 @@ vacuum(List *relations, VacuumParams *params,
 		Assert(params->options & VACOPT_ANALYZE);
 		if (IsAutoVacuumWorkerProcess())
 			use_own_xacts = true;
-		else if (in_outer_xact)
+		else if (vac_in_outer_xact)
 			use_own_xacts = false;
 		else if (list_length(relations) > 1)
 			use_own_xacts = true;
@@ -469,7 +470,7 @@ vacuum(List *relations, VacuumParams *params,
 	 */
 	if (use_own_xacts)
 	{
-		Assert(!in_outer_xact);
+		Assert(!vac_in_outer_xact);
 
 		/* ActiveSnapshot is not set by autovacuum */
 		if (ActiveSnapshotSet())
@@ -521,7 +522,7 @@ vacuum(List *relations, VacuumParams *params,
 }
 
 analyze_rel(vrel->oid, vrel->relation, params,
-			vrel->va_cols, in_outer_xact, vac_strategy);
+			vrel->va_cols, vac_in_outer_xact, vac_strategy);
 
 if (use_own_xacts)
 {
@@ -2214,6 +2215,12 @@ vacuum_delay_point(void)
 		 WAIT_EVENT_VACUUM_DELAY);
 		ResetLatch(MyLatch);
 
+		if (ConfigReloadPending && !vac_in_outer_xact)
+		{
+			ConfigReloadPending = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
 		VacuumCostBalance = 0;
 
 		/* update balance values for workers */
-- 
2.37.2

Re: verbose mode for pg_input_error_message?

2023-02-23 Thread Nathan Bossart

On Thu, Feb 23, 2023 at 11:30:38AM -0800, Nathan Bossart wrote:
> Will post a new version shortly.

As promised...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From b284ba24f771b6ccbf599839bdc813af718b91a1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 23 Feb 2023 10:31:24 -0800
Subject: [PATCH v3 1/1] add details to pg_input_error_message

---
 contrib/cube/expected/cube.out|  6 +-
 contrib/hstore/expected/hstore.out| 12 +--
 contrib/intarray/expected/_int.out| 10 +--
 contrib/isn/expected/isn.out  | 10 +--
 contrib/ltree/expected/ltree.out  | 20 ++---
 contrib/seg/expected/seg.out  | 16 ++--
 doc/src/sgml/func.sgml| 29 ---
 src/backend/utils/adt/misc.c  | 38 +++--
 src/include/catalog/pg_proc.dat   | 10 ++-
 src/test/regress/expected/arrays.out  |  6 +-
 src/test/regress/expected/bit.out | 30 +++
 src/test/regress/expected/boolean.out |  6 +-
 src/test/regress/expected/box.out | 12 +--
 src/test/regress/expected/char.out|  6 +-
 src/test/regress/expected/char_1.out  |  6 +-
 src/test/regress/expected/char_2.out  |  6 +-
 src/test/regress/expected/date.out| 12 +--
 src/test/regress/expected/domain.out  | 24 +++---
 src/test/regress/expected/enum.out| 12 +--
 src/test/regress/expected/float4.out  |  6 +-
 src/test/regress/expected/float8.out  |  6 +-
 src/test/regress/expected/geometry.out| 12 +--
 src/test/regress/expected/inet.out| 18 ++---
 src/test/regress/expected/int2.out| 18 ++---
 src/test/regress/expected/int4.out|  6 +-
 src/test/regress/expected/int8.out|  6 +-
 src/test/regress/expected/interval.out| 12 +--
 src/test/regress/expected/json.out|  6 +-
 src/test/regress/expected/json_encoding.out   |  2 +-
 src/test/regress/expected/json_encoding_1.out |  6 +-
 src/test/regress/expected/jsonb.out   | 12 +--
 src/test/regress/expected/jsonpath.out| 14 ++--
 src/test/regress/expected/line.out| 30 +++
 src/test/regress/expected/lseg.out|  6 +-
 src/test/regress/expected/macaddr.out | 12 +--
 src/test/regress/expected/macaddr8.out| 12 +--
 src/test/regress/expected/money.out   | 12 +--
 src/test/regress/expected/multirangetypes.out | 12 +--
 src/test/regress/expected/numeric.out | 18 ++---
 src/test/regress/expected/oid.out | 24 +++---
 src/test/regress/expected/path.out| 12 +--
 src/test/regress/expected/pg_lsn.out  |  6 +-
 src/test/regress/expected/point.out   |  6 +-
 src/test/regress/expected/polygon.out | 12 +--
 src/test/regress/expected/privileges.out  | 18 ++---
 src/test/regress/expected/rangetypes.out  | 30 +++
 src/test/regress/expected/regproc.out | 78 +--
 src/test/regress/expected/rowtypes.out| 12 +--
 src/test/regress/expected/strings.out | 18 ++---
 src/test/regress/expected/tid.out | 12 +--
 src/test/regress/expected/time.out| 12 +--
 src/test/regress/expected/timestamp.out   | 12 +--
 src/test/regress/expected/timestamptz.out | 12 +--
 src/test/regress/expected/timetz.out  | 12 +--
 src/test/regress/expected/tstypes.out | 18 ++---
 src/test/regress/expected/uuid.out|  6 +-
 src/test/regress/expected/varchar.out |  6 +-
 src/test/regress/expected/varchar_1.out   |  6 +-
 src/test/regress/expected/varchar_2.out   |  6 +-
 src/test/regress/expected/xid.out | 24 +++---
 src/test/regress/expected/xml.out | 10 +--
 src/test/regress/expected/xml_1.out   |  4 +-
 src/test/regress/expected/xml_2.out   |  8 +-
 src/test/regress/sql/xml.sql  |  4 +-
 64 files changed, 452 insertions(+), 413 deletions(-)

diff --git a/contrib/cube/expected/cube.out b/contrib/cube/expected/cube.out
index dc23e5ccc0..3bb42b063b 100644
--- a/contrib/cube/expected/cube.out
+++ b/contrib/cube/expected/cube.out
@@ -345,9 +345,9 @@ SELECT pg_input_is_valid('-1e-700', 'cube');
 (1 row)
 
 SELECT pg_input_error_message('-1e-700', 'cube');
-   pg_input_error_message
--
- "-1e-700" is out of range for type double precision
+  pg_input_error_message   
+---
+ ("""-1e-700"" is out of range for type double precision",,,22003)
 (1 row)
 
 --
diff --git a/contrib/hstore/expected/hstore.out b/contrib/hstore/expected/hstore.out
index d6faa91867..d58fee585e 100644
--- a/contrib/hstore/expected/hstore.out
+++ b/contrib/hstore/expected/hstore.out
@@ -266,15 +266,15 @@ select

Re: buildfarm + meson

2023-02-23 Thread Andrew Dunstan



On 2023-02-23 Th 10:58, Andres Freund wrote:



On a Windows instance, fairly similar to what's running drongo, I can get a
successful build with meson+VS2019, but I'm getting an error in the
regression tests, which don't like setting lc_time to 'de_DE'. Not sure
what's going on there.

Huh, that's odd.


See my reply to Michael for details

I suspect the issue might be related to this:

+   local %ENV = (PATH => $ENV{PATH}, PGUSER => $ENV{PGUSER});
+   @makeout=run_log("meson test --logbase checklog --print-errorlogs 
--no-rebuild -C $pgsql --suite setup --suite regress");



I commented out the 'local %ENV' line and still got the error. I also 
got the same error running by hand.



cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: "out of relcache_callback_list slots" after multiple calls to pg_logical_slot_get_binary_changes

2023-02-23 Thread Tom Lane

Peter Smith  writes:
> Should the 'relation_callback_registered' variable name be plural?

Yeah, plural seems better to me too.  I fixed that and did a little
comment-editing and pushed it.

regards, tom lane

Re: Add LZ4 compression in pg_dump

2023-02-23 Thread Tomas Vondra

On 2/23/23 16:26, Tomas Vondra wrote:
> Thanks for v30 with the updated commit messages. I've pushed 0001 after
> fixing a comment typo and removing (I think) an unnecessary change in an
> error message.
> 
> I'll give the buildfarm a bit of time before pushing 0002 and 0003.
> 

I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
and marked the CF entry as committed. Thanks for the patch!

I wonder how difficult would it be to add the zstd compression, so that
we don't have the annoying "unsupported" cases.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Proposal: :SQL_EXEC_TIME (like :ROW_COUNT) Variable (psql)

2023-02-23 Thread Kirk Wolak

Everyone,
  I love that my proposal for %T in the prompt, triggered some great
conversations.

  This is not instead of that.  That lets me run a query and come back
HOURS later, and know it finished before 7PM like it was supposed to!

  This feature is simple.  We forget to set \timing on...
We run a query, and we WONDER... how long did  that take.

  This, too, should be a trivial problem (the code will tell).

  I am proposing this to get feedback (I don't have a final design in mind,
but I will start by reviewing when and how ROW_COUNT gets set, and what
\timing reports).

  Next up, as I learn (and make mistakes), this toughens me up...

  I am not sure the name is right, but I would like to report it in the
same (ms) units as \timing, since there is an implicit relationship in what
they are doing.

  I think like ROW_COUNT, it should not change because of internal commands.
So, you guys +1 this thing, give additional comments.  When the feedback
settles, I commit to making it happen.

Thanks, Kirk

Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2023-02-23 Thread Andres Freund

Hi,

On 2023-02-23 13:56:56 -0500, Tom Lane wrote:
> Corey Huinker  writes:
> > My not-ready-for-16 work on CAST( ... ON DEFAULT ... ) involved making
> > FuncExpr/IoCoerceExpr/ArrayCoerceExpr have a safe_mode flag, and that
> > necessitates adding a reserror boolean to ExprEvalStep for subsequent steps
> > to test if the error happened.
> 
> Why do you want it in ExprEvalStep ... couldn't it be in ExprState?
> I can't see why you'd need more than one at a time during evaluation.

I don't know exactly what CAST( ... ON DEFAULT ... ) is aiming for - I guess
it wants to assign a different value when the cast fails?  Is the default
expression a constant, or does it need to be runtime evaluated?  If a const,
then the cast steps just could assign the new value. If runtime evaluation is
needed I'd expect the various coerce steps to jump to the value implementing
the default expression in case of a failure.

So I'm not sure we even need a reserror field in ExprState.

Greetings,

Andres Freund

Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2023-02-23 Thread Andres Freund

Hi,

On 2023-02-23 13:39:14 -0500, Corey Huinker wrote:
> My not-ready-for-16 work on CAST( ... ON DEFAULT ... ) involved making
> FuncExpr/IoCoerceExpr/ArrayCoerceExpr have a safe_mode flag, and that
> necessitates adding a reserror boolean to ExprEvalStep for subsequent steps
> to test if the error happened.

I think if that requires adding a new variable to each ExprEvalStep, it's
DOA. The overhead would be too high. But I don't see why it would need to be
added all ExprEvalSteps instead of individual steps, or perhaps ExprEvalState?

> Will that change be throwing some architectures over the 64 byte count?

It would.

I find the 'pahole' tool very useful for looking at struct layout.

struct ExprEvalStep {
intptr_t   opcode;   /* 0 8 */
Datum *resvalue; /* 8 8 */
_Bool *resnull;  /*16 8 */
union {
struct {
intlast_var; /*24 4 */
_Bool  fixed;/*28 1 */

/* XXX 3 bytes hole, try to pack */

TupleDesc  known_desc;   /*32 8 */
const TupleTableSlotOps  * kind; /*40 8 */
} fetch; /*2424 */
...
struct {
Datum *values;   /*24 8 */
_Bool *nulls;/*32 8 */
intnelems;   /*40 4 */
MinMaxOp   op;   /*44 4 */
FmgrInfo * finfo;/*48 8 */
FunctionCallInfo fcinfo_data;/*56 8 */
} minmax;/*2440 */
...

} d; /*2440 */

/* size: 64, cachelines: 1, members: 4 */
};

We don't have memory to spare in the "general" portion of ExprEvalStep
(currently 24 bytes), as several of the type-specific portions are already 40
bytes large.

Greetings,

Andres Freund

Re: verbose mode for pg_input_error_message?

2023-02-23 Thread Nathan Bossart

On Thu, Feb 23, 2023 at 10:40:48AM -0800, Nathan Bossart wrote:
> On Tue, Jan 10, 2023 at 03:41:12PM -0800, Nathan Bossart wrote:
>> My vote would be to redefine the existing pg_input_error_message() function
>> to return a record, but I recognize that this would inflate the patch quite
>> a bit due to all the existing uses in the tests.  If this is the only
>> argument against this approach, I'm happy to help with the patch.
> 
> Here's an attempt at this.

This seems to have made cfbot angry.  Will post a new version shortly.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)

2023-02-23 Thread Kirk Wolak

On Thu, Feb 23, 2023 at 9:52 AM Tom Lane  wrote:

> Heikki Linnakangas  writes:
> > On 23/02/2023 13:20, Peter Eisentraut wrote:
> >> If you don't have \timing turned on before the query starts, psql won't
> >> record what the time was before the query, so you can't compute the run
> >> time afterwards.  This kind of feature would only work if you always
> >> take the start time, even if \timing is turned off.
>
> > Correct. That seems acceptable though? gettimeofday() can be slow on
> > some platforms, but I doubt it's *that* slow, that we couldn't call it
> > two times per query.
>
> Yeah, you'd need to capture both the start and stop times even if
> \timing isn't on, in case you get asked later.  But the backend is
> going to call gettimeofday at least once per query, likely more
> depending on what features you use.  And there are inherently
> multiple kernel calls involved in sending a query and receiving
> a response.  I tend to agree with Heikki that this overhead would
> be unnoticeable.  (Of course, some investigation proving that
> wouldn't be unwarranted.)
>
> regards, tom lane
>

Note, for this above feature, I was thinking we have a  ROW_COUNT variable
I use \set to see.
The simplest way to add this is maybe a set variable:  EXEC_TIME
And it's set when ROW_COUNT gets set.
+1

==
Now, since this opened a lively discussion, I am officially submitting my
first patch.
This includes the small change to prompt.c and the documentation.  I had
help from Andrey Borodin,
and Pavel Stehule, who have supported me in how to propose, and use gitlab,
etc.

We are programmers... It's literally our job to sharpen our tools.  And
PSQL is one of my most used.
A small frustration, felt regularly was the motive.

Regards, Kirk
PS: If I am supposed to edit the subject to say there is a patch here, I
did not know
PPS: I appreciate ANY and ALL feedback... This is how we learn!
From eaf6d05028052a3ccaa8d980953ac4fd75255250 Mon Sep 17 00:00:00 2001
From: Kirk Wolak 
Date: Thu, 23 Feb 2023 17:52:09 +
Subject: [PATCH] Time option added to psql prompt

This adds a useful time option to the prompt: %T. Which does not
require a wasteful backquoted shell command which is also not
compatible between operating systems.
The format is simply HH24:MI:SS no other options available by design!

Author: Kirk Wolak 
Reviewed-By: Andrey Borodin 
Reviewed-By: Nikolay Samokhvalov 
Thread: 
https://postgr.es/m/CACLU5mSRwHr_8z%3DenMj-nXF1tmC7%2BJn5heZQNiKuLyxYUtL2fg%40mail.gmail.com
---
 doc/src/sgml/ref/psql-ref.sgml |  9 +
 src/bin/psql/prompt.c  | 10 +-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index dc6528dc11d..04ab9eeb8c0 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -4575,6 +4575,15 @@ testdb= INSERT INTO my_table VALUES 
(:'content');
 
   
 
+  
+%T
+
+ 
+  The current time on the client in HH24:MI:SS format.
+ 
+
+  
+
   
 %x
 
diff --git a/src/bin/psql/prompt.c b/src/bin/psql/prompt.c
index 969cd9908e5..a590c27389b 100644
--- a/src/bin/psql/prompt.c
+++ b/src/bin/psql/prompt.c
@@ -41,6 +41,7 @@
  * or a ! if session is not connected to a database;
  * in prompt2 -, *, ', or ";
  * in prompt3 nothing
+ * %T - time in HH24:MI:SS format
  * %x - transaction status: empty, *, !, ? (unknown or no connection)
  * %l - The line number inside the current statement, starting from 1.
  * %? - the error code of the last query (not yet implemented)
@@ -223,7 +224,14 @@ get_prompt(promptStatus_t status, ConditionalStack cstack)
break;
}
break;
-
+   /* output HH24:MI:SS */
+   case 'T':
+   {
+   time_t current_time = 
time(NULL);
+   struct tm *tm_info = 
localtime(_time);
+   sprintf(buf, "%02d:%02d:%02d", 
tm_info->tm_hour, tm_info->tm_min, tm_info->tm_sec);
 
+   }   
+   break;
case 'x':
if (!pset.db)
buf[0] = '?';
-- 
GitLab

Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2023-02-23 Thread Tom Lane

Corey Huinker  writes:
> My not-ready-for-16 work on CAST( ... ON DEFAULT ... ) involved making
> FuncExpr/IoCoerceExpr/ArrayCoerceExpr have a safe_mode flag, and that
> necessitates adding a reserror boolean to ExprEvalStep for subsequent steps
> to test if the error happened.

Why do you want it in ExprEvalStep ... couldn't it be in ExprState?
I can't see why you'd need more than one at a time during evaluation.

regards, tom lane

Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)

2023-02-23 Thread Kirk Wolak

On Thu, Feb 23, 2023 at 1:16 PM Maciek Sakrejda 
wrote:

> On Thu, Feb 23, 2023, 09:55 Nikolay Samokhvalov 
> wrote:
>
>> On Thu, Feb 23, 2023 at 9:05 AM Maciek Sakrejda 
>> wrote:
>>
>>> I think Heikki's solution is probably more practical since (1) ..
>>
>>
>> Note that these ideas target two *different* problems:
>> - what was the duration of the last query
>> - when was the last query executed
>>
>> So, having both solved would be ideal.
>>
>
> Fair point, but since the duration solution needs to capture two
> timestamps anyway, it could print start time as well as duration.
>
> The prompt timestamp could still be handy for more intricate session
> forensics, but I don't know if that's a common-enough use case.
>
> Thanks,
> Maciek
>

It's really common during migrations, and forensics.  I often do a bunch of
stuff in 2 systems.  Then check the overlap.
Andrey brought up the value of 2 people separate working on things, being
able to go back and review when did you change that setting? Which has
happened to many of us in support sessions...

Thanks!

Re: PATCH: Using BRIN indexes for sorted output

2023-02-23 Thread Tomas Vondra

On 2/23/23 17:44, Matthias van de Meent wrote:
> On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
>  wrote:
>>
>> On 2/23/23 15:19, Matthias van de Meent wrote:
>>> Comments on 0001, mostly comments and patch design:
> 
> One more comment:
> 
 + range->min_value = bval->bv_values[0];
 + range->max_value = bval->bv_values[1];
> 
> This produces dangling pointers for minmax indexes on by-ref types
> such as text and bytea, due to the memory context of the decoding
> tuple being reset every iteration. :-(
> 

Yeah, that sounds like a bug. Also a sign the tests should have some
by-ref data types (presumably there are none, as that would certainly
trip some asserts etc.).

 +range_values_cmp(const void *a, const void *b, void *arg)
>>>
>>> Can the arguments of these functions be modified into the types they
>>> are expected to receive? If not, could you add a comment on why that's
>>> not possible?
>>
>> The reason is that that's what qsort() expects. If you change that to
>> actual data types, you'll get compile-time warnings. I agree this may
>> need better comments, though.
> 
> Thanks in advance.
> 
 + * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).
>>>
>>> Could you please expand on this? We do have GIST support for ranges, too.
>>>
>>
>> Expand in what way? This is meant to be AM-specific, so if GiST wants to
>> collect some additional stats, it's free to do so - perhaps some of the
>> ideas from the stats collected for BRIN would be applicable, but it's
>> also bound to the index structure.
> 
> I don't quite understand the flow of the comment, as I don't clearly
> see what the "BRIN for ranges" tries to refer to. In my view, that
> makes it a bad example which needs further explanation or rewriting,
> aka "expanding on".
> 

Ah, right. Yeah, the "BRIN for ranges" wording is a bit misleading. It
should really say only BRIN, but I was focused on the minmax use case,
so I mentioned the ranges.

 + * brin_minmax_stats
 + *Calculate custom statistics for a BRIN minmax index.
 + *
 + * At the moment this calculates:
 + *
 + *  - number of summarized/not-summarized and all/has nulls ranges
>>>
>>> I think statistics gathering of an index should be done at the AM
>>> level, not attribute level. The docs currently suggest that the user
>>> builds one BRIN index with 16 columns instead of 16 BRIN indexes with
>>> one column each, which would make the statistics gathering use 16x
>>> more IO if the scanned data cannot be reused.
>>>
>>
>> Why? The row sample is collected only once and used for building all the
>> index AM stats - it doesn't really matter if we analyze 16 single-column
>> indexes or 1 index with 16 columns. Yes, we'll need to scan more
>> indexes, but the with 16 columns the summaries will be larger so the
>> total amount of I/O will be almost the same I think.
>>
>> Or maybe I don't understand what I/O you're talking about?
> 
> With the proposed patch, we do O(ncols_statsenabled) scans over the
> BRIN index. Each scan reads all ncol columns of all block ranges from
> disk, so in effect the data scan does on the order of
> O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
> all columns have statistics enabled.
> 

I don't think that's the number of I/O operations we'll do, because we
always read the whole BRIN tuple at once. So I believe it should rather
be something like

  O(ncols_statsenabled * nranges)

assuming nranges is the number of page ranges. But even that's likely a
significant overestimate because most of the tuples will be served from
shared buffers.

Considering how tiny BRIN indexes are, this is likely orders of
magnitude less I/O than we expend on sampling rows from the table. I
mean, with the default statistics target we read ~3 pages (~240MB)
or more just to sample the rows. Randomly, while the BRIN index is
likely scanned mostly sequentially.

Maybe there are cases where this would be an issue, but I haven't seen
one when working on this patch (and I did a lot of experiments). I'd
like to see one before we start optimizing it ...

This also reminds me that the issues I actually saw (e.g. memory
consumption) would be made worse by processing all columns at once,
because then you need to keep more columns in memory.


>>> It is possible to build BRIN indexes on more than one column with more
>>> than one opclass family like `USING brin (id int8_minmax_ops, id
>>> int8_bloom_ops)`. This would mean various duplicate statistics fields,
>>> no?
>>> It seems to me that it's more useful to do the null- and n_summarized
>>> on the index level instead of duplicating that inside the opclass.
>>
>> I don't think it's worth it. The amount of data this would save is tiny,
>> and it'd only apply to cases where the index includes the same attribute
>> multiple times, and that's pretty rare I think. I don't think it's worth
>> the extra complexity.
> 
> Not necessarily, it was just an example of where we'd

Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-02-23 Thread Stephen Frost

Greetings,

* Jacob Champion (jchamp...@timescale.com) wrote:
> On Mon, Feb 20, 2023 at 2:35 PM Stephen Frost  wrote:
> > Having skimmed back through this thread again, I still feel that the
> > direction that was originally being taken (actually support something in
> > libpq and the backend, be it with libiddawc or something else or even
> > our own code, and not just throw hooks in various places) makes a lot
> > more sense and is a lot closer to how Kerberos and client-side certs and
> > even LDAP auth work today.
> 
> Cool, that helps focus the effort. Thanks!

Great, glad to hear that.

> > That also seems like a much better answer
> > for our users when it comes to new authentication methods than having
> > extensions and making libpq developers have to write their own custom
> > code, not to mention that we'd still need to implement something in psql
> > to provide such a hook if we are to have psql actually usefully exercise
> > this, no?
> 
> I don't mind letting clients implement their own flows... as long as
> it's optional. So even if we did use a hook in the end, I agree that
> we've got to exercise it ourselves.

This really doesn't feel like a great area to try and do hooks or
similar in, not the least because that approach has been tried and tried
again (PAM, GSSAPI, SASL would all be examples..) and frankly none of
them has turned out great (which is why we can't just tell people "well,
install the pam_oauth2 and watch everything work!") and this strikes me
as trying to do that yet again but worse as it's not even a dedicated
project trying to solve the problem but more like a side project.  SCRAM
was good, we've come a long way thanks to that, this feels like it
should be more in line with that rather than trying to invent yet
another new "generic" set of hooks/APIs that will just cause DBAs and
our users headaches trying to make work.

> > In the Kerberos test suite we have today, we actually bring up a proper
> > Kerberos server, set things up, and then test end-to-end installing a
> > keytab for the server, getting a TGT, getting a service ticket, testing
> > authentication and encryption, etc.  Looking around, it seems like the
> > equivilant would perhaps be to use Glewlwyd and libiddawc or libcurl and
> > our own code to really be able to test this and show that it works and
> > that we're doing it correctly, and to let us know if we break something.
> 
> The original patchset includes a test server in Python -- a major
> advantage being that you can test the client and server independently
> of each other, since the implementation is so asymmetric. Additionally
> testing against something like Glewlwyd would be a great way to stack
> coverage. (If we *only* test against a packaged server, though, it'll
> be harder to test our stuff in the presence of malfunctions and other
> corner cases.)

Oh, that's even better- I agree entirely that having test code that can
be instructed to return specific errors so that we can test that our
code responds properly is great (and is why pgbackrest has things like
a stub'd out libpq, fake s3, GCS, and Azure servers, and more) and would
certainly want to keep that, even if we also build out a test that uses
a real server to provide integration testing with not-our-code too.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)

2023-02-23 Thread Pavel Stehule

čt 23. 2. 2023 v 19:16 odesílatel Maciek Sakrejda 
napsal:

> On Thu, Feb 23, 2023, 09:55 Nikolay Samokhvalov 
> wrote:
>
>> On Thu, Feb 23, 2023 at 9:05 AM Maciek Sakrejda 
>> wrote:
>>
>>> I think Heikki's solution is probably more practical since (1) ..
>>
>>
>> Note that these ideas target two *different* problems:
>> - what was the duration of the last query
>> - when was the last query executed
>>
>> So, having both solved would be ideal.
>>
>
> Fair point, but since the duration solution needs to capture two
> timestamps anyway, it could print start time as well as duration.
>
> The prompt timestamp could still be handy for more intricate session
> forensics, but I don't know if that's a common-enough use case.
>


It is hard to say what is a common enough case, but I cannot imagine more
things than this.

small notice - bash has special support for this

Regards

Pavel


> Thanks,
> Maciek
>
>>

Re: verbose mode for pg_input_error_message?

2023-02-23 Thread Nathan Bossart

On Tue, Jan 10, 2023 at 03:41:12PM -0800, Nathan Bossart wrote:
> My vote would be to redefine the existing pg_input_error_message() function
> to return a record, but I recognize that this would inflate the patch quite
> a bit due to all the existing uses in the tests.  If this is the only
> argument against this approach, I'm happy to help with the patch.

Here's an attempt at this.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 1b92850e7b8811a1e1114a18359d01d33089c0bf Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 23 Feb 2023 10:31:24 -0800
Subject: [PATCH v2 1/1] add details to pg_input_error_message

---
 contrib/cube/expected/cube.out|  6 +-
 contrib/hstore/expected/hstore.out| 12 +--
 contrib/intarray/expected/_int.out| 10 +--
 contrib/isn/expected/isn.out  | 10 +--
 contrib/ltree/expected/ltree.out  | 20 ++---
 contrib/seg/expected/seg.out  | 16 ++--
 doc/src/sgml/func.sgml| 29 ---
 src/backend/utils/adt/misc.c  | 38 +++--
 src/include/catalog/pg_proc.dat   | 10 ++-
 src/test/regress/expected/arrays.out  |  6 +-
 src/test/regress/expected/bit.out | 30 +++
 src/test/regress/expected/boolean.out |  6 +-
 src/test/regress/expected/box.out | 12 +--
 src/test/regress/expected/char_1.out  |  6 +-
 src/test/regress/expected/date.out| 12 +--
 src/test/regress/expected/domain.out  | 24 +++---
 src/test/regress/expected/enum.out| 12 +--
 src/test/regress/expected/float4.out  |  6 +-
 src/test/regress/expected/float8.out  |  6 +-
 src/test/regress/expected/geometry.out| 12 +--
 src/test/regress/expected/inet.out| 18 ++---
 src/test/regress/expected/int2.out| 18 ++---
 src/test/regress/expected/int4.out|  6 +-
 src/test/regress/expected/int8.out|  6 +-
 src/test/regress/expected/interval.out| 12 +--
 src/test/regress/expected/json.out|  6 +-
 src/test/regress/expected/json_encoding.out   |  2 +-
 src/test/regress/expected/jsonb.out   | 12 +--
 src/test/regress/expected/jsonpath.out| 14 ++--
 src/test/regress/expected/line.out| 30 +++
 src/test/regress/expected/lseg.out|  6 +-
 src/test/regress/expected/macaddr.out | 12 +--
 src/test/regress/expected/macaddr8.out| 12 +--
 src/test/regress/expected/money.out   | 12 +--
 src/test/regress/expected/multirangetypes.out | 12 +--
 src/test/regress/expected/numeric.out | 18 ++---
 src/test/regress/expected/oid.out | 24 +++---
 src/test/regress/expected/path.out| 12 +--
 src/test/regress/expected/pg_lsn.out  |  6 +-
 src/test/regress/expected/point.out   |  6 +-
 src/test/regress/expected/polygon.out | 12 +--
 src/test/regress/expected/privileges.out  | 18 ++---
 src/test/regress/expected/rangetypes.out  | 30 +++
 src/test/regress/expected/regproc.out | 78 +--
 src/test/regress/expected/rowtypes.out| 12 +--
 src/test/regress/expected/strings.out | 18 ++---
 src/test/regress/expected/tid.out | 12 +--
 src/test/regress/expected/time.out| 12 +--
 src/test/regress/expected/timestamp.out   | 12 +--
 src/test/regress/expected/timestamptz.out | 12 +--
 src/test/regress/expected/timetz.out  | 12 +--
 src/test/regress/expected/tstypes.out | 18 ++---
 src/test/regress/expected/uuid.out|  6 +-
 src/test/regress/expected/varchar_1.out   |  6 +-
 src/test/regress/expected/xid.out | 24 +++---
 src/test/regress/expected/xml.out | 20 +++--
 56 files changed, 438 insertions(+), 391 deletions(-)

diff --git a/contrib/cube/expected/cube.out b/contrib/cube/expected/cube.out
index dc23e5ccc0..3bb42b063b 100644
--- a/contrib/cube/expected/cube.out
+++ b/contrib/cube/expected/cube.out
@@ -345,9 +345,9 @@ SELECT pg_input_is_valid('-1e-700', 'cube');
 (1 row)
 
 SELECT pg_input_error_message('-1e-700', 'cube');
-   pg_input_error_message
--
- "-1e-700" is out of range for type double precision
+  pg_input_error_message   
+---
+ ("""-1e-700"" is out of range for type double precision",,,22003)
 (1 row)
 
 --
diff --git a/contrib/hstore/expected/hstore.out b/contrib/hstore/expected/hstore.out
index d6faa91867..d58fee585e 100644
--- a/contrib/hstore/expected/hstore.out
+++ b/contrib/hstore/expected/hstore.out
@@ -266,15 +266,15 @@ select pg_input_is_valid('a=b', 'hstore');
 (1 row)
 
 select pg_input_error_message('a=b', 'hstore');
- pg_input_error_message

Re: Ability to reference other extensions by schema in extension scripts

2023-02-23 Thread Sandro Santilli

On Mon, Feb 06, 2023 at 05:19:39AM -0500, Regina Obe wrote:
> 
> Attached is a revised version of the original patch. It is revised to
> prevent 
> 
> ALTER EXTENSION .. SET SCHEMA  if there is a dependent extension that 
> references the extension in their scripts using @extschema:extensionname@
> It also adds additional tests to verify that new feature.
> 
> In going thru the code base, I was tempted to add a new dependency type
> instead of using the existing DEPENDENCY_AUTO.  I think this would be
> cleaner, but I felt that was overstepping the area a bit, since it requires
> making changes to dependency.h and dependency.c
> 
> My main concern with using DEPENDENCY_AUTO is because it was designed for
> cases where an object can be dropped without need for CASCADE.  In this
> case, we don't want a dependent extension to be dropped if it's required is
> dropped.  However since there will already exist 
> a DEPENDENCY_NORMAL between the 2 extensions, I figure we are protected
> against that issue already.

I was thinking: how about using the "refobjsubid" to encode the
"level" of dependency on an extension ? Right now "refobjsubid" is
always 0 when the referenced object is an extension.
Could we consider subid=1 to mean the dependency is not only
on the extension but ALSO on it's schema location ?

Also: should we really allow extensions to rely on other extension
w/out fully-qualifying calls to their functions ? Or should it be
discouraged and thus forbidden ? If we wanted to forbid it we then
would not need to encode any additional dependency but rather always
forbid `ALTER EXTENSION .. SET SCHEMA` whenever the extension is
a dependency of any other extension.

On the code in the patch itself, I tried with this simple use case:

  - ext1, relocatable, exposes an ext1log(text) function

  - ext2, relocatable, exposes an ext2log(text) function
calling @extschema:ext1@.ext1log()

What is not good:

- Drop of ext1 automatically cascades to drop of ext2 without even a 
notice:

test=# create extension ext2 cascade;
NOTICE:  installing required extension "ext1"
CREATE EXTENSION
test=# drop extension ext1;
DROP EXTENSION -- no WARNING, no NOTICE, ext2 is gone

What is good:

- ext1 cannot be relocated while ext2 is loaded:

test=# create extension ext2 cascade;
NOTICE:  installing required extension "ext1"
CREATE EXTENSION
test=# alter extension ext1 set schema n1;
ERROR:  Extension can not be relocated because dependent 
extension references it's location
test=# drop extension ext2;
DROP EXTENSION
test=# alter extension ext1 set schema n1;
ALTER EXTENSION

--strk;

  Libre GIS consultant/developer
  https://strk.kbt.io/services.html

Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2023-02-23 Thread Corey Huinker

On Wed, Feb 22, 2023 at 5:47 PM Andres Freund  wrote:

> Hi,
>
> On 2023-02-22 16:34:44 -0500, Tom Lane wrote:
> > I wrote:
> > > Andres Freund  writes:
> > >> Maybe it's worth sticking a StaticAssert() for the struct size
> > >> somewhere.
> >
> > > Indeed.  I thought we had one already.
> >
> > >> I'm a bit wary about that being too noisy, there are some machines
> with
> > >> odd alignment requirements. Perhaps worth restricting the assertion to
> > >> x86-64 + armv8 or such?
> >
> > > I'd put it in first and only reconsider if it shows unfixable problems.
> >
> > Now that we've got the sizeof(ExprEvalStep) under control, shouldn't
> > we do the attached?
>
> Indeed. Pushed.
>
> Let's hope there's no rarely used architecture with odd alignment rules.
>
> Greetings,
>
> Andres Freund
>
>
>
I have a question about this that may affect some of my future work.

My not-ready-for-16 work on CAST( ... ON DEFAULT ... ) involved making
FuncExpr/IoCoerceExpr/ArrayCoerceExpr have a safe_mode flag, and that
necessitates adding a reserror boolean to ExprEvalStep for subsequent steps
to test if the error happened.

Will that change be throwing some architectures over the 64 byte count?

Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)

2023-02-23 Thread Maciek Sakrejda

On Thu, Feb 23, 2023, 09:55 Nikolay Samokhvalov 
wrote:

> On Thu, Feb 23, 2023 at 9:05 AM Maciek Sakrejda 
> wrote:
>
>> I think Heikki's solution is probably more practical since (1) ..
>
>
> Note that these ideas target two *different* problems:
> - what was the duration of the last query
> - when was the last query executed
>
> So, having both solved would be ideal.
>

Fair point, but since the duration solution needs to capture two timestamps
anyway, it could print start time as well as duration.

The prompt timestamp could still be handy for more intricate session
forensics, but I don't know if that's a common-enough use case.

Thanks,
Maciek

>

Re: meson: Optionally disable installation of test modules

2023-02-23 Thread Nazir Bilal Yavuz

Hi,

Thanks for the review.

On Mon, 20 Feb 2023 at 21:44, Peter Eisentraut
 wrote:
>
> I tested this a bit.  It works fine.  The approach makes sense to me.
>
> The install_additional_files script could be simplified a bit.  You
> could use os.makedirs(dest, exist_ok=True) and avoid the error checking.
>I don't think any callers try to copy a directory source, so the
> shutil.copytree() stuff isn't necessary.  Run pycodestyle over the
> script.  And let's put the script into src/tools/ like the other support
> scripts.
>

I updated the patch in line with your comments.

Regards,
Nazir Bilal Yavuz
Microsoft
From 7804aa928de7595cb89bfd7668952e1e1fe48038 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 23 Feb 2023 20:35:22 +0300
Subject: [PATCH v3] meson: prevent installation of test files during main
 install

---
 meson.build   | 32 ++-
 src/backend/meson.build   |  7 
 src/test/modules/delay_execution/meson.build  |  6 ++--
 src/test/modules/dummy_index_am/meson.build   |  9 ++
 src/test/modules/dummy_seclabel/meson.build   |  9 ++
 src/test/modules/plsample/meson.build | 10 ++
 src/test/modules/spgist_name_ops/meson.build  | 10 ++
 .../ssl_passphrase_callback/meson.build   |  6 ++--
 src/test/modules/test_bloomfilter/meson.build |  9 ++
 .../modules/test_copy_callbacks/meson.build   |  9 ++
 .../modules/test_custom_rmgrs/meson.build |  9 ++
 src/test/modules/test_ddl_deparse/meson.build |  9 ++
 src/test/modules/test_extensions/meson.build  |  4 +--
 .../modules/test_ginpostinglist/meson.build   |  9 ++
 src/test/modules/test_integerset/meson.build  |  9 ++
 src/test/modules/test_lfind/meson.build   |  9 ++
 src/test/modules/test_oat_hooks/meson.build   |  6 ++--
 src/test/modules/test_parser/meson.build  |  9 ++
 .../test_pg_db_role_setting/meson.build   |  9 ++
 src/test/modules/test_pg_dump/meson.build |  4 +--
 src/test/modules/test_predtest/meson.build|  9 ++
 src/test/modules/test_rbtree/meson.build  |  9 ++
 src/test/modules/test_regex/meson.build   |  9 ++
 src/test/modules/test_rls_hooks/meson.build   |  6 ++--
 src/test/modules/test_shm_mq/meson.build  |  9 ++
 src/test/modules/test_slru/meson.build|  9 ++
 src/test/modules/worker_spi/meson.build   |  9 ++
 src/test/regress/meson.build  | 18 ---
 src/tools/install_additional_files| 28 
 29 files changed, 139 insertions(+), 151 deletions(-)
 create mode 100644 src/tools/install_additional_files

diff --git a/meson.build b/meson.build
index f5347044526..1e4b3d5445a 100644
--- a/meson.build
+++ b/meson.build
@@ -2791,6 +2791,10 @@ backend_code = declare_dependency(
   dependencies: os_deps + backend_both_deps + backend_deps,
 )
 
+# install these files only during test, not main install
+test_install_files = []
+test_install_libs = []
+
 # src/backend/meson.build defines backend_mod_code used for extension
 # libraries.
 
@@ -2811,6 +2815,10 @@ subdir('doc/src/sgml')
 
 generated_sources_ac += {'': ['GNUmakefile']}
 
+# After processing src/test, add test_install_libs to the testprep_targets
+# to build them
+testprep_targets += test_install_libs
+
 
 # If there are any files in the source directory that we also generate in the
 # build directory, they might get preferred over the newly generated files,
@@ -2893,14 +2901,36 @@ meson_install_args = meson_args + ['install'] + {
 'muon': []
 }[meson_impl]
 
+# setup tests should  be run first,
+# so define priority for these
+setup_tests_priority = 100
 test('tmp_install',
 meson_bin, args: meson_install_args ,
 env: {'DESTDIR':test_install_destdir},
-priority: 100,
+priority: setup_tests_priority,
 timeout: 300,
 is_parallel: false,
 suite: ['setup'])
 
+# get full paths of test_install_libs to copy them
+test_install_libs_fp = []
+foreach lib: test_install_libs
+  test_install_libs_fp += lib.full_path()
+endforeach
+
+install_additional_files = files('src/tools/install_additional_files')
+test('install_additional_files',
+python, args: [
+  install_additional_files,
+  '--sharedir', test_install_location / contrib_data_args['install_dir'],
+  '--libdir', test_install_location / dir_lib_pkg,
+  '--install_files', test_install_files,
+  '--install_libs', test_install_libs_fp,
+],
+priority: setup_tests_priority,
+is_parallel: false,
+suite: ['setup'])
+
 test_result_dir = meson.build_root() / 'testrun'
 
 
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 4fdd209b826..ccfc382fcfd 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -180,12 +180,19 @@ backend_mod_code = declare_dependency(
   dependencies: backend_mod_deps,
 )
 
+# normal extension modules
 pg_mod_args = default_mod_args + {
   'dependencies':

Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)

2023-02-23 Thread Nikolay Samokhvalov

On Thu, Feb 23, 2023 at 9:05 AM Maciek Sakrejda 
wrote:

> I think Heikki's solution is probably more practical since (1) ..


Note that these ideas target two *different* problems:
- what was the duration of the last query
- when was the last query executed

So, having both solved would be ideal.

Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)

2023-02-23 Thread Maciek Sakrejda

+1 on solving the general problem of "I forgot to set \timing--how
long did this run?". I could have used that more than once in the
past, and I'm sure it will come up again.

I think Heikki's solution is probably more practical since (1) even if
we add the prompt parameter originally proposed, I don't see it being
included in the default, so it would require users to change their
prompt before they can benefit from it and (2) even if we commit to
never allowing tweaks to the format, I foresee a slow but endless
trickle of requests and patches to do so.

Thanks,
Maciek

Re: PATCH: Using BRIN indexes for sorted output

2023-02-23 Thread Matthias van de Meent

On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
 wrote:
>
> On 2/23/23 15:19, Matthias van de Meent wrote:
>> Comments on 0001, mostly comments and patch design:

One more comment:

>>> + range->min_value = bval->bv_values[0];
>>> + range->max_value = bval->bv_values[1];

This produces dangling pointers for minmax indexes on by-ref types
such as text and bytea, due to the memory context of the decoding
tuple being reset every iteration. :-(

> >> +range_values_cmp(const void *a, const void *b, void *arg)
> >
> > Can the arguments of these functions be modified into the types they
> > are expected to receive? If not, could you add a comment on why that's
> > not possible?
>
> The reason is that that's what qsort() expects. If you change that to
> actual data types, you'll get compile-time warnings. I agree this may
> need better comments, though.

Thanks in advance.

> >> + * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).
> >
> > Could you please expand on this? We do have GIST support for ranges, too.
> >
>
> Expand in what way? This is meant to be AM-specific, so if GiST wants to
> collect some additional stats, it's free to do so - perhaps some of the
> ideas from the stats collected for BRIN would be applicable, but it's
> also bound to the index structure.

I don't quite understand the flow of the comment, as I don't clearly
see what the "BRIN for ranges" tries to refer to. In my view, that
makes it a bad example which needs further explanation or rewriting,
aka "expanding on".

> >> + * brin_minmax_stats
> >> + *Calculate custom statistics for a BRIN minmax index.
> >> + *
> >> + * At the moment this calculates:
> >> + *
> >> + *  - number of summarized/not-summarized and all/has nulls ranges
> >
> > I think statistics gathering of an index should be done at the AM
> > level, not attribute level. The docs currently suggest that the user
> > builds one BRIN index with 16 columns instead of 16 BRIN indexes with
> > one column each, which would make the statistics gathering use 16x
> > more IO if the scanned data cannot be reused.
> >
>
> Why? The row sample is collected only once and used for building all the
> index AM stats - it doesn't really matter if we analyze 16 single-column
> indexes or 1 index with 16 columns. Yes, we'll need to scan more
> indexes, but the with 16 columns the summaries will be larger so the
> total amount of I/O will be almost the same I think.
>
> Or maybe I don't understand what I/O you're talking about?

With the proposed patch, we do O(ncols_statsenabled) scans over the
BRIN index. Each scan reads all ncol columns of all block ranges from
disk, so in effect the data scan does on the order of
O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
all columns have statistics enabled.

> > It is possible to build BRIN indexes on more than one column with more
> > than one opclass family like `USING brin (id int8_minmax_ops, id
> > int8_bloom_ops)`. This would mean various duplicate statistics fields,
> > no?
> > It seems to me that it's more useful to do the null- and n_summarized
> > on the index level instead of duplicating that inside the opclass.
>
> I don't think it's worth it. The amount of data this would save is tiny,
> and it'd only apply to cases where the index includes the same attribute
> multiple times, and that's pretty rare I think. I don't think it's worth
> the extra complexity.

Not necessarily, it was just an example of where we'd save IO.
Note that the current gathering method already retrieves all tuple
attribute data, so from a basic processing perspective we'd save some
time decoding as well.

> >
> > I'm planning on reviewing the other patches, and noticed that a lot of
> > the patches are marked WIP. Could you share a status on those, because
> > currently that status is unknown: Are these patches you don't plan on
> > including, or are these patches only (or mostly) included for
> > debugging?
> >
>
> I think the WIP label is a bit misleading, I used it mostly to mark
> patches that are not meant to be committed on their own. A quick overview:
>
> [...]

Thanks for the explanation, that's quite helpful. I'll see to further
reviewing 0004 and 0005 when I have additional time.

Kind regards,

Matthias van de Meent.

Re: Wrong query results caused by loss of join quals

2023-02-23 Thread Tom Lane

Richard Guo  writes:
> On Thu, Feb 23, 2023 at 4:50 AM Tom Lane  wrote:
>> Less-hasty v2 patch attached.

> I think the patch is in good shape now.

Pushed, thanks for reviewing!

regards, tom lane

Re: buildfarm + meson

2023-02-23 Thread Andres Freund

On 2023-02-23 06:27:23 -0500, Andrew Dunstan wrote:
> 
> On 2023-02-22 We 20:20, Andres Freund wrote:
> > 
> > > There is work to do to make sure we pick up the right log files, and maybe
> > > adjust a module or two. I have adopted a design where instead of trying to
> > > know a lot about the testing regime the client needs to know a lot less.
> > > Instead, it gets meson to tell it the set of tests. I will probably work 
> > > on
> > > enabling some sort of filter, but I think this makes things more
> > > future-proof. I have stuck with the design of making testing fairly
> > > fine-grained, so each suite runs separately.
> > I don't understand why you'd want to run each suite separately. Serially
> > executing the test takes way longer than doing so in parallel. Why would we
> > want to enforce that?
> > 
> > Particularly because with meson the tests log files and the failed tests can
> > directly be correlated? And it should be easy to figure out which log files
> > need to be kept, you can just skip the directories in testrun/ that contain
> > test.success.
> > 
> 
> We can revisit that later. For now I'm more concerned with getting a working
> setup.

My fear is that this ends up being entrenched in the design and hard to change
later.


> The requirements of the buildfarm are a bit different from those of a
> developer, though. Running things in parallel can make things faster, but
> that can also increase the compute load.

Sure, I'm not advocating to using a [high] concurrency by default.


> Also, running things serially makes it easier to report a failure stage that
> pinpoints the test that encountered an issue.

You're relying on running tests in a specific order. Instead you can also just
run tests in parallel and check test status in order and report the first
failed test in that order.


> But like I say we can come
> back to this.

> 
> > > On a Windows instance, fairly similar to what's running drongo, I can get 
> > > a
> > > successful build with meson+VS2019, but I'm getting an error in the
> > > regression tests, which don't like setting lc_time to 'de_DE'. Not sure
> > > what's going on there.
> > Huh, that's odd.
> 
> 
> See my reply to Michael for details

I suspect the issue might be related to this:

+   local %ENV = (PATH => $ENV{PATH}, PGUSER => $ENV{PGUSER});
+   @makeout=run_log("meson test --logbase checklog 
--print-errorlogs --no-rebuild -C $pgsql --suite setup --suite regress");

Greetings,

Andres Freund

Re: Add LZ4 compression in pg_dump

2023-02-23 Thread Tomas Vondra

Thanks for v30 with the updated commit messages. I've pushed 0001 after
fixing a comment typo and removing (I think) an unnecessary change in an
error message.

I'll give the buildfarm a bit of time before pushing 0002 and 0003.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Using BRIN indexes for sorted output

2023-02-23 Thread Tomas Vondra

On 2/23/23 15:19, Matthias van de Meent wrote:
> On Sat, 18 Feb 2023 at 16:55, Tomas Vondra
>  wrote:
>>
>> cfbot complained there's one more place triggering a compiler warning on
>> 32-bit systems, so here's a version fixing that.
>>
>> I've also added a copy of the regression tests but using the indexam
>> stats added in 0001. This is just a copy of the already existing
>> regression tests, just with enable_indexam_stats=true - this should
>> catch some of the issues that went mostly undetected in the earlier
>> patch versions.
> 
> 
> Comments on 0001, mostly comments and patch design:
> 
>> +range_minval_cmp(const void *a, const void *b, void *arg)
>> [...]
>> +range_maxval_cmp(const void *a, const void *b, void *arg)
>> [...]
>> +range_values_cmp(const void *a, const void *b, void *arg)
> 
> Can the arguments of these functions be modified into the types they
> are expected to receive? If not, could you add a comment on why that's
> not possible?
> I don't think it's good practise to "just" use void* for arguments to
> cast them to more concrete types in the next lines without an
> immediate explanation.
> 

The reason is that that's what qsort() expects. If you change that to
actual data types, you'll get compile-time warnings. I agree this may
need better comments, though.

>> + * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).
> 
> Could you please expand on this? We do have GIST support for ranges, too.
> 

Expand in what way? This is meant to be AM-specific, so if GiST wants to
collect some additional stats, it's free to do so - perhaps some of the
ideas from the stats collected for BRIN would be applicable, but it's
also bound to the index structure.

>> + * brin_minmax_stats
>> + *Calculate custom statistics for a BRIN minmax index.
>> + *
>> + * At the moment this calculates:
>> + *
>> + *  - number of summarized/not-summarized and all/has nulls ranges
> 
> I think statistics gathering of an index should be done at the AM
> level, not attribute level. The docs currently suggest that the user
> builds one BRIN index with 16 columns instead of 16 BRIN indexes with
> one column each, which would make the statistics gathering use 16x
> more IO if the scanned data cannot be reused.
> 

Why? The row sample is collected only once and used for building all the
index AM stats - it doesn't really matter if we analyze 16 single-column
indexes or 1 index with 16 columns. Yes, we'll need to scan more
indexes, but the with 16 columns the summaries will be larger so the
total amount of I/O will be almost the same I think.

Or maybe I don't understand what I/O you're talking about?

> It is possible to build BRIN indexes on more than one column with more
> than one opclass family like `USING brin (id int8_minmax_ops, id
> int8_bloom_ops)`. This would mean various duplicate statistics fields,
> no?
> It seems to me that it's more useful to do the null- and n_summarized
> on the index level instead of duplicating that inside the opclass.

I don't think it's worth it. The amount of data this would save is tiny,
and it'd only apply to cases where the index includes the same attribute
multiple times, and that's pretty rare I think. I don't think it's worth
the extra complexity.

> 
>> + for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
> 
> I am not familiar with the frequency of max-sized relations, but this
> would go into an infinite loop for pagesPerRange values >1 for
> max-sized relations due to BlockNumber wraparound. I think there
> should be some additional overflow checks here.
> 

Good point, but that's a pre-existing issue. We do this same loop in a
number of places.

>> +/*
>> + * get_attstaindexam
>> + *
>> + *  Given the table and attribute number of a column, get the index AM
>> + *  statistics.  Return NULL if no data available.
>> + *
> 
> Shouldn't this be "given the index and attribute number" instead of
> "the table and attribute number"?
> I think we need to be compatible with indexes on expression here, so
> that we don't fail to create (or use) statistics for an index `USING
> brin ( (int8range(my_min_column, my_max_column, '[]'))
> range_inclusion_ops)` when we implement stats for range_inclusion_ops.
> 
>> + * Alternative to brin_minmax_match_tuples_to_ranges2, leveraging ordering
> 
> Does this function still exist?
> 

Yes, but only in 0003 - it's a "brute-force" algorithm used as a
cross-check the result of the optimized algorithm in 0001. You're right
it should not be referenced in the comment.

> 
> I'm planning on reviewing the other patches, and noticed that a lot of
> the patches are marked WIP. Could you share a status on those, because
> currently that status is unknown: Are these patches you don't plan on
> including, or are these patches only (or mostly) included for
> debugging?
> 

I think the WIP label is a bit misleading, I used it mostly to mark
patches that are not meant to be committed on their own. A quick overview:

Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)

2023-02-23 Thread Tom Lane

Heikki Linnakangas  writes:
> On 23/02/2023 13:20, Peter Eisentraut wrote:
>> If you don't have \timing turned on before the query starts, psql won't
>> record what the time was before the query, so you can't compute the run
>> time afterwards.  This kind of feature would only work if you always
>> take the start time, even if \timing is turned off.

> Correct. That seems acceptable though? gettimeofday() can be slow on 
> some platforms, but I doubt it's *that* slow, that we couldn't call it 
> two times per query.

Yeah, you'd need to capture both the start and stop times even if
\timing isn't on, in case you get asked later.  But the backend is
going to call gettimeofday at least once per query, likely more
depending on what features you use.  And there are inherently
multiple kernel calls involved in sending a query and receiving
a response.  I tend to agree with Heikki that this overhead would
be unnoticeable.  (Of course, some investigation proving that
wouldn't be unwarranted.)

regards, tom lane

Re: PATCH: Using BRIN indexes for sorted output

2023-02-23 Thread Matthias van de Meent

On Sat, 18 Feb 2023 at 16:55, Tomas Vondra
 wrote:
>
> cfbot complained there's one more place triggering a compiler warning on
> 32-bit systems, so here's a version fixing that.
>
> I've also added a copy of the regression tests but using the indexam
> stats added in 0001. This is just a copy of the already existing
> regression tests, just with enable_indexam_stats=true - this should
> catch some of the issues that went mostly undetected in the earlier
> patch versions.

Comments on 0001, mostly comments and patch design:

> +range_minval_cmp(const void *a, const void *b, void *arg)
> [...]
> +range_maxval_cmp(const void *a, const void *b, void *arg)
> [...]
> +range_values_cmp(const void *a, const void *b, void *arg)

Can the arguments of these functions be modified into the types they
are expected to receive? If not, could you add a comment on why that's
not possible?
I don't think it's good practise to "just" use void* for arguments to
cast them to more concrete types in the next lines without an
immediate explanation.

> + * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).

Could you please expand on this? We do have GIST support for ranges, too.

> + * brin_minmax_stats
> + *Calculate custom statistics for a BRIN minmax index.
> + *
> + * At the moment this calculates:
> + *
> + *  - number of summarized/not-summarized and all/has nulls ranges

I think statistics gathering of an index should be done at the AM
level, not attribute level. The docs currently suggest that the user
builds one BRIN index with 16 columns instead of 16 BRIN indexes with
one column each, which would make the statistics gathering use 16x
more IO if the scanned data cannot be reused.

It is possible to build BRIN indexes on more than one column with more
than one opclass family like `USING brin (id int8_minmax_ops, id
int8_bloom_ops)`. This would mean various duplicate statistics fields,
no?
It seems to me that it's more useful to do the null- and n_summarized
on the index level instead of duplicating that inside the opclass.

> + for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)

I am not familiar with the frequency of max-sized relations, but this
would go into an infinite loop for pagesPerRange values >1 for
max-sized relations due to BlockNumber wraparound. I think there
should be some additional overflow checks here.

> +/*
> + * get_attstaindexam
> + *
> + *  Given the table and attribute number of a column, get the index AM
> + *  statistics.  Return NULL if no data available.
> + *

Shouldn't this be "given the index and attribute number" instead of
"the table and attribute number"?
I think we need to be compatible with indexes on expression here, so
that we don't fail to create (or use) statistics for an index `USING
brin ( (int8range(my_min_column, my_max_column, '[]'))
range_inclusion_ops)` when we implement stats for range_inclusion_ops.

> + * Alternative to brin_minmax_match_tuples_to_ranges2, leveraging ordering

Does this function still exist?

.

I'm planning on reviewing the other patches, and noticed that a lot of
the patches are marked WIP. Could you share a status on those, because
currently that status is unknown: Are these patches you don't plan on
including, or are these patches only (or mostly) included for
debugging?

Kind regards,

Matthias van de Meent

Re: Reducing connection overhead in pg_upgrade compat check phase

2023-02-23 Thread Daniel Gustafsson

> On 22 Feb 2023, at 20:20, Nathan Bossart  wrote:

> One thing I noticed is that the
> "failed check" log is only printed once, even if multiple data type checks
> failed.  I believe this is because this message uses PG_STATUS.  If I
> change it to PG_REPORT, all of the "failed check" messages appear.  TBH I'm
> not sure we need this message at all since a more detailed explanation will
> be printed afterwards.  If we do keep it around, I think it should be
> indented so that it looks more like this:
> 
>   Checking for data type usagechecking 
> all databases  
>   failed check: incompatible aclitem data type in user tables
>   failed check: reg* data types in user tables

Thats a good point, that's better.  I think it makes sense to keep it around.

>> One change this brings is that check.c contains version specific checks in 
>> the
>> struct.  Previously these were mostly contained in version.c (some, like the
>> 9.4 jsonb check was in check.c) which maintained some level of separation.
>> Splitting the array init is of course one option but it also seems a tad 
>> messy.
>> Not sure what's best, but for now I've documented it in the array comment at
>> least.
> 
> Hm.  We could move check_for_aclitem_data_type_usage() and
> check_for_jsonb_9_4_usage() to version.c since those are only used for
> determining whether the check applies now.  Otherwise, IMO things are in
> roughly the right place.  I don't think it's necessary to split the array.

Will do, thanks.

--
Daniel Gustafsson

Re: Raising the SCRAM iteration count

2023-02-23 Thread Daniel Gustafsson

> On 22 Feb 2023, at 18:21, Jonathan S. Katz  wrote:
> On 2/22/23 8:39 AM, Daniel Gustafsson wrote:

>> The attached is a rebase on top of master with no other additional hacking 
>> done
>> on top of the above review comments.
> 
> Generally LGTM. I read through earlier comments (sorry I missed replying) and 
> have nothing to add or object to.

Thanks for reviewing!

In fixing the CFBot test error in the previous version I realized through
off-list discussion that the GUC name was badly chosen.  Incorporating the
value of another GUC in the name is a bad idea, so the attached version reverts
to "scram_iterations=".  Should there ever be another SCRAM method
standardized (which seems a slim chance to happen before the v17 freeze) we can
make a backwards compatible change to ": | "
where the latter is a default for all.  Internally the variable contains
sha_256 though, that part I think is fine for readability.

--
Daniel Gustafsson

v5-0001-Make-SCRAM-iteration-count-configurable.patch
Description: Binary data

Re: pg_rewind race condition just after promotion

2023-02-23 Thread Heikki Linnakangas


On 22/02/2023 16:00, Heikki Linnakangas wrote:

On 11/12/2022 02:01, Ian Lawrence Barwick wrote:

2021年11月9日(火) 20:31 Daniel Gustafsson :



On 14 Jul 2021, at 14:03, Aleksander Alekseev  wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

The v3 patch LGTM. I wonder if we should explicitly say in pg_rewind tests that
they _don't_ have to call `checkpoint`, or otherwise, we will lose the test
coverage for this scenario. But I don't have a strong opinion on this one.

The new status of this patch is: Ready for Committer


Heikki, do you have plans to address this patch during this CF?


Friendly reminder ping one year on; I haven't looked at this patch in
detail but going by the thread contents it seems it should be marked
"Ready for Committer"? Moved to the next CF anyway.


Here's an updated version of the patch.

I renamed the arguments to findCommonAncestorTimeline() so that the
'targetHistory' argument doesn't shadow the global 'targetHistory'
variable. No other changes, and this still looks good to me, so I'll
wait for the cfbot to run on this and commit in the next few days.


Pushed. I decided not to backpatch this, after all. We haven't really 
been treating this as a bug so far, and the patch didn't apply cleanly 
to v13 and before.


- Heikki

Re: Improving inferred query column names

2023-02-23 Thread Joe Conway


On 2/22/23 23:03, Tom Lane wrote:

Andres Freund  writes:

We could just do something like printing __. So
something like avg(reltuples / relpages) would end up as
avg_reltuples_float48div_relpages.
Whether that's worth it, or whether column name lengths would be too painful,
IDK.


I think you'd soon be hitting NAMEDATALEN limits ...




Probably an unpalatable idea, but if we did something like 
md5('avg(reltuples / relpages)') for the column name, it would be 
(reasonably) unique and deterministic. Not pretty, but possibly useful 
in some cases.




--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: logical decoding and replication of sequences, take 2

2023-02-23 Thread Tomas Vondra

On 2/22/23 18:04, Jonathan S. Katz wrote:
> On 2/22/23 5:02 AM, Tomas Vondra wrote:
>>
>> On 2/22/23 03:28, Jonathan S. Katz wrote:
> 
>>> Thanks for continuing to work on this patch! I tested the latest version
>>> and have some feedback/clarifications.
>>>
>>
>> Thanks!
> 
> Also I should mention I've been testing with both async/sync logical
> replication. I didn't have any specific comments on either as it seemed
> to just work and behaviors aligned with existing expectations.
> 
> Generally it's been a good experience and it seems to be working. :) At
> this point I'm trying to understand the limitations and tripwires so we
> can guide users appropriately.
> 

Good to hear.

>> Yes, this is due to how we WAL-log sequences. We don't log individual
>> increments, but every 32nd increment and we log the "future" sequence
>> state so that after a crash/recovery we don't generate duplicates.
>>
>> So you do nextval() and it returns 1. But into WAL we record 32. And
>> there will be no WAL records until nextval reaches 32 and needs to
>> generate another batch.
>>
>> And because logical replication relies on these WAL records, it inherits
>> this batching behavior with a "jump" on recovery/failover. IMHO it's OK,
>> it works for the "logical failover" use case and if you need gapless
>> sequences then regular sequences are not an issue anyway.
>>
>> It's possible to reduce the jump a bit by reducing the batch size (from
>> 32 to 0) so that every increment is logged. But it doesn't eliminate it
>> because of rollbacks.
> 
> I generally agree. I think it's mainly something we should capture in
> the user docs that they can be a jump on the subscriber side, so people
> are not surprised.
> 
> Interestingly, in systems that tend to have higher rates of failover
> (I'm thinking of a few distributed systems), this may cause int4
> sequences to exhaust numbers slightly (marginally?) more quickly. Likely
> not too big of an issue, but something to keep in mind.
> 

IMHO the number of systems that would work fine with int4 sequences but
this change results in the sequences being "exhausted" too quickly is
indistinguishable from 0. I don't think this is an issue.

>>> 2. Using with origin=none with nonconflicting sequences.
>>>
>>> I modified the example in [1] to set up two schemas with non-conflicting
>>> sequences[2], e.g. on instance 1:
>>>
>>> CREATE TABLE public.room (
>>>  id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1)
>>> PRIMARY KEY,
>>>  name text NOT NULL
>>> );
>>>
>>> and instance 2:
>>>
>>> CREATE TABLE public.room (
>>>  id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2)
>>> PRIMARY KEY,
>>>  name text NOT NULL
>>> );
>>>
>>
>> Well, yeah. We don't support active-active logical replication (at least
>> not with the built-in). You can easily get into similar issues without
>> sequences.
> 
> The "origin=none" feature lets you replicate tables bidirectionally.
> While it's not full "active-active", this is a starting point and a
> feature for v16. We'll definitely have users replicating data
> bidirectionally with this.
> 

Well, then the users need to use some other way to generate IDs, not
local sequences. Either some sort of distributed/global sequence, UUIDs
or something like that.

>> Replicating a sequence overwrites the state of the sequence on the other
>> side, which may result in it generating duplicate values with the other
>> node, etc.
> 
> I understand that we don't currently support global sequences, but I am
> concerned there may be a tripwire here in the origin=none case given
> it's fairly common to use serial/GENERATED BY to set primary keys. And
> it's fairly trivial to set them to be nonconflicting, or at least give
> the user the appearance that they are nonconflicting.
> 
> From my high level understand of how sequences work, this sounds like it
> would be a lift to support the example in [1]. Or maybe the answer is
> that you can bidirectionally replicate the changes in the tables, but
> not sequences?
> 

Yes, I don't think local sequences don't and can't work in such setups.

> In any case, we should update the restrictions in [2] to state: while
> sequences can be replicated, there is additional work required if you
> are bidirectionally replicating tables that use sequences, esp. if used
> in a PK or a constraint. We can provide alternatives to how a user could
> set that up, i.e. not replicates the sequences or do something like in [3].
> 

I agree. I see this as mostly a documentation issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow logical replication to copy tables in binary format

2023-02-23 Thread Jelte Fennema

> This is because copy_data is not something stored in pg_subscription
> or another catalog. But this is not an issue for copy_fornat since its
> value will be stored in the catalog. This can allow users to set the
> format even if copy_data=false and no initial sync is needed at that
> moment.

One other approach that might make sense is to expand the values that
copy_data accepts to include the value "binary" (and probably "text"
for clarity). That way people could easily choose for each sync if
they want to use binary copy, text copy or no copy. Based on your
message, this would mean that copy_format=binary would not be stored
in catalogs anymore, does that have any bad side-effects for the
implementation?

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-23 Thread Hayato Kuroda (Fujitsu)

Dear Shi,

Thank you for reviewing! PSA new version.

> + elog(DEBUG2, "time-delayed replication for txid %u, delay_time
> = %d ms, remaining wait time: %ld ms",
> +  ctx->write_xid, (int) ctx->min_send_delay,
> +  remaining_wait_time_ms);
> 
> I tried and saw that the xid here looks wrong, what it got is the xid of the
> previous transaction. It seems `ctx->write_xid` has not been updated and we
> can't use it.
>

Good catch. There are several approaches to fix that, I choose the simplest way.
TransactionId was added as an argument of functions.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v7-0001-Time-delayed-logical-replication-on-publisher-sid.patch
Description:  v7-0001-Time-delayed-logical-replication-on-publisher-sid.patch

Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)

2023-02-23 Thread Heikki Linnakangas


On 23/02/2023 13:20, Peter Eisentraut wrote:

On 22.02.23 19:14, Heikki Linnakangas wrote:

How about a new backslash command or psql variable to show how long the
previous statement took? Something like:


If you don't have \timing turned on before the query starts, psql won't
record what the time was before the query, so you can't compute the run
time afterwards.  This kind of feature would only work if you always
take the start time, even if \timing is turned off.


Correct. That seems acceptable though? gettimeofday() can be slow on 
some platforms, but I doubt it's *that* slow, that we couldn't call it 
two times per query.


- Heikki

Re: Allow logical replication to copy tables in binary format

2023-02-23 Thread Melih Mutlu

shiy.f...@fujitsu.com , 23 Şub 2023 Per, 12:29
tarihinde şunu yazdı:

> On Thu, Feb 23, 2023 12:40 PM Kuroda, Hayato/黒田 隼人 <
> kuroda.hay...@fujitsu.com> wrote:
> >
> > > > I'm not sure the combination of "copy_format = binary" and
> "copy_data =
> > false"
> > > > should be accepted or not. How do you think?
> > >
> > > It seems quite useless indeed to specify the format of a copy that
> won't
> > happen.
> >
> > I understood that the conbination of "copy_format = binary" and
> "copy_data =
> > false"
> > should be rejected in parse_subscription_options() and
> AlterSubscription(). Is it
> > right?
> > I'm expecting that is done in next version.
> >
>
> The copy_data option only takes effect once in CREATE SUBSCIPTION or ALTER
> SUBSCIPTION REFRESH PUBLICATION command, but the copy_format option can
> take
> affect multiple times if the subscription is refreshed multiple times.
> Even if
> the subscription is created with copy_date=false, copy_format can take
> affect
> when executing ALTER SUBSCIPTION REFRESH PUBLICATION. So, I am not sure we
> want
> to reject this usage.
>

I believe the places copy_data and copy_format are needed are pretty much
the same. I couldn't think of a case where copy_format is needed but
copy_data isn't. Please let me know if I'm missing something.
CREATE SUBSCRIPTION, ALTER SUBSCRIPTION SET/ADD/REFRESH PUBLICATION are all
the places where initial sync can happen. For all these commands, copy_data
needs to be given as a parameter or it will be set to the default value
which is true. Even if copy_data=false when the sub was created, REFRESH
PUBLICATION (without an explicit copy_data parameter) will copy some tables
if needed regardless of what copy_data was in CREATE SUBSCRIPTION. This is
because copy_data is not something stored in pg_subscription or another
catalog. But this is not an issue for copy_fornat since its value will be
stored in the catalog. This can allow users to set the format even if
copy_data=false and no initial sync is needed at that moment. So that
future initial syncs which can be triggered by ALTER SUBSCRIPTION will be
performed in the correct format.
So, I also think we should allow setting copy_format even if
copy_data=false.

Another way to deal with this issue could be expecting the user to specify
format every time copy_format is needed, similar to the case for copy_data,
and moving on with text (default) format if it's not specified for the
current CREATE/ALTER SUBSCRIPTION execution. But I don't think this would
make things easier.

Best,
-- 
Melih Mutlu
Microsoft

Re: pgindent vs. git whitespace check

2023-02-23 Thread Andrew Dunstan



On 2023-02-22 We 23:48, Tom Lane wrote:

For my own taste, I really don't have any objection to // in isolation --
the problem with it is just that we've got megabytes of code in the other
style.  I fear it'd look really ugly to have an intermixture of // and /*
comment styles.



Maybe, I've seen some mixing elsewhere and it didn't make me shudder. I 
agree that you probably wouldn't want to mix both styles for end of line 
comments in a single function, although a rule like that would be hard 
to enforce mechanically.




Mass conversion of /* to // style would answer that,
but would also create an impossible back-patching problem.





Yeah, I agree that's a complete non-starter.


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: [PATCH] Add function to_oct

2023-02-23 Thread Peter Eisentraut


On 20.12.22 23:08, Eric Radman wrote:

This patch is a new function based on the implementation of to_hex(int).

Since support for octal integer literals was added, to_oct(int) allows
octal values to be easily stored and returned in query results.

   to_oct(0o755) = '755'

This is probably most useful for storing file system permissions.


Note this subsequent discussion about the to_hex function: 
https://www.postgresql.org/message-id/flat/CAEZATCVbkL1ynqpsKiTDpch34%3DSCr5nnau%3DnfNmiy2nM3SJHtw%40mail.gmail.com


Also, I think there is no "to binary" function, so perhaps if we're 
going down this road one way or the other, we should probably complete 
the set.

Re: buildfarm + meson

2023-02-23 Thread Andrew Dunstan



On 2023-02-22 We 20:20, Andres Freund wrote:



There is work to do to make sure we pick up the right log files, and maybe
adjust a module or two. I have adopted a design where instead of trying to
know a lot about the testing regime the client needs to know a lot less.
Instead, it gets meson to tell it the set of tests. I will probably work on
enabling some sort of filter, but I think this makes things more
future-proof. I have stuck with the design of making testing fairly
fine-grained, so each suite runs separately.

I don't understand why you'd want to run each suite separately. Serially
executing the test takes way longer than doing so in parallel. Why would we
want to enforce that?

Particularly because with meson the tests log files and the failed tests can
directly be correlated? And it should be easy to figure out which log files
need to be kept, you can just skip the directories in testrun/ that contain
test.success.



We can revisit that later. For now I'm more concerned with getting a 
working setup. The requirements of the buildfarm are a bit different 
from those of a developer, though. Running things in parallel can make 
things faster, but that can also increase the compute load. Also, 
running things serially makes it easier to report a failure stage that 
pinpoints the test that encountered an issue. But like I say we can come 
back to this.




On a Windows instance, fairly similar to what's running drongo, I can get a
successful build with meson+VS2019, but I'm getting an error in the
regression tests, which don't like setting lc_time to 'de_DE'. Not sure
what's going on there.

Huh, that's odd.



See my reply to Michael for details






meson apparently wants touch and cp installed, although I can't see why at
first glance. For Windows I just copied them into the path from an msys2
installation.

Those should probably be fixed.



Yeah. For touch I think we can probably just get rid of this line in the 
root meson.build:


touch = find_program('touch', native: true)

For cp there doesn't seem to be a formal requirement, but there is a 
recipe in src/common/unicode/meson.build that uses it, maybe that's what 
caused the failure. On Windows/msvc we could just use copy instead, I think.


I haven't experimented with any of this.


cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)

2023-02-23 Thread Peter Eisentraut


On 22.02.23 19:14, Heikki Linnakangas wrote:
How about a new backslash command or psql variable to show how long the 
previous statement took? Something like:


If you don't have \timing turned on before the query starts, psql won't 
record what the time was before the query, so you can't compute the run 
time afterwards.  This kind of feature would only work if you always 
take the start time, even if \timing is turned off.

Re: some namespace.c refactoring

2023-02-23 Thread Peter Eisentraut


On 20.02.23 15:03, Peter Eisentraut wrote:

On 15.02.23 19:04, Alvaro Herrera wrote:

That said, I think most of this code is invoked for DDL, where
performance is not so critical; probably just fixing
get_object_property_data to not be so naïve would suffice.


Ok, I'll look into that.


I did a variety of performance testing on this now.

I wrote a C function that calls the "is visible" functions in a tight loop:

Datum
pg_test_visible(PG_FUNCTION_ARGS)
{
int32 count = PG_GETARG_INT32(0);
Oid relid = PG_GETARG_OID(1);
Oid typid = PG_GETARG_OID(2);

for (int i = 0; i < count; i++)
{
RelationIsVisible(relid);
TypeIsVisible(typid);
//ObjectIsVisible(RelationRelationId, relid);
//ObjectIsVisible(TypeRelationId, typid);
}

PG_RETURN_VOID();
}

(It's calling two different ones to defeat the caching in 
get_object_property_data().)


Here are some run times:

unpatched:

select pg_test_visible(100_000_000, 'pg_class', 'int4');
Time: 4536.747 ms (00:04.537)

select pg_test_visible(100_000_000, 'tenk1', 'widget');
Time: 10828.802 ms (00:10.829)

(Note that the "is visible" functions special case system catalogs.)

patched:

select pg_test_visible(100_000_000, 'pg_class', 'int4');
Time: 11409.948 ms (00:11.410)

select pg_test_visible(100_000_000, 'tenk1', 'widget');
Time: 18649.496 ms (00:18.649)

So, it's slower, but it's not clear whether it matters in practice, 
considering this test.


I also wondered if this is visible through a normal external function 
call, so I tried


do $$ begin perform pg_get_statisticsobjdef(28999) from 
generate_series(1, 1_000_000); end $$;


(where that is the OID of the first object from select * from 
pg_statistic_ext; in the regression database).


unpatched:

Time: 6952.259 ms (00:06.952)

patched (first patch only):

Time: 6993.655 ms (00:06.994)

patched (both patches):

Time: 7114.290 ms (00:07.114)

So there is some visible impact, but again, the test isn't realistic.

Then I tried a few ways to make get_object_property_data() faster.  I 
tried building a class_id+index cache that is qsort'ed (once) and then 
bsearch'ed, that helped only minimally, not enough to make up the 
difference.  I also tried just searching the class_id+index cache 
linearly, hoping maybe that if the cache is smaller it would be more 
efficient to access, but that actually made things (minimally) worse, 
probably because of the indirection.  So it might be hard to get much 
more out of this.  I also thought about PerfectHash, but I didn't code 
that up yet.


Another way would be to not use get_object_property_data() at all but 
write a "common" function that we pass in all it needs hardcodedly, like


bool
RelationIsVisible(Oid relid)
{
return IsVisible_common(RELOID,
Anum_pg_class_relname
Anum_pg_class_relnamespace);
}

This would still save a lot of duplicate code.

But again, I don't think the micro-performance really matters here.

Disable vacuuming to provide data history

2023-02-23 Thread marekmosiewicz

Hey,

It depnends on scenario, but there is many use cases that hack data
change from somebody with admin privileges could be disaster.
That is the place where data history could come with help.  Some basic
solution would be trigger which writes previous version of record
to some other table. Trigger however can be disabled or removed (crazy
solution would be to provide pernament
triggers and tables which  can only be pernamently inserted). 
Then we have also possibility to modify tablespace directly on disk.

But Postgres has ability to not override records when two concurrent
transaction modify data to provide MVCC.

So what about pernamently not vacuumable tables. Adding some xid log
tables with hash of record on hash on previous hash.
I think that would be serious additional advantage for best open source
relational databes.

Best regards,
   Marek Mosiewicz

RE: Rework LogicalOutputPluginWriterUpdateProgress

2023-02-23 Thread Hayato Kuroda (Fujitsu)

Dear Wang,

Thank you for making the patch. IIUC your patch basically can achieve that 
output plugin
does not have to call UpdateProgress.

I think the basic approach is as follows, is it right?

1. In *_cb_wrapper, set ctx->did_write to false
2. In OutputPluginWrite() set ctx->did_write to true.
   This means that changes are really written, not skipped.
3. At the end of the transaction, call update_progress_and_keepalive().
   Even if we are not at the end, check skipped count and call the function if 
needed.
   The counter will be reset if ctx->did_write is true or we exceed the 
threshold.

Followings are my comments. I apologize if I missed some previous discussions.

01. logical.c

```
+static void update_progress_and_keepalive(struct LogicalDecodingContext *ctx,
+   
  bool finished_xact);
+
+static bool is_skip_threshold_change(struct LogicalDecodingContext *ctx);
```

"struct" may be not needed.

02. UpdateDecodingProgressAndKeepalive

I think the name should be UpdateDecodingProgressAndSendKeepalive(), keepalive 
is not verb.
(But it's ok to ignore if you prefer the shorter name)
Same thing can be said for the name of datatype and callback.

03. UpdateDecodingProgressAndKeepalive

```
+   /* set output state */
+   ctx->accept_writes = false;
+   ctx->write_xid = xid;
+   ctx->write_location = lsn;
+   ctx->did_write = false;
```

Do we have to modify accept_writes, write_xid, and write_location here?
These value is not used in WalSndUpdateProgressAndKeepalive().

04. stream_abort_cb_wrapper

```
+   update_progress_and_keepalive(ctx, true)
```

I'm not sure, but is it correct that call update_progress_and_keepalive() with
finished_xact = true? Isn't there a possibility that streamed sub-transaciton 
is aborted?


05. is_skip_threshold_change

At the end of the transaction, update_progress_and_keepalive() is called 
directly.
Don't we have to reset change_count here?

06. ReorderBufferAbort

Assuming that the top transaction is aborted. At that time 
update_progress_and_keepalive()
is called in stream_abort_cb_wrapper(), an then 
WalSndUpdateProgressAndKeepalive()
is called at the end of ReorderBufferAbort(). Do we have to do in both?
I think stream_abort_cb_wrapper() may be not needed.

07. WalSndUpdateProgress

You renamed ProcessPendingWrites() to WalSndSendPending(), but it may be still 
strange
because it will be called even if there are no pending writes.

Isn't it sufficient to call ProcessRepliesIfAny(), WalSndCheckTimeOut() and
(at least) WalSndKeepaliveIfNecessary()in the case? Or better name may be 
needed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: buildfarm + meson

2023-02-23 Thread Andrew Dunstan



On 2023-02-22 We 19:23, Michael Paquier wrote:



On a Windows instance, fairly similar to what's running drongo, I can get a
successful build with meson+VS2019, but I'm getting an error in the
regression tests, which don't like setting lc_time to 'de_DE'. Not sure
what's going on there.

What's the regression issue?  Some text-field ordering that ought to
be enforced with a C collation?



Here's the diff


diff -w -U3 
C:/prog/bf/buildroot/HEAD/pgsql/src/test/regress/expected/collate.windows.win1252.out
 
C:/prog/bf/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/collate.windows.win1252.out
--- 
C:/prog/bf/buildroot/HEAD/pgsql/src/test/regress/expected/collate.windows.win1252.out
    2023-02-22 16:32:03.762370300 +
+++ 
C:/prog/bf/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/collate.windows.win1252.out
    2023-02-22 22:54:59.281395200 +
@@ -363,16 +363,17 @@
 
 -- to_char

 SET lc_time TO 'de_DE';
+ERROR:  invalid value for parameter "lc_time": "de_DE"
 SELECT to_char(date '2010-03-01', 'DD TMMON ');
    to_char
 -
- 01 MRZ 2010
+ 01 MAR 2010
 (1 row)
 
 SELECT to_char(date '2010-03-01', 'DD TMMON ' COLLATE "de_DE");

    to_char
 -
- 01 MRZ 2010
+ 01 MAR 2010
 (1 row)
 
 -- to_date


cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-02-23 Thread John Naylor

I ran a couple "in situ" tests on server hardware using UUID columns, since
they are common in the real world and have bad correlation to heap
order, so are a challenge for index vacuum.

=== test 1, delete everything from a small table, with very small
maintenance_work_mem:

alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;

-- unrealistically low
alter system set maintenance_work_mem = '32MB';

create table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from
generate_series(1,50*1000*1000);
create index on test (x);

delete from test;
vacuum (verbose, truncate off) test;
--

master:
INFO:  finished vacuuming "john.naylor.public.test": index scans: 9
system usage: CPU: user: 70.04 s, system: 19.85 s, elapsed: 802.06 s

v29 patch:
INFO:  finished vacuuming "john.naylor.public.test": index scans: 1
system usage: CPU: user: 9.80 s, system: 2.62 s, elapsed: 36.68 s

This is a bit artificial, but it's easy to construct cases where the array
leads to multiple index scans but the new tid store can fit everythin
without breaking a sweat. I didn't save the progress reporting, but v29 was
using about 11MB for tid storage.


=== test 2: try to stress tid lookup with production maintenance_work_mem:
1. use unlogged table to reduce noise
2. vacuum freeze first to reduce heap scan time
3. delete some records at the beginning and end of heap to defeat binary
search's pre-check

alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;

alter system set maintenance_work_mem = '1GB';

create unlogged table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from
generate_series(1,1000*1000*1000);
vacuum_freeze test;

select pg_size_pretty(pg_table_size('test'));
 pg_size_pretty

 41 GB

create index on test (x);

select pg_size_pretty(pg_total_relation_size('test'));
 pg_size_pretty

 71 GB

select max(ctid) from test;
 max
--
 (5405405,75)

delete from test where ctid <  '(10,0)'::tid;
delete from test where ctid > '(530,0)'::tid;

vacuum (verbose, truncate off) test;

both:
INFO:  vacuuming "john.naylor.public.test"
INFO:  finished vacuuming "john.naylor.public.test": index scans: 1
index scan needed: 205406 pages from table (3.80% of total) had 3800
dead item identifiers removed

--
master:
system usage: CPU: user: 134.32 s, system: 19.24 s, elapsed: 286.14 s

v29 patch:
system usage: CPU: user:  97.71 s, system: 45.78 s, elapsed: 573.94 s

The entire vacuum took 25% less wall clock time. Reminder that this is
without wal logging, and also unscientific because only one run.

--
I took 10 seconds of perf data while index vacuuming was going on (showing
calls > 2%):

master:
  40.59%  postgres  postgres[.] vac_cmp_itemptr
  24.97%  postgres  libc-2.17.so[.] bsearch
   6.67%  postgres  postgres[.] btvacuumpage
   4.61%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
   3.48%  postgres  postgres[.] PageIndexMultiDelete
   2.67%  postgres  postgres[.] vac_tid_reaped
   2.03%  postgres  postgres[.] compactify_tuples
   2.01%  postgres  libc-2.17.so[.] __memcpy_ssse3_back

v29 patch:

  29.22%  postgres  postgres[.] TidStoreIsMember
   9.30%  postgres  postgres[.] btvacuumpage
   7.76%  postgres  postgres[.] PageIndexMultiDelete
   6.31%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
   5.60%  postgres  postgres[.] compactify_tuples
   4.26%  postgres  libc-2.17.so[.] __memcpy_ssse3_back
   4.12%  postgres  postgres[.] hash_search_with_hash_value

--
master:
psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuples,
num_dead_tuples from pg_stat_progress_vacuum"
   phase   | heap_blks_total | heap_blks_scanned | max_dead_tuples
| num_dead_tuples
---+-+---+-+-
 vacuuming indexes | 5405406 |   5405406 |   178956969
|3800

v29 patch:
psql  -c "select phase, heap_blks_total, heap_blks_scanned,
max_dead_tuple_bytes, dead_tuple_bytes from pg_stat_progress_vacuum"
   phase   | heap_blks_total | heap_blks_scanned |
max_dead_tuple_bytes | dead_tuple_bytes
---+-+---+--+--
 vacuuming indexes | 5405406 |   5405406 |
1073670144 |  8678064

Here, the old array pessimistically needs 1GB allocated (as for any table >
~5GB), but only fills 228MB for tid lookup. The patch reports 8.7MB. Tables
that only fit, say, 30-50 tuples per page will have less extreme

Re: Wrong query results caused by loss of join quals

2023-02-23 Thread Richard Guo

On Thu, Feb 23, 2023 at 4:50 AM Tom Lane  wrote:

> I thought about this and decided that it's not really a problem.
> have_relevant_joinclause is just a heuristic, and I don't think we
> need to prioritize forming a join if the only relevant clauses look
> like this.  We won't be able to use such clauses for merge or hash,
> so we're going to end up with an unconstrained nestloop, which isn't
> something to be eager to form.  The join ordering rules will take
> care of forcing us to make the join when necessary.


Agreed.  And as I tried, in lots of cases joins with such clauses would
be accepted by have_join_order_restriction(), which always appears with
have_relevant_joinclause().


> The only easy improvement I can see to make here is to apply the old
> rules at inner joins.  Maybe it's worth complicating the data structures
> to be smarter at outer joins, but I rather doubt it: we could easily
> expend more overhead than we'll save here by examining irrelevant ECs.
> In any case, if there is a useful optimization here, it can be pursued
> later.


This makes sense.


> I changed it anyway after noting that (a) passing in the ojrelid is
> needful to be able to distinguish inner and outer joins, and
> (b) the existing comment about the join_relids input is now wrong.
> Even if it happens to not be borked for current callers, that seems
> like a mighty fragile assumption.


Agreed. This is reasonable.


> Less-hasty v2 patch attached.


I think the patch is in good shape now.

Thanks
Richard

RE: Allow logical replication to copy tables in binary format

2023-02-23 Thread shiy.f...@fujitsu.com

On Thu, Feb 23, 2023 12:40 PM Kuroda, Hayato/黒田 隼人  
wrote:
> 
> > > I'm not sure the combination of "copy_format = binary" and "copy_data =
> false"
> > > should be accepted or not. How do you think?
> >
> > It seems quite useless indeed to specify the format of a copy that won't
> happen.
> 
> I understood that the conbination of "copy_format = binary" and "copy_data =
> false"
> should be rejected in parse_subscription_options() and AlterSubscription(). 
> Is it
> right?
> I'm expecting that is done in next version.
> 

The copy_data option only takes effect once in CREATE SUBSCIPTION or ALTER
SUBSCIPTION REFRESH PUBLICATION command, but the copy_format option can take
affect multiple times if the subscription is refreshed multiple times. Even if
the subscription is created with copy_date=false, copy_format can take affect
when executing ALTER SUBSCIPTION REFRESH PUBLICATION. So, I am not sure we want
to reject this usage.

Besides, here are my comments on the v9 patch.
1.
src/bin/pg_dump/pg_dump.c
if (fout->remoteVersion >= 16)
-   appendPQExpBufferStr(query, " s.suborigin\n");
+   {
+   appendPQExpBufferStr(query, " s.suborigin,\n");
+   appendPQExpBufferStr(query, " s.subcopyformat\n");
+   }
else
-   appendPQExpBuffer(query, " '%s' AS suborigin\n", 
LOGICALREP_ORIGIN_ANY);
+   {
+   appendPQExpBuffer(query, " '%s' AS suborigin,\n", 
LOGICALREP_ORIGIN_ANY);
+   appendPQExpBuffer(query, " '%c' AS subcopyformat\n", 
LOGICALREP_COPY_AS_TEXT);
+   }

src/bin/psql/describe.c
if (pset.sversion >= 16)
+   {
appendPQExpBuffer(,
  ", suborigin AS 
\"%s\"\n",
  
gettext_noop("Origin"));
+   /* Copy format is only supported in v16 and higher */
+   appendPQExpBuffer(,
+ ", subcopyformat AS 
\"%s\"\n",
+ gettext_noop("Copy 
Format"));
+   }

I think we can call only once appendPQExpBuffer() for the two options which are 
supported in v16.
For example,
if (pset.sversion >= 16)
{
appendPQExpBuffer(,
  ", suborigin AS 
\"%s\"\n"
  ", subcopyformat AS 
\"%s\"\n",
  
gettext_noop("Origin"),
  gettext_noop("Copy 
Format"));
}

2.
src/bin/psql/tab-complete.c
@@ -1926,7 +1926,7 @@ psql_completion(const char *text, int start, int end)
/* ALTER SUBSCRIPTION  SET ( */
else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && 
TailMatches("SET", "("))
COMPLETE_WITH("binary", "disable_on_error", "origin", 
"slot_name",
- "streaming", "synchronous_commit");
+ "streaming", "synchronous_commit", 
"copy_format");
/* ALTER SUBSCRIPTION  SKIP ( */
else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && 
TailMatches("SKIP", "("))
COMPLETE_WITH("lsn");
@@ -3269,7 +3269,8 @@ psql_completion(const char *text, int start, int end)
else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", 
"("))
COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
  "disable_on_error", "enabled", 
"origin", "slot_name",
- "streaming", "synchronous_commit", 
"two_phase");
+ "streaming", "synchronous_commit", 
"two_phase",
+ "copy_format");


The options should be listed in alphabetical order. See commit d547f7cf5ef.

Regards,
Shi Yu

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-23 Thread shiy.f...@fujitsu.com

On Wed, Feb 22, 2023 9:48 PM Kuroda, Hayato/黒田 隼人  
wrote:
> 
> Thank you for reviewing! PSA new version.
> 

Thanks for your patch. Here is a comment.

+   elog(DEBUG2, "time-delayed replication for txid %u, delay_time 
= %d ms, remaining wait time: %ld ms",
+ctx->write_xid, (int) ctx->min_send_delay,
+remaining_wait_time_ms);

I tried and saw that the xid here looks wrong, what it got is the xid of the
previous transaction. It seems `ctx->write_xid` has not been updated and we
can't use it.

Regards,
Shi Yu

Re: pg_regress: Treat child process failure as test failure

2023-02-23 Thread Daniel Gustafsson

> On 22 Feb 2023, at 21:55, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>>> On 22 Feb 2023, at 21:33, Andres Freund  wrote:
>>> On 2023-02-22 15:10:11 +0100, Daniel Gustafsson wrote:
 Rebased patch to handle breakage of v2 due to bd8d453e9b.
> 
>>> I think we probably should just apply this? The current behaviour doesn't 
>>> seem
>>> right, and I don't see a downside of the new behaviour?
> 
>> Agreed, I can't think of a regression test where we wouldn't want this.  My
>> only concern was if any of the ECPG tests were doing something odd that would
>> break from this but I can't see anything.
> 
> +1.  I was a bit surprised to realize that we might not count such
> a case as a failure.

Done that way, thanks!

--
Daniel Gustafsson

Re: pg_stat_statements and "IN" conditions

2023-02-23 Thread David Geier


Hi,


Seems like supporting only constants is a good starting
point. The only thing that is likely confusing for users is that NUMERICs
(and potentially constants of other types) are unsupported. Wouldn't it be
fairly simple to support them via something like the following?

     is_const(element) || (is_coercion(element) && is_const(element->child))

It definitely makes sense to implement that, although I don't think it's
going to be acceptable to do that via directly listing conditions an
element has to satisfy. It probably has to be more flexible, sice we
would like to extend it in the future. My plan is to address this in a
follow-up patch, when the main mechanism is approved. Would you agree
with this approach?


I still think it's counterintuitive and I'm pretty sure people would 
even report this as a bug because not knowing about the difference in 
internal representation you would expect NUMERICs to work the same way 
other constants work. If anything we would have to document it.


Can't we do something pragmatic and have something like 
IsMergableInElement() which for now only supports constants and later 
can be extended? Or what exactly do you mean by "more flexible"?


--
David Geier
(ServiceNow)

Re: pgindent vs. git whitespace check

2023-02-23 Thread Daniel Gustafsson

> On 23 Feb 2023, at 05:48, Tom Lane  wrote:

> For my own taste, I really don't have any objection to // in isolation --
> the problem with it is just that we've got megabytes of code in the other
> style.  I fear it'd look really ugly to have an intermixture of // and /*
> comment styles.  

We could use the "use the style of surrounding code (comments)" approach - when
changing an existing commented function use the style already present; when
adding a net new function a choice can be made (unless we mandate a style).  It
will still look ugly, but it will be less bad than mixing within the same
block.

> Mass conversion of /* to // style would answer that,
> but would also create an impossible back-patching problem.

Yeah, that sounds incredibly invasive.

--
Daniel Gustafsson

Re: [PATCH] Add pretty-printed XML output option

2023-02-23 Thread Jim Jones


On 23.02.23 08:51, Peter Eisentraut wrote:

In kwlist.h you have

    PG_KEYWORD("indent", INDENT, UNRESERVED_KEYWORD, AS_LABEL)

but you can actually make it BARE_LABEL, which is preferable.

More importantly, you need to add the new keyword to the 
bare_label_keyword production in gram.y.  I thought we had some 
tooling in the build system to catch this kind of omission, but it's 
apparently not working right now.

Entry in kwlist.h changed to BARE_LABEL.


Elsewhere, let's rename the xmlformat() C function to xmlserialize() 
(or maybe something like xmlserialize_indent()), so the association is 
clearer.


xmlserialize_indent sounds much better and makes the association indeed 
clearer. Changed in v19.


v19 attached.

Thanks for the review!

Best, Jim
From ed1e4a9fc94a6b65a9be6b125ae5fa8af1aa9d68 Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Mon, 20 Feb 2023 23:35:22 +0100
Subject: [PATCH v19] Add pretty-printed XML output option

This patch implements the XML/SQL:2011 feature 'X069, XMLSERIALIZE: INDENT.'
It adds the options INDENT and NO INDENT (default) to the existing
xmlserialize function. It uses the indentation feature of xmlDocDumpFormatMemory
from libxml2 to format XML strings. Although the INDENT feature is designed
to work with xml strings of type DOCUMENT, this implementation also allows
the usage of CONTENT type strings as long as it contains a well-formed xml -
note the XMLOPTION_DOCUMENT in the xml_parse call.

This patch also includes documentation, regression tests and their three
possible output files xml.out, xml_1.out and xml_2.out.
---
 doc/src/sgml/datatype.sgml|   8 +-
 src/backend/catalog/sql_features.txt  |   2 +-
 src/backend/executor/execExprInterp.c |  11 ++-
 src/backend/parser/gram.y |  13 +++-
 src/backend/parser/parse_expr.c   |   1 +
 src/backend/utils/adt/xml.c   |  41 ++
 src/include/nodes/parsenodes.h|   1 +
 src/include/nodes/primnodes.h |   1 +
 src/include/parser/kwlist.h   |   1 +
 src/include/utils/xml.h   |   1 +
 src/test/regress/expected/xml.out | 106 ++
 src/test/regress/expected/xml_1.out   |  74 ++
 src/test/regress/expected/xml_2.out   | 106 ++
 src/test/regress/sql/xml.sql  |  26 +++
 14 files changed, 385 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 467b49b199..53d59662b9 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -4460,14 +4460,18 @@ xml 'bar'
 xml, uses the function
 xmlserialize:xmlserialize
 
-XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type )
+XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type [ [NO] INDENT ] )
 
 type can be
 character, character varying, or
 text (or an alias for one of those).  Again, according
 to the SQL standard, this is the only way to convert between type
 xml and character types, but PostgreSQL also allows
-you to simply cast the value.
+you to simply cast the value. The option INDENT allows to
+indent the serialized xml output - the default is NO INDENT.
+It is designed to indent XML strings of type DOCUMENT, but it can also
+   be used with CONTENT as long as value
+   contains a well-formed XML.

 

diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt
index 3766762ae3..2e196faeeb 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -619,7 +619,7 @@ X061	XMLParse: character string input and DOCUMENT option			YES
 X065	XMLParse: binary string input and CONTENT option			NO	
 X066	XMLParse: binary string input and DOCUMENT option			NO	
 X068	XMLSerialize: BOM			NO	
-X069	XMLSerialize: INDENT			NO	
+X069	XMLSerialize: INDENT			YES	
 X070	XMLSerialize: character string serialization and CONTENT option			YES	
 X071	XMLSerialize: character string serialization and DOCUMENT option			YES	
 X072	XMLSerialize: character string serialization			YES	
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 19351fe34b..0339862267 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -3829,6 +3829,8 @@ ExecEvalXmlExpr(ExprState *state, ExprEvalStep *op)
 			{
 Datum	   *argvalue = op->d.xmlexpr.argvalue;
 bool	   *argnull = op->d.xmlexpr.argnull;
+bool	indent = op->d.xmlexpr.xexpr->indent;
+text	   *data;
 
 /* argument type is known to be xml */
 Assert(list_length(xexpr->args) == 1);
@@ -3837,9 +3839,14 @@ ExecEvalXmlExpr(ExprState *state, ExprEvalStep *op)
 	return;
 value = argvalue[0];
 
-*op->resvalue = PointerGetDatum(xmltotext_with_xmloption(DatumGetXmlP(value),
-		 xexpr->xmloption));
 *op->resnull = false;
+
+data = xmltotext_with_xmloption(DatumGetXmlP(value),
+xexpr->xmloption);
+

79 matches

Mail list logo